Skip to content

Commit 35fac1d

Browse files
Net-new file and TOC entry
1 parent 19b087e commit 35fac1d

2 files changed

Lines changed: 220 additions & 0 deletions

File tree

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
---
2+
title: Troubleshoot Rolling Upgrade Issues
3+
description: Describes how to troubleshoot rolling upgrade issues.
4+
ms.date: 12/05/2025
5+
manager: dcscontentpm
6+
audience: itpro
7+
ms.topic: troubleshooting
8+
ms.author: jeffhugh
9+
ms.reviewer: kaushika, v-ryanberg, v-gsitser
10+
ms.custom:
11+
- sap:rolling upgrade and high availability\rolling upgrade issues
12+
- pcy:WinComm Storage High Avail
13+
---
14+
# Troubleshoot rolling upgrade issues
15+
16+
## Summary
17+
18+
This article provides a structured troubleshooting approach for addressing common issues encountered during rolling upgrades in Windows Server Failover Clustering (WSFC), Storage Spaces Direct, SQL Server Always On availability groups, and Hyper-V.
19+
20+
Rolling upgrades are essential for maintaining and upgrading systems with minimal downtime. However, challenges like compatibility and configuration errors can impact availability and potentially cause data loss.
21+
22+
## Prerequisites
23+
24+
Before starting a rolling upgrade:
25+
26+
- Verify that the rolling upgrade feature is supported for your workload and operating system (OS) versions.
27+
- Confirm all cluster nodes are healthy using the `Get-ClusterNode` PowerShell command.
28+
- Ensure you have up-to-date backups, including:
29+
- System state
30+
- Cluster configuration
31+
- User data
32+
33+
## Potential workarounds
34+
35+
### Address rolling upgrade failures
36+
37+
1. Move core resources to another node using Failover Cluster Manager or the `Move-ClusterGroup` PowerShell command.
38+
2. Use `Suspend-ClusterNode -Drain` to migrate roles and resources off the node.
39+
3. Check cluster logs for dependencies or errors blocking the operation.
40+
41+
## Troubleshooting checklist
42+
43+
1. **Review prerequisites**: Ensure the environment meets all prerequisites previously cited in this article.
44+
45+
2. **Validate cluster status**: Run `Test-Cluster` and resolve any validation warnings or errors.
46+
- Verify the current cluster functional level using `Get-Cluster | Select ClusterFunctionalLevel`.
47+
- Validate network connectivity among all nodes.
48+
49+
3. **Plan and sequence upgrades**: Document the sequence of node upgrades (one node at a time).
50+
- Move cluster roles (like virtual machines (VMs), availability groups, or file shares) off the node being upgraded.
51+
- Update all nodes with the latest supported patches or hotfixes for the current OS.
52+
53+
4. **Communicate with stakeholders**: Inform stakeholders and schedule maintenance windows.
54+
- Notify monitoring teams to avoid unnecessary alerts.
55+
56+
5. **Ensure application awareness**: Confirm application compatibility for workloads like SQL Server, Hyper-V, or file services.
57+
- Inform application owners of planned upgrades.
58+
59+
6. **Conduct pre-upgrade tests**: Review logs for Windows, applications, clusters, and storage to identify any pre-existing issues.
60+
61+
## Common issues and their respective solutions
62+
63+
### 1. Rolling upgrade fails to start or node can't be evicted
64+
65+
**Symptoms**
66+
67+
You're unable to pause, drain, or remove a node from the cluster. Errors like "Node ... cannot be removed from the cluster ..." appear.
68+
69+
**Cause**
70+
71+
The node hosts core cluster resources, dependencies are misconfigured, or the cluster is unstable.
72+
73+
**Solution**
74+
75+
1. Move core resources to another node using Failover Cluster Manager or `Move-ClusterGroup`.
76+
2. Use `Suspend-ClusterNode -Drain` to move roles and resources.
77+
3. Ensure the node isn't the last up-to-date or quorum node.
78+
4. Check cluster logs for blocking dependencies.
79+
80+
### 2. Failure adding upgraded node back to cluster
81+
82+
**Symptoms**
83+
84+
Errors like "A node attempted to join a failover cluster but failed due to incompatibility…" or version mismatch messages appear.
85+
86+
**Cause**
87+
88+
Unsupported OS version mix or unpatched node.
89+
90+
**Solution**
91+
92+
1. Verify the supported OS and cluster version matrix.
93+
2. Patch the node to the latest cumulative update (CU).
94+
3. Upgrade the OS versions sequentially (for example, 2016 → 2019 → 2022).
95+
4. Use `Get-ClusterLog` to identify versioning errors.
96+
97+
### 3. Resource or service fails to come online
98+
99+
**Symptoms**
100+
101+
Resources like VMs or file shares enter a failed or offline state post-upgrade. Common Event IDs include `1069`, `1146`, and `1230`.
102+
103+
**Cause**
104+
105+
Misconfiguration during upgrade, missing registry keys or files, or service account failures.
106+
107+
**Solution**
108+
109+
1. Check cluster events in Failover Cluster Manager.
110+
2. Validate resource owner configurations using `Get-ClusterResource | Get-ClusterOwnerNode`.
111+
3. Repair or recreate missing dependencies.
112+
4. Restart cluster services with `Restart-Service ClusSvc`.
113+
114+
### 4. Quorum or communication loss
115+
116+
**Symptoms**
117+
118+
Cluster goes offline, nodes enter quarantine, or Event IDs `1135` and `1136` appear.
119+
120+
**Cause** Network partition, firewall configuration, or quorum misconfiguration.
121+
122+
**Solution**
123+
124+
1. Ensure all required ports are open.
125+
2. Check network, DNS, and routing configurations.
126+
3. Check quorum settings with `Get-ClusterQuorum` and update them if necessary.
127+
4. Run `Validate-Cluster` to identify root causes.
128+
129+
### 5. Patch or update failure or known bug
130+
131+
**Symptoms**
132+
133+
Cluster services crash post-update or resources fail due to a known problematic update.
134+
135+
**Cause**
136+
137+
Microsoft updates or patches causing cluster instability.
138+
139+
**Solution**
140+
141+
1. Review Microsoft Knowledge Base (KB) articles for known issues.
142+
2. Remove problematic updates if needed.
143+
3. Apply recommended hotfixes or wait for updated patches.
144+
4. Open a support case if still unresolved.
145+
146+
### 6. Cluster validation or functional level errors
147+
148+
**Symptoms**
149+
150+
Unable to update the cluster functional level or validation fails.
151+
152+
**Cause**
153+
154+
Mixed OS versions, incomplete upgrades, or outdated drivers.
155+
156+
**Solution**
157+
158+
1. Update all nodes and ensure they're joined to the cluster.
159+
2. Update hardware drivers (like network and storage) and firmware.
160+
3. Use `Update-ClusterFunctionalLevel` to complete the upgrade.
161+
4. Review logs for driver or validation failures.
162+
163+
## Advanced troubleshooting and data collection
164+
165+
For persistent or complex issues, collect the following data:
166+
167+
- **Cluster logs**
168+
169+
```powershell
170+
171+
Get-ClusterLog -TimeSpan 24:00 -Destination
172+
173+
```
174+
175+
**System and application event logs**
176+
177+
```powershell
178+
179+
Get-WinEvent -LogName System -MaxEvents 1000 | Export-Csv <Path>\SystemLogs.csv
180+
Get-WinEvent -LogName Application -MaxEvents 1000 | Export-Csv <Path>\AppLogs.csv
181+
182+
```
183+
184+
**Resource and node status**
185+
186+
```powershell
187+
188+
189+
Get-ClusterNode
190+
Get-ClusterResource
191+
Get-ClusterGroup
192+
Test-Cluster
193+
194+
```
195+
196+
**Network and driver information**
197+
198+
```powershell
199+
200+
Get-NetAdapter -IncludeHidden | Export-Csv <Path>\NetAdapters.csv
201+
202+
```
203+
204+
**Patch or update history**
205+
206+
```powershell
207+
208+
Get-HotFix | Export-Csv \Hotfix.csv
209+
210+
```
211+
212+
## References
213+
214+
- [Upgrade a Windows Server failover cluster with a cluster OS rolling upgrade](/windows-server/failover-clustering/cluster-operating-system-rolling-upgrade)
215+
- [Update-ClusterFunctionalLevel](/powershell/module/failoverclusters/update-clusterfunctionallevel)
216+
- [Known issues - KB5062557](https://support.microsoft.com/help/5062557)

support/windows-server/toc.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1190,6 +1190,10 @@ items:
11901190
href: ./high-availability/troubleshoot-issues-accounts-used-failover-clusters.md
11911191
- name: Tuning Failover Cluster Network Thresholds
11921192
href: ./high-availability/iaas-sql-failover-cluster-network-thresholds.md
1193+
- name: Rolling upgrades
1194+
items:
1195+
- name: Troubleshoot rolling upgrade issues
1196+
href: ./high-availability/troubleshoot-rolling-upgrades.md
11931197
- name: Licensing and activation
11941198
items:
11951199
- name: Licensing and activation

0 commit comments

Comments
 (0)