Skip to content

Commit 64cbef5

Browse files
Merge pull request #10319 from ryanberg-aquent/CI-7835
AB#7835: [Windows Server: New] Troubleshooting Guide: Rolling Upgrades
2 parents 72e82ac + 997d1f3 commit 64cbef5

2 files changed

Lines changed: 228 additions & 0 deletions

File tree

Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
---
2+
title: Troubleshoot Rolling Upgrade Issues
3+
description: Discusses how to troubleshoot rolling upgrade issues.
4+
ms.date: 12/05/2025
5+
manager: dcscontentpm
6+
audience: itpro
7+
ms.topic: troubleshooting
8+
ms.author: jeffhugh
9+
ms.reviewer: kaushika, v-ryanberg, v-gsitser
10+
ms.custom:
11+
- sap:rolling upgrade and high availability\rolling upgrade issues
12+
- pcy:WinComm Storage High Avail
13+
- appliesto:
14+
- <a href=https://learn.microsoft.com/windows/release-health/windows-server-release-info target=_blank>Supported versions of Windows Server</a>
15+
---
16+
# Troubleshoot rolling upgrade issues
17+
18+
## Summary
19+
20+
This article provides a structured troubleshooting method to resolve common issues that you might encounter during rolling upgrades in Windows Server Failover Clustering (WSFC), Storage Spaces Direct, SQL Server Always On availability groups, and Hyper-V.
21+
22+
Rolling upgrades are essential for maintaining and upgrading systems while experiencing minimal downtime. However, challenges such as compatibility and configuration errors can affect availability, and potentially cause data loss.
23+
24+
## Prerequisites
25+
26+
Before you start a rolling upgrade:
27+
28+
- Verify that the rolling upgrade feature is supported for your workload and operating system (OS) versions.
29+
- Verify that all cluster nodes are healthy by using the `Get-ClusterNode` PowerShell command.
30+
- Make sure that you have up-to-date backups, including:
31+
- System state
32+
- Cluster configuration
33+
- User data
34+
35+
## Potential workarounds
36+
37+
### Address rolling upgrade failures
38+
39+
1. Move core resources to another node by using Failover Cluster Manager or the `Move-ClusterGroup` PowerShell command.
40+
2. Migrate roles and resources off the node by using `Suspend-ClusterNode -Drain`.
41+
3. Check cluster logs for dependencies or errors that might block the operation.
42+
43+
## Troubleshooting checklist
44+
45+
1. **Review prerequisites**: Make sure that the environment meets all prerequisites that are mentioned in this article.
46+
47+
2. **Validate cluster status**: Resolve any validation warnings or errors by running `Test-Cluster`.
48+
- Verify the current cluster functional level by using `Get-Cluster | Select ClusterFunctionalLevel`.
49+
- Validate network connectivity among all nodes.
50+
51+
3. **Plan and sequence upgrades**: Document the sequence of node upgrades (one node at a time).
52+
- Move cluster roles (such as virtual machines (VMs), availability groups, or file shares) off the node that's being upgraded.
53+
- Update all nodes to the latest supported updates or hotfixes for the current OS.
54+
55+
4. **Communicate with stakeholders**: Inform stakeholders and schedule maintenance windows.
56+
- Notify monitoring teams in order to avoid unnecessary alerts.
57+
58+
5. **Ensure application awareness**: Verify application compatibility for workloads such as SQL Server, Hyper-V, or file services.
59+
- Inform application owners about planned upgrades.
60+
61+
6. **Conduct pre-upgrade tests**: Review logs for Windows, applications, clusters, and storage to identify any pre-existing issues.
62+
63+
## Common issues and their respective solutions
64+
65+
### 1. Rolling upgrade doesn't start or node can't be evicted
66+
67+
**Symptoms**
68+
69+
You can't pause, drain, or remove a node from the cluster. You receive error messages such as the following example:
70+
71+
> Node... cannot be removed from the cluster.
72+
73+
**Cause**
74+
75+
The node hosts core cluster resources, dependencies are misconfigured or the cluster is unstable.
76+
77+
**Solution**
78+
79+
1. Move core resources to another node by using Failover Cluster Manager or `Move-ClusterGroup`.
80+
2. move roles and resources by running `Suspend-ClusterNode -Drain`.
81+
3. Make sure that the node isn't the last up-to-date or quorum node.
82+
4. Check cluster logs for blocking dependencies.
83+
84+
### 2. Can't restore upgraded node to cluster
85+
86+
**Symptoms**
87+
88+
You receive a version mismatch message or error messages such as the following example:
89+
90+
> A node attempted to join a failover cluster but failed due to incompatibility.
91+
92+
**Cause**
93+
94+
Unsupported OS version mix or nonupdated node.
95+
96+
**Solution**
97+
98+
1. Verify the supported OS and cluster version matrix.
99+
2. Update the node to the latest cumulative update (CU).
100+
3. Upgrade the OS versions sequentially (for example, 2016 → 2019 → 2022).
101+
4. Identify versioning errors by using `Get-ClusterLog`.
102+
103+
### 3. Resource or service doesn't come online
104+
105+
**Symptoms**
106+
107+
Resources such as VMs or file shares enter a failed or offline state post-upgrade. Common Event IDs include `1069`, `1146`, and `1230`.
108+
109+
**Cause**
110+
111+
Misconfiguration during upgrade, missing registry keys or files, or service account failures.
112+
113+
**Solution**
114+
115+
1. Check cluster events in Failover Cluster Manager.
116+
2. Verify resource owner configurations by running `Get-ClusterResource | Get-ClusterOwnerNode`.
117+
3. Repair or re-create missing dependencies.
118+
4. Restart cluster services by running `Restart-Service ClusSvc`.
119+
120+
### 4. Quorum or communication loss
121+
122+
**Symptoms**
123+
124+
Cluster goes offline, nodes enter quarantine, or Event IDs `1135` and `1136` appear.
125+
126+
**Cause**
127+
128+
Network partition, firewall configuration, or quorum misconfiguration.
129+
130+
**Solution**
131+
132+
1. Make sure that all required ports are open.
133+
2. Check network, DNS, and routing configurations.
134+
3. Check quorum settings by running `Get-ClusterQuorum`. Update settings as appropriate.
135+
4. To identify root causes, run `Validate-Cluster`.
136+
137+
### 5. Update failure or known bug
138+
139+
**Symptoms**
140+
141+
Cluster services stop responding after an update, or resources fail because of a known problematic update.
142+
143+
**Cause**
144+
145+
Cluster instability occurred after a Microsoft update installation.
146+
147+
**Solution**
148+
149+
1. Review Microsoft Knowledge Base (KB) articles for known issues.
150+
2. Remove problematic updates, if it's necessary.
151+
3. Apply recommended hotfixes or wait for new updates.
152+
4. Open a support case if the issue remains unresolved.
153+
154+
### 6. Cluster validation or functional level errors
155+
156+
**Symptoms**
157+
158+
Can't update the cluster functional level, or validation fails.
159+
160+
**Cause**
161+
162+
Mixed OS versions, incomplete upgrades, or outdated drivers.
163+
164+
**Solution**
165+
166+
1. Update all nodes, and make sure that they're joined to the cluster.
167+
2. Update hardware drivers (such as network and storage) and firmware.
168+
3. Complete the upgrade by using `Update-ClusterFunctionalLevel`.
169+
4. Review logs for driver or validation failures.
170+
171+
## Advanced troubleshooting and data collection
172+
173+
For persistent or complex issues, collect the following data.
174+
175+
**Cluster logs**
176+
177+
```powershell
178+
179+
Get-ClusterLog -TimeSpan 24:00 -Destination
180+
181+
```
182+
183+
**System and application event logs**
184+
185+
```powershell
186+
187+
Get-WinEvent -LogName System -MaxEvents 1000 | Export-Csv <Path>\SystemLogs.csv
188+
Get-WinEvent -LogName Application -MaxEvents 1000 | Export-Csv <Path>\AppLogs.csv
189+
190+
```
191+
192+
**Resource and node status**
193+
194+
```powershell
195+
196+
197+
Get-ClusterNode
198+
Get-ClusterResource
199+
Get-ClusterGroup
200+
Test-Cluster
201+
202+
```
203+
204+
**Network and driver information**
205+
206+
```powershell
207+
208+
Get-NetAdapter -IncludeHidden | Export-Csv <Path>\NetAdapters.csv
209+
210+
```
211+
212+
**Update history**
213+
214+
```powershell
215+
216+
Get-HotFix | Export-Csv \Hotfix.csv
217+
218+
```
219+
220+
## References
221+
222+
- [Upgrade a Windows Server failover cluster with a cluster OS rolling upgrade](/windows-server/failover-clustering/cluster-operating-system-rolling-upgrade)
223+
- [Update-ClusterFunctionalLevel](/powershell/module/failoverclusters/update-clusterfunctionallevel)
224+
- [Known issues - KB5062557](https://support.microsoft.com/help/5062557)

support/windows-server/toc.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1212,6 +1212,10 @@ items:
12121212
href: ./high-availability/troubleshoot-issues-accounts-used-failover-clusters.md
12131213
- name: Tuning Failover Cluster Network Thresholds
12141214
href: ./high-availability/iaas-sql-failover-cluster-network-thresholds.md
1215+
- name: Rolling upgrades
1216+
items:
1217+
- name: Troubleshoot rolling upgrade issues
1218+
href: ./high-availability/troubleshoot-rolling-upgrades.md
12151219
- name: Site configuration
12161220
items:
12171221
- name: Troubleshoot site configuration issues

0 commit comments

Comments
 (0)