|
| 1 | +--- |
| 2 | +title: Troubleshoot Rolling Upgrade Issues |
| 3 | +description: Describes how to troubleshoot rolling upgrade issues. |
| 4 | +ms.date: 12/05/2025 |
| 5 | +manager: dcscontentpm |
| 6 | +audience: itpro |
| 7 | +ms.topic: troubleshooting |
| 8 | +ms.author: jeffhugh |
| 9 | +ms.reviewer: kaushika, v-ryanberg, v-gsitser |
| 10 | +ms.custom: |
| 11 | +- sap:rolling upgrade and high availability\rolling upgrade issues |
| 12 | +- pcy:WinComm Storage High Avail |
| 13 | +--- |
| 14 | +# Troubleshoot rolling upgrade issues |
| 15 | + |
| 16 | +## Summary |
| 17 | + |
| 18 | +This article provides a structured troubleshooting approach for addressing common issues encountered during rolling upgrades in Windows Server Failover Clustering (WSFC), Storage Spaces Direct, SQL Server Always On availability groups, and Hyper-V. |
| 19 | + |
| 20 | +Rolling upgrades are essential for maintaining and upgrading systems with minimal downtime. However, challenges like compatibility and configuration errors can impact availability and potentially cause data loss. |
| 21 | + |
| 22 | +## Prerequisites |
| 23 | + |
| 24 | +Before starting a rolling upgrade: |
| 25 | + |
| 26 | +- Verify that the rolling upgrade feature is supported for your workload and operating system (OS) versions. |
| 27 | +- Confirm all cluster nodes are healthy using the `Get-ClusterNode` PowerShell command. |
| 28 | +- Ensure you have up-to-date backups, including: |
| 29 | + - System state |
| 30 | + - Cluster configuration |
| 31 | + - User data |
| 32 | + |
| 33 | +## Potential workarounds |
| 34 | + |
| 35 | +### Address rolling upgrade failures |
| 36 | + |
| 37 | +1. Move core resources to another node using Failover Cluster Manager or the `Move-ClusterGroup` PowerShell command. |
| 38 | +2. Use `Suspend-ClusterNode -Drain` to migrate roles and resources off the node. |
| 39 | +3. Check cluster logs for dependencies or errors blocking the operation. |
| 40 | + |
| 41 | +## Troubleshooting checklist |
| 42 | + |
| 43 | +1. **Review prerequisites**: Ensure the environment meets all prerequisites previously cited in this article. |
| 44 | + |
| 45 | +2. **Validate cluster status**: Run `Test-Cluster` and resolve any validation warnings or errors. |
| 46 | + - Verify the current cluster functional level using `Get-Cluster | Select ClusterFunctionalLevel`. |
| 47 | + - Validate network connectivity among all nodes. |
| 48 | + |
| 49 | +3. **Plan and sequence upgrades**: Document the sequence of node upgrades (one node at a time). |
| 50 | + - Move cluster roles (like virtual machines (VMs), availability groups, or file shares) off the node being upgraded. |
| 51 | + - Update all nodes with the latest supported patches or hotfixes for the current OS. |
| 52 | + |
| 53 | +4. **Communicate with stakeholders**: Inform stakeholders and schedule maintenance windows. |
| 54 | + - Notify monitoring teams to avoid unnecessary alerts. |
| 55 | + |
| 56 | +5. **Ensure application awareness**: Confirm application compatibility for workloads like SQL Server, Hyper-V, or file services. |
| 57 | + - Inform application owners of planned upgrades. |
| 58 | + |
| 59 | +6. **Conduct pre-upgrade tests**: Review logs for Windows, applications, clusters, and storage to identify any pre-existing issues. |
| 60 | + |
| 61 | +## Common issues and their respective solutions |
| 62 | + |
| 63 | +### 1. Rolling upgrade fails to start or node can't be evicted |
| 64 | + |
| 65 | +**Symptoms** |
| 66 | + |
| 67 | +You're unable to pause, drain, or remove a node from the cluster. Errors like "Node ... cannot be removed from the cluster ..." appear. |
| 68 | + |
| 69 | +**Cause** |
| 70 | + |
| 71 | +The node hosts core cluster resources, dependencies are misconfigured, or the cluster is unstable. |
| 72 | + |
| 73 | +**Solution** |
| 74 | + |
| 75 | +1. Move core resources to another node using Failover Cluster Manager or `Move-ClusterGroup`. |
| 76 | +2. Use `Suspend-ClusterNode -Drain` to move roles and resources. |
| 77 | +3. Ensure the node isn't the last up-to-date or quorum node. |
| 78 | +4. Check cluster logs for blocking dependencies. |
| 79 | + |
| 80 | +### 2. Failure adding upgraded node back to cluster |
| 81 | + |
| 82 | +**Symptoms** |
| 83 | + |
| 84 | +Errors like "A node attempted to join a failover cluster but failed due to incompatibility…" or version mismatch messages appear. |
| 85 | + |
| 86 | +**Cause** |
| 87 | + |
| 88 | +Unsupported OS version mix or unpatched node. |
| 89 | + |
| 90 | +**Solution** |
| 91 | + |
| 92 | +1. Verify the supported OS and cluster version matrix. |
| 93 | +2. Patch the node to the latest cumulative update (CU). |
| 94 | +3. Upgrade the OS versions sequentially (for example, 2016 → 2019 → 2022). |
| 95 | +4. Use `Get-ClusterLog` to identify versioning errors. |
| 96 | + |
| 97 | +### 3. Resource or service fails to come online |
| 98 | + |
| 99 | +**Symptoms** |
| 100 | + |
| 101 | +Resources like VMs or file shares enter a failed or offline state post-upgrade. Common Event IDs include `1069`, `1146`, and `1230`. |
| 102 | + |
| 103 | +**Cause** |
| 104 | + |
| 105 | +Misconfiguration during upgrade, missing registry keys or files, or service account failures. |
| 106 | + |
| 107 | +**Solution** |
| 108 | + |
| 109 | +1. Check cluster events in Failover Cluster Manager. |
| 110 | +2. Validate resource owner configurations using `Get-ClusterResource | Get-ClusterOwnerNode`. |
| 111 | +3. Repair or recreate missing dependencies. |
| 112 | +4. Restart cluster services with `Restart-Service ClusSvc`. |
| 113 | + |
| 114 | +### 4. Quorum or communication loss |
| 115 | + |
| 116 | +**Symptoms** |
| 117 | + |
| 118 | +Cluster goes offline, nodes enter quarantine, or Event IDs `1135` and `1136` appear. |
| 119 | + |
| 120 | +**Cause** Network partition, firewall configuration, or quorum misconfiguration. |
| 121 | + |
| 122 | +**Solution** |
| 123 | + |
| 124 | +1. Ensure all required ports are open. |
| 125 | +2. Check network, DNS, and routing configurations. |
| 126 | +3. Check quorum settings with `Get-ClusterQuorum` and update them if necessary. |
| 127 | +4. Run `Validate-Cluster` to identify root causes. |
| 128 | + |
| 129 | +### 5. Patch or update failure or known bug |
| 130 | + |
| 131 | +**Symptoms** |
| 132 | + |
| 133 | +Cluster services crash post-update or resources fail due to a known problematic update. |
| 134 | + |
| 135 | +**Cause** |
| 136 | + |
| 137 | +Microsoft updates or patches causing cluster instability. |
| 138 | + |
| 139 | +**Solution** |
| 140 | + |
| 141 | +1. Review Microsoft Knowledge Base (KB) articles for known issues. |
| 142 | +2. Remove problematic updates if needed. |
| 143 | +3. Apply recommended hotfixes or wait for updated patches. |
| 144 | +4. Open a support case if still unresolved. |
| 145 | + |
| 146 | +### 6. Cluster validation or functional level errors |
| 147 | + |
| 148 | +**Symptoms** |
| 149 | + |
| 150 | +Unable to update the cluster functional level or validation fails. |
| 151 | + |
| 152 | +**Cause** |
| 153 | + |
| 154 | +Mixed OS versions, incomplete upgrades, or outdated drivers. |
| 155 | + |
| 156 | +**Solution** |
| 157 | + |
| 158 | +1. Update all nodes and ensure they're joined to the cluster. |
| 159 | +2. Update hardware drivers (like network and storage) and firmware. |
| 160 | +3. Use `Update-ClusterFunctionalLevel` to complete the upgrade. |
| 161 | +4. Review logs for driver or validation failures. |
| 162 | + |
| 163 | +## Advanced troubleshooting and data collection |
| 164 | + |
| 165 | +For persistent or complex issues, collect the following data: |
| 166 | + |
| 167 | +- **Cluster logs** |
| 168 | + |
| 169 | +```powershell |
| 170 | +
|
| 171 | +Get-ClusterLog -TimeSpan 24:00 -Destination |
| 172 | +
|
| 173 | +``` |
| 174 | + |
| 175 | +**System and application event logs** |
| 176 | + |
| 177 | +```powershell |
| 178 | + |
| 179 | + Get-WinEvent -LogName System -MaxEvents 1000 | Export-Csv <Path>\SystemLogs.csv |
| 180 | + Get-WinEvent -LogName Application -MaxEvents 1000 | Export-Csv <Path>\AppLogs.csv |
| 181 | + |
| 182 | +``` |
| 183 | + |
| 184 | +**Resource and node status** |
| 185 | + |
| 186 | +```powershell |
| 187 | + |
| 188 | + |
| 189 | + Get-ClusterNode |
| 190 | + Get-ClusterResource |
| 191 | + Get-ClusterGroup |
| 192 | + Test-Cluster |
| 193 | + |
| 194 | +``` |
| 195 | + |
| 196 | +**Network and driver information** |
| 197 | + |
| 198 | +```powershell |
| 199 | + |
| 200 | + Get-NetAdapter -IncludeHidden | Export-Csv <Path>\NetAdapters.csv |
| 201 | + |
| 202 | +``` |
| 203 | + |
| 204 | +**Patch or update history** |
| 205 | + |
| 206 | +```powershell |
| 207 | +
|
| 208 | +Get-HotFix | Export-Csv \Hotfix.csv |
| 209 | +
|
| 210 | +``` |
| 211 | + |
| 212 | +## References |
| 213 | + |
| 214 | +- [Upgrade a Windows Server failover cluster with a cluster OS rolling upgrade](/windows-server/failover-clustering/cluster-operating-system-rolling-upgrade) |
| 215 | +- [Update-ClusterFunctionalLevel](/powershell/module/failoverclusters/update-clusterfunctionallevel) |
| 216 | +- [Known issues - KB5062557](https://support.microsoft.com/help/5062557) |
0 commit comments