|
| 1 | +--- |
| 2 | +title: Troubleshoot Rolling Upgrade Issues |
| 3 | +description: Discusses how to troubleshoot rolling upgrade issues. |
| 4 | +ms.date: 12/05/2025 |
| 5 | +manager: dcscontentpm |
| 6 | +audience: itpro |
| 7 | +ms.topic: troubleshooting |
| 8 | +ms.author: jeffhugh |
| 9 | +ms.reviewer: kaushika, v-ryanberg, v-gsitser |
| 10 | +ms.custom: |
| 11 | +- sap:rolling upgrade and high availability\rolling upgrade issues |
| 12 | +- pcy:WinComm Storage High Avail |
| 13 | +- appliesto: |
| 14 | + - <a href=https://learn.microsoft.com/windows/release-health/windows-server-release-info target=_blank>Supported versions of Windows Server</a> |
| 15 | +--- |
| 16 | +# Troubleshoot rolling upgrade issues |
| 17 | + |
| 18 | +## Summary |
| 19 | + |
| 20 | +This article provides a structured troubleshooting method to resolve common issues that you might encounter during rolling upgrades in Windows Server Failover Clustering (WSFC), Storage Spaces Direct, SQL Server Always On availability groups, and Hyper-V. |
| 21 | + |
| 22 | +Rolling upgrades are essential for maintaining and upgrading systems while experiencing minimal downtime. However, challenges such as compatibility and configuration errors can affect availability, and potentially cause data loss. |
| 23 | + |
| 24 | +## Prerequisites |
| 25 | + |
| 26 | +Before you start a rolling upgrade: |
| 27 | + |
| 28 | +- Verify that the rolling upgrade feature is supported for your workload and operating system (OS) versions. |
| 29 | +- Verify that all cluster nodes are healthy by using the `Get-ClusterNode` PowerShell command. |
| 30 | +- Make sure that you have up-to-date backups, including: |
| 31 | + - System state |
| 32 | + - Cluster configuration |
| 33 | + - User data |
| 34 | + |
| 35 | +## Potential workarounds |
| 36 | + |
| 37 | +### Address rolling upgrade failures |
| 38 | + |
| 39 | +1. Move core resources to another node by using Failover Cluster Manager or the `Move-ClusterGroup` PowerShell command. |
| 40 | +2. Migrate roles and resources off the node by using `Suspend-ClusterNode -Drain`. |
| 41 | +3. Check cluster logs for dependencies or errors that might block the operation. |
| 42 | + |
| 43 | +## Troubleshooting checklist |
| 44 | + |
| 45 | +1. **Review prerequisites**: Make sure that the environment meets all prerequisites that are mentioned in this article. |
| 46 | + |
| 47 | +2. **Validate cluster status**: Resolve any validation warnings or errors by running `Test-Cluster`. |
| 48 | + - Verify the current cluster functional level by using `Get-Cluster | Select ClusterFunctionalLevel`. |
| 49 | + - Validate network connectivity among all nodes. |
| 50 | + |
| 51 | +3. **Plan and sequence upgrades**: Document the sequence of node upgrades (one node at a time). |
| 52 | + - Move cluster roles (such as virtual machines (VMs), availability groups, or file shares) off the node that's being upgraded. |
| 53 | + - Update all nodes to the latest supported updates or hotfixes for the current OS. |
| 54 | + |
| 55 | +4. **Communicate with stakeholders**: Inform stakeholders and schedule maintenance windows. |
| 56 | + - Notify monitoring teams in order to avoid unnecessary alerts. |
| 57 | + |
| 58 | +5. **Ensure application awareness**: Verify application compatibility for workloads such as SQL Server, Hyper-V, or file services. |
| 59 | + - Inform application owners about planned upgrades. |
| 60 | + |
| 61 | +6. **Conduct pre-upgrade tests**: Review logs for Windows, applications, clusters, and storage to identify any pre-existing issues. |
| 62 | + |
| 63 | +## Common issues and their respective solutions |
| 64 | + |
| 65 | +### 1. Rolling upgrade doesn't start or node can't be evicted |
| 66 | + |
| 67 | +**Symptoms** |
| 68 | + |
| 69 | +You can't pause, drain, or remove a node from the cluster. You receive error messages such as the following example: |
| 70 | + |
| 71 | +> Node... cannot be removed from the cluster. |
| 72 | +
|
| 73 | +**Cause** |
| 74 | + |
| 75 | +The node hosts core cluster resources, dependencies are misconfigured or the cluster is unstable. |
| 76 | + |
| 77 | +**Solution** |
| 78 | + |
| 79 | +1. Move core resources to another node by using Failover Cluster Manager or `Move-ClusterGroup`. |
| 80 | +2. move roles and resources by running `Suspend-ClusterNode -Drain`. |
| 81 | +3. Make sure that the node isn't the last up-to-date or quorum node. |
| 82 | +4. Check cluster logs for blocking dependencies. |
| 83 | + |
| 84 | +### 2. Can't restore upgraded node to cluster |
| 85 | + |
| 86 | +**Symptoms** |
| 87 | + |
| 88 | +You receive a version mismatch message or error messages such as the following example: |
| 89 | + |
| 90 | +> A node attempted to join a failover cluster but failed due to incompatibility. |
| 91 | +
|
| 92 | +**Cause** |
| 93 | + |
| 94 | +Unsupported OS version mix or nonupdated node. |
| 95 | + |
| 96 | +**Solution** |
| 97 | + |
| 98 | +1. Verify the supported OS and cluster version matrix. |
| 99 | +2. Update the node to the latest cumulative update (CU). |
| 100 | +3. Upgrade the OS versions sequentially (for example, 2016 → 2019 → 2022). |
| 101 | +4. Identify versioning errors by using `Get-ClusterLog`. |
| 102 | + |
| 103 | +### 3. Resource or service doesn't come online |
| 104 | + |
| 105 | +**Symptoms** |
| 106 | + |
| 107 | +Resources such as VMs or file shares enter a failed or offline state post-upgrade. Common Event IDs include `1069`, `1146`, and `1230`. |
| 108 | + |
| 109 | +**Cause** |
| 110 | + |
| 111 | +Misconfiguration during upgrade, missing registry keys or files, or service account failures. |
| 112 | + |
| 113 | +**Solution** |
| 114 | + |
| 115 | +1. Check cluster events in Failover Cluster Manager. |
| 116 | +2. Verify resource owner configurations by running `Get-ClusterResource | Get-ClusterOwnerNode`. |
| 117 | +3. Repair or re-create missing dependencies. |
| 118 | +4. Restart cluster services by running `Restart-Service ClusSvc`. |
| 119 | + |
| 120 | +### 4. Quorum or communication loss |
| 121 | + |
| 122 | +**Symptoms** |
| 123 | + |
| 124 | +Cluster goes offline, nodes enter quarantine, or Event IDs `1135` and `1136` appear. |
| 125 | + |
| 126 | +**Cause** |
| 127 | + |
| 128 | +Network partition, firewall configuration, or quorum misconfiguration. |
| 129 | + |
| 130 | +**Solution** |
| 131 | + |
| 132 | +1. Make sure that all required ports are open. |
| 133 | +2. Check network, DNS, and routing configurations. |
| 134 | +3. Check quorum settings by running `Get-ClusterQuorum`. Update settings as appropriate. |
| 135 | +4. To identify root causes, run `Validate-Cluster`. |
| 136 | + |
| 137 | +### 5. Update failure or known bug |
| 138 | + |
| 139 | +**Symptoms** |
| 140 | + |
| 141 | +Cluster services stop responding after an update, or resources fail because of a known problematic update. |
| 142 | + |
| 143 | +**Cause** |
| 144 | + |
| 145 | +Cluster instability occurred after a Microsoft update installation. |
| 146 | + |
| 147 | +**Solution** |
| 148 | + |
| 149 | +1. Review Microsoft Knowledge Base (KB) articles for known issues. |
| 150 | +2. Remove problematic updates, if it's necessary. |
| 151 | +3. Apply recommended hotfixes or wait for new updates. |
| 152 | +4. Open a support case if the issue remains unresolved. |
| 153 | + |
| 154 | +### 6. Cluster validation or functional level errors |
| 155 | + |
| 156 | +**Symptoms** |
| 157 | + |
| 158 | +Can't update the cluster functional level, or validation fails. |
| 159 | + |
| 160 | +**Cause** |
| 161 | + |
| 162 | +Mixed OS versions, incomplete upgrades, or outdated drivers. |
| 163 | + |
| 164 | +**Solution** |
| 165 | + |
| 166 | +1. Update all nodes, and make sure that they're joined to the cluster. |
| 167 | +2. Update hardware drivers (such as network and storage) and firmware. |
| 168 | +3. Complete the upgrade by using `Update-ClusterFunctionalLevel`. |
| 169 | +4. Review logs for driver or validation failures. |
| 170 | + |
| 171 | +## Advanced troubleshooting and data collection |
| 172 | + |
| 173 | +For persistent or complex issues, collect the following data. |
| 174 | + |
| 175 | +**Cluster logs** |
| 176 | + |
| 177 | +```powershell |
| 178 | +
|
| 179 | +Get-ClusterLog -TimeSpan 24:00 -Destination |
| 180 | +
|
| 181 | +``` |
| 182 | + |
| 183 | +**System and application event logs** |
| 184 | + |
| 185 | +```powershell |
| 186 | + |
| 187 | + Get-WinEvent -LogName System -MaxEvents 1000 | Export-Csv <Path>\SystemLogs.csv |
| 188 | + Get-WinEvent -LogName Application -MaxEvents 1000 | Export-Csv <Path>\AppLogs.csv |
| 189 | + |
| 190 | +``` |
| 191 | + |
| 192 | +**Resource and node status** |
| 193 | + |
| 194 | +```powershell |
| 195 | + |
| 196 | + |
| 197 | + Get-ClusterNode |
| 198 | + Get-ClusterResource |
| 199 | + Get-ClusterGroup |
| 200 | + Test-Cluster |
| 201 | + |
| 202 | +``` |
| 203 | + |
| 204 | +**Network and driver information** |
| 205 | + |
| 206 | +```powershell |
| 207 | + |
| 208 | + Get-NetAdapter -IncludeHidden | Export-Csv <Path>\NetAdapters.csv |
| 209 | + |
| 210 | +``` |
| 211 | + |
| 212 | +**Update history** |
| 213 | + |
| 214 | +```powershell |
| 215 | +
|
| 216 | +Get-HotFix | Export-Csv \Hotfix.csv |
| 217 | +
|
| 218 | +``` |
| 219 | + |
| 220 | +## References |
| 221 | + |
| 222 | +- [Upgrade a Windows Server failover cluster with a cluster OS rolling upgrade](/windows-server/failover-clustering/cluster-operating-system-rolling-upgrade) |
| 223 | +- [Update-ClusterFunctionalLevel](/powershell/module/failoverclusters/update-clusterfunctionallevel) |
| 224 | +- [Known issues - KB5062557](https://support.microsoft.com/help/5062557) |
0 commit comments