You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-suse.md
+25-20Lines changed: 25 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,6 +31,7 @@ This article provides guidance on troubleshooting, analysis, and resolution of t
31
31
* Additionally, services dependent on network connectivity, such as `waagent`, generate communication related error messages in the logs, further indicating network related disruptions.
32
32
33
33
The following messages can be observed in `/var/log/messages`:
34
+
34
35
From `node 01`:
35
36
```output
36
37
Aug 21 01:48:00 node 01 corosync[19389]: [TOTEM ] Token has not been received in 30000 ms
@@ -44,15 +45,15 @@ Aug 21 01:47:27 node 02 corosync[15241]: [KNET ] host: host: 2 has no active
44
45
Aug 21 01:47:31 node 02 corosync[15241]: [TOTEM ] Token has not been received in 30000 ms
45
46
```
46
47
47
-
####Cause
48
+
### Cause
48
49
It's noted that the unexpected node reboot is observed due to Network Maintenance activity or an outage. For confirmation, the timestamp can be matched by reviewing the [Azure Maintenance Notification](/azure/virtual-machines/linux/maintenance-notifications) in Azure Portal. For more information about Azure Scheduled Events, see [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events).
49
50
50
51
#### Resolution
51
52
If the unexpected reboot timestamp aligns with a maintenance activity, the analysis confirms that the cluster was impacted by either platform or network maintenance. For further assistance or additional queries, you can open a support request by following these [instructions](#next-steps).
52
53
53
54
### Scenario 2: Cluster Misconfiguration
54
55
The cluster nodes experience unexpected failover or node reboots and it's often observed due to cluster misconfiguration affecting the stability of Pacemaker Clusters.
55
-
The cluster configuration can be reviwed by running the following command:
56
+
The cluster configuration can be reviewed by running the following command:
56
57
```bash
57
58
sudo crm configure show
58
59
```
@@ -65,29 +66,29 @@ Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconf
65
66
- Wrong STONITH resource rettings: Incorrect parameters for Azure fencing agents like `fence_azure_arm` can cause nodes to reboot unexpectedly during failovers.
66
67
- Insufficient permissions: The Azure resource group or credentials used for fencing may lack required permissions, causing STONITH failures.
67
68
68
-
2. Missing/Incorrect Resource Constraints:
69
+
2. Missing/Incorrect resource constraints:
69
70
Poorly set constraints can cause resources to be redistributed unnecessarily, leading to node overload and, reboots. Misaligned resource dependency configurations can cause nodes to go into a fail/reboot loop.
70
71
71
-
3. Cluster Threshold and Timeout Misconfigurations:
72
+
3. Cluster threshold and timeout misconfigurations:
72
73
-`failure-timeout`, `migration-threshold`, or `monitor-timeout` values may result in nodes being prematurely rebooted.
73
74
- Heartbeat Timeout Settings: Incorrect `corosync` timeout settings for heartbeat intervals can cause nodes to assume each other are offline, triggering unnecessary reboots.
74
75
75
-
4. Lack of Proper Health Checks:
76
-
Insufficient Monitoring of Critical Resources: Not setting proper health-check intervals for critical services like SAP HANA(High-performance ANalytic Application) can cause resource or node failures.
76
+
4. Lack of proper health checks:
77
+
Not setting proper health-check intervals for critical services like SAP HANA(High-performance ANalytic Application) can cause resource or node failures.
77
78
78
-
5. Resource Agent Misconfiguration:
79
+
5. Resource agent misconfiguration:
79
80
- Custom resource agents misaligned with cluster: Resource agents that don't adhere to Pacemaker standards can create unpredictable behavior, including node reboots.
80
81
- Wrong resource start/stop parameters: Improperly tuned start/stop parameters in cluster configuration may lead to nodes rebooting during resource recovery.
81
82
82
-
6. Corosync Configuration Issues:
83
+
6. Corosync configuration issues:
83
84
- Unoptimized network settings: Incorrect multicast/unicast configuration can lead to heartbeat communication failures. Mismatched `ring0` and `ring1` network configurations cause split-brain scenarios and node fencing.
84
85
- Token timeout mismatches: Token timeout values not aligned with the environment’s latency can trigger node isolation and reboots.
85
86
- Following command can be used to review Corosync configuration:
86
87
```bash
87
88
sudo cat /etc/corosync/corosync.conf
88
89
```
89
90
90
-
### Resolution
91
+
#### Resolution
91
92
- It's necessary to follow the proper guidelines outlined for setting up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
92
93
- Steps to make necessary changes to the cluster configuration:
93
94
1. Stop the application on both the nodes.
@@ -107,34 +108,38 @@ Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconf
107
108
### Scenario 3: Migration from On-premises to Azure
108
109
When migrating a SUSE Pacemaker cluster from on-premises to Azure, unexpected reboots can arise from specific misconfigurations or overlooked dependencies. Below are common mistakes in this category:
109
110
110
-
#### Cause
111
+
### Cause
111
112
112
113
1. Incomplete or incorrect STONITH configuration:
113
114
- No STONITH or fencing misfconfigured: Not configuring STONITH (Shoot-The-Other-Node-In-The-Head) properly can lead to nodes being marked as unhealthy and triggering unnecessary reboots.
114
115
- Wrong STONITH resource settings: Incorrect parameters for Azure fencing agents like `fence_azure_arm` can cause nodes to reboot unexpectedly during failovers.
115
116
- Insufficient permissions: The Azure resource group or credentials used for fencing may lack required permissions, causing STONITH failures. Key Azure-specific parameters, such as subscription ID, resource group, or VM names, must be correctly configured in the fencing agent. Omissions here can result in fencing failures and unexpected reboots.
117
+
118
+
For more information, see [Troubleshoot Azure Fence Agent startup issues in SUSE](troubleshoot-azure-fence-agent-startup-suse.md) and [Troubleshoot SBD service failure in SUSE Pacemaker clusters](troubleshoot-sbd-issues-sles.md)
116
119
117
120
2. Network misconfigurations:
118
121
Misconfigured VNets, subnets, or security group rules can block essential cluster communication, leading to perceived node failures and reboots.
119
-
Refer: [Virtual networks and virtual machines in Azure](/azure/virtual-network/network-overview)
122
+
123
+
For more information, see [Virtual networks and virtual machines in Azure](/azure/virtual-machines/linux/network-overview)
120
124
121
125
3. Metadata Service issues:
122
126
Azure's cloud metadata services must be correctly handled; otherwise, resource detection or startup processes can fail.
- [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events)
127
+
128
+
For more information, see [Azure Instance Metadata Service](/azure/virtual-machines/instance-metadata-service) and [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events)
126
129
127
130
4. Performance and latency mismatches:
128
131
- Inadequate VM sizing: Migrated workloads may not align with the selected Azure VM(Virtual Machine) size, causing resource overutilization and triggering reboots.
129
132
- Disk I/O mismatches: On-premises workloads with high IOPS(Input/output operations per second) demands must be paired with the appropriate Azure disk or storage performance tier.
130
-
Refer:[Collect performance metrics for a Linux VM](/azure/virtual-machines/linux/collect-performance-metrics-from-a-linux-system.md)
133
+
134
+
For more information, see [Collect performance metrics for a Linux VM](collect-performance-metrics-from-a-linux-system.md)
131
135
132
136
5. Security and Firewall Rules:
133
-
- Port Blockages: On-premises clusters often have open, internal communication, while Azure NSGs (Network Security Groups) or firewalls may block ports required for Pacemaker/Corosync communication.
134
-
Refer: [Network security group test](/azure/virtual-machines/network-security-group-test)
137
+
- Port Block: On-premises clusters often have open, internal communication, while Azure NSGs (Network Security Groups) or firewalls may block ports required for Pacemaker/Corosync communication.
138
+
139
+
For more information, see [Network security group test](/azure/virtual-machines/network-security-group-test)
135
140
136
141
137
-
### Resolution
142
+
#### Resolution
138
143
- It's necessary to follow the proper guidelines outlined for setting up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
139
144
140
145
### Scenario 4: `HANA_CALL` timeout after 60 seconds
@@ -148,7 +153,7 @@ The Azure SUSE Pacemaker Cluster is running SAP HANA as application and experien
148
153
2024-06-04T09:25:38.736748+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary: we didn't expect node_status to be: DUMP <00000000 0a |.|#01200000001>
149
154
```
150
155
151
-
#### Cause
156
+
### Cause
152
157
The SAP HANA timeout messages are commonly considered internal application timeouts, and the SAP vendor should be engaged.
153
158
154
159
#### Resolution
@@ -172,7 +177,7 @@ The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as applicatio
172
177
2024-11-09T07:39:42.828955-05:00 node 01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov 9 07:39:42 2024
173
178
```
174
179
175
-
#### Cause
180
+
### Cause
176
181
The `ASCS/ERS` resource is considered the application for SAP Netweaver clusters. When the corresponding cluster monitoring resource times out, it triggers a failover process.
0 commit comments