You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-suse.md
+13-12Lines changed: 13 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,9 +8,10 @@ ms.topic: troubleshooting
8
8
ms.date: 1/13/2025
9
9
ms.service: azure-virtual-machines
10
10
ms.collection: linux
11
+
ms.custom: sap:Issue with Pacemaker cluster, and fencing
11
12
---
12
13
13
-
# Troubleshooting Unexpected Node Reboots in Azure Linux SUSE Pacemaker Cluster Nodes
14
+
# Troubleshooting Unexpected Node reboots in Azure Linux SUSE Pacemaker Cluster Nodes
14
15
15
16
**Applies to:**:heavy_check_mark: Linux VMs
16
17
@@ -26,7 +27,7 @@ This article provides guidance on troubleshooting, analysis, and resolution of t
26
27
-[SBD with an iscsi target server](/azure/sap/workloads/high-availability-guide-suse-pacemaker?tabs=msi#sbd-with-an-iscsi-target-server)
27
28
-[SBD with an Azure shared disk](/azure/sap/workloads/high-availability-guide-suse-pacemaker?tabs=msi#sbd-with-an-azure-shared-disk)
28
29
29
-
###Scenario 1: Network Outage
30
+
## Scenario 1: Network Outage
30
31
* The cluster nodes are experiencing `corosync` communication errors, resulting in continuous retransmissions due to an inability to establish communication between nodes. This issue triggers application timeouts, ultimately leading to node fencing and subsequent reboots.
31
32
* Additionally, services dependent on network connectivity, such as `waagent`, generate communication related error messages in the logs, further indicating network related disruptions.
32
33
@@ -48,20 +49,20 @@ Aug 21 01:47:31 node 02 corosync[15241]: [TOTEM ] Token has not been received
48
49
### Cause
49
50
The unexpected node reboot is noted as a result of a Network Maintenance activity or an outage. For confirmation, the timestamp can be matched by reviewing the [Azure Maintenance Notification](/azure/virtual-machines/linux/maintenance-notifications) in Azure Portal. For more information about Azure Scheduled Events, see [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events).
50
51
51
-
####Resolution
52
+
### Resolution
52
53
If the unexpected reboot timestamp aligns with a maintenance activity, the analysis confirms that either platform or network maintenance impacted the cluster.
53
54
54
55
For further assistance or other queries, you can open a support request by following these [instructions](#next-steps).
55
56
56
-
###Scenario 2: Cluster Misconfiguration
57
+
## Scenario 2: Cluster Misconfiguration
57
58
The cluster nodes experience unexpected failovers or node reboots, often caused by cluster misconfigurations that affect the stability of Pacemaker Clusters.
58
59
59
60
The cluster configuration can be reviewed by running the following command:
60
61
```bash
61
62
sudo crm configure show
62
63
```
63
64
64
-
###Cause
65
+
## Cause
65
66
Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconfigurations:
66
67
67
68
1. Incorrect STONITH configuration:
@@ -91,7 +92,7 @@ Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconf
91
92
sudo cat /etc/corosync/corosync.conf
92
93
```
93
94
94
-
#### Resolution
95
+
### Resolution
95
96
- It's necessary to follow the proper guidelines outlined for setting up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
96
97
- Steps to make necessary changes to the cluster configuration:
97
98
1. Stop the application on both the nodes.
@@ -108,7 +109,7 @@ Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconf
108
109
```bash
109
110
crm configure property maintenance-mode=false
110
111
```
111
-
### Scenario 3: Migration from On-premises to Azure
112
+
## Scenario 3: Migration from On-premises to Azure
112
113
When migrating a SUSE Pacemaker cluster from on-premises to Azure, unexpected reboots can arise from specific misconfigurations or overlooked dependencies.
113
114
114
115
### Cause
@@ -143,10 +144,10 @@ The following are common mistakes in this category:
143
144
For more information, see [Network security group test](/azure/virtual-machines/network-security-group-test)
144
145
145
146
146
-
#### Resolution
147
+
### Resolution
147
148
- It's necessary to follow the proper guidelines outlined for setting up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
148
149
149
-
### Scenario 4: `HANA_CALL` timeout after 60 seconds
150
+
## Scenario 4: `HANA_CALL` timeout after 60 seconds
150
151
151
152
The Azure SUSE Pacemaker Cluster is running SAP HANA as application and experiences unexpected reboot on one of the nodes or both the nodes in the Pacemaker Cluster. Reviewing the `/var/log/messages` or `/var/log/pacemaker.log`, the node reboot is due to `HANA_CALL` time out as shown:
152
153
@@ -160,12 +161,12 @@ The Azure SUSE Pacemaker Cluster is running SAP HANA as application and experien
160
161
### Cause
161
162
The SAP HANA timeout messages are commonly considered internal application timeouts, and the SAP vendor should be engaged.
162
163
163
-
#### Resolution
164
+
### Resolution
164
165
- To identify the root cause of the issue, it's essential to review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
165
166
- Particular attention should be given to memory pressure and storage devices, their configuration, especially if HANA is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
166
167
- Once external factors, such as platform or network outages, are ruled out, engaging the application vendor for trace call analysis and log review is recommended.
167
168
168
-
### Scenario 5: `ASCS/ERS` timeout in SAP Netweaver Clusters
169
+
## Scenario 5: `ASCS/ERS` timeout in SAP Netweaver Clusters
169
170
170
171
The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as application and experiences unexpected reboot on one of the nodes or both the nodes in the Pacemaker Cluster. Following messages can be observed in`/var/log/messages`:
171
172
@@ -184,7 +185,7 @@ The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as applicatio
184
185
### Cause
185
186
The `ASCS/ERS` resource is considered the application for SAP Netweaver clusters. When the corresponding cluster monitoring resource times out, it triggers a failover process.
186
187
187
-
#### Resolution
188
+
### Resolution
188
189
- To identify the root cause of the issue, it's essential to review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
189
190
- Particular attention should be given to memory pressure and storage devices, their configuration especially if SAP Netweaver is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
190
191
- Once external factors, such as platform or network outages, are ruled out, engaging the application vendor for trace call analysis and log review is recommended.
0 commit comments