Skip to content

Commit 1d62267

Browse files
authored
Update headings and add custom metadata in troubleshooting guide
1 parent 914b3e9 commit 1d62267

1 file changed

Lines changed: 13 additions & 12 deletions

File tree

support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-suse.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,10 @@ ms.topic: troubleshooting
88
ms.date: 1/13/2025
99
ms.service: azure-virtual-machines
1010
ms.collection: linux
11+
ms.custom: sap:Issue with Pacemaker cluster, and fencing
1112
---
1213

13-
# Troubleshooting Unexpected Node Reboots in Azure Linux SUSE Pacemaker Cluster Nodes
14+
# Troubleshooting Unexpected Node reboots in Azure Linux SUSE Pacemaker Cluster Nodes
1415

1516
**Applies to:** :heavy_check_mark: Linux VMs
1617

@@ -26,7 +27,7 @@ This article provides guidance on troubleshooting, analysis, and resolution of t
2627
- [SBD with an iscsi target server](/azure/sap/workloads/high-availability-guide-suse-pacemaker?tabs=msi#sbd-with-an-iscsi-target-server)
2728
- [SBD with an Azure shared disk](/azure/sap/workloads/high-availability-guide-suse-pacemaker?tabs=msi#sbd-with-an-azure-shared-disk)
2829

29-
### Scenario 1: Network Outage
30+
## Scenario 1: Network Outage
3031
* The cluster nodes are experiencing `corosync` communication errors, resulting in continuous retransmissions due to an inability to establish communication between nodes. This issue triggers application timeouts, ultimately leading to node fencing and subsequent reboots.
3132
* Additionally, services dependent on network connectivity, such as `waagent`, generate communication related error messages in the logs, further indicating network related disruptions.
3233

@@ -48,20 +49,20 @@ Aug 21 01:47:31 node 02 corosync[15241]: [TOTEM ] Token has not been received
4849
### Cause
4950
The unexpected node reboot is noted as a result of a Network Maintenance activity or an outage. For confirmation, the timestamp can be matched by reviewing the [Azure Maintenance Notification](/azure/virtual-machines/linux/maintenance-notifications) in Azure Portal. For more information about Azure Scheduled Events, see [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events).
5051

51-
#### Resolution
52+
### Resolution
5253
If the unexpected reboot timestamp aligns with a maintenance activity, the analysis confirms that either platform or network maintenance impacted the cluster.
5354

5455
For further assistance or other queries, you can open a support request by following these [instructions](#next-steps).
5556

56-
### Scenario 2: Cluster Misconfiguration
57+
## Scenario 2: Cluster Misconfiguration
5758
The cluster nodes experience unexpected failovers or node reboots, often caused by cluster misconfigurations that affect the stability of Pacemaker Clusters.
5859

5960
The cluster configuration can be reviewed by running the following command:
6061
```bash
6162
sudo crm configure show
6263
```
6364

64-
### Cause
65+
## Cause
6566
Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconfigurations:
6667

6768
1. Incorrect STONITH configuration:
@@ -91,7 +92,7 @@ Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconf
9192
sudo cat /etc/corosync/corosync.conf
9293
```
9394

94-
#### Resolution
95+
### Resolution
9596
- It's necessary to follow the proper guidelines outlined for setting up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
9697
- Steps to make necessary changes to the cluster configuration:
9798
1. Stop the application on both the nodes.
@@ -108,7 +109,7 @@ Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconf
108109
```bash
109110
crm configure property maintenance-mode=false
110111
```
111-
### Scenario 3: Migration from On-premises to Azure
112+
## Scenario 3: Migration from On-premises to Azure
112113
When migrating a SUSE Pacemaker cluster from on-premises to Azure, unexpected reboots can arise from specific misconfigurations or overlooked dependencies.
113114
114115
### Cause
@@ -143,10 +144,10 @@ The following are common mistakes in this category:
143144
For more information, see [Network security group test](/azure/virtual-machines/network-security-group-test)
144145

145146

146-
#### Resolution
147+
### Resolution
147148
- It's necessary to follow the proper guidelines outlined for setting up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
148149
149-
### Scenario 4: `HANA_CALL` timeout after 60 seconds
150+
## Scenario 4: `HANA_CALL` timeout after 60 seconds
150151
151152
The Azure SUSE Pacemaker Cluster is running SAP HANA as application and experiences unexpected reboot on one of the nodes or both the nodes in the Pacemaker Cluster. Reviewing the `/var/log/messages` or `/var/log/pacemaker.log`, the node reboot is due to `HANA_CALL` time out as shown:
152153
@@ -160,12 +161,12 @@ The Azure SUSE Pacemaker Cluster is running SAP HANA as application and experien
160161
### Cause
161162
The SAP HANA timeout messages are commonly considered internal application timeouts, and the SAP vendor should be engaged.
162163
163-
#### Resolution
164+
### Resolution
164165
- To identify the root cause of the issue, it's essential to review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
165166
- Particular attention should be given to memory pressure and storage devices, their configuration, especially if HANA is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
166167
- Once external factors, such as platform or network outages, are ruled out, engaging the application vendor for trace call analysis and log review is recommended.
167168

168-
### Scenario 5: `ASCS/ERS` timeout in SAP Netweaver Clusters
169+
## Scenario 5: `ASCS/ERS` timeout in SAP Netweaver Clusters
169170

170171
The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as application and experiences unexpected reboot on one of the nodes or both the nodes in the Pacemaker Cluster. Following messages can be observed in `/var/log/messages` :
171172

@@ -184,7 +185,7 @@ The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as applicatio
184185
### Cause
185186
The `ASCS/ERS` resource is considered the application for SAP Netweaver clusters. When the corresponding cluster monitoring resource times out, it triggers a failover process.
186187

187-
#### Resolution
188+
### Resolution
188189
- To identify the root cause of the issue, it's essential to review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
189190
- Particular attention should be given to memory pressure and storage devices, their configuration especially if SAP Netweaver is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
190191
- Once external factors, such as platform or network outages, are ruled out, engaging the application vendor for trace call analysis and log review is recommended.

0 commit comments

Comments
 (0)