Skip to content

Commit 132c87f

Browse files
authored
Add scenario-specific headings for causes and resolutions.
1 parent 33ae69a commit 132c87f

1 file changed

Lines changed: 12 additions & 11 deletions

File tree

support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-suse.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -46,10 +46,10 @@ Aug 21 01:47:27 node 02 corosync[15241]: [KNET ] host: host: 2 has no active
4646
Aug 21 01:47:31 node 02 corosync[15241]: [TOTEM ] Token has not been received in 30000 ms
4747
```
4848

49-
### Cause
49+
### Cause for scenario 1
5050
The unexpected node reboot is noted as a result of a Network Maintenance activity or an outage. For confirmation, the timestamp can be matched by reviewing the [Azure Maintenance Notification](/azure/virtual-machines/linux/maintenance-notifications) in Azure Portal. For more information about Azure Scheduled Events, see [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events).
5151

52-
### Resolution
52+
### Resolution for scenario 1
5353
If the unexpected reboot timestamp aligns with a maintenance activity, the analysis confirms that either platform or network maintenance impacted the cluster.
5454

5555
For further assistance or other queries, you can open a support request by following these [instructions](#next-steps).
@@ -62,7 +62,7 @@ The cluster configuration can be reviewed by running the following command:
6262
sudo crm configure show
6363
```
6464

65-
## Cause
65+
### Cause for scenario 2
6666
Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconfigurations:
6767

6868
1. Incorrect STONITH configuration:
@@ -92,7 +92,7 @@ Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconf
9292
sudo cat /etc/corosync/corosync.conf
9393
```
9494

95-
### Resolution
95+
### Resolution for scenario 2
9696
- It's necessary to follow the proper guidelines outlined for setting up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
9797
- Steps to make necessary changes to the cluster configuration:
9898
1. Stop the application on both the nodes.
@@ -112,7 +112,7 @@ Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconf
112112
## Scenario 3: Migration from On-premises to Azure
113113
When migrating a SUSE Pacemaker cluster from on-premises to Azure, unexpected reboots can arise from specific misconfigurations or overlooked dependencies.
114114
115-
### Cause
115+
### Cause for scenario 3
116116
The following are common mistakes in this category:
117117
118118
1. Incomplete or incorrect STONITH configuration:
@@ -144,8 +144,9 @@ The following are common mistakes in this category:
144144
For more information, see [Network security group test](/azure/virtual-machines/network-security-group-test)
145145

146146

147-
### Resolution
148-
- It's necessary to follow the proper guidelines outlined for setting up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
147+
### Resolution for scenario 3
148+
149+
Follow the guidelines outlined to set up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
149150

150151
## Scenario 4: `HANA_CALL` timeout after 60 seconds
151152

@@ -158,10 +159,10 @@ The Azure SUSE Pacemaker Cluster is running SAP HANA as application and experien
158159
2024-06-04T09:25:38.736748+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary: we didn't expect node_status to be: DUMP <00000000 0a |.|#01200000001>
159160
```
160161

161-
### Cause
162+
### Cause for scenario 4
162163
The SAP HANA timeout messages are commonly considered internal application timeouts, and the SAP vendor should be engaged.
163164

164-
### Resolution
165+
### Resolution for scenario 4
165166
- To identify the root cause of the issue, it's essential to review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
166167
- Particular attention should be given to memory pressure and storage devices, their configuration, especially if HANA is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
167168
- Once external factors, such as platform or network outages, are ruled out, engaging the application vendor for trace call analysis and log review is recommended.
@@ -182,10 +183,10 @@ The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as applicatio
182183
2024-11-09T07:39:42.828955-05:00 node 01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov  9 07:39:42 2024
183184
```
184185
185-
### Cause
186+
### Cause for scenario 5
186187
The `ASCS/ERS` resource is considered the application for SAP Netweaver clusters. When the corresponding cluster monitoring resource times out, it triggers a failover process.
187188
188-
### Resolution
189+
### Resolution scenario 5
189190
- To identify the root cause of the issue, it's essential to review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
190191
- Particular attention should be given to memory pressure and storage devices, their configuration especially if SAP Netweaver is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
191192
- Once external factors, such as platform or network outages, are ruled out, engaging the application vendor for trace call analysis and log review is recommended.

0 commit comments

Comments
 (0)