Skip to content

Commit 2088d12

Browse files
authored
Fix formatting and update content in troubleshooting guide.
1 parent 3526704 commit 2088d12

1 file changed

Lines changed: 58 additions & 36 deletions

File tree

support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-rhel.md

Lines changed: 58 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,18 @@ ms.custom: sap:Issue with Pacemaker cluster, and fencing
1515

1616
**Applies to:** :heavy_check_mark: Linux VMs
1717

18-
This article provides guidance for troubleshooting, analysis, and resolution of the most common scenarios for unexpected node restarts in RHEL (RedHat Enterprise Linux) Pacemaker Clusters.
18+
This article provides guidance for troubleshooting, analysis, and resolution of most common scenarios for unexpected node restarts in RedHat Enterprise Linux (RHEL) Pacemaker Clusters.
1919

2020
## Prerequisites
2121

2222
- Make sure that the Pacemaker Cluster setup is correctly configured by following the guidelines that are provided in [Set up Pacemaker on Red Hat Enterprise Linux in Azure](/azure/sap/workloads/high-availability-guide-rhel-pacemaker).
2323
- For a Microsoft Azure Pacemaker cluster that uses the Azure Fence Agent as the STONITH (Shoot-The-Other-Node-In-The-Head) device, refer to the documentation [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration).
2424
- For a Microsoft Azure Pacemaker cluster that uses SBD (STONITH Block Device) storage protection as the STONITH device, choose one of the following setup options (see the articles for detailed information):
25-
- [SBD with an iSCSI target server](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-iscsi-target-server)
26-
- [SBD with an Azure shared disk](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-azure-shared-disk)
25+
- [SBD with an iSCSI target server](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-iscsi-target-server)
26+
- [SBD with an Azure shared disk](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-azure-shared-disk)
2727

2828
## Scenario 1: Network Outage
29+
2930
- The cluster nodes are experiencing `corosync` communication errors. This causes continuous retransmissions because of an inability to establish communication between nodes. This issue triggers application timeouts, ultimately causing node fencing and subsequent restarts.
3031
- Additionally, services that are dependent on network connectivity, such as `waagent`, generate communication related error messages in the logs. This further indicates network related disruptions.
3132

@@ -45,14 +46,17 @@ Aug 21 01:47:31 node 02 corosync[15241]: [TOTEM ] Token has not been received
4546
```
4647

4748
### Cause for scenario 1
49+
4850
An unexpected node restart occurs because of a Network Maintenance activity or an outage. For confirmation, you can match the timestamp by reviewing the [Azure Maintenance Notification](/azure/virtual-machines/linux/maintenance-notifications) in the Azure portal. For more information about Azure Scheduled Events, see [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events).
4951

5052
### Resolution for scenario 1
53+
5154
If the unexpected restart timestamp aligns with a maintenance activity, the analysis confirms that either platform or network maintenance affected the cluster.
5255

5356
For further assistance or other queries, you can open a support request by following [these instructions](#next-steps).
5457

5558
## Scenario 2: Cluster Misconfiguration
59+
5660
The cluster nodes experience unexpected failovers or node restarts. These are often caused by cluster misconfigurations that affect the stability of Pacemaker Clusters.
5761

5862
To review the cluster configuration, run the following command:
@@ -61,6 +65,7 @@ sudo pcs configure show
6165
```
6266

6367
### Cause for scenario 2
68+
6469
Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of misconfigurations:
6570

6671
- Incorrect STONITH configuration:
@@ -83,14 +88,17 @@ Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of mi
8388
- Wrong resource start/stop parameters: Incorrectly tuned start/stop parameters in cluster configuration may cause nodes to restart during resource recovery.
8489

8590
### Resolution for scenario 2
91+
8692
- Follow the proper guidelines to set up a [RHEL Pacemaker Cluster](#prerequisites). Additionally, make sure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability-rhel) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-rhel), as specified in the Microsoft documentation.
8793
- Steps to make necessary changes to the cluster configuration:
8894
1. Stop the application on both the nodes.
8995
2. Put the cluster into maintenance-mode.
96+
9097
```bash
9198
sudo pcs property set maintenance-mode=true
9299
```
93100
3. Edit the cluster configuration:
101+
94102
```bash
95103
sudo pcs configure edit
96104
```
@@ -106,9 +114,11 @@ Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of mi
106114
> To mitigate such risks, it's recommended to disable security tools on systems running a Pacemaker cluster or ensure that appropriate exclusions are configured to prevent conflicts with the cluster and its associated applications.
107115

108116
## Scenario 3: Migration from on-premises to Azure
117+
109118
When you migrate a SUSE Pacemaker cluster from on-premises to Azure, unexpected restarts can occur because of specific misconfigurations or overlooked dependencies.
110119

111120
### Cause for scenario 3
121+
112122
The following are common mistakes in this category:
113123

114124
- Incomplete or incorrect STONITH configuration:
@@ -139,50 +149,57 @@ The following are common mistakes in this category:
139149
140150
For more information, see [Network security group test](/azure/virtual-machines/network-security-group-test)
141151
142-
143152
### Resolution for scenario 3
144153
145154
Follow the proper guidelines to set up a [RHEL Pacemaker Cluster](#prerequisites). Additionally, make sure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in the Microsoft documentation.
146155
147-
## Scenario 4: Both cluster nodes are killed after a failover event on RHEL 8
148-
* The Pacemaker cluster faces an outage, and proceeds to trigger a failover event.
149-
* In a two node cluster configuration, both nodes kill each other, and stay offline until manual intervention.
150-
* The stonith device `python-user` triggers the shutdown instruction for both nodes.
156+
## Scenario 4: Both cluster nodes are terminated after a failover event on RHEL 8
157+
158+
The Pacemaker cluster faces an outage, and proceeds to trigger a failover event. In a two node cluster configuration, both nodes are terminated, and stay offline until manual intervention.
159+
160+
The logs indicate that the STONITH device `python-user` triggers the shutdown instruction for both nodes.
151161
152162
### Cause for scenario 4
153-
When there's an outage, like a Platform/Network interruption as discussed in [Scenario 1](#scenario-1-network-outage), both nodes attempt to write to the STONITH device to fence each other since they lose totem. Normally, the stonith device takes the instruction from the first node that's available, to write on it in order to shutdown the other node. If both nodes are allowed to write to the stonith device, they end up killing each other.
163+
164+
During an outage, like a Platform or network interruption as discussed in [Scenario 1](#scenario-1-network-outage), both nodes attempt to write to the STONITH device to fence each other since they lose Totem Token. Normally, the STONITH device takes the instruction from the first node that's available, to write on it in order to shutdown the other node. If both nodes are allowed to write to the STONITH device, they end up killing each other.
165+
166+
During an outage such as a platform or network interruption described in [Scenario 1](#scenario-1-network-outage), both nodes try to write to the STONITH device to fence each other because they lose Totem token. Typically, the STONITH device follows the first available node's instruction to shut down the other node. If both nodes are allowed to write to the STONITH device, they might shut each other down.
154167

155168
### Resolution for scenario 4
156-
It's recommended to use `priority-fencing-delay` or `pcmk_delay_max` parameter, so only one VM should be acknowledged by the STONITH device.
157169

170+
It's recommended to use `priority-fencing-delay` or `pcmk_delay_max` parameter, so only one VM should be acknowledged by the STONITH device.
158171
159172
1. Set the cluster under maintenance-mode.
160-
```bash
161-
sudo pcs property set maintenance-mode=true
162-
```
173+
174+
```bash
175+
sudo pcs property set maintenance-mode=true
176+
```
163177
164178
2. Edit the cluster configuration.
165-
```bash
166-
sudo pcs configure edit
167-
```
179+
180+
```bash
181+
sudo pcs configure edit
182+
```
168183
169184
3. If the Pacemaker version is less than `2.0.4-6.el8`, then add the parameter `pcmk_delay_max`:
170-
```bash
171-
sudo pcs property set pcmk_delay_max=15s
172-
```
173185
174-
* If the version is higher than `2.0.4-6.el8`, then use the parameter `priority-fencing-delay` instead:
175-
```bash
176-
sudo pcs property set priority-fencing-delay=15s
177-
```
186+
```bash
187+
sudo pcs property set pcmk_delay_max=15s
188+
```
189+
190+
If the version is higher than `2.0.4-6.el8`, then use the parameter `priority-fencing-delay` instead:
191+
192+
```bash
193+
sudo pcs property set priority-fencing-delay=15s
194+
```
178195
179-
4. Save the changes and remove the cluster out of maintenance-mode.
180-
```bash
181-
sudo pcs property set maintenance-mode=false
182-
```
196+
4. Save the changes and remove the cluster out of maintenance-mode.
183197
184-
For more information refer to [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration).
198+
```bash
199+
sudo pcs property set maintenance-mode=false
200+
```
185201
202+
For more information, see [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration).
186203
187204
## Scenario 5: `HANA_CALL` timeout after 60 seconds
188205
@@ -194,36 +211,41 @@ The Azure RHEL Pacemaker Cluster is running SAP HANA as an application, and it e
194211
2024-06-04T09:25:38.724146+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary: we didn't expect node_status to be: <>
195212
2024-06-04T09:25:38.736748+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary: we didn't expect node_status to be: DUMP <00000000 0a |.|#01200000001>
196213
```
214+
197215
### Cause for scenario 5
216+
198217
The SAP HANA time-out messages are commonly considered internal application timeouts. Therefore, the SAP vendor should be engaged.
199218
200219
### Resolution for scenario 5
201-
- To identify the root cause of the issue, review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
202-
- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if HANA is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
220+
221+
- To identify the root cause of the issue, review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
222+
- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if HANA is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
203223
- After you rule out external factors, such as platform or network outages, we recommend that you contact the application vendor for trace call analysis and log review.
204-
-
224+
205225
## Scenario 6: `ASCS/ERS` time-out in SAP Netweaver clusters
206226
207227
The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as an application, and it experiences unexpected restarts on one of the nodes or both nodes in the Pacemaker Cluster. The following messages are logged in the `/var/log/messages` log:
208228
209229
```output
210230
2024-11-09T07:36:42.037589-05:00 node 01 SAPInstance(RSC_SAP_ERS10)[8689]: ERROR: SAP instance service enrepserver is not running with status GRAY !
211-
2024-11-09T07:36:42.044583-05:00 node 01 pacemaker-controld[2596]: notice: Result of monitor operation for RSC_SAP_ERS10 on node01: not running
231+
2024-11-09T07:36:42.044583-05:00 node 01 pacemaker-controld[2596]: notice: Result of monitor operation for RSC_SAP_ERS10 on node01: not running
212232
```
213233
214234
```output
215235
2024-11-09T07:39:42.789404-05:00 node01 SAPInstance(RSC_SAP_ASCS00)[16393]: ERROR: SAP Instance CP2-ASCS00 start failed: #01109.11.2024 07:39:42#012WaitforStarted#012FAIL: process msg_server MessageServer not running
216-
2024-11-09T07:39:420.796280-05:00 node01 pacemaker-execd[2404]: notice: RSC_SAP_ASCS00 start (call 78, PID 16393) exited with status 7 (execution time 23.488s)
217-
2024-11-09T07:39:42.828845-05:00 node 01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov  9 07:39:42 2024
218-
2024-11-09T07:39:42.828955-05:00 node 01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov  9 07:39:42 2024
236+
2024-11-09T07:39:420.796280-05:00 node01 pacemaker-execd[2404]: notice: RSC_SAP_ASCS00 start (call 78, PID 16393) exited with status 7 (execution time 23.488s)
237+
2024-11-09T07:39:42.828845-05:00 node 01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov 9 07:39:42 2024
238+
2024-11-09T07:39:42.828955-05:00 node 01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov 9 07:39:42 2024
219239
```
220240
221241
### Cause for scenario 6
242+
222243
The `ASCS/ERS` resource is considered to be the application for SAP Netweaver clusters. When the corresponding cluster monitoring resource times out, it triggers a failover process.
223244
224245
### Resolution for scenario 6
246+
225247
- To identify the root cause of the issue, we recommend that you review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
226-
- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if SAP Netweaver is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
248+
- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if SAP Netweaver is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
227249
- After you rule out external factors, such as platform or network outages, we recommend that you engage the application vendor for trace call analysis and log review.
228250
229251
## Next steps

0 commit comments

Comments
 (0)