Fix formatting and update content in troubleshooting guide.

genlin · web-flow · commit 2088d12c35dc · 2025-02-24T15:46:21.000+08:00
diff --git a/support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-rhel.md b/support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-rhel.md
@@ -15,17 +15,18 @@ ms.custom: sap:Issue with Pacemaker cluster, and fencing
 
 **Applies to:** :heavy_check_mark: Linux VMs
 
-This article provides guidance for troubleshooting, analysis, and resolution of the most common scenarios for unexpected node restarts in RHEL (RedHat Enterprise Linux) Pacemaker Clusters.
+This article provides guidance for troubleshooting, analysis, and resolution of most common scenarios for unexpected node restarts in RedHat Enterprise Linux (RHEL) Pacemaker Clusters.
 
 ## Prerequisites
 
 - Make sure that the Pacemaker Cluster setup is correctly configured by following the guidelines that are provided in [Set up Pacemaker on Red Hat Enterprise Linux in Azure](/azure/sap/workloads/high-availability-guide-rhel-pacemaker).
 - For a Microsoft Azure Pacemaker cluster that uses the Azure Fence Agent as the STONITH (Shoot-The-Other-Node-In-The-Head) device, refer to the documentation [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration).
 - For a Microsoft Azure Pacemaker cluster that uses SBD (STONITH Block Device) storage protection as the STONITH device, choose one of the following setup options (see the articles for detailed information):
-   - [SBD with an iSCSI target server](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-iscsi-target-server)
-   - [SBD with an Azure shared disk](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-azure-shared-disk)
+    - [SBD with an iSCSI target server](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-iscsi-target-server)
+    - [SBD with an Azure shared disk](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-azure-shared-disk)
 
 ## Scenario 1: Network Outage
+
 - The cluster nodes are experiencing `corosync` communication errors. This causes continuous retransmissions because of an inability to establish communication between nodes. This issue triggers application timeouts, ultimately causing node fencing and subsequent restarts.
 - Additionally, services that are dependent on network connectivity, such as `waagent`, generate communication related error messages in the logs. This further indicates network related disruptions.
 
@@ -45,14 +46,17 @@ Aug 21 01:47:31 node  02 corosync[15241]:  [TOTEM ] Token has not been received
 ```
 
 ### Cause for scenario 1
+
 An unexpected node restart occurs because of a Network Maintenance activity or an outage. For confirmation, you can match the timestamp by reviewing the [Azure Maintenance Notification](/azure/virtual-machines/linux/maintenance-notifications) in the Azure portal. For more information about Azure Scheduled Events, see [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events).
 
 ### Resolution for scenario 1
+
 If the unexpected restart timestamp aligns with a maintenance activity, the analysis confirms that either platform or network maintenance affected the cluster.
 
 For further assistance or other queries, you can open a support request by following [these instructions](#next-steps).
 
 ## Scenario 2: Cluster Misconfiguration
+
 The cluster nodes experience unexpected failovers or node restarts. These are often caused by cluster misconfigurations that affect the stability of Pacemaker Clusters.
 
 To review the cluster configuration, run the following command:
@@ -61,6 +65,7 @@ sudo pcs configure show
 ```
 
 ### Cause for scenario 2
+
 Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of misconfigurations:
 
 - Incorrect STONITH configuration: 
@@ -83,14 +88,17 @@ Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of mi
     - Wrong resource start/stop parameters: Incorrectly tuned start/stop parameters in cluster configuration may cause nodes to restart during resource recovery.
 
 ### Resolution for scenario 2
+
 - Follow the proper guidelines to set up a [RHEL Pacemaker Cluster](#prerequisites). Additionally, make sure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability-rhel) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-rhel), as specified in the Microsoft documentation.
 - Steps to make necessary changes to the cluster configuration: 
     1. Stop the application on both the nodes. 
     2. Put the cluster into maintenance-mode. 
+
        ```bash
        sudo pcs property set maintenance-mode=true 
        ```
     3. Edit the cluster configuration:
+    
        ```bash
        sudo pcs configure edit 
        ```
@@ -106,9 +114,11 @@ Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of mi
 > To mitigate such risks, it's recommended to disable security tools on systems running a Pacemaker cluster or ensure that appropriate exclusions are configured to prevent conflicts with the cluster and its associated applications.
 
 ## Scenario 3: Migration from on-premises to Azure
+
 When you migrate a SUSE Pacemaker cluster from on-premises to Azure, unexpected restarts can occur because of specific misconfigurations or overlooked dependencies. 
 
 ### Cause for scenario 3
+
 The following are common mistakes in this category:
 
 - Incomplete or incorrect STONITH configuration:
@@ -139,50 +149,57 @@ The following are common mistakes in this category:
    
    For more information, see [Network security group test](/azure/virtual-machines/network-security-group-test)
 
-
 ### Resolution for scenario 3
 
 Follow the proper guidelines to set up a [RHEL Pacemaker Cluster](#prerequisites). Additionally, make sure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in the Microsoft documentation.
 
-## Scenario 4: Both cluster nodes are killed after a failover event on RHEL 8 
-* The Pacemaker cluster faces an outage, and proceeds to trigger a failover event.
-* In a two node cluster configuration, both nodes kill each other, and stay offline until manual intervention.
-* The stonith device `python-user` triggers the shutdown instruction for both nodes.
+## Scenario 4: Both cluster nodes are terminated after a failover event on RHEL 8
+
+The Pacemaker cluster faces an outage, and proceeds to trigger a failover event. In a two node cluster configuration, both nodes are terminated, and stay offline until manual intervention.
+
+The logs indicate that the STONITH device `python-user` triggers the shutdown instruction for both nodes.
 
 ### Cause for scenario 4
-When there's an outage, like a Platform/Network interruption as discussed in [Scenario 1](#scenario-1-network-outage), both nodes attempt to write to the STONITH device to fence each other since they lose totem. Normally, the stonith device takes the instruction from the first node that's available, to write on it in order to shutdown the other node. If both nodes are allowed to write to the stonith device, they end up killing each other.
+
+During an outage, like a Platform or network interruption as discussed in [Scenario 1](#scenario-1-network-outage), both nodes attempt to write to the STONITH device to fence each other since they lose Totem Token. Normally, the STONITH device takes the instruction from the first node that's available, to write on it in order to shutdown the other node. If both nodes are allowed to write to the STONITH device, they end up killing each other.
+
+During an outage such as a platform or network interruption described in [Scenario 1](#scenario-1-network-outage), both nodes try to write to the STONITH device to fence each other because they lose Totem token. Typically, the STONITH device follows the first available node's instruction to shut down the other node. If both nodes are allowed to write to the STONITH device, they might shut each other down.
 
 ### Resolution for scenario 4
-It's recommended to use `priority-fencing-delay` or `pcmk_delay_max` parameter, so only one VM should be acknowledged by the STONITH device. 
 
+It's recommended to use `priority-fencing-delay` or `pcmk_delay_max` parameter, so only one VM should be acknowledged by the STONITH device.
 
 1. Set the cluster under maintenance-mode.
- ```bash
- sudo pcs property set maintenance-mode=true
- ```
+
+    ```bash
+    sudo pcs property set maintenance-mode=true
+    ```
 
 2. Edit the cluster configuration.
- ```bash
-  sudo pcs configure edit 
- ```
+
+    ```bash
+    sudo pcs configure edit 
+    ```
 
 3. If the Pacemaker version is less than `2.0.4-6.el8`, then add the parameter `pcmk_delay_max`:
- ```bash
- sudo pcs property set pcmk_delay_max=15s
- ```
 
-* If the version is higher than `2.0.4-6.el8`, then use the parameter `priority-fencing-delay` instead:
- ```bash
- sudo pcs property set priority-fencing-delay=15s
- ```
+    ```bash
+    sudo pcs property set pcmk_delay_max=15s
+    ```
+
+    If the version is higher than `2.0.4-6.el8`, then use the parameter `priority-fencing-delay` instead:
+
+    ```bash
+    sudo pcs property set priority-fencing-delay=15s
+    ```
 
-4.  Save the changes and remove the cluster out of maintenance-mode. 
- ```bash
- sudo pcs property set maintenance-mode=false
- ```
+4. Save the changes and remove the cluster out of maintenance-mode.
 
-For more information refer to [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration).
+    ```bash
+    sudo pcs property set maintenance-mode=false
+    ```
 
+For more information, see [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration).
 
 ## Scenario 5: `HANA_CALL` timeout after 60 seconds
 
@@ -194,36 +211,41 @@ The Azure RHEL Pacemaker Cluster is running SAP HANA as an application, and it e
 2024-06-04T09:25:38.724146+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary:  we didn't expect node_status to be: <>
 2024-06-04T09:25:38.736748+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary:  we didn't expect node_status to be: DUMP <00000000  0a                                                |.|#01200000001>
 ```
+
 ### Cause for scenario 5
+
 The SAP HANA time-out messages are commonly considered internal application timeouts. Therefore, the SAP vendor should be engaged.
 
 ### Resolution for scenario 5
-- To identify the root cause of the issue, review the [OS performance](collect-performance-metrics-from-a-linux-system.md). 
-- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if HANA is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files. 
+
+- To identify the root cause of the issue, review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
+- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if HANA is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
 - After you rule out external factors, such as platform or network outages, we recommend that you contact the application vendor for trace call analysis and log review.
-- 
+
 ## Scenario 6: `ASCS/ERS` time-out in SAP Netweaver clusters
 
 The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as an application, and it experiences unexpected restarts on one of the nodes or both nodes in the Pacemaker Cluster. The following messages are logged in the `/var/log/messages` log:
 
 ```output
 2024-11-09T07:36:42.037589-05:00 node  01 SAPInstance(RSC_SAP_ERS10)[8689]: ERROR: SAP instance service enrepserver is not running with status GRAY !
-2024-11-09T07:36:42.044583-05:00 node  01 pacemaker-controld[2596]: notice: Result of monitor operation for RSC_SAP_ERS10 on node01: not running 
+2024-11-09T07:36:42.044583-05:00 node  01 pacemaker-controld[2596]: notice: Result of monitor operation for RSC_SAP_ERS10 on node01: not running 
 ```
 
 ```output
 2024-11-09T07:39:42.789404-05:00 node01 SAPInstance(RSC_SAP_ASCS00)[16393]: ERROR: SAP Instance CP2-ASCS00 start failed: #01109.11.2024 07:39:42#012WaitforStarted#012FAIL: process msg_server MessageServer not running
-2024-11-09T07:39:420.796280-05:00 node01 pacemaker-execd[2404]: notice: RSC_SAP_ASCS00 start (call 78, PID 16393) exited with status 7 (execution time 23.488s)
-2024-11-09T07:39:42.828845-05:00 node  01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov  9 07:39:42 2024 
-2024-11-09T07:39:42.828955-05:00 node  01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov  9 07:39:42 2024 
+2024-11-09T07:39:420.796280-05:00 node01 pacemaker-execd[2404]: notice: RSC_SAP_ASCS00 start (call 78, PID 16393) exited with status 7 (execution time 23.488s)
+2024-11-09T07:39:42.828845-05:00 node  01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov  9 07:39:42 2024 
+2024-11-09T07:39:42.828955-05:00 node  01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov  9 07:39:42 2024 
 ```
 
 ### Cause for scenario 6
+
 The `ASCS/ERS` resource is considered to be the application for SAP Netweaver clusters. When the corresponding cluster monitoring resource times out, it triggers a failover process.
 
 ### Resolution for scenario 6
+
 - To identify the root cause of the issue, we recommend that you review the [OS performance](collect-performance-metrics-from-a-linux-system.md). 
-- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if SAP Netweaver is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files. 
+- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if SAP Netweaver is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
 - After you rule out external factors, such as platform or network outages, we recommend that you engage the application vendor for trace call analysis and log review.
 
 ## Next steps