Fix typos and formatting in troubleshooting guide

rnirek · web-flow · commit a3f9e1280dc9 · 2025-01-14T21:50:59.000-06:00
diff --git a/support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-suse.md b/support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-suse.md
@@ -31,6 +31,7 @@ This article provides guidance on troubleshooting, analysis, and resolution of t
 * Additionally, services dependent on network connectivity, such as `waagent`, generate communication related error messages in the logs, further indicating network related disruptions.
 
 The following  messages can be observed in `/var/log/messages`:
+
 From `node 01`:
 ```output
 Aug 21 01:48:00 node  01 corosync[19389]:  [TOTEM ] Token has not been received in 30000 ms
@@ -44,15 +45,15 @@ Aug 21 01:47:27 node  02 corosync[15241]:  [KNET  ] host: host: 2 has no active
 Aug 21 01:47:31 node  02 corosync[15241]:  [TOTEM ] Token has not been received in 30000 ms
 ```
 
-#### Cause
+### Cause
 It's noted that the unexpected node reboot is observed due to Network Maintenance activity or an outage. For confirmation, the timestamp can be matched by reviewing the [Azure Maintenance Notification](/azure/virtual-machines/linux/maintenance-notifications) in Azure Portal. For more information about Azure Scheduled Events, see [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events).
 
 #### Resolution
 If the unexpected reboot timestamp aligns with a maintenance activity, the analysis confirms that the cluster was impacted by either platform or network maintenance. For further assistance or additional queries, you can open a support request by following these [instructions](#next-steps).
 
 ### Scenario 2: Cluster Misconfiguration
 The cluster nodes experience unexpected failover or node reboots and it's often observed due to cluster misconfiguration affecting the stability of Pacemaker Clusters. 
-The cluster configuration can be reviwed by running the following command:
+The cluster configuration can be reviewed by running the following command:
 ```bash
 sudo crm configure show
 ```
@@ -65,29 +66,29 @@ Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconf
     - Wrong STONITH resource rettings: Incorrect parameters for Azure fencing agents like `fence_azure_arm` can cause nodes to reboot unexpectedly during failovers.
     - Insufficient permissions: The Azure resource group or credentials used for fencing may lack required permissions, causing STONITH failures.
 
-2. Missing/Incorrect Resource Constraints:
+2. Missing/Incorrect resource constraints:
    Poorly set constraints can cause resources to be redistributed unnecessarily, leading to node overload and, reboots. Misaligned resource dependency configurations can cause nodes to go into a fail/reboot loop.
 
-3. Cluster Threshold and Timeout Misconfigurations:
+3. Cluster threshold and timeout misconfigurations:
     - `failure-timeout`, `migration-threshold`, or `monitor-timeout` values may result in nodes being prematurely rebooted.
     - Heartbeat Timeout Settings: Incorrect `corosync` timeout settings for heartbeat intervals can cause nodes to assume each other are offline, triggering unnecessary reboots.
 
-4. Lack of Proper Health Checks:
-    Insufficient Monitoring of Critical Resources: Not setting proper health-check intervals for critical services like SAP HANA(High-performance ANalytic Application) can cause resource or node failures.
+4. Lack of proper health checks:
+    Not setting proper health-check intervals for critical services like SAP HANA(High-performance ANalytic Application) can cause resource or node failures.
 
-5. Resource Agent Misconfiguration:
+5. Resource agent misconfiguration:
     - Custom resource agents misaligned with cluster: Resource agents that don't adhere to Pacemaker standards can create unpredictable behavior, including node reboots.
     - Wrong resource start/stop parameters: Improperly tuned start/stop parameters in cluster configuration may lead to nodes rebooting during resource recovery.
 
-6. Corosync Configuration Issues:
+6. Corosync configuration issues:
     - Unoptimized network settings: Incorrect multicast/unicast configuration can lead to heartbeat communication failures. Mismatched `ring0` and `ring1` network configurations cause split-brain scenarios and node fencing.
     - Token timeout mismatches: Token timeout values not aligned with the environment’s latency can trigger node isolation and reboots.
     - Following command can be used to review Corosync configuration:
       ```bash
       sudo cat /etc/corosync/corosync.conf
       ```
 
-### Resolution
+#### Resolution
 - It's necessary to follow the proper guidelines outlined for setting up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
 - Steps to make necessary changes to the cluster configuration: 
     1. Stop the application on both the nodes. 
@@ -107,34 +108,38 @@ Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconf
 ### Scenario 3: Migration from On-premises to Azure
 When migrating a SUSE Pacemaker cluster from on-premises to Azure, unexpected reboots can arise from specific misconfigurations or overlooked dependencies. Below are common mistakes in this category:
 
-#### Cause
+### Cause
 
 1. Incomplete or incorrect STONITH configuration:
     - No STONITH or fencing misfconfigured: Not configuring STONITH (Shoot-The-Other-Node-In-The-Head) properly can lead to nodes being marked as unhealthy and triggering unnecessary reboots.
     - Wrong STONITH resource settings: Incorrect parameters for Azure fencing agents like `fence_azure_arm` can cause nodes to reboot unexpectedly during failovers.
     - Insufficient permissions: The Azure resource group or credentials used for fencing may lack required permissions, causing STONITH failures. Key Azure-specific parameters, such as subscription ID, resource group, or VM names, must be correctly configured in the fencing agent. Omissions here can result in fencing failures and unexpected reboots.
+    
+   For more information, see [Troubleshoot Azure Fence Agent startup issues in SUSE](troubleshoot-azure-fence-agent-startup-suse.md) and [Troubleshoot SBD service failure in SUSE Pacemaker clusters](troubleshoot-sbd-issues-sles.md)
 
 2. Network misconfigurations:
    Misconfigured VNets, subnets, or security group rules can block essential cluster communication, leading to perceived node failures and reboots.
-   Refer: [Virtual networks and virtual machines in Azure](/azure/virtual-network/network-overview)
+   
+   For more information, see [Virtual networks and virtual machines in Azure](/azure/virtual-machines/linux/network-overview)
 
 3. Metadata Service issues:
    Azure's cloud metadata services must be correctly handled; otherwise, resource detection or startup processes can fail.
-   Refer: 
-       - [Azure Instance Metadata Service](/azure/virtual-machines/instance-metadata-service)
-       - [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events)
+   
+   For more information, see [Azure Instance Metadata Service](/azure/virtual-machines/instance-metadata-service) and [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events)
 
 4. Performance and latency mismatches:
     - Inadequate VM sizing: Migrated workloads may not align with the selected Azure VM(Virtual Machine) size, causing resource overutilization and triggering reboots.
     - Disk I/O mismatches: On-premises workloads with high IOPS(Input/output operations per second) demands must be paired with the appropriate Azure disk or storage performance tier.
-   Refer:[Collect performance metrics for a Linux VM](/azure/virtual-machines/linux/collect-performance-metrics-from-a-linux-system.md)
+   
+   For more information, see [Collect performance metrics for a Linux VM](collect-performance-metrics-from-a-linux-system.md)
 
 5. Security and Firewall Rules:
-    - Port Blockages: On-premises clusters often have open, internal communication, while Azure NSGs (Network Security Groups) or firewalls may block ports required for Pacemaker/Corosync communication.
-    Refer: [Network security group test](/azure/virtual-machines/network-security-group-test)
+    - Port Block: On-premises clusters often have open, internal communication, while Azure NSGs (Network Security Groups) or firewalls may block ports required for Pacemaker/Corosync communication.
+   
+   For more information, see [Network security group test](/azure/virtual-machines/network-security-group-test)
 
 
-### Resolution
+#### Resolution
 - It's necessary to follow the proper guidelines outlined for setting up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
 
 ### Scenario 4: `HANA_CALL` timeout after 60 seconds
@@ -148,7 +153,7 @@ The Azure SUSE Pacemaker Cluster is running SAP HANA as application and experien
 2024-06-04T09:25:38.736748+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary:  we didn't expect node_status to be: DUMP <00000000  0a                                                |.|#01200000001>
 ```
 
-#### Cause
+### Cause
 The SAP HANA timeout messages are commonly considered internal application timeouts, and the SAP vendor should be engaged.
 
 #### Resolution
@@ -172,7 +177,7 @@ The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as applicatio
 2024-11-09T07:39:42.828955-05:00 node  01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov  9 07:39:42 2024 
 ```
 
-#### Cause
+### Cause
 The `ASCS/ERS` resource is considered the application for SAP Netweaver clusters. When the corresponding cluster monitoring resource times out, it triggers a failover process.
 
 #### Resolution