Skip to content

Commit a3f9e12

Browse files
authored
Fix typos and formatting in troubleshooting guide
1 parent 918d009 commit a3f9e12

1 file changed

Lines changed: 25 additions & 20 deletions

File tree

support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-suse.md

Lines changed: 25 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ This article provides guidance on troubleshooting, analysis, and resolution of t
3131
* Additionally, services dependent on network connectivity, such as `waagent`, generate communication related error messages in the logs, further indicating network related disruptions.
3232

3333
The following messages can be observed in `/var/log/messages`:
34+
3435
From `node 01`:
3536
```output
3637
Aug 21 01:48:00 node 01 corosync[19389]: [TOTEM ] Token has not been received in 30000 ms
@@ -44,15 +45,15 @@ Aug 21 01:47:27 node 02 corosync[15241]: [KNET ] host: host: 2 has no active
4445
Aug 21 01:47:31 node 02 corosync[15241]: [TOTEM ] Token has not been received in 30000 ms
4546
```
4647

47-
#### Cause
48+
### Cause
4849
It's noted that the unexpected node reboot is observed due to Network Maintenance activity or an outage. For confirmation, the timestamp can be matched by reviewing the [Azure Maintenance Notification](/azure/virtual-machines/linux/maintenance-notifications) in Azure Portal. For more information about Azure Scheduled Events, see [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events).
4950

5051
#### Resolution
5152
If the unexpected reboot timestamp aligns with a maintenance activity, the analysis confirms that the cluster was impacted by either platform or network maintenance. For further assistance or additional queries, you can open a support request by following these [instructions](#next-steps).
5253

5354
### Scenario 2: Cluster Misconfiguration
5455
The cluster nodes experience unexpected failover or node reboots and it's often observed due to cluster misconfiguration affecting the stability of Pacemaker Clusters.
55-
The cluster configuration can be reviwed by running the following command:
56+
The cluster configuration can be reviewed by running the following command:
5657
```bash
5758
sudo crm configure show
5859
```
@@ -65,29 +66,29 @@ Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconf
6566
- Wrong STONITH resource rettings: Incorrect parameters for Azure fencing agents like `fence_azure_arm` can cause nodes to reboot unexpectedly during failovers.
6667
- Insufficient permissions: The Azure resource group or credentials used for fencing may lack required permissions, causing STONITH failures.
6768

68-
2. Missing/Incorrect Resource Constraints:
69+
2. Missing/Incorrect resource constraints:
6970
Poorly set constraints can cause resources to be redistributed unnecessarily, leading to node overload and, reboots. Misaligned resource dependency configurations can cause nodes to go into a fail/reboot loop.
7071

71-
3. Cluster Threshold and Timeout Misconfigurations:
72+
3. Cluster threshold and timeout misconfigurations:
7273
- `failure-timeout`, `migration-threshold`, or `monitor-timeout` values may result in nodes being prematurely rebooted.
7374
- Heartbeat Timeout Settings: Incorrect `corosync` timeout settings for heartbeat intervals can cause nodes to assume each other are offline, triggering unnecessary reboots.
7475

75-
4. Lack of Proper Health Checks:
76-
Insufficient Monitoring of Critical Resources: Not setting proper health-check intervals for critical services like SAP HANA(High-performance ANalytic Application) can cause resource or node failures.
76+
4. Lack of proper health checks:
77+
Not setting proper health-check intervals for critical services like SAP HANA(High-performance ANalytic Application) can cause resource or node failures.
7778

78-
5. Resource Agent Misconfiguration:
79+
5. Resource agent misconfiguration:
7980
- Custom resource agents misaligned with cluster: Resource agents that don't adhere to Pacemaker standards can create unpredictable behavior, including node reboots.
8081
- Wrong resource start/stop parameters: Improperly tuned start/stop parameters in cluster configuration may lead to nodes rebooting during resource recovery.
8182

82-
6. Corosync Configuration Issues:
83+
6. Corosync configuration issues:
8384
- Unoptimized network settings: Incorrect multicast/unicast configuration can lead to heartbeat communication failures. Mismatched `ring0` and `ring1` network configurations cause split-brain scenarios and node fencing.
8485
- Token timeout mismatches: Token timeout values not aligned with the environment’s latency can trigger node isolation and reboots.
8586
- Following command can be used to review Corosync configuration:
8687
```bash
8788
sudo cat /etc/corosync/corosync.conf
8889
```
8990

90-
### Resolution
91+
#### Resolution
9192
- It's necessary to follow the proper guidelines outlined for setting up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
9293
- Steps to make necessary changes to the cluster configuration:
9394
1. Stop the application on both the nodes.
@@ -107,34 +108,38 @@ Unexpected reboots in an Azure SUSE Pacemaker cluster often occur due to misconf
107108
### Scenario 3: Migration from On-premises to Azure
108109
When migrating a SUSE Pacemaker cluster from on-premises to Azure, unexpected reboots can arise from specific misconfigurations or overlooked dependencies. Below are common mistakes in this category:
109110
110-
#### Cause
111+
### Cause
111112
112113
1. Incomplete or incorrect STONITH configuration:
113114
- No STONITH or fencing misfconfigured: Not configuring STONITH (Shoot-The-Other-Node-In-The-Head) properly can lead to nodes being marked as unhealthy and triggering unnecessary reboots.
114115
- Wrong STONITH resource settings: Incorrect parameters for Azure fencing agents like `fence_azure_arm` can cause nodes to reboot unexpectedly during failovers.
115116
- Insufficient permissions: The Azure resource group or credentials used for fencing may lack required permissions, causing STONITH failures. Key Azure-specific parameters, such as subscription ID, resource group, or VM names, must be correctly configured in the fencing agent. Omissions here can result in fencing failures and unexpected reboots.
117+
118+
For more information, see [Troubleshoot Azure Fence Agent startup issues in SUSE](troubleshoot-azure-fence-agent-startup-suse.md) and [Troubleshoot SBD service failure in SUSE Pacemaker clusters](troubleshoot-sbd-issues-sles.md)
116119
117120
2. Network misconfigurations:
118121
Misconfigured VNets, subnets, or security group rules can block essential cluster communication, leading to perceived node failures and reboots.
119-
Refer: [Virtual networks and virtual machines in Azure](/azure/virtual-network/network-overview)
122+
123+
For more information, see [Virtual networks and virtual machines in Azure](/azure/virtual-machines/linux/network-overview)
120124
121125
3. Metadata Service issues:
122126
Azure's cloud metadata services must be correctly handled; otherwise, resource detection or startup processes can fail.
123-
Refer:
124-
- [Azure Instance Metadata Service](/azure/virtual-machines/instance-metadata-service)
125-
- [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events)
127+
128+
For more information, see [Azure Instance Metadata Service](/azure/virtual-machines/instance-metadata-service) and [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events)
126129

127130
4. Performance and latency mismatches:
128131
- Inadequate VM sizing: Migrated workloads may not align with the selected Azure VM(Virtual Machine) size, causing resource overutilization and triggering reboots.
129132
- Disk I/O mismatches: On-premises workloads with high IOPS(Input/output operations per second) demands must be paired with the appropriate Azure disk or storage performance tier.
130-
Refer:[Collect performance metrics for a Linux VM](/azure/virtual-machines/linux/collect-performance-metrics-from-a-linux-system.md)
133+
134+
For more information, see [Collect performance metrics for a Linux VM](collect-performance-metrics-from-a-linux-system.md)
131135

132136
5. Security and Firewall Rules:
133-
- Port Blockages: On-premises clusters often have open, internal communication, while Azure NSGs (Network Security Groups) or firewalls may block ports required for Pacemaker/Corosync communication.
134-
Refer: [Network security group test](/azure/virtual-machines/network-security-group-test)
137+
- Port Block: On-premises clusters often have open, internal communication, while Azure NSGs (Network Security Groups) or firewalls may block ports required for Pacemaker/Corosync communication.
138+
139+
For more information, see [Network security group test](/azure/virtual-machines/network-security-group-test)
135140

136141

137-
### Resolution
142+
#### Resolution
138143
- It's necessary to follow the proper guidelines outlined for setting up a [SUSE Pacemaker Cluster](#prerequisites). Additionally, ensure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in our Microsoft documentation.
139144
140145
### Scenario 4: `HANA_CALL` timeout after 60 seconds
@@ -148,7 +153,7 @@ The Azure SUSE Pacemaker Cluster is running SAP HANA as application and experien
148153
2024-06-04T09:25:38.736748+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary: we didn't expect node_status to be: DUMP <00000000 0a |.|#01200000001>
149154
```
150155
151-
#### Cause
156+
### Cause
152157
The SAP HANA timeout messages are commonly considered internal application timeouts, and the SAP vendor should be engaged.
153158
154159
#### Resolution
@@ -172,7 +177,7 @@ The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as applicatio
172177
2024-11-09T07:39:42.828955-05:00 node 01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov  9 07:39:42 2024
173178
```
174179
175-
#### Cause
180+
### Cause
176181
The `ASCS/ERS` resource is considered the application for SAP Netweaver clusters. When the corresponding cluster monitoring resource times out, it triggers a failover process.
177182
178183
#### Resolution

0 commit comments

Comments
 (0)