Skip to content

Commit 724f5ea

Browse files
authored
Merge pull request #8211 from rnirek/patch-13
AB#4030: Create troubleshoot-unexpected-node-reboots-pacemaker-rhel.md
2 parents 288a02e + abc740d commit 724f5ea

2 files changed

Lines changed: 258 additions & 0 deletions

File tree

support/azure/virtual-machines/linux/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -280,6 +280,8 @@
280280
href: troubleshoot-sbd-issues-rhel.md
281281
- name: Troubleshoot pacemaker cluster services and resources startup issues
282282
href: troubleshoot-rhel-pacemaker-cluster-services-resources-startup-issues.md
283+
- name: Troubleshoot unexpected node reboots issues in RHEL
284+
href: troubleshoot-unexpected-node-reboots-pacemaker-rhel.md
283285
- name: SLES
284286
items:
285287
- name: Troubleshoot Azure fence agent startup issues in SUSE
Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
---
2+
title: Troubleshoot Unexpected Node Restarts in Azure Linux RHEL Pacemaker Cluster
3+
description: This article provides troubleshooting steps for resolving unexpected node restarts in RHEL Linux Pacemaker Clusters
4+
author: rnirek
5+
ms.author: rnirek
6+
ms.reviewer: divargas, rnirek, lariasjaen
7+
ms.topic: troubleshooting
8+
ms.date: 2/19/2025
9+
ms.service: azure-virtual-machines
10+
ms.collection: linux
11+
ms.custom: sap:Issue with Pacemaker cluster, and fencing
12+
---
13+
14+
# Troubleshooting unexpected node restarts in Azure Linux RHEL Pacemaker Cluster nodes
15+
16+
**Applies to:** :heavy_check_mark: Linux VMs
17+
18+
This article provides guidance for troubleshooting, analysis, and resolution of most common scenarios for unexpected node restarts in RedHat Enterprise Linux (RHEL) Pacemaker Clusters.
19+
20+
## Prerequisites
21+
22+
- Make sure that the Pacemaker Cluster setup is correctly configured by following the guidelines that are provided in [Set up Pacemaker on Red Hat Enterprise Linux in Azure](/azure/sap/workloads/high-availability-guide-rhel-pacemaker).
23+
- For a Microsoft Azure Pacemaker Cluster that uses the Azure Fence Agent as the STONITH (Shoot-The-Other-Node-In-The-Head) device, refer to the [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration) documentation.
24+
- For a Microsoft Azure Pacemaker cluster that uses SBD (STONITH Block Device) storage protection as the STONITH device, choose one of the following setup options (see the articles for detailed information):
25+
- [SBD with an iSCSI target server](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-iscsi-target-server)
26+
- [SBD with an Azure shared disk](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-azure-shared-disk)
27+
28+
## Scenario 1: Network outage
29+
30+
- The cluster nodes are experiencing `corosync` communication errors. This causes continuous retransmissions because of an inability to establish communication between nodes. This issue triggers application time-outs and ultimately causes node fencing and subsequent restarts.
31+
- Services that are dependent on network connectivity, such as `waagent`, generate communication-related error entries in the logs. These entries further indicate network-related disruptions.
32+
33+
The following messages are logged in the `/var/log/messages` log:
34+
35+
From `node 01`:
36+
```output
37+
Aug 21 01:48:00 node 01 corosync[19389]: [TOTEM ] Token has not been received in 30000 ms
38+
Aug 21 01:48:00 node 01 corosync[19389]: [TOTEM ] A processor failed, forming new configuration: token timed out (40000ms), waiting 48000ms for consensus.
39+
```
40+
From `node 02`:
41+
```output
42+
Aug 21 01:47:27 node 02 corosync[15241]: [KNET ] link: host: 2 link: 0 is down
43+
Aug 21 01:47:27 node 02 corosync[15241]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
44+
Aug 21 01:47:27 node 02 corosync[15241]: [KNET ] host: host: 2 has no active links
45+
Aug 21 01:47:31 node 02 corosync[15241]: [TOTEM ] Token has not been received in 30000 ms
46+
```
47+
48+
### Cause of scenario 1
49+
50+
An unexpected node restart occurs because of a network maintenance activity or an outage. For verification, you can match the timestamp by reviewing the [Azure Maintenance Notification](/azure/virtual-machines/linux/maintenance-notifications) in the Azure portal. For more information about Azure Scheduled Events, see [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events).
51+
52+
### Resolution for scenario 1
53+
54+
If the unexpected restart timestamp aligns with a maintenance activity, the analysis confirms that either platform or network maintenance affected the cluster.
55+
56+
For further assistance or other inquiries, you can open a support request by following [these instructions](#next-steps).
57+
58+
## Scenario 2: Cluster misconfiguration
59+
60+
The cluster nodes experience unexpected failovers or node restarts. These issues often occur becuase of cluster misconfigurations that affect the stability of Pacemaker Clusters.
61+
62+
To review the cluster configuration, run the following command:
63+
```bash
64+
sudo pcs configure show
65+
```
66+
67+
### Cause of scenario 2
68+
69+
Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of misconfigurations:
70+
71+
- Incorrect STONITH configuration:
72+
- No STONITH or fencing misconfigured: Not configuring STONITH correctly could cause nodes to be marked as unhealthy and trigger unnecessary restarts.
73+
- Wrong STONITH resource settings: Incorrect parameters for Azure fencing agents, such as `fence_azure_arm`, could cause nodes to restart unexpectedly during failovers.
74+
- Insufficient permissions: The Azure resource group or credentials that are used for fencing might lack required permissions and cause STONITH failures.
75+
76+
- Missing or incorrect resource constraints:
77+
Poorly set constraints could cause resources to be redistributed unnecessarily. This situation, in turn, could cause node overload and restarts. Misaligned resource dependency configurations could cause nodes to fail or go into a restart loop.
78+
79+
- Cluster threshold and time-out misconfigurations:
80+
- `failure-time-out`, `migration-threshold`, or `monitor-time-out` values might cause nodes to be prematurely restarted.
81+
- Heartbeat Timeout Settings: Incorrect `corosync` time-out settings for heartbeat intervals could cause nodes to assume that the other nodes are offline. This situation can trigger unnecessary restarts.
82+
83+
- Lack of proper health checks:
84+
Not setting correct health check intervals for critical services such as SAP HANA (High-performance ANalytic Application) could cause resource or node failures.
85+
86+
- Resource agent misconfiguration:
87+
- Custom resource agents misaligned with cluster: Resource agents that don't adhere to Pacemaker standards can create unpredictable behavior, including node restarts.
88+
- Wrong resource start/stop parameters: Incorrectly tuned start/stop parameters in cluster configuration might cause nodes to restart during resource recovery.
89+
90+
### Resolution for scenario 2
91+
92+
- Follow the proper guidelines to set up a [RHEL Pacemaker Cluster](#prerequisites). Additionally, make sure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability-rhel) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-rhel), as specified in the Microsoft documentation.
93+
- Steps to make necessary changes to the cluster configuration:
94+
1. Stop the application on both nodes.
95+
2. Put the cluster into maintenance-mode:
96+
97+
```bash
98+
sudo pcs property set maintenance-mode=true
99+
```
100+
3. Edit the cluster configuration:
101+
102+
```bash
103+
sudo pcs configure edit
104+
```
105+
4. Save the changes.
106+
5. Remove the cluster from maintenance mode.
107+
```bash
108+
sudo pcs property set maintenance-mode=false
109+
```
110+
111+
> [!IMPORTANT]
112+
> When you troubleshoot unexpected node restarts or failures, it's crucial to assess the effect of security tools that are installed on the system. These tools might interfere with cluster operations by blocking essential processes or modifying system files. This situation could cause instability, unexpected time-outs, or node restarts.
113+
>
114+
> To mitigate such risks, we recommend that you disable security tools on systems that are running a Pacemaker Cluster. Alternatively, you can configure appropriate exclusions to prevent conflicts with the cluster and its associated applications.
115+
116+
## Scenario 3: Migration from on-premises to Azure
117+
118+
When you migrate a SUSE Pacemaker cluster from on-premises to Azure, unexpected restarts can occur because of specific misconfigurations or overlooked dependencies.
119+
120+
### Cause of scenario 3
121+
122+
The following are common mistakes that are made in this category:
123+
124+
- Incomplete or incorrect STONITH configuration:
125+
- No STONITH or fencing misfconfigured: Not configuring STONITH correctly could cause nodes to be marked as unhealthy and trigger unnecessary restarts.
126+
- Wrong STONITH resource settings: Incorrect parameters for Azure fencing agents such as `fence_azure_arm` could cause nodes to restart unexpectedly during failovers.
127+
- Insufficient permissions: The Azure resource group or credentials that are used for fencing might lack required permissions and cause STONITH failures. Key Azure-specific parameters, such as subscription ID, resource group, or VM (Virtual Machine) names, must be correctly configured in the fencing agent. Omissions here could cause fencing failures and unexpected restarts.
128+
129+
For more information, see [Troubleshoot Azure Fence Agent startup issues in RHEL](troubleshoot-azure-fence-agent-rhel.md) and [Troubleshoot SBD service failure in RHEL Pacemaker clusters](troubleshoot-sbd-issues-rhel.md)
130+
131+
- Network misconfigurations:
132+
Misconfigured VNets, subnets, or security group rules can block essential cluster communication and cause perceived node failures and restarts.
133+
134+
For more information, see [Virtual networks and virtual machines in Azure](/azure/virtual-machines/linux/network-overview)
135+
136+
- Metadata Service issues:
137+
Azure's cloud metadata services must be correctly handled. Otherwise, resource detection or startup processes can fail.
138+
139+
For more information, see [Azure Instance Metadata Service](/azure/virtual-machines/instance-metadata-service) and [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events)
140+
141+
- Performance and latency mismatches:
142+
- Inadequate VM sizing: Migrated workloads might not align with the selected Azure VM size. This causes excessive resource use and triggers restarts.
143+
- Disk I/O mismatches: On-premises workloads that have high IOPS (Input/output operations per second) demands must be paired with the appropriate Azure disk or storage performance tier.
144+
145+
For more information, see [Collect performance metrics for a Linux VM](collect-performance-metrics-from-a-linux-system.md)
146+
147+
- Security and firewall rules:
148+
- Port Block: On-premises clusters often have open, internal communication. Additionally, Azure NSGs (Network Security Groups) or firewalls might block ports that are required for Pacemaker or Corosync communication.
149+
150+
For more information, see [Network security group test](/azure/virtual-machines/network-security-group-test)
151+
152+
### Resolution for scenario 3
153+
154+
Follow the proper guidelines to set up an [RHEL Pacemaker Cluster](#prerequisites). Additionally, make sure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in the Microsoft documentation.
155+
156+
## Scenario 4: Both cluster nodes are terminated after a failover event on RHEL 8
157+
158+
The Pacemaker Cluster anticipates an outage, and it proceeds to trigger a failover event. In a two-node cluster configuration, both nodes are terminated and stay offline until manual intervention can occur.
159+
160+
The logs indicate that the STONITH device, `python-user`, triggers the shutdown instruction for both nodes.
161+
162+
### Cause of scenario 4
163+
164+
During an outage, such as a platform or network interruption [see Scenario 1](#scenario-1-network-outage), both nodes try to write to the STONITH device to fence the other because they lose the totem token. Typically, the STONITH device takes instruction from the first available node to write on it in order to shut down the other node. If both nodes are allowed to write to the STONITH device, they might terminate each other.
165+
166+
### Resolution for scenario 4
167+
168+
We recommended that you use the `priority-fencing-delay` or `pcmk_delay_max` parameter so that only one VM is acknowledged by the STONITH device:
169+
170+
1. Set the cluster under maintenance-mode:
171+
172+
```bash
173+
sudo pcs property set maintenance-mode=true
174+
```
175+
176+
2. Edit the cluster configuration:
177+
178+
```bash
179+
sudo pcs configure edit
180+
```
181+
182+
3. If the Pacemaker version is earlier than `2.0.4-6.el8`, add the `pcmk_delay_max` parameter:
183+
184+
```bash
185+
sudo pcs property set pcmk_delay_max=15s
186+
```
187+
188+
If the Pacemaker version is later than `2.0.4-6.el8`, use the `priority-fencing-delay` parameter instead:
189+
190+
```bash
191+
sudo pcs property set priority-fencing-delay=15s
192+
```
193+
194+
4. Save the changes, and remove the cluster from maintenance mode:
195+
196+
```bash
197+
sudo pcs property set maintenance-mode=false
198+
```
199+
200+
For more information, see [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration).
201+
202+
## Scenario 5: `HANA_CALL` time-out after 60 seconds
203+
204+
The Azure RHEL Pacemaker Cluster is running SAP HANA as an application, and it experiences unexpected restarts on one of the nodes or both nodes in the Pacemaker Cluster. Per the `/var/log/messages` or `/var/log/pacemaker.log` log entries, the node restart is caused by a `HANA_CALL` time-out, as follows:
205+
206+
```output
207+
2024-06-04T09:25:37.772406+00:00 node01 SAPHanaTopology(rsc_SAPHanaTopology_H00_HDB02)[99440]: WARNING: RA: HANA_CALL timed out after 60 seconds running command 'hdbnsutil -sr_stateConfiguration --sapcontrol=1'
208+
2024-06-04T09:25:38.711650+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: WARNING: RA: HANA_CALL timed out after 60 seconds running command 'hdbnsutil -sr_stateConfiguration'
209+
2024-06-04T09:25:38.724146+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary: we didn't expect node_status to be: <>
210+
2024-06-04T09:25:38.736748+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary: we didn't expect node_status to be: DUMP <00000000 0a |.|#01200000001>
211+
```
212+
213+
### Cause of scenario 5
214+
215+
The SAP HANA time-out messages are commonly considered to be internal application time-outs. Therefore, the SAP vendor should be engaged.
216+
217+
### Resolution for scenario 5
218+
219+
- To identify the root cause of the issue, review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
220+
- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if HANA is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
221+
- After you rule out external factors, such as platform or network outages, we recommend that you contact the application vendor for trace call analysis and log review.
222+
223+
## Scenario 6: `ASCS/ERS` time-out in SAP Netweaver clusters
224+
225+
The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as an application, and it experiences unexpected restarts on one of the nodes or both nodes in the Pacemaker Cluster. The following messages are logged in the `/var/log/messages` log:
226+
227+
```output
228+
2024-11-09T07:36:42.037589-05:00 node 01 SAPInstance(RSC_SAP_ERS10)[8689]: ERROR: SAP instance service enrepserver is not running with status GRAY !
229+
2024-11-09T07:36:42.044583-05:00 node 01 pacemaker-controld[2596]: notice: Result of monitor operation for RSC_SAP_ERS10 on node01: not running
230+
```
231+
232+
```output
233+
2024-11-09T07:39:42.789404-05:00 node01 SAPInstance(RSC_SAP_ASCS00)[16393]: ERROR: SAP Instance CP2-ASCS00 start failed: #01109.11.2024 07:39:42#012WaitforStarted#012FAIL: process msg_server MessageServer not running
234+
2024-11-09T07:39:420.796280-05:00 node01 pacemaker-execd[2404]: notice: RSC_SAP_ASCS00 start (call 78, PID 16393) exited with status 7 (execution time 23.488s)
235+
2024-11-09T07:39:42.828845-05:00 node 01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov 9 07:39:42 2024
236+
2024-11-09T07:39:42.828955-05:00 node 01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov 9 07:39:42 2024
237+
```
238+
239+
### Cause of scenario 6
240+
241+
The `ASCS/ERS` resource is considered to be the application for SAP Netweaver clusters. When the corresponding cluster monitoring resource times out, it triggers a failover process.
242+
243+
### Resolution for scenario 6
244+
245+
- To identify the root cause of the issue, we recommend that you review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
246+
- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if SAP Netweaver is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
247+
- After you rule out external factors, such as platform or network outages, we recommend that you engage the application vendor for trace call analysis and log review.
248+
249+
## Next steps
250+
For more help, open a support request, and submit your request by attaching [sosreport](https://access.redhat.com/solutions/3592) logs for troubleshooting.
251+
252+
[!INCLUDE [Third-party disclaimer](../../../includes/third-party-disclaimer.md)]
253+
254+
[!INCLUDE [Third-party contact disclaimer](../../../includes/third-party-contact-disclaimer.md)]
255+
256+
[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]

0 commit comments

Comments
 (0)