You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-rhel.md
+58-36Lines changed: 58 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,17 +15,18 @@ ms.custom: sap:Issue with Pacemaker cluster, and fencing
15
15
16
16
**Applies to:**:heavy_check_mark: Linux VMs
17
17
18
-
This article provides guidance for troubleshooting, analysis, and resolution of the most common scenarios for unexpected node restarts in RHEL (RedHat Enterprise Linux) Pacemaker Clusters.
18
+
This article provides guidance for troubleshooting, analysis, and resolution of most common scenarios for unexpected node restarts in RedHat Enterprise Linux (RHEL) Pacemaker Clusters.
19
19
20
20
## Prerequisites
21
21
22
22
- Make sure that the Pacemaker Cluster setup is correctly configured by following the guidelines that are provided in [Set up Pacemaker on Red Hat Enterprise Linux in Azure](/azure/sap/workloads/high-availability-guide-rhel-pacemaker).
23
23
- For a Microsoft Azure Pacemaker cluster that uses the Azure Fence Agent as the STONITH (Shoot-The-Other-Node-In-The-Head) device, refer to the documentation [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration).
24
24
- For a Microsoft Azure Pacemaker cluster that uses SBD (STONITH Block Device) storage protection as the STONITH device, choose one of the following setup options (see the articles for detailed information):
25
-
-[SBD with an iSCSI target server](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-iscsi-target-server)
26
-
-[SBD with an Azure shared disk](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-azure-shared-disk)
25
+
-[SBD with an iSCSI target server](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-iscsi-target-server)
26
+
-[SBD with an Azure shared disk](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-azure-shared-disk)
27
27
28
28
## Scenario 1: Network Outage
29
+
29
30
- The cluster nodes are experiencing `corosync` communication errors. This causes continuous retransmissions because of an inability to establish communication between nodes. This issue triggers application timeouts, ultimately causing node fencing and subsequent restarts.
30
31
- Additionally, services that are dependent on network connectivity, such as `waagent`, generate communication related error messages in the logs. This further indicates network related disruptions.
31
32
@@ -45,14 +46,17 @@ Aug 21 01:47:31 node 02 corosync[15241]: [TOTEM ] Token has not been received
45
46
```
46
47
47
48
### Cause for scenario 1
49
+
48
50
An unexpected node restart occurs because of a Network Maintenance activity or an outage. For confirmation, you can match the timestamp by reviewing the [Azure Maintenance Notification](/azure/virtual-machines/linux/maintenance-notifications) in the Azure portal. For more information about Azure Scheduled Events, see [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events).
49
51
50
52
### Resolution for scenario 1
53
+
51
54
If the unexpected restart timestamp aligns with a maintenance activity, the analysis confirms that either platform or network maintenance affected the cluster.
52
55
53
56
For further assistance or other queries, you can open a support request by following [these instructions](#next-steps).
54
57
55
58
## Scenario 2: Cluster Misconfiguration
59
+
56
60
The cluster nodes experience unexpected failovers or node restarts. These are often caused by cluster misconfigurations that affect the stability of Pacemaker Clusters.
57
61
58
62
To review the cluster configuration, run the following command:
@@ -61,6 +65,7 @@ sudo pcs configure show
61
65
```
62
66
63
67
### Cause for scenario 2
68
+
64
69
Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of misconfigurations:
65
70
66
71
- Incorrect STONITH configuration:
@@ -83,14 +88,17 @@ Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of mi
83
88
- Wrong resource start/stop parameters: Incorrectly tuned start/stop parameters in cluster configuration may cause nodes to restart during resource recovery.
84
89
85
90
### Resolution for scenario 2
91
+
86
92
- Follow the proper guidelines to set up a [RHEL Pacemaker Cluster](#prerequisites). Additionally, make sure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability-rhel) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-rhel), as specified in the Microsoft documentation.
87
93
- Steps to make necessary changes to the cluster configuration:
88
94
1. Stop the application on both the nodes.
89
95
2. Put the cluster into maintenance-mode.
96
+
90
97
```bash
91
98
sudo pcs property set maintenance-mode=true
92
99
```
93
100
3. Edit the cluster configuration:
101
+
94
102
```bash
95
103
sudo pcs configure edit
96
104
```
@@ -106,9 +114,11 @@ Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of mi
106
114
> To mitigate such risks, it's recommended to disable security tools on systems running a Pacemaker cluster or ensure that appropriate exclusions are configured to prevent conflicts with the cluster and its associated applications.
107
115
108
116
## Scenario 3: Migration from on-premises to Azure
117
+
109
118
When you migrate a SUSE Pacemaker cluster from on-premises to Azure, unexpected restarts can occur because of specific misconfigurations or overlooked dependencies.
110
119
111
120
### Cause for scenario 3
121
+
112
122
The following are common mistakes in this category:
113
123
114
124
- Incomplete or incorrect STONITH configuration:
@@ -139,50 +149,57 @@ The following are common mistakes in this category:
139
149
140
150
For more information, see [Network security group test](/azure/virtual-machines/network-security-group-test)
141
151
142
-
143
152
### Resolution for scenario 3
144
153
145
154
Follow the proper guidelines to set up a [RHEL Pacemaker Cluster](#prerequisites). Additionally, make sure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in the Microsoft documentation.
146
155
147
-
## Scenario 4: Both cluster nodes are killed after a failover event on RHEL 8
148
-
* The Pacemaker cluster faces an outage, and proceeds to trigger a failover event.
149
-
* In a two node cluster configuration, both nodes kill each other, and stay offline until manual intervention.
150
-
* The stonith device `python-user` triggers the shutdown instruction for both nodes.
156
+
## Scenario 4: Both cluster nodes are terminated after a failover event on RHEL 8
157
+
158
+
The Pacemaker cluster faces an outage, and proceeds to trigger a failover event. In a two node cluster configuration, both nodes are terminated, and stay offline until manual intervention.
159
+
160
+
The logs indicate that the STONITH device `python-user` triggers the shutdown instruction for both nodes.
151
161
152
162
### Cause for scenario 4
153
-
When there's an outage, like a Platform/Network interruption as discussed in [Scenario 1](#scenario-1-network-outage), both nodes attempt to write to the STONITH device to fence each other since they lose totem. Normally, the stonith device takes the instruction from the first node that's available, to write on it in order to shutdown the other node. If both nodes are allowed to write to the stonith device, they end up killing each other.
163
+
164
+
During an outage, like a Platform or network interruption as discussed in [Scenario 1](#scenario-1-network-outage), both nodes attempt to write to the STONITH device to fence each other since they lose Totem Token. Normally, the STONITH device takes the instruction from the first node that's available, to write on it in order to shutdown the other node. If both nodes are allowed to write to the STONITH device, they end up killing each other.
165
+
166
+
During an outage such as a platform or network interruption described in [Scenario 1](#scenario-1-network-outage), both nodes try to write to the STONITH device to fence each other because they lose Totem token. Typically, the STONITH device follows the first available node's instruction to shut down the other node. If both nodes are allowed to write to the STONITH device, they might shut each other down.
154
167
155
168
### Resolution for scenario 4
156
-
It's recommended to use `priority-fencing-delay` or `pcmk_delay_max` parameter, so only one VM should be acknowledged by the STONITH device.
157
169
170
+
It's recommended to use `priority-fencing-delay` or `pcmk_delay_max` parameter, so only one VM should be acknowledged by the STONITH device.
158
171
159
172
1. Set the cluster under maintenance-mode.
160
-
```bash
161
-
sudo pcs property set maintenance-mode=true
162
-
```
173
+
174
+
```bash
175
+
sudo pcs property set maintenance-mode=true
176
+
```
163
177
164
178
2. Edit the cluster configuration.
165
-
```bash
166
-
sudo pcs configure edit
167
-
```
179
+
180
+
```bash
181
+
sudo pcs configure edit
182
+
```
168
183
169
184
3. If the Pacemaker version is less than `2.0.4-6.el8`, then add the parameter `pcmk_delay_max`:
170
-
```bash
171
-
sudo pcs property set pcmk_delay_max=15s
172
-
```
173
185
174
-
* If the version is higher than `2.0.4-6.el8`, then use the parameter `priority-fencing-delay` instead:
175
-
```bash
176
-
sudo pcs property set priority-fencing-delay=15s
177
-
```
186
+
```bash
187
+
sudo pcs property set pcmk_delay_max=15s
188
+
```
189
+
190
+
If the version is higher than `2.0.4-6.el8`, then use the parameter `priority-fencing-delay` instead:
191
+
192
+
```bash
193
+
sudo pcs property set priority-fencing-delay=15s
194
+
```
178
195
179
-
4. Save the changes and remove the cluster out of maintenance-mode.
180
-
```bash
181
-
sudo pcs property set maintenance-mode=false
182
-
```
196
+
4. Save the changes and remove the cluster out of maintenance-mode.
183
197
184
-
For more information refer to [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration).
198
+
```bash
199
+
sudo pcs property set maintenance-mode=false
200
+
```
185
201
202
+
For more information, see [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration).
186
203
187
204
## Scenario 5: `HANA_CALL` timeout after 60 seconds
188
205
@@ -194,36 +211,41 @@ The Azure RHEL Pacemaker Cluster is running SAP HANA as an application, and it e
194
211
2024-06-04T09:25:38.724146+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary: we didn't expect node_status to be: <>
195
212
2024-06-04T09:25:38.736748+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary: we didn't expect node_status to be: DUMP <00000000 0a |.|#01200000001>
196
213
```
214
+
197
215
### Cause for scenario 5
216
+
198
217
The SAP HANA time-out messages are commonly considered internal application timeouts. Therefore, the SAP vendor should be engaged.
199
218
200
219
### Resolution for scenario 5
201
-
- To identify the root cause of the issue, review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
202
-
- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if HANA is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
220
+
221
+
- To identify the root cause of the issue, review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
222
+
- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if HANA is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
203
223
- After you rule out external factors, such as platform or network outages, we recommend that you contact the application vendor for trace call analysis and log review.
204
-
-
224
+
205
225
## Scenario 6: `ASCS/ERS` time-out in SAP Netweaver clusters
206
226
207
227
The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as an application, and it experiences unexpected restarts on one of the nodes or both nodes in the Pacemaker Cluster. The following messages are logged in the `/var/log/messages` log:
208
228
209
229
```output
210
230
2024-11-09T07:36:42.037589-05:00 node 01 SAPInstance(RSC_SAP_ERS10)[8689]: ERROR: SAP instance service enrepserver is not running with status GRAY !
211
-
2024-11-09T07:36:42.044583-05:00 node 01 pacemaker-controld[2596]:notice: Result of monitor operation for RSC_SAP_ERS10 on node01: not running
231
+
2024-11-09T07:36:42.044583-05:00 node 01 pacemaker-controld[2596]:notice: Result of monitor operation for RSC_SAP_ERS10 on node01: not running
212
232
```
213
233
214
234
```output
215
235
2024-11-09T07:39:42.789404-05:00 node01 SAPInstance(RSC_SAP_ASCS00)[16393]: ERROR: SAP Instance CP2-ASCS00 start failed: #01109.11.2024 07:39:42#012WaitforStarted#012FAIL: process msg_server MessageServer not running
216
-
2024-11-09T07:39:420.796280-05:00 node01 pacemaker-execd[2404]:notice: RSC_SAP_ASCS00 start (call 78, PID 16393) exited with status 7 (execution time 23.488s)
217
-
2024-11-09T07:39:42.828845-05:00 node 01 pacemaker-schedulerd[2406]:warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov 9 07:39:42 2024
218
-
2024-11-09T07:39:42.828955-05:00 node 01 pacemaker-schedulerd[2406]:warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov 9 07:39:42 2024
236
+
2024-11-09T07:39:420.796280-05:00 node01 pacemaker-execd[2404]:notice: RSC_SAP_ASCS00 start (call 78, PID 16393) exited with status 7 (execution time 23.488s)
237
+
2024-11-09T07:39:42.828845-05:00 node 01 pacemaker-schedulerd[2406]:warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov 9 07:39:42 2024
238
+
2024-11-09T07:39:42.828955-05:00 node 01 pacemaker-schedulerd[2406]:warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov 9 07:39:42 2024
219
239
```
220
240
221
241
### Cause for scenario 6
242
+
222
243
The `ASCS/ERS` resource is considered to be the application for SAP Netweaver clusters. When the corresponding cluster monitoring resource times out, it triggers a failover process.
223
244
224
245
### Resolution for scenario 6
246
+
225
247
- To identify the root cause of the issue, we recommend that you review the [OS performance](collect-performance-metrics-from-a-linux-system.md).
226
-
- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if SAP Netweaver is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
248
+
- You should pay particular attention to memory pressure and storage devices and their configuration. This is especially true if SAP Netweaver is hosted on Network File System (NFS), Azure NetApp Files (ANF), or Azure Files.
227
249
- After you rule out external factors, such as platform or network outages, we recommend that you engage the application vendor for trace call analysis and log review.
0 commit comments