You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-rhel.md
+37-39Lines changed: 37 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,15 +20,15 @@ This article provides guidance for troubleshooting, analysis, and resolution of
20
20
## Prerequisites
21
21
22
22
- Make sure that the Pacemaker Cluster setup is correctly configured by following the guidelines that are provided in [Set up Pacemaker on Red Hat Enterprise Linux in Azure](/azure/sap/workloads/high-availability-guide-rhel-pacemaker).
23
-
- For a Microsoft Azure Pacemaker cluster that uses the Azure Fence Agent as the STONITH (Shoot-The-Other-Node-In-The-Head) device, refer to the documentation [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration).
23
+
- For a Microsoft Azure Pacemaker Cluster that uses the Azure Fence Agent as the STONITH (Shoot-The-Other-Node-In-The-Head) device, refer to the [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration) documentation.
24
24
- For a Microsoft Azure Pacemaker cluster that uses SBD (STONITH Block Device) storage protection as the STONITH device, choose one of the following setup options (see the articles for detailed information):
25
25
-[SBD with an iSCSI target server](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-iscsi-target-server)
26
26
-[SBD with an Azure shared disk](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#sbd-with-an-azure-shared-disk)
27
27
28
-
## Scenario 1: Network Outage
28
+
## Scenario 1: Network outage
29
29
30
-
- The cluster nodes are experiencing `corosync` communication errors. This causes continuous retransmissions because of an inability to establish communication between nodes. This issue triggers application timeouts, ultimately causing node fencing and subsequent restarts.
31
-
-Additionally, services that are dependent on network connectivity, such as `waagent`, generate communicationrelated error messages in the logs. This further indicates networkrelated disruptions.
30
+
- The cluster nodes are experiencing `corosync` communication errors. This causes continuous retransmissions because of an inability to establish communication between nodes. This issue triggers application time-outs and ultimately causes node fencing and subsequent restarts.
31
+
-Services that are dependent on network connectivity, such as `waagent`, generate communication-related error messages in the logs. This further indicates network-related disruptions.
32
32
33
33
The following messages are logged in the `/var/log/messages` log:
34
34
@@ -45,17 +45,17 @@ Aug 21 01:47:27 node 02 corosync[15241]: [KNET ] host: host: 2 has no active
45
45
Aug 21 01:47:31 node 02 corosync[15241]: [TOTEM ] Token has not been received in 30000 ms
46
46
```
47
47
48
-
### Cause for scenario 1
48
+
### Cause of scenario 1
49
49
50
-
An unexpected node restart occurs because of a Network Maintenance activity or an outage. For confirmation, you can match the timestamp by reviewing the [Azure Maintenance Notification](/azure/virtual-machines/linux/maintenance-notifications) in the Azure portal. For more information about Azure Scheduled Events, see [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events).
50
+
An unexpected node restart occurs because of a network maintenance activity or an outage. For verification, you can match the timestamp by reviewing the [Azure Maintenance Notification](/azure/virtual-machines/linux/maintenance-notifications) in the Azure portal. For more information about Azure Scheduled Events, see [Azure Metadata Service: Scheduled Events for Linux VMs](/azure/virtual-machines/linux/scheduled-events).
51
51
52
52
### Resolution for scenario 1
53
53
54
54
If the unexpected restart timestamp aligns with a maintenance activity, the analysis confirms that either platform or network maintenance affected the cluster.
55
55
56
-
For further assistance or other queries, you can open a support request by following [these instructions](#next-steps).
56
+
For further assistance or other inquiries, you can open a support request by following [these instructions](#next-steps).
57
57
58
-
## Scenario 2: Cluster Misconfiguration
58
+
## Scenario 2: Cluster misconfiguration
59
59
60
60
The cluster nodes experience unexpected failovers or node restarts. These are often caused by cluster misconfigurations that affect the stability of Pacemaker Clusters.
61
61
@@ -64,7 +64,7 @@ To review the cluster configuration, run the following command:
64
64
sudo pcs configure show
65
65
```
66
66
67
-
### Cause for scenario 2
67
+
### Cause of scenario 2
68
68
69
69
Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of misconfigurations:
70
70
@@ -77,22 +77,22 @@ Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of mi
77
77
Poorly set constraints can cause resources to be redistributed unnecessarily. This can cause node overload and restarts. Misaligned resource dependency configurations can cause nodes to fail or go into a restart loop.
78
78
79
79
- Cluster threshold and time-out misconfigurations:
80
-
-`failure-time-out`, `migration-threshold`, or `monitor-time-out` values may cause nodes to be prematurely restarted.
81
-
- Heartbeat Timeout Settings: Incorrect `corosync` time-out settings for heartbeat intervals can cause nodes to assume each other are offline, triggering unnecessary restarts.
80
+
-`failure-time-out`, `migration-threshold`, or `monitor-time-out` values might cause nodes to be prematurely restarted.
81
+
- Heartbeat Timeout Settings: Incorrect `corosync` time-out settings for heartbeat intervals can cause nodes to assume that the other nodes are offline. This can trigger unnecessary restarts.
82
82
83
83
- Lack of proper health checks:
84
84
Not setting correct health check intervals for critical services such as SAP HANA (High-performance ANalytic Application) can cause resource or node failures.
85
85
86
86
- Resource agent misconfiguration:
87
87
- Custom resource agents misaligned with cluster: Resource agents that don't adhere to Pacemaker standards can create unpredictable behavior, including node restarts.
88
-
- Wrong resource start/stop parameters: Incorrectly tuned start/stop parameters in cluster configuration may cause nodes to restart during resource recovery.
88
+
- Wrong resource start/stop parameters: Incorrectly tuned start/stop parameters in cluster configuration might cause nodes to restart during resource recovery.
89
89
90
90
### Resolution for scenario 2
91
91
92
92
- Follow the proper guidelines to set up a [RHEL Pacemaker Cluster](#prerequisites). Additionally, make sure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability-rhel) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-rhel), as specified in the Microsoft documentation.
93
93
- Steps to make necessary changes to the cluster configuration:
94
-
1. Stop the application on both the nodes.
95
-
2. Put the cluster into maintenance-mode.
94
+
1. Stop the application on both nodes.
95
+
2. Put the cluster into maintenance-mode:
96
96
97
97
```bash
98
98
sudo pcs property set maintenance-mode=true
@@ -103,28 +103,28 @@ Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of mi
103
103
sudo pcs configure edit
104
104
```
105
105
4. Save the changes.
106
-
5. Remove the cluster from maintenance-mode.
106
+
5. Remove the cluster from maintenancemode.
107
107
```bash
108
108
sudo pcs property set maintenance-mode=false
109
109
```
110
110
111
111
> [!IMPORTANT]
112
-
> When troubleshooting unexpected node restarts or failures, it's crucial to assess the impact of security tools installed on the system. These tools may interfere with cluster operations by blocking essential processes or modifying system files, potentially causing instability, unexpected timeouts, or node reboots.
112
+
> When you troubleshoot unexpected node restarts or failures, it's crucial to assess the effect of security tools that are installed on the system. These tools might interfere with cluster operations by blocking essential processes or modifying system files. This could cause instability, unexpected time-outs, or node reboots.
113
113
>
114
-
> To mitigate such risks, it's recommended to disable security tools on systems running a Pacemaker cluster or ensure that appropriate exclusions are configured to prevent conflicts with the cluster and its associated applications.
114
+
> To mitigate such risks, we recommend that you disable security tools on systems that are running a Pacemaker Cluster, or make sure that appropriate exclusions are configured to prevent conflicts with the cluster and its associated applications.
115
115
116
116
## Scenario 3: Migration from on-premises to Azure
117
117
118
118
When you migrate a SUSE Pacemaker cluster from on-premises to Azure, unexpected restarts can occur because of specific misconfigurations or overlooked dependencies.
119
119
120
-
### Cause for scenario 3
120
+
### Cause of scenario 3
121
121
122
-
The following are common mistakes in this category:
122
+
The following are common mistakes that are made in this category:
123
123
124
124
- Incomplete or incorrect STONITH configuration:
125
125
- No STONITH or fencing misfconfigured: Not configuring STONITH correctly can cause nodes to be marked as unhealthy and trigger unnecessary restarts.
126
126
- Wrong STONITH resource settings: Incorrect parameters for Azure fencing agents such as `fence_azure_arm` can cause nodes to restart unexpectedly during failovers.
127
-
- Insufficient permissions: The Azure resource group or credentials that are used forfencing might lack required permissions and cause STONITH failures. Key Azure-specific parameters, such as subscription ID, resource group, or VM(Virtual Machine) names, must be correctly configuredin the fencing agent. Omissions here can cause fencing failures and unexpected restarts.
127
+
- Insufficient permissions: The Azure resource group or credentials that are used for fencing might lack required permissions and cause STONITH failures. Key Azure-specific parameters, such as subscription ID, resource group, or VM(Virtual Machine) names, must be correctly configured in the fencing agent. Omissions here can cause fencing failures and unexpected restarts.
128
128
129
129
For more information, see [Troubleshoot Azure Fence Agent startup issues in RHEL](troubleshoot-azure-fence-agent-rhel.md) and [Troubleshoot SBD service failure in RHEL Pacemaker clusters](troubleshoot-sbd-issues-rhel.md)
130
130
@@ -140,68 +140,66 @@ The following are common mistakes in this category:
140
140
141
141
- Performance and latency mismatches:
142
142
- Inadequate VM sizing: Migrated workloads might not align with the selected Azure VM size. This causes excessive resource use and triggers restarts.
143
-
- Disk I/O mismatches: On-premises workloads with high IOPS (Input/output operations per second) demands must be paired with the appropriate Azure disk or storage performance tier.
143
+
- Disk I/O mismatches: On-premises workloads that have high IOPS (Input/output operations per second) demands must be paired with the appropriate Azure disk or storage performance tier.
144
144
145
145
For more information, see [Collect performance metrics for a Linux VM](collect-performance-metrics-from-a-linux-system.md)
146
146
147
-
- Security and Firewall Rules:
147
+
- Security and firewall rules:
148
148
- Port Block: On-premises clusters often have open, internal communication. Additionally, Azure NSGs (Network Security Groups) or firewalls might block ports that are required for Pacemaker or Corosync communication.
149
149
150
150
For more information, see [Network security group test](/azure/virtual-machines/network-security-group-test)
151
151
152
152
### Resolution for scenario 3
153
153
154
-
Follow the proper guidelines to set up a [RHEL Pacemaker Cluster](#prerequisites). Additionally, make sure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in the Microsoft documentation.
154
+
Follow the proper guidelines to set up an [RHEL Pacemaker Cluster](#prerequisites). Additionally, make sure that appropriate resources are allocated for applications such as [SAP HANA](/azure/sap/workloads/sap-hana-high-availability) or [SAP NetWeaver](/azure/sap/workloads/high-availability-guide-suse), as specified in the Microsoft documentation.
155
155
156
156
## Scenario 4: Both cluster nodes are terminated after a failover event on RHEL 8
157
157
158
-
The Pacemaker cluster faces an outage, and proceeds to trigger a failover event. In a twonode cluster configuration, both nodes are terminated, and stay offline until manual intervention.
158
+
The Pacemaker Cluster anticipates an outage, and it proceeds to trigger a failover event. In a two-node cluster configuration, both nodes are terminated and stay offline until manual intervention can occur.
159
159
160
-
The logs indicate that the STONITH device `python-user` triggers the shutdown instruction for both nodes.
160
+
The logs indicate that the STONITH device,`python-user`, triggers the shutdown instruction for both nodes.
161
161
162
-
### Cause for scenario 4
162
+
### Cause of scenario 4
163
163
164
-
During an outage, like a Platform or network interruption as discussed in [Scenario 1](#scenario-1-network-outage), both nodes attempt to write to the STONITH device to fence each other since they lose Totem Token. Normally, the STONITH device takes the instruction from the first node that's available, to write on it in order to shutdown the other node. If both nodes are allowed to write to the STONITH device, they end up killing each other.
165
-
166
-
During an outage such as a platform or network interruption described in [Scenario 1](#scenario-1-network-outage), both nodes try to write to the STONITH device to fence each other because they lose Totem token. Typically, the STONITH device follows the first available node's instruction to shut down the other node. If both nodes are allowed to write to the STONITH device, they might shut each other down.
164
+
During an outage, such as a platform or network interruption of the kind that's discussed in [Scenario 1](#scenario-1-network-outage), both nodes try to write to the STONITH device to fence the other because they lose the totem token. Typically, the STONITH device takes instruction from the first available node to write on it in order to shut down the other node. If both nodes are allowed to write to the STONITH device, they might terminate each other.
167
165
168
166
### Resolution for scenario 4
169
167
170
-
It's recommended to use `priority-fencing-delay` or `pcmk_delay_max` parameter, so only one VM should be acknowledged by the STONITH device.
168
+
We recommended that you use the `priority-fencing-delay` or `pcmk_delay_max` parameter so that only one VM is acknowledged by the STONITH device:
171
169
172
-
1. Set the cluster under maintenance-mode.
170
+
1. Set the cluster under maintenance-mode:
173
171
174
172
```bash
175
173
sudo pcs property set maintenance-mode=true
176
174
```
177
175
178
-
2. Edit the cluster configuration.
176
+
2. Edit the cluster configuration:
179
177
180
178
```bash
181
179
sudo pcs configure edit
182
180
```
183
181
184
-
3. If the Pacemaker version is less than `2.0.4-6.el8`, then add the parameter `pcmk_delay_max`:
182
+
3. If the Pacemaker version is earlier than `2.0.4-6.el8`, add the `pcmk_delay_max` parameter:
185
183
186
184
```bash
187
185
sudo pcs property set pcmk_delay_max=15s
188
186
```
189
187
190
-
If the version is higher than `2.0.4-6.el8`, then use the parameter `priority-fencing-delay` instead:
188
+
If the Pacemaker version is later than `2.0.4-6.el8`, use the `priority-fencing-delay` parameter instead:
191
189
192
190
```bash
193
191
sudo pcs property set priority-fencing-delay=15s
194
192
```
195
193
196
-
4. Save the changes and remove the cluster out of maintenance-mode.
194
+
4. Save the changes, and remove the cluster from maintenancemode:
197
195
198
196
```bash
199
197
sudo pcs property set maintenance-mode=false
200
198
```
201
199
202
200
For more information, see [RHEL - Create Azure Fence agent STONITH device](/azure/sap/workloads/high-availability-guide-rhel-pacemaker#azure-fence-agent-configuration).
203
201
204
-
## Scenario 5: `HANA_CALL` timeout after 60 seconds
202
+
## Scenario 5: `HANA_CALL` time-out after 60 seconds
205
203
206
204
The Azure RHEL Pacemaker Cluster is running SAP HANA as an application, and it experiences unexpected restarts on one of the nodes or both nodes in the Pacemaker Cluster. Per the `/var/log/messages` or `/var/log/pacemaker.log` log entries, the node restart is caused by a `HANA_CALL` time-out, as follows:
207
205
@@ -212,9 +210,9 @@ The Azure RHEL Pacemaker Cluster is running SAP HANA as an application, and it e
212
210
2024-06-04T09:25:38.736748+00:00 node01 SAPHana(rsc_SAPHana_H00_HDB02)[99475]: ERROR: ACT: check_for_primary: we didn't expect node_status to be: DUMP <00000000 0a |.|#01200000001>
213
211
```
214
212
215
-
### Cause for scenario 5
213
+
### Cause of scenario 5
216
214
217
-
The SAP HANA time-out messages are commonly considered internal application timeouts. Therefore, the SAP vendor should be engaged.
215
+
The SAP HANA time-out messages are commonly considered to be internal application time-outs. Therefore, the SAP vendor should be engaged.
218
216
219
217
### Resolution for scenario 5
220
218
@@ -238,7 +236,7 @@ The Azure SUSE Pacemaker Cluster is running SAP Netweaver ASCS/ERS as an applica
238
236
2024-11-09T07:39:42.828955-05:00 node 01 pacemaker-schedulerd[2406]: warning: Unexpected result (not running) was recorded for start of RSC_SAP_ASCS00 on node01 at Nov 9 07:39:42 2024
239
237
```
240
238
241
-
### Cause for scenario 6
239
+
### Cause of scenario 6
242
240
243
241
The `ASCS/ERS` resource is considered to be the application for SAP Netweaver clusters. When the corresponding cluster monitoring resource times out, it triggers a failover process.
0 commit comments