You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: support/azure/virtual-machines/linux/troubleshoot-unexpected-node-reboots-pacemaker-rhel.md
+15-15Lines changed: 15 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: Troubleshoot Unexpected Node Reboots in Azure Linux RHEL Pacemaker Cluster
2
+
title: Troubleshoot Unexpected Node Restarts in Azure Linux RHEL Pacemaker Cluster
3
3
description: This article provides troubleshooting steps for resolving unexpected node restarts in RHEL Linux Pacemaker Clusters
4
4
author: rnirek
5
5
ms.author: rnirek
@@ -28,7 +28,7 @@ This article provides guidance for troubleshooting, analysis, and resolution of
28
28
## Scenario 1: Network outage
29
29
30
30
- The cluster nodes are experiencing `corosync` communication errors. This causes continuous retransmissions because of an inability to establish communication between nodes. This issue triggers application time-outs and ultimately causes node fencing and subsequent restarts.
31
-
- Services that are dependent on network connectivity, such as `waagent`, generate communication-related error messages in the logs. This further indicates network-related disruptions.
31
+
- Services that are dependent on network connectivity, such as `waagent`, generate communication-related error entries in the logs. These entries further indicate network-related disruptions.
32
32
33
33
The following messages are logged in the `/var/log/messages` log:
34
34
@@ -57,7 +57,7 @@ For further assistance or other inquiries, you can open a support request by fol
57
57
58
58
## Scenario 2: Cluster misconfiguration
59
59
60
-
The cluster nodes experience unexpected failovers or node restarts. These are often caused by cluster misconfigurations that affect the stability of Pacemaker Clusters.
60
+
The cluster nodes experience unexpected failovers or node restarts. These issues often occur becuase of cluster misconfigurations that affect the stability of Pacemaker Clusters.
61
61
62
62
To review the cluster configuration, run the following command:
63
63
```bash
@@ -69,19 +69,19 @@ sudo pcs configure show
69
69
Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of misconfigurations:
70
70
71
71
- Incorrect STONITH configuration:
72
-
- No STONITH or fencing misconfigured: Not configuring STONITH correctly can cause nodes to be marked as unhealthy and trigger unnecessary restarts.
73
-
- Wrong STONITH resource settings: Incorrect parameters for Azure fencing agents, such as `fence_azure_arm`, can cause nodes to restart unexpectedly during failovers.
72
+
- No STONITH or fencing misconfigured: Not configuring STONITH correctly could cause nodes to be marked as unhealthy and trigger unnecessary restarts.
73
+
- Wrong STONITH resource settings: Incorrect parameters for Azure fencing agents, such as `fence_azure_arm`, could cause nodes to restart unexpectedly during failovers.
74
74
- Insufficient permissions: The Azure resource group or credentials that are used for fencing might lack required permissions and cause STONITH failures.
75
75
76
76
- Missing or incorrect resource constraints:
77
-
Poorly set constraints can cause resources to be redistributed unnecessarily. This can cause node overload and restarts. Misaligned resource dependency configurations can cause nodes to fail or go into a restart loop.
77
+
Poorly set constraints could cause resources to be redistributed unnecessarily. This situation, in turn, could cause node overload and restarts. Misaligned resource dependency configurations could cause nodes to fail or go into a restart loop.
78
78
79
79
- Cluster threshold and time-out misconfigurations:
80
80
-`failure-time-out`, `migration-threshold`, or `monitor-time-out` values might cause nodes to be prematurely restarted.
81
-
- Heartbeat Timeout Settings: Incorrect `corosync` time-out settings for heartbeat intervals can cause nodes to assume that the other nodes are offline. This can trigger unnecessary restarts.
81
+
- Heartbeat Timeout Settings: Incorrect `corosync` time-out settings for heartbeat intervals could cause nodes to assume that the other nodes are offline. This situation can trigger unnecessary restarts.
82
82
83
83
- Lack of proper health checks:
84
-
Not setting correct health check intervals for critical services such as SAP HANA (High-performance ANalytic Application) can cause resource or node failures.
84
+
Not setting correct health check intervals for critical services such as SAP HANA (High-performance ANalytic Application) could cause resource or node failures.
85
85
86
86
- Resource agent misconfiguration:
87
87
- Custom resource agents misaligned with cluster: Resource agents that don't adhere to Pacemaker standards can create unpredictable behavior, including node restarts.
@@ -109,9 +109,9 @@ Unexpected restarts in an Azure SUSE Pacemaker cluster often occur because of mi
109
109
```
110
110
111
111
> [!IMPORTANT]
112
-
> When you troubleshoot unexpected node restarts or failures, it's crucial to assess the effect of security tools that are installed on the system. These tools might interfere with cluster operations by blocking essential processes or modifying system files. This could cause instability, unexpected time-outs, or node reboots.
112
+
> When you troubleshoot unexpected node restarts or failures, it's crucial to assess the effect of security tools that are installed on the system. These tools might interfere with cluster operations by blocking essential processes or modifying system files. This situation could cause instability, unexpected time-outs, or node restarts.
113
113
>
114
-
> To mitigate such risks, we recommend that you disable security tools on systems that are running a Pacemaker Cluster, or make sure that appropriate exclusions are configured to prevent conflicts with the cluster and its associated applications.
114
+
> To mitigate such risks, we recommend that you disable security tools on systems that are running a Pacemaker Cluster. Alternatively, you can configure appropriate exclusions to prevent conflicts with the cluster and its associated applications.
115
115
116
116
## Scenario 3: Migration from on-premises to Azure
117
117
@@ -122,9 +122,9 @@ When you migrate a SUSE Pacemaker cluster from on-premises to Azure, unexpected
122
122
The following are common mistakes that are made in this category:
123
123
124
124
- Incomplete or incorrect STONITH configuration:
125
-
- No STONITH or fencing misfconfigured: Not configuring STONITH correctly can cause nodes to be marked as unhealthy and trigger unnecessary restarts.
126
-
- Wrong STONITH resource settings: Incorrect parameters for Azure fencing agents such as `fence_azure_arm` can cause nodes to restart unexpectedly during failovers.
127
-
- Insufficient permissions: The Azure resource group or credentials that are used for fencing might lack required permissions and cause STONITH failures. Key Azure-specific parameters, such as subscription ID, resource group, or VM (Virtual Machine) names, must be correctly configured in the fencing agent. Omissions here can cause fencing failures and unexpected restarts.
125
+
- No STONITH or fencing misfconfigured: Not configuring STONITH correctly could cause nodes to be marked as unhealthy and trigger unnecessary restarts.
126
+
- Wrong STONITH resource settings: Incorrect parameters for Azure fencing agents such as `fence_azure_arm` could cause nodes to restart unexpectedly during failovers.
127
+
- Insufficient permissions: The Azure resource group or credentials that are used for fencing might lack required permissions and cause STONITH failures. Key Azure-specific parameters, such as subscription ID, resource group, or VM (Virtual Machine) names, must be correctly configured in the fencing agent. Omissions here could cause fencing failures and unexpected restarts.
128
128
129
129
For more information, see [Troubleshoot Azure Fence Agent startup issues in RHEL](troubleshoot-azure-fence-agent-rhel.md) and [Troubleshoot SBD service failure in RHEL Pacemaker clusters](troubleshoot-sbd-issues-rhel.md)
130
130
@@ -161,7 +161,7 @@ The logs indicate that the STONITH device, `python-user`, triggers the shutdown
161
161
162
162
### Cause of scenario 4
163
163
164
-
During an outage, such as a platform or network interruption of the kind that's discussed in [Scenario 1](#scenario-1-network-outage), both nodes try to write to the STONITH device to fence the other because they lose the totem token. Typically, the STONITH device takes instruction from the first available node to write on it in order to shut down the other node. If both nodes are allowed to write to the STONITH device, they might terminate each other.
164
+
During an outage, such as a platform or network interruption [see Scenario 1](#scenario-1-network-outage), both nodes try to write to the STONITH device to fence the other because they lose the totem token. Typically, the STONITH device takes instruction from the first available node to write on it in order to shut down the other node. If both nodes are allowed to write to the STONITH device, they might terminate each other.
165
165
166
166
### Resolution for scenario 4
167
167
@@ -247,7 +247,7 @@ The `ASCS/ERS` resource is considered to be the application for SAP Netweaver cl
247
247
- After you rule out external factors, such as platform or network outages, we recommend that you engage the application vendor for trace call analysis and log review.
248
248
249
249
## Next steps
250
-
For additional help, open a support request, and submit your request by attaching [sosreport](https://access.redhat.com/solutions/3592) logs for troubleshooting.
250
+
For more help, open a support request, and submit your request by attaching [sosreport](https://access.redhat.com/solutions/3592) logs for troubleshooting.
0 commit comments