Skip to content

Commit ee891ad

Browse files
authored
Fix typos and improve clarity in documentation
Edit review per CI 4169
1 parent fa2086a commit ee891ad

1 file changed

Lines changed: 18 additions & 18 deletions

File tree

support/azure/virtual-machines/linux/troubleshoot-rhel-pacemaker-cluster-services-resources-startup-issues.md

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Troubleshoot RHEL pacemaker cluster services and resources startup issues in Azure
2+
title: Troubleshoot RHEL Pacemaker Cluster Services and Resources Startup Issues in Azure
33
description: Provides troubleshooting guidance for issues related to cluster resources or services in RedHat Enterprise Linux (RHEL)) Pacemaker Cluster
44
ms.reviewer: rnirek,srsakthi
55
ms.author: rnirek
@@ -71,7 +71,7 @@ quorum {
7171

7272
### Resolution for scenario 1
7373

74-
1. Before you make any changes, ensure you have a backup or snapshot. For more information, see [Azure VM backup](/azure/backup/backup-azure-vms-introduction).
74+
1. Before you make any changes, make sure that you have a backup or snapshot. For more information, see [Azure VM backup](/azure/backup/backup-azure-vms-introduction).
7575

7676
2. Check for missing quorum section in `/etc/corosync/corosync.conf`. Compare the existing `corosync.conf` with any backup that's available in `/etc/corosync/`.
7777

@@ -125,7 +125,7 @@ quorum {
125125
}
126126
```
127127

128-
5. Remove the cluster from maintenance-mode.
128+
5. Remove the cluster from maintenance mode.
129129

130130
```bash
131131
sudo pcs property set maintenance-mode=false
@@ -149,7 +149,7 @@ quorum {
149149

150150
A virtual IP resource (`IPaddr2` resource) didn't start or stop in Pacemaker.
151151
152-
The following error messages are logged in `/var/log/pacemaker.log`:
152+
The following error entries are logged in `/var/log/pacemaker.log`:
153153
154154
```output
155155
25167 IPaddr2(VIP)[16985]: 2024/09/07_15:44:19 ERROR: Unable to find nic or netmask.
@@ -208,7 +208,7 @@ vip_HN1_03_start_0 on node-1 'unknown error' (1): call=30, status=complete, exit
208208

209209
If a route that matches the `VIP` isn't in the default routing table, you can specify the `NIC` name in the Pacemaker resource so that it can be configured to bypass the check:
210210
211-
1. Before you make any changes, ensure you have a backup or snapshot. For more information, see [Azure VM backup](/azure/backup/backup-azure-vms-introduction).
211+
1. Before you make any changes, make sure that you have a backup or snapshot. For more information, see [Azure VM backup](/azure/backup/backup-azure-vms-introduction).
212212
213213
2. Put the cluster into maintenance mode:
214214
@@ -334,7 +334,7 @@ The SAP HANA resource can't be started by Pacemaker if there are `SYN` failures
334334
> [!Important]
335335
> Steps 2, 3, and 4 must be performed by using a SAP administrator account. This is because these steps use a SAP System ID to stop, start, and re-enable replication manually.
336336
337-
1. Before you make any changes, ensure you have a backup or snapshot. For more information, see [Azure VM backup](/azure/backup/backup-azure-vms-introduction).
337+
1. Before you make any changes, make sure that you have a backup or snapshot. For more information, see [Azure VM backup](/azure/backup/backup-azure-vms-introduction).
338338
339339
2. Put the cluster into maintenance mode:
340340
@@ -512,7 +512,7 @@ This issue frequently occurs if the database is modified (manually stopped or st
512512
> [!Note]
513513
> Steps 1 through 5 should be performed by an SAP administrator.
514514

515-
1. Before you make any changes, ensure you have a backup or snapshot. For more information, see [Azure VM backup](/azure/backup/backup-azure-vms-introduction).
515+
1. Before you make any changes, make sure that you have a backup or snapshot. For more information, see [Azure VM backup](/azure/backup/backup-azure-vms-introduction).
516516

517517
2. Put the cluster into maintenance mode:
518518

@@ -620,7 +620,7 @@ Because of incorrect `InstanceName` and `START_PROFILE` attributes, the SAP inst
620620
> [!Note]
621621
> This resolution is applicable if `InstanceName` and `START_PROFILE` are separate files.
622622
623-
1. Before you make any changes, ensure you have a backup or snapshot. For more information, see [Azure VM backup](/azure/backup/backup-azure-vms-introduction).
623+
1. Before you make any changes, make sure that you have a backup or snapshot. For more information, see [Azure VM backup](/azure/backup/backup-azure-vms-introduction).
624624
625625
2. Put the cluster into maintenance mode:
626626
@@ -659,15 +659,15 @@ Because of incorrect `InstanceName` and `START_PROFILE` attributes, the SAP inst
659659
sudo pcs property set maintenance-mode=false
660660
```
661661
662-
## Scenario 5: Fenced Node Fails to Rejoin Cluster
662+
## Scenario 5: Fenced node doesn't rejoin cluster
663663

664664
### Symptom for scenario 5
665665

666-
Once the fencing operation is complete, the affected node typically doesn't rejoin the pacemaker cluster, and both the pacemaker and corosync services remain stopped unless they are manually started to resume the cluster back online.
666+
After the fencing operation is finished, the affected node typically doesn't rejoin the Pacemaker Cluster, and both the Pacemaker and Corosync services remain stopped unless they are manually started to restore the cluster online.
667667
668668
### Cause for scenario 5
669669
670-
After the node was fenced, rebooted, and restarted its cluster services, it subsequently received a message stating `We were allegedly just fenced`, which caused it to shut down its pacemaker and corosync services and prevented the cluster from starting. Node1 initiated a STONITH action against node2, and at `03:27:23`, when the network issue was resolved, node2 rejoined the corosync membership. Consequently, a new two-node membership was established, as shown in `/var/log/messages` for node1.
670+
After the node is fenced and restarted and has restarted its cluster services, it subsequently receives a message that states, `We were allegedly just fenced`. This causes it to shut down its Pacemaker and Corosync services and prevent the cluster from starting. Node1 initiates a STONITH action against node2. At `03:27:23`, when the network issue is resolved, node2 rejoins the Corosync membership. Consequently, a new two-node membership is established, as shown in `/var/log/messages` for node1:
671671
672672
```output
673673
Feb 20 03:26:56 node1 corosync[1722]: [TOTEM ] A processor failed, forming new configuration.
@@ -682,13 +682,13 @@ Feb 20 03:27:25 node1 corosync[1722]: [QUORUM] Members[2]: 1 2
682682
Feb 20 03:27:25 node1 corosync[1722]: [MAIN ] Completed service synchronization, ready to provide service.
683683
```
684684
685-
node1 received confirmation that node2 had been successfully rebooted as shown in `/var/log/messages` for node2.
685+
Node1 received confirmation that node2 was successfully restarted, as shown in `/var/log/messages` for node2.
686686
687687
```output
688688
Feb 20 03:27:46 node1 pacemaker-fenced[1736]: notice: Operation 'reboot' [43895] (call 28 from pacemaker-controld.1740) targeting node2 using xvm2 returned 0 (OK)
689689
```
690690
691-
To fully complete the STONITH action, the system needed to deliver the confirmation message to every node. Since node2 rejoined the group at `03:27:25` and no new membership excluding node2 had yet been formed due to the token and consensus timeouts not having expired, the confirmation message was delayed until node2 restarted its cluster services after boot. Upon receiving the message, node2 recognized that it had been fenced and consequently shut down its services as shown:
691+
To fully complete the STONITH action, the system had to deliver the confirmation message to every node. Because node2 rejoined the group at `03:27:25` and no new membership that excluded node2 was yet formed because of the token and consensus timeouts not expiring, the confirmation message is delayed until node2 restarts its cluster services after startup. Upon receiving the message, node2 recognizes that it has been fenced and, consequently, shut down its services as shown:
692692
693693
`/var/log/messages` in node1:
694694
```output
@@ -713,9 +713,9 @@ Feb 20 03:29:09 node2 pacemaker-controld [1323] (tengine_stonith_notify) crit:
713713
714714
### Resolution for scenario 5
715715
716-
Configure a startup delay for the corosync service. This pause provides sufficient time for a new CPG(Closed Process Group) membership to form and excluding the fenced node, so that the STONITH reboot process can complete by ensuring the completion message reaches all nodes in the membership.
716+
Configure a startup delay for the Crosync service. This pause provides sufficient time for a new Closed Process Group (CPG) membership to form and exclude the fenced node so that the STONITH restart process can finish by making sure the completion message reaches all nodes in the membership.
717717
718-
To achieve this, execute the following commands:
718+
To achieve this effect, run the following commands:
719719
720720
1. Put the cluster into maintenance mode:
721721
@@ -724,7 +724,7 @@ To achieve this, execute the following commands:
724724
```
725725
2. Create a systemd drop-in file on all the nodes in the cluster:
726726
727-
- Edit the corosync file:
727+
- Edit the Corosync file:
728728
```bash
729729
sudo systemctl edit corosync.service
730730
```
@@ -733,11 +733,11 @@ To achieve this, execute the following commands:
733733
[Service]
734734
ExecStartPre=/bin/sleep 60
735735
```
736-
- After saving and exiting the text editor, reload the systemd manager configuration with:
736+
- After you save the file and exit the text editor, reload the systemd manager configuration:
737737
```bash
738738
sudo systemctl daemon-reload
739739
```
740-
3. Remove the cluster out of maintenance mode:
740+
3. Remove the cluster from maintenance mode:
741741
```bash
742742
sudo pcs property set maintenance-mode=false
743743
```

0 commit comments

Comments
 (0)