You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: support/azure/service-fabric/cluster/troubleshoot-service-fabric-repair-jobs.md
+39-68Lines changed: 39 additions & 68 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -47,112 +47,83 @@ The resulting entity is the *repair task*, which is used within the Service Fabr
47
47
48
48
In the Created state, the Repair Manager accepts and stores the repair request. The task then waits for a Repair Executor to claim it. The requestor can cancel the task during this stage without any restrictions. The Repair Manager has ownership in this state.
49
49
50
-
* Claimed
50
+
###Claimed
51
51
52
-
Once the task is Claimed, the Repair Executor (RE) takes ownership but doesn't specify the repair's impact. The requestor still retains the ability to cancel the task at this stage. The repair executor has ownership in this state.
52
+
Once the task is Claimed, the Repair Executor takes ownership but doesn't specify the repair's impact. The requestor still retains the ability to cancel the task at this stage. The Repair Executor has ownership in this state.
53
53
54
-
* Preparing
54
+
###Preparing
55
55
56
-
In the Preparing state, the Repair Executor specifies the impact, and the Repair Manager prepares the environment, such as deactivating nodes. If the task is cancelled now, it skips execution and moves directly to restoring. Operator also has the option to force approval, bypassing certain safety checks. Repair Manager has ownership in this state.
56
+
In the Preparing state, the Repair Executor specifies the impact and the Repair Manager prepares the environment (like deactivating nodes). If the task is cancelled now, it skips executing and moves directly to restoring. The Operator also has the option to force approval and bypass certain safety checks. The Repair Manager has ownership in this state.
57
57
58
-
* Approved
58
+
###Approved
59
59
60
-
When the task reaches Approved, the Repair Manager has completed all preparations and approved execution. The Repair Executor should move the task to Executing before starting the repair. Cancellation at this point requires cooperation from the executor, which should support cancellation if possible. Repair Executor has ownership in this state.
60
+
When the task reaches the Approved state, the Repair Manager has completed all preparations and approved execution. The Repair Executor moves the task to the Executing state before starting the repair. Cancellation at this point requires cooperation from the Repair Executor who has ownership in this state.
61
61
62
-
* Executing
62
+
###Executing
63
63
64
-
During Executing, the Repair Executor is actively performing the repair. The executor must finish all disruptive actions before reporting completion. Cancellation now requires executor cooperation and should only be acknowledged when it is safe to do so. Repair executor has ownership in this state.
64
+
During the Executing state, the Repair Executor performs the repair. The Repair Executor must finish all potentially disruptive actions before reporting completion. Cancellation now requires Repair Executor cooperation and should only be acknowledged when it's safe to do so. The Repair Executor has ownership in this state.
65
65
66
-
* Restoring
66
+
###Restoring
67
67
68
-
Once the repair is complete, the task enters Restoring, where the Repair Manager restores the environment, such as reactivating nodes. At this stage, the task cannot be cancelled. Repair Manager has ownership in this state.
68
+
Once the repair is complete, the task enters the Restoring state, where the Repair Manager restores the environment (like reactivating nodes). At this stage, the task can't be cancelled. The Repair Manager has ownership in this state.
69
69
70
-
* Completed
70
+
###Completed
71
71
72
-
Finally, in the Completed state, the task is finished, and no further state changes occur. The final status may be Succeeded, Cancelled, interrupted (with details), or Failed (with details).
72
+
Finally, in the Completed state, the task is finished and no further state changes occur. The final status is one of the following states: Succeeded, Cancelled, Interrupted (with details), or Failed (with details).
73
73
74
-
## Analysis from Service Fabric Explorer For stuck repair task
74
+
## Using Service Fabric Explorer for troubleshooting a stuck repair task
75
75
76
76
### Infrastructure Jobs view
77
77
78
-
To view jobs that have been submitted to Service Fabric for approval, navigate to the **Infrastructure Jobs** tab under cluster view. Each entry includes a **Job ID**, which remains consistent across Service Fabric and outside service fabric. The **Acknowledgement Status**indicates whether the job has been approved by Service Fabric:
78
+
To view jobs that Service Fabric receives for approval, go to the **Infrastructure Jobs** tab in the cluster view. Each entry includes a **Job ID** which stays the same across and outside of Service Fabric. The **Acknowledgement Status**shows whether Service Fabric approves the job with one of the following states:
79
79
80
-
-**WaitingForAcknowledgement**means the job is still pending approval.
81
-
-**Acknowledged**confirms that the job has been approved by Service Fabric.
80
+
-**WaitingForAcknowledgement**- The job is still waiting for approval.
81
+
-**Acknowledged**- Service Fabric approves the job.
82
82
83
-
This view represents perspective of the job. Jobs will only appear here when they are present in the received document. In addition to the **Job ID** and **Acknowledgement Status**, the **Impact Types** section displays the nature of the job’s impact. The **Current Repair Task** section shows which repair task is actively running for the job approval on the Service Fabric side. By selecting **All Repair Tasks**, you can view the status of every repair task associated with the current job.
83
+
Jobs only appear here when they're present in the received document. In addition to the **Job ID** and **Acknowledgement Status**, the **Impact Types** section displays the nature of the job’s impact. The **Current Repair Task** section shows which repair task is actively running for job approval on the Service Fabric side. By selecting **All Repair Tasks**, you can view the status of every repair task associated with the current job.
To view individual and all repair tasks associated with a cluster, go to the Repair Jobs tab. This section displays both pending repair tasks and those that have been completed or cancelled. For pending tasks, you can see their current state. A repair task is not yet approved by Service Fabric if its state is Created, Claimed, or Preparing. Once a repair task transitions to the Approved state, it is considered approved by Service Fabric, and the approval is then forwarded to the Repair Executor for the corresponding job.
89
+
To view individual and all repair tasks associated with a cluster, go to the **Repair Jobs** tab. This displays pending repair tasks, completed repair tasks, or cancelled repair tasks. You can also see the statefor any pending task.
92
90
91
+
If a repair task state is Created, Claimed, or Preparing, it's not yet approved by Service Fabric. Once a repair task transitions to the Approved state, it's considered approved and is then forwarded to the Repair Executor for the corresponding job.
If a repair task is stuck in the Preparing state, there are two possible reasons: It could be stuck in either a Health Check or a Safety Check. Unhealthy entity in the cluster, including customer applications as well as system applications can cause the health check to not be green. To determine if it's stuck in a Health Check, first verify whether Preparing or Restoring Health Check is enabled—based on the state where the task is stuck. In the Repair Task view, expanding the task will show the Health Check status, indicating whether it is enabled.
95
+
If a repair task gets stuck in the Preparing state, it's either stuck in a health check or a safety check. An unhealthy entity in the cluster (including customer applications as well as system applications) can cause the health check to fail. To determine if the task is stuck in a health check, first verify whether **Preparing** or **Restoring Health Check** is enabledbased on the state where the task is stuck. In the **Repair Task** view, expanding the task shows the health check status, indicating if it's enabled.
99
96
97
+
:::image type="content" source="media/troubleshoot-service-fabric-repair-jobs/cluster-health-check.png" alt-text="Cluster Health Check view." lightbox="media/troubleshoot-service-fabric-repair-jobs/cluster-health-check.png":::
100
98
101
-
<center>
102
-
![Health check view][Image3]
103
-
</center>
99
+
If enabled, **Repair Task History** shows that the health check started but didn't complete, confirming that the task is stuck in the Health Check phase.
104
100
105
-
If enabled, the Repair Task History will show that the Health Check started but did not complete, confirming that the task is stuck in the Health Check phase—as illustrated in the screenshot above.
101
+
### Safety Check view
106
102
107
-
### Safety check view
103
+
A repair task can get stuck in the Safety Check phase only if it has an impact on any node. This can be verified by checking the **Impact** section in the **Repair Task**view. If a node impact is present, you can identify which Safety Check is causing the delay by inspecting each impacted node individually. Select the node from the **Node List**. In the **Safety Check** section, you’ll see the specific check where the task is stuck. The **Repair Task ID** is also displayed here, indicating which repair task is responsible for the node deactivation and safety check.
108
104
109
-
A repair task can get stuck in the Safety Check phase only if it has an impact on any node. This can be confirmed by checking the Impact section in the Repair Task view, as shown in the previous screenshot. If node impact is present, you can identify which Safety Check is causing the delay by inspecting each impacted node individually. Click on the node from the Node List, and in the Safety Check section, you’ll see the specific check where the task is stuck. The Repair Task ID is also displayed here, indicating which repair task is responsible for the node deactivation and safety check—this is illustrated in the screenshot below. In the screenshot below, the repair task is stuck in the EnsureSeedNodeQuorum safety check.
105
+
For example, in the following screenshot, the repair task is stuck in the **EnsureSeedNodeQuorum** safety check.
If there are no errors in **Infrastructure Service** related to a repair task and the task has entered the Executing state, it means the job’s acknowledgment status is Acknowledged for Impact Start. Similarly, if the repair task transitions to the Completed state, it indicates that the job’s acknowledgment status is Acknowledged for Impact End.
115
110
116
-
If there are no errors in Infrastructure Service (IS) related to a repairtask and the task has entered the Executing state, it means the job’s acknowledgment status is 'Acknowledged' for Impact Start. Similarly, if the repairtask transitions to the Completed state, it indicates that the job’s acknowledgment status is 'Acknowledged' for Impact End.
All completed or cancelled repair tasks for the cluster can be viewed by selecting **Completed Repair Tasks**. This provides a comprehensive list of repair tasks that have either successfully finished or were terminated.
121
114
122
-
All completed or cancelled repair tasks for the cluster can be viewed by clicking on the Completed Repair Tasks section. This provides a comprehensive list of repair tasks that have either successfully finished or were terminated.
### Infrastructure Service and Repair Manager Service health check
124
118
125
-
<center>
126
-
![Completed repair task view][Image6]
127
-
</center>
119
+
To check the health of the Infrastructure Service or Repair Manager Service, select the service from the list and select **Health Evaluation**. This view shows whether the service is healthy, in a Warning state, or in an Error state, along with further details.
128
120
129
-
### Infrastructure service and repair Manager Service Health check
121
+
:::image type="content" source="media/troubleshoot-service-fabric-repair-jobs/cluster-infrastructure-service-health.png" alt-text="Infrastructure Service Health view." lightbox="media/troubleshoot-service-fabric-repair-jobs/cluster-infrastructure-service-health.png":::
130
122
131
-
To check the health of the Infrastructure Service or Repair Manager Service, select the service from the list and open the Health Evaluation tab. This tab shows whether the service is healthy, in a warning state, or in an error state, along with details of any warnings or errors.
132
-
133
-
<center>
134
-
![Infrastructure service health view][Image7]
135
-
</center>
136
-
137
-
<center>
138
-
![RepairManager service health view][Image8]
139
-
</center>
123
+
:::image type="content" source="media/troubleshoot-service-fabric-repair-jobs/cluster-repairmanager-service-health.png" alt-text="Repair Manager Service Health view." lightbox="media/troubleshoot-service-fabric-repair-jobs/cluster-repairmanager-service-health.png":::
140
124
141
125
### Job throttling status for Infrastructure Service
142
126
143
-
To check if any job is being throttled for a specific Infrastructure Service, select the service and open the Health Evaluation tab and select All. Look for health events related to job throttling. If a job is throttled, the tab will display the job ID along with the reason for throttling.
144
-
145
-
<center>
146
-
![Job throttling Is view][Image9]
147
-
</center>
148
-
127
+
To check if any job is being throttled for a specific Infrastructure Service, select the service > **Health Evaluation** > **All**. Look for health events related to job throttling. If a job is throttled, the job ID along with the reason for throttling is displayed.
0 commit comments