|
| 1 | +--- |
| 2 | +title: Troubleshooting guide for customers to investigate and analyse, using Service Fabric Explorer (SFX), why repair jobs are not being approved. |
| 3 | +description: Learn how to analyze stuck repair jobs using Service Fabric Explorer. |
| 4 | +ms.topic: concept-article |
| 5 | +ms.author: ashukumar |
| 6 | +author: ashukumar |
| 7 | +ms.service: azure-service-fabric |
| 8 | +services: service-fabric |
| 9 | +ms.date: 01/20/2026 |
| 10 | +# Customer intent: As a Service Fabric customer, I want to analyze the reason why a repair job is stuck using Service Fabric Explorer. |
| 11 | +--- |
| 12 | + |
| 13 | +# Troubleshooting guide for customers to investigate and analyse, using Service Fabric Explorer (SFX), why repair jobs are not being approved |
| 14 | + |
| 15 | +## Repair Task overview in service fabric |
| 16 | +Any operation initiated from the Virtual Machine Scale Set (VMSS) that targets VMs is processed by Service Fabric (SF) as a repair task derived from the job it receives. The Infrastructure Service creates a repair task for each job and enriches it with details such as the update type, targeted update domain (UD), and document incarnation number. These jobs begin with UD0 and progress sequentially through UD1, UD2, and so on within the Service Fabric cluster. If an Update Domain walk is required, separate repair tasks are generated for each UD. For example, in a cluster with five UDs, five distinct repair tasks will be created. These tasks execute one after another, UD by UD, and their progress can be tracked in Service Fabric Explorer (SFX). |
| 17 | + |
| 18 | +RepairManager – Repair Manager (RM) defines and implements a safe workflow for performing repairs by coordinating between the Repair Requestor, Repair Executor, and itself to ensure safe and consistent repair actions. |
| 19 | + |
| 20 | +Infrastructure Service – Infrastructure Service (IS) is responsible for managing and orchestrating infrastructure-level operations, such as updates and repairs, ensuring the health and stability of the Service Fabric cluster. |
| 21 | + |
| 22 | +### Repair Task vs. Repair Job |
| 23 | + |
| 24 | +A repair job refers to an Azure‑initiated maintenance operation that provides essential details such as the job ID, repair type, targeted update domain, document incarnation number, node‑impact information, and additional metadata. Service Fabric then creates a repair task by combining details such as: |
| 25 | + |
| 26 | +* Repair type |
| 27 | +* Target update domain |
| 28 | +* Document incarnation number |
| 29 | + |
| 30 | +These elements are combined in the following format: |
| 31 | + |
| 32 | +Azure/repair type/repair job/update domain/document incarnation number |
| 33 | + |
| 34 | +Example:Azure/TenantUpdate/addfb79e-1e8c-42c8-a967-b0e2e0afd6b4/0/110 |
| 35 | + |
| 36 | +Repair Type: - It indicates the repair category, such as Tenant or Platform, and whether the operation is a maintenance action or an update. |
| 37 | +Target Update domain: - It refers to the update domain that the repair job is targeting at that time. |
| 38 | +Document incarnation number: - The Document Incarnation Number is a monotonically increasing version identifier for the update document received by Service Fabric from Azure. |
| 39 | + |
| 40 | +The resulting entity is the repair task, which is used within the Service Fabric context. In contrast, the repair job is recognized outside Service Fabric components. |
| 41 | + |
| 42 | +## Repair task states and their ownership |
| 43 | + |
| 44 | +* Created |
| 45 | + |
| 46 | +In the Created state, the Repair Manager (RM) accepts and stores the repair request. At this point, the task is waiting for a Repair Executor (RE) to claim it. The requestor can cancel the task during this stage without any restrictions. Repair manager has ownership in this state. |
| 47 | + |
| 48 | +* Claimed |
| 49 | + |
| 50 | +Once the task is Claimed, the Repair Executor (RE) has taken ownership but has not yet specified the repair's impact. The requestor still retains the ability to cancel the task at this stage. Repair executor has ownership in this state. |
| 51 | + |
| 52 | +* Preparing |
| 53 | + |
| 54 | +In the Preparing state, the Repair Executor specifies the impact, and the Repair Manager prepares the environment, such as deactivating nodes. If the task is cancelled now, it skips execution and moves directly to restoring. Operator also have the option to force approval, bypassing certain safety checks. Repair Manager has ownership in this state. |
| 55 | + |
| 56 | +* Approved |
| 57 | + |
| 58 | +When the task reaches Approved, the Repair Manager has completed all preparations and approved execution. The Repair Executor should move the task to Executing before starting the repair. Cancellation at this point requires cooperation from the executor, which should support cancellation if possible. Repair Executor has ownership in this state. |
| 59 | + |
| 60 | +* Executing |
| 61 | + |
| 62 | +During Executing, the Repair Executor is actively performing the repair. The executor must finish all disruptive actions before reporting completion. Cancellation now requires executor cooperation and should only be acknowledged when it is safe to do so. Repair executor has ownership in this state. |
| 63 | + |
| 64 | +* Restoring |
| 65 | + |
| 66 | +Once the repair is complete, the task enters Restoring, where the Repair Manager restores the environment, such as reactivating nodes. At this stage, the task cannot be cancelled. Repair Manager has ownership in this state. |
| 67 | + |
| 68 | +* Completed |
| 69 | + |
| 70 | +Finally, in the Completed state, the task is finished, and no further state changes occur. The final status may be Succeeded, Cancelled, interrupted (with details), or Failed (with details). |
| 71 | + |
| 72 | +## Analysis from Service Fabric Explorer For stuck repair task |
| 73 | + |
| 74 | +### Infrastructure Jobs view |
| 75 | + |
| 76 | +To view jobs that have been submitted to Service Fabric for approval, navigate to the Infrastructure Jobs tab under cluster view. Each entry includes a Job ID, which remains consistent across Service Fabric as well as outside service fabric. The Acknowledgement Status indicates whether the job has been approved by Service Fabric: • WaitingForAcknowledgement means the job is still pending approval. • Acknowledged confirms that the job has been approved by Service Fabric. This view represents perspective of the job. Jobs will only appear here when they are present in the received document. In addition to the Job ID and Acknowledgement Status, the Impact Types section displays the nature of the job’s impact. The Current Repair Task section shows which repair task is actively running for the job approval on the Service Fabric side. By selecting All Repair Tasks, you can view the status of every repair task associated with the current job. |
| 77 | + |
| 78 | + |
| 79 | +<center> |
| 80 | +![Infrastructure Job view][Image1] |
| 81 | +</center> |
| 82 | + |
| 83 | +### Repair Jobs and health check view |
| 84 | + |
| 85 | +To view individual and all repair tasks associated with a cluster, go to the Repair Jobs tab. This section displays both pending repair tasks and those that have been completed or cancelled. For pending tasks, you can see their current state. A repair task is not yet approved by Service Fabric if its state is Created, Claimed, or Preparing. Once a repair task transitions to the Approved state, it is considered approved by Service Fabric, and the approval is then forwarded to the Repair Executor for the corresponding job. |
| 86 | + |
| 87 | + |
| 88 | +<center> |
| 89 | +![Repair task view][Image2] |
| 90 | +</center> |
| 91 | + |
| 92 | +If a repair task is stuck in the Preparing state, there are two possible reasons: It could be stuck in either a Health Check or a Safety Check. Unhealthy entity in the cluster, including customer applications as well as system applications can cause the health check to not be green. To determine if it's stuck in a Health Check, first verify whether Preparing or Restoring Health Check is enabled—based on the state where the task is stuck. In the Repair Task view, expanding the task will show the Health Check status, indicating whether it is enabled. |
| 93 | + |
| 94 | + |
| 95 | +<center> |
| 96 | +![Health check view][Image3] |
| 97 | +</center> |
| 98 | + |
| 99 | +If enabled, the Repair Task History will show that the Health Check started but did not complete, confirming that the task is stuck in the Health Check phase—as illustrated in the screenshot above. |
| 100 | + |
| 101 | +### Safety check view |
| 102 | + |
| 103 | +A repair task can get stuck in the Safety Check phase only if it has an impact on any node. This can be confirmed by checking the Impact section in the Repair Task view, as shown in the previous screenshot. If node impact is present, you can identify which Safety Check is causing the delay by inspecting each impacted node individually. Click on the node from the Node List, and in the Safety Check section, you’ll see the specific check where the task is stuck. The Repair Task ID is also displayed here, indicating which repair task is responsible for the node deactivation and safety check—this is illustrated in the screenshot below. In the screenshot below, the repair task is stuck in the EnsureSeedNodeQuorum safety check. |
| 104 | + |
| 105 | + |
| 106 | +<center> |
| 107 | +![Safety check view][Image4] |
| 108 | +</center> |
| 109 | + |
| 110 | +If there are no errors in Infrastructure Service (IS) related to a repair task and the task has entered the Executing state, it means the job’s acknowledgment status is 'Acknowledged' for Impact Start. Similarly, if the repair task transitions to the Completed state, it indicates that the job’s acknowledgment status is 'Acknowledged' for Impact End. |
| 111 | + |
| 112 | +<center> |
| 113 | +![Repair task executing view][Image5] |
| 114 | +</center> |
| 115 | + |
| 116 | +All completed or cancelled repair tasks for the cluster can be viewed by clicking on the Completed Repair Tasks section. This provides a comprehensive list of repair tasks that have either successfully finished or were terminated. |
| 117 | + |
| 118 | + |
| 119 | +<center> |
| 120 | +![Completed repair task view][Image6] |
| 121 | +</center> |
| 122 | + |
| 123 | +### Infrastructure service and repair Manager Service Health check |
| 124 | + |
| 125 | +To check the health of the Infrastructure Service or Repair Manager Service, select the service from the list and open the Health Evaluation tab. This tab shows whether the service is healthy, in a warning state, or in an error state, along with details of any warnings or errors. |
| 126 | + |
| 127 | +<center> |
| 128 | +![Infrastructure service health view][Image7] |
| 129 | +</center> |
| 130 | + |
| 131 | +<center> |
| 132 | +![RepairManager service health view][Image8] |
| 133 | +</center> |
| 134 | + |
| 135 | +### Job throttling status for Infrastructure Service |
| 136 | + |
| 137 | +To check if any job is being throttled for a specific Infrastructure Service, select the service and open the Health Evaluation tab and select All. Look for health events related to job throttling. If a job is throttled, the tab will display the job ID along with the reason for throttling. |
| 138 | + |
| 139 | +<center> |
| 140 | +![Job throttling Is view][Image9] |
| 141 | +</center> |
| 142 | + |
| 143 | + |
| 144 | +[Image1]:./media/Infrastructure-job-view.png |
| 145 | +[Image2]:./media/repair-task-view.png |
| 146 | +[Image3]:./media/Health-check.png |
| 147 | +[Image4]:./media/safety-check-view.png |
| 148 | +[Image5]:./media/Repair-task-executing.png |
| 149 | +[Image6]:./media/completed-repair-task-view.png |
| 150 | +[Image7]:./media/Infrastructure-service-health.png |
| 151 | +[Image8]:./media/RepairManager-Service-Health.png |
| 152 | +[Image9]:./media/Job-Throttling-status-for-IS.png |
0 commit comments