Skip to content

Commit f33bbda

Browse files
author
amsliu
committed
Merge branch 'main' of https://github.com/amsliu/SupportArticles-docs-pr into v-liuamson-parentbranch-CI6619
2 parents 22d1775 + a9965e4 commit f33bbda

12 files changed

Lines changed: 525 additions & 312 deletions
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
---
2+
title: Troubleshoot Hybrid Runbook Worker Job Failures in Azure Automation
3+
description: Discusses some common issues that might occur when you run a runbook on Hybrid Runbook Worker.
4+
ms.date: 06/13/2025
5+
ms.reviewer: adoyle
6+
ms.service: azure-automation
7+
ms.custom: sap:Runbook not working as expected
8+
---
9+
10+
# Hybrid Runbook Worker job isn't working as expected
11+
12+
This article provides guidance for troubleshooting and resolving issues that affect Hybrid Runbook Worker in Azure Automation.
13+
14+
> [!NOTE]
15+
> Azure Automation enables the recovery of runbooks that are deleted in the past 29 days. For more information, see [Restore deleted runbook](/azure/automation/manage-runbooks#restore-deleted-runbook).
16+
17+
## Troubleshoot connectivity issues
18+
19+
Connectivity problems are a common cause of issues that affect runbooks. Use the [Test Cloud Connectivity tool](/azure/azure-monitor/agents/agent-windows-troubleshoot?tabs=UpdateMMA#connectivity-issues) to verify that your environment is correctly configured.
20+
21+
## General troubleshooting
22+
23+
| **Issue** | **Resolution** |
24+
|----------|----------------|
25+
| Runbooks behave differently on a hybrid worker than in Azure Automation. | See [Runbook permissions](/azure/automation/automation-hrw-run-runbooks#runbook-permissions) for information about authentication differences. |
26+
| Error: No certificate was found. | Follow the "[No Certificate Found](/azure/automation/troubleshoot/hybrid-runbook-worker#no-cert-found)" section in the troubleshooting guide. |
27+
| You need to troubleshoot a custom runbook. | See [Troubleshoot runbook issues](/azure/automation/troubleshoot/runbooks). |
28+
| You need to check job status. | Review [job details and statuses](/azure/automation/automation-runbook-execution#job-statuses). |
29+
| Hybrid worker doesn't run jobs or is unresponsive. | Troubleshoot by using [Hybrid Runbook Worker diagnostics](/azure/automation/troubleshoot/hybrid-runbook-worker). |
30+
| Runbooks suddenly stop working. | Make sure that you [migrated to managed identity](/azure/automation/migrate-run-as-accounts-managed-identity?tabs=sa-managed-identity#cert-renewal) and that webhooks aren't expired. |
31+
| Need help with passing parameters into a webhook. | See [Start a runbook from a webhook](/azure/automation/automation-webhooks#parameters-used-when-the-webhook-starts-a-runbook). |
32+
| You want to use both `Az` and `AzureRM` modules in runbook. | This scenario isn't supported. Use only [Az modules in runbooks](/azure/automation/automation-update-azure-modules). |
33+
| Can't start or schedule a runbook. | Verify that the runbook is in a [published state](/azure/automation/manage-runbooks#publish-a-runbook). |
34+
| Runbook is suspended or failed unexpectedly. | Review [job statuses](/azure/automation/automation-runbook-execution#job-statuses). Add logging to the runbook by using [output streams](/azure/automation/automation-runbook-output-and-messages#working-with-message-streams). If the job fails three times, check [the automation limits](/azure/azure-resource-manager/management/azure-subscription-service-limits#automation-limits) and consider using a [Hybrid Worker](/azure/automation/automation-hybrid-runbook-worker). |
35+
36+
37+
### Windows Hybrid Runbook Worker issues
38+
39+
| **Issue** | **Resolution** |
40+
|-----------|----------------|
41+
| Event 4502 appears in the Operations Manager log. | See [Event 4502](/azure/automation/troubleshoot/hybrid-runbook-worker#event-4502). |
42+
| Script using `Connect-MsolService` can't connect to Microsoft 365. | See [Sandbox can't connect to Microsoft 365](/azure/automation/troubleshoot/hybrid-runbook-worker#scenario-orchestratorsandboxexe-cant-connect-to-microsoft-365-through-proxy). |
43+
| Machine is running but not sending heartbeat data. | See [Hybrid worker not reporting](/azure/automation/troubleshoot/hybrid-runbook-worker#corrupt-cache). |
44+
45+
46+
### Linux Hybrid Runbook Worker issues
47+
48+
| **Issue** | **Resolution** |
49+
|-----------|----------------|
50+
| Unexpected password prompt appears when using `sudo`. | See [Linux runbook worker prompts for password](/azure/automation/troubleshoot/hybrid-runbook-worker#prompt-for-password). |
51+
| Log file shows "The specified class does not exist." | See [Class does not exist error](/azure/automation/troubleshoot/hybrid-runbook-worker#class-does-not-exist). |
52+
| Linux job is stuck in **Running** state | 1. Switch to `sudo` permissions: `sudo su`<br>2. Make sure that the `hwd` service is running: `systemctl status hwd.service`<br>3. Open the following file in Hybrid Worker: `vi /lib/systemd/system/hwd.service`<br>4. Update the setting from `CPUQuota=25%` to `CPUQuota=` to make the usage unrestricted, as shown in the following example: <br><br>`[Unit]`<br>`Description=HW Service`<br>`After=network.target`<br>`[Service]`<br>`Type=simple`<br>`ExecStart=/usr/bin/python3 .../automationWorkerStarterScript.py`<br>`TimeoutStartSec=5`<br>`Restart=always`<br>`RestartSec=10s`<br>`TimeoutStopSec=600`<br>`CPUQuota=`<br>`KillMode=process`<br>`[Install]`<br>`WantedBy=multi-user.target`<br><br> 5. Restart the `hwd` service: <br>`systemctl daemon-reload` <br> `systemctl restart hwd.service`<br>|
53+
54+
## Other error messages
55+
56+
| **Error** | **Resolution** |
57+
|-----------|----------------|
58+
| "The subscription cannot be found" | This error usually means that the runbook isn't using a managed identity. Follow the steps in [Unable to find subscription](/azure/automation/troubleshoot/runbooks#unable-to-find-subscription). |
59+
| "Strong authentication enrollment is required." | See [Authentication to Azure failed due to MFA](/azure/automation/troubleshoot/runbooks#auth-failed-mfa). |
60+
| "No permission" or similar error | Make sure that the [managed identity has appropriate permissions](/azure/role-based-access-control/role-assignments-portal). |
61+
62+
## Reference
63+
64+
- [Automation Hybrid Runbook Worker overview](/azure/automation/automation-hybrid-runbook-worker)
65+
- [Deploy a Windows Hybrid Runbook Worker](/azure/automation/automation-windows-hrw-install)
66+
- [Deploy a Linux Hybrid Runbook Worker](/azure/automation/automation-linux-hrw-install)
67+
- [Run runbooks on a Hybrid Runbook Worker](/azure/automation/automation-hrw-run-runbooks)

support/azure/automation/toc.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
href: welcome-automation.yml
33
- name: Runbook not working as expected
44
items:
5+
- name: Hybrid Runbook Worker job isn't working
6+
href: runbooks/runbook-fails-on-hybrid-worker.md
57
- name: Runbook jobs get suspended
68
href: runbooks/runbook-job-suspended.md
79
- name: Troubleshoot error codes during runbook execution
@@ -14,3 +16,5 @@
1416
href: runbooks/troubleshoot-runbook-execution-issues.md
1517
- name: Troubleshoot runbook execution issues when using PowerShell
1618
href: runbooks/powershell-job-script-cmdlets-not-working.md
19+
- name: Troubleshoot Hybrid Runbook Worker Job Failures
20+
href: runbooks/troubleshoot-runbook-execution-issues.md

support/azure/azure-kubernetes/availability-performance/cluster-service-health-probe-mode-issues.md

Lines changed: 156 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,9 @@ description: Diagnoses and fixes common issues with the health probe mode featur
44
ms.date: 06/03/2024
55
ms.reviewer: niqi, cssakscic, v-weizhu
66
ms.service: azure-kubernetes-service
7-
ms.custom: sap:Node/node pool availability and performance, devx-track-azurecli
7+
ms.custom: sap:Node/node pool availability and performance, devx-track-azurecli, innovation-engine
88
---
9+
910
# Troubleshoot issues when enabling the AKS cluster service health probe mode
1011

1112
The health probe mode feature allows you to configure how Azure Load Balancer probes the health of the nodes in your Azure Kubernetes Service (AKS) cluster. You can choose between two modes: Shared and ServiceNodePort. The Shared mode uses a single health probe for all external traffic policy cluster services that use the same load balancer. In contrast, the ServiceNodePort mode uses a separate health probe for each service. The Shared mode can reduce the number of health probes and improve the performance of the load balancer, but it requires some additional components to work properly. To enable this feature, see [How to enable the health probe mode feature using the Azure CLI](#how-to-enable-the-health-probe-mode-feature-using-the-azure-cli).
@@ -36,11 +37,92 @@ The following operations also happen:
3637

3738
To troubleshoot these issues, follow these steps:
3839

39-
1. Check the RP frontend log to see if the health probe mode in the LoadBalancerProfile is properly configured. You can use the `az aks show` command to view the LoadBalancerProfile property of your cluster.
40-
41-
2. Check the *overlaymgr* log to see if the cloud provider secret is updated. The keyword to look for is `cloudConfigSecretResolver`. Or check the contents of the cloud-provider-config secret in the `ccp` namespace. You can use the `kubectl get secret` command to view the secret.
42-
43-
3. Check the chart or overlay daemonset cloud-node-manager to see if the health-probe-proxy sidecar container is enabled. You can use the `kubectl get ds` command to view the daemonset.
40+
1. First, connect to your AKS cluster using the Azure CLI:
41+
42+
```azurecli
43+
export RESOURCE_GROUP="aks-rg"
44+
export AKS_CLUSTER_NAME="aks-cluster"
45+
az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --overwrite-existing
46+
```
47+
48+
2. Next, check the RP frontend log to see if the health probe mode in the LoadBalancerProfile is properly configured. You can use the `az aks show` command to view the LoadBalancerProfile property of your cluster.
49+
50+
```azurecli
51+
export RESOURCE_GROUP="aks-rg"
52+
export AKS_CLUSTER_NAME="aks-cluster"
53+
az aks show --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --query "networkProfile.loadBalancerProfile"
54+
```
55+
Results:
56+
57+
<!-- expected_similarity=0.3 -->
58+
59+
```output
60+
{
61+
"clusterServiceLoadBalancerHealthProbeMode": "Shared",
62+
"managedOutboundIPs": null,
63+
"outboundIPs": null,
64+
"outboundIPPrefixes": null,
65+
"allocatedOutboundPorts": null,
66+
"effectiveOutboundIPs": [
67+
{
68+
"id": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/MC_aks-rg_aks-cluster_eastus2/providers/Microsoft.Network/publicIPAddresses/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
69+
}
70+
],
71+
"idleTimeoutInMinutes": 30,
72+
"loadBalancerSku": "standard",
73+
"managedOutboundIPv6": null
74+
}
75+
```
76+
77+
3. Check the cloud provider configuration. In modern AKS clusters, the cloud provider configuration is managed internally and the `ccp` namespace doesn't exist. Instead, check for cloud provider related resources and verify the cloud-node-manager pods are running properly:
78+
79+
80+
```bash
81+
# Check for cloud provider related ConfigMaps in kube-system
82+
kubectl get configmap -n kube-system | grep -i azure
83+
84+
# Check if cloud-node-manager pods are running (indicates cloud provider integration is working)
85+
kubectl get pods -n kube-system | grep cloud-node-manager
86+
87+
# Check the azure-ip-masq-agent-config if it exists
88+
kubectl get configmap azure-ip-masq-agent-config-reconciled -n kube-system -o yaml 2>/dev/null || echo "ConfigMap not found"
89+
```
90+
Results:
91+
92+
<!-- expected_similarity=0.3 -->
93+
94+
```output
95+
configmap/azure-ip-masq-agent-config-reconciled 1 11h
96+
97+
cloud-node-manager-rfb2w 2/2 Running 0 16m
98+
```
99+
100+
4. Check the chart or overlay daemonset cloud-node-manager to see if the health-probe-proxy sidecar container is enabled. You can use the `kubectl get ds` command to view the daemonset.
101+
102+
```shell
103+
kubectl get ds -n kube-system cloud-node-manager -o yaml
104+
```
105+
Results:
106+
107+
<!-- expected_similarity=0.3 -->
108+
109+
```output
110+
apiVersion: apps/v1
111+
kind: DaemonSet
112+
metadata:
113+
name: cloud-node-manager
114+
namespace: kube-system
115+
...
116+
spec:
117+
template:
118+
spec:
119+
containers:
120+
- name: cloud-node-manager
121+
image: mcr.microsoft.com/oss/kubernetes/azure-cloud-node-manager:xxxxxxxx
122+
- name: health-probe-proxy
123+
image: mcr.microsoft.com/oss/kubernetes/azure-health-probe-proxy:xxxxxxxx
124+
...
125+
```
44126
45127
## Cause 1: The health probe mode isn't Shared or ServiceNodePort
46128
@@ -74,6 +156,26 @@ The health probe mode feature requires you to register the feature on your subsc
74156
75157
Make sure you register the feature for your subscription before creating or updating your cluster. You can use the `az feature register` command to register the feature.
76158
159+
```azurecli
160+
export FEATURE_NAME="EnableSLBSharedHealthProbePreview"
161+
export PROVIDER_NAMESPACE="Microsoft.ContainerService"
162+
az feature register --name $FEATURE_NAME --namespace $PROVIDER_NAMESPACE
163+
```
164+
Results:
165+
166+
<!-- expected_similarity=0.3 -->
167+
168+
```output
169+
{
170+
"id": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/providers/Microsoft.Features/providers/Microsoft.ContainerService/features/EnableAKSClusterServiceLoadBalancerHealthProbeMode",
171+
"name": "Microsoft.ContainerService/EnableAKSClusterServiceLoadBalancerHealthProbeMode",
172+
"properties": {
173+
"state": "Registering"
174+
},
175+
"type": "Microsoft.Features/providers/features"
176+
}
177+
```
178+
77179
## Cause 5: The Kubernetes version is earlier than v1.28.0
78180

79181
The health probe mode feature requires a minimum Kubernetes version of v1.28.0. If you use an older version, the feature won't work.
@@ -90,8 +192,53 @@ For Windows, the kube-proxy component doesn't start until you create the first n
90192

91193
To enable the health probe mode feature, run one of the following commands:
92194

93-
- `az aks create/update --cluster-service-load-balancer-health-probe-mode Shared`
94-
95-
- `az aks create/update --cluster-service-load-balancer-health-probe-mode ServiceNodePort (default)`
195+
Enable `ServiceNodePort` health probe mode (default) for a cluster:
196+
197+
```shell
198+
export RESOURCE_GROUP="aks-rg"
199+
export AKS_CLUSTER_NAME="aks-cluster"
200+
az aks update --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --cluster-service-load-balancer-health-probe-mode ServiceNodePort
201+
```
202+
Results:
203+
204+
```output
205+
{
206+
"name": "aks-cluster",
207+
"location": "eastus2",
208+
"resourceGroup": "aks-rg",
209+
"kubernetesVersion": "1.28.x",
210+
"provisioningState": "Succeeded",
211+
"loadBalancerProfile": {
212+
"clusterServiceLoadBalancerHealthProbeMode": "ServiceNodePort",
213+
...
214+
},
215+
...
216+
}
217+
```
218+
219+
Enable `Shared` health probe mode for a cluster:
220+
221+
```shell
222+
export RESOURCE_GROUP="MyAksResourceGroup"
223+
export AKS_CLUSTER_NAME="MyAksCluster"
224+
az aks update --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --cluster-service-load-balancer-health-probe-mode Shared
225+
```
226+
227+
Results:
228+
229+
```output
230+
{
231+
"name": "MyAksCluster",
232+
"location": "eastus2",
233+
"resourceGroup": "MyAksResourceGroup",
234+
"kubernetesVersion": "1.28.x",
235+
"provisioningState": "Succeeded",
236+
"loadBalancerProfile": {
237+
"clusterServiceLoadBalancerHealthProbeMode": "Shared",
238+
...
239+
},
240+
...
241+
}
242+
```
96243

97244
[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]

0 commit comments

Comments
 (0)