Merge pull request #9343 from MicrosoftDocs/main

Simonx Xu · web-flow · commit 45d69b42193a · 2025-07-16T10:39:16.000+08:00
Auto push to live 2025-07-16 02:39:08
diff --git a/support/azure/azure-kubernetes/availability-performance/cluster-service-health-probe-mode-issues.md b/support/azure/azure-kubernetes/availability-performance/cluster-service-health-probe-mode-issues.md
@@ -4,8 +4,9 @@ description: Diagnoses and fixes common issues with the health probe mode featur
 ms.date: 06/03/2024
 ms.reviewer: niqi, cssakscic, v-weizhu
 ms.service: azure-kubernetes-service
-ms.custom: sap:Node/node pool availability and performance, devx-track-azurecli
+ms.custom: sap:Node/node pool availability and performance, devx-track-azurecli, innovation-engine
 ---
+
 # Troubleshoot issues when enabling the AKS cluster service health probe mode
 
 The health probe mode feature allows you to configure how Azure Load Balancer probes the health of the nodes in your Azure Kubernetes Service (AKS) cluster. You can choose between two modes: Shared and ServiceNodePort. The Shared mode uses a single health probe for all external traffic policy cluster services that use the same load balancer. In contrast, the ServiceNodePort mode uses a separate health probe for each service. The Shared mode can reduce the number of health probes and improve the performance of the load balancer, but it requires some additional components to work properly. To enable this feature, see [How to enable the health probe mode feature using the Azure CLI](#how-to-enable-the-health-probe-mode-feature-using-the-azure-cli).
@@ -36,11 +37,92 @@ The following operations also happen:
 
 To troubleshoot these issues, follow these steps:
 
-1. Check the RP frontend log to see if the health probe mode in the LoadBalancerProfile is properly configured. You can use the `az aks show` command to view the LoadBalancerProfile property of your cluster. 
-
-2. Check the *overlaymgr* log to see if the cloud provider secret is updated. The keyword to look for is `cloudConfigSecretResolver`. Or check the contents of the cloud-provider-config secret in the `ccp` namespace. You can use the `kubectl get secret` command to view the secret. 
-
-3. Check the chart or overlay daemonset cloud-node-manager to see if the health-probe-proxy sidecar container is enabled. You can use the `kubectl get ds` command to view the daemonset.
+1. First, connect to your AKS cluster using the Azure CLI:
+
+    ```azurecli
+    export RESOURCE_GROUP="aks-rg"
+    export AKS_CLUSTER_NAME="aks-cluster"
+    az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --overwrite-existing
+    ```
+
+2. Next, check the RP frontend log to see if the health probe mode in the LoadBalancerProfile is properly configured. You can use the `az aks show` command to view the LoadBalancerProfile property of your cluster. 
+
+    ```azurecli
+    export RESOURCE_GROUP="aks-rg"
+    export AKS_CLUSTER_NAME="aks-cluster"
+    az aks show --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --query "networkProfile.loadBalancerProfile"
+    ```
+    Results:
+
+    <!-- expected_similarity=0.3 -->
+
+    ```output
+    {
+      "clusterServiceLoadBalancerHealthProbeMode": "Shared",
+      "managedOutboundIPs": null,
+      "outboundIPs": null,
+      "outboundIPPrefixes": null,
+      "allocatedOutboundPorts": null,
+      "effectiveOutboundIPs": [
+        {
+          "id": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/MC_aks-rg_aks-cluster_eastus2/providers/Microsoft.Network/publicIPAddresses/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+        }
+      ],
+      "idleTimeoutInMinutes": 30,
+      "loadBalancerSku": "standard",
+      "managedOutboundIPv6": null
+    }
+    ```
+
+3. Check the cloud provider configuration. In modern AKS clusters, the cloud provider configuration is managed internally and the `ccp` namespace doesn't exist. Instead, check for cloud provider related resources and verify the cloud-node-manager pods are running properly:
+
+
+    ```bash
+    # Check for cloud provider related ConfigMaps in kube-system
+    kubectl get configmap -n kube-system | grep -i azure
+    
+    # Check if cloud-node-manager pods are running (indicates cloud provider integration is working)
+    kubectl get pods -n kube-system | grep cloud-node-manager
+    
+    # Check the azure-ip-masq-agent-config if it exists
+    kubectl get configmap azure-ip-masq-agent-config-reconciled -n kube-system -o yaml 2>/dev/null || echo "ConfigMap not found"
+    ```
+    Results:
+
+    <!-- expected_similarity=0.3 -->
+
+    ```output
+    configmap/azure-ip-masq-agent-config-reconciled   1      11h
+    
+    cloud-node-manager-rfb2w                        2/2     Running   0          16m
+    ```
+
+4. Check the chart or overlay daemonset cloud-node-manager to see if the health-probe-proxy sidecar container is enabled. You can use the `kubectl get ds` command to view the daemonset.
+
+    ```shell
+    kubectl get ds -n kube-system cloud-node-manager -o yaml
+    ```
+    Results:
+
+    <!-- expected_similarity=0.3 -->
+
+    ```output
+    apiVersion: apps/v1
+    kind: DaemonSet
+    metadata:
+      name: cloud-node-manager
+      namespace: kube-system
+      ...
+    spec:
+      template:
+        spec:
+          containers:
+          - name: cloud-node-manager
+            image: mcr.microsoft.com/oss/kubernetes/azure-cloud-node-manager:xxxxxxxx
+          - name: health-probe-proxy
+            image: mcr.microsoft.com/oss/kubernetes/azure-health-probe-proxy:xxxxxxxx
+          ...
+    ```
 
 ## Cause 1: The health probe mode isn't Shared or ServiceNodePort
 
@@ -74,6 +156,26 @@ The health probe mode feature requires you to register the feature on your subsc
 
 Make sure you register the feature for your subscription before creating or updating your cluster. You can use the `az feature register` command to register the feature. 
 
+```azurecli
+export FEATURE_NAME="EnableSLBSharedHealthProbePreview"
+export PROVIDER_NAMESPACE="Microsoft.ContainerService"
+az feature register --name $FEATURE_NAME --namespace $PROVIDER_NAMESPACE
+```
+Results:
+
+<!-- expected_similarity=0.3 -->
+
+```output
+{
+  "id": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/providers/Microsoft.Features/providers/Microsoft.ContainerService/features/EnableAKSClusterServiceLoadBalancerHealthProbeMode",
+  "name": "Microsoft.ContainerService/EnableAKSClusterServiceLoadBalancerHealthProbeMode",
+  "properties": {
+    "state": "Registering"
+  },
+  "type": "Microsoft.Features/providers/features"
+}
+```
+
 ## Cause 5: The Kubernetes version is earlier than v1.28.0
 
 The health probe mode feature requires a minimum Kubernetes version of v1.28.0. If you use an older version, the feature won't work. 
@@ -90,8 +192,53 @@ For Windows, the kube-proxy component doesn't start until you create the first n
 
 To enable the health probe mode feature, run one of the following commands:
 
-- `az aks create/update --cluster-service-load-balancer-health-probe-mode Shared`
-
-- `az aks create/update --cluster-service-load-balancer-health-probe-mode ServiceNodePort (default)`
+Enable `ServiceNodePort` health probe mode (default) for a cluster:
+
+```shell
+export RESOURCE_GROUP="aks-rg"
+export AKS_CLUSTER_NAME="aks-cluster"
+az aks update --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --cluster-service-load-balancer-health-probe-mode ServiceNodePort 
+```
+Results:
+
+```output
+{
+  "name": "aks-cluster",
+  "location": "eastus2",
+  "resourceGroup": "aks-rg",
+  "kubernetesVersion": "1.28.x",
+  "provisioningState": "Succeeded",
+  "loadBalancerProfile": {
+    "clusterServiceLoadBalancerHealthProbeMode": "ServiceNodePort",
+    ...
+  },
+  ...
+}
+```
+
+Enable `Shared` health probe mode for a cluster:
+
+```shell
+export RESOURCE_GROUP="MyAksResourceGroup"
+export AKS_CLUSTER_NAME="MyAksCluster"
+az aks update --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --cluster-service-load-balancer-health-probe-mode Shared
+```
+
+Results:
+
+```output
+{
+  "name": "MyAksCluster",
+  "location": "eastus2",
+  "resourceGroup": "MyAksResourceGroup",
+  "kubernetesVersion": "1.28.x",
+  "provisioningState": "Succeeded",
+  "loadBalancerProfile": {
+    "clusterServiceLoadBalancerHealthProbeMode": "Shared",
+    ...
+  },
+  ...
+}
+```
 
 [!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)] 
diff --git a/support/azure/azure-kubernetes/create-upgrade-delete/cannot-scale-cluster-autoscaler-enabled-node-pool.md b/support/azure/azure-kubernetes/create-upgrade-delete/cannot-scale-cluster-autoscaler-enabled-node-pool.md
@@ -3,14 +3,15 @@ title: Cluster autoscaler fails to scale with cannot scale cluster autoscaler en
 description: Learn how to troubleshoot the cannot scale cluster autoscaler enabled node pool error when your autoscaler isn't scaling up or down.
 author: sgeannina
 ms.author: ninasegares
-ms.date: 04/17/2025
-ms.reviewer: aritraghosh, chiragpa.momajed
+ms.date: 06/09/2024
+ms.reviewer: aritraghosh, chiragpa
 ms.service: azure-kubernetes-service
-ms.custom: sap:Create, Upgrade, Scale and Delete operations (cluster or nodepool)
+ms.custom: sap:Create, Upgrade, Scale and Delete operations (cluster or nodepool), innovation-engine
 ---
+
 # Cluster autoscaler fails to scale with "cannot scale cluster autoscaler enabled node pool" error
 
-This article discusses how to resolve the "cannot scale cluster autoscaler enabled node pool" error that occurs when you scale a cluster that has an autoscaler-enabled node pool.
+This article discusses how to resolve the "cannot scale cluster autoscaler enabled node pool" error that appears when scaling a cluster with an autoscaler enabled node pool.
 
 ## Symptoms
 
@@ -22,33 +23,33 @@ You receive an error message that resembles the following message:
 
 ## Troubleshooting checklist
 
-Azure Kubernetes Service (AKS) uses Azure Virtual Machine Scale Sets-based agent pools. These pools contain cluster nodes and [cluster autoscaling capabilities](/azure/aks/cluster-autoscaler), if they're enabled.
+Azure Kubernetes Service (AKS) uses virtual machine scale sets-based agent pools, which contain cluster nodes and [cluster autoscaling capabilities](/azure/aks/cluster-autoscaler) if enabled.
 
 ### Check that the cluster virtual machine scale set exists
 
-1. Sign in to the [Azure portal](https://portal.azure.com).
-1. Find the node resource group by searching for the following names:  
+1. Sign in to [Azure portal](https://portal.azure.com).
+1. Find the node resource group by searching the following names:
+
+   - The default name `MC_{AksResourceGroupName}_{YourAksClusterName}_{AksResourceLocation}`.
+   - The custom name (if it was provided at creation).
 
-   - The default name `MC_{AksResourceGroupName}_{YourAksClusterName}_{AksResourceLocation}`
-   - The custom name (if it was provided at creation)
-     >
    > [!NOTE]
-   > When you create a cluster, AKS automatically creates a second resource group to store the AKS resources. For more information, see [Why are two resource groups created with AKS?](/azure/aks/faq#why-are-two-resource-groups-created-with-aks)
+   > When you create a new cluster, AKS automatically creates a second resource group to store the AKS resources. For more information, see [Why are two resource groups created with AKS?](/azure/aks/faq#why-are-two-resource-groups-created-with-aks)
 
-1. Check the list of resources to make sure that a virtual machine scale set exists.
+1. Check the list of resources and make sure that there's a virtual machine scale set.
 
 ## Cause 1: The cluster virtual machine scale set was deleted
 
-If you delete the virtual machine scale set that's attached to the cluster, this action causes the cluster autoscaler to fail. It also causes issues when you provision resources such as nodes and pods.
+Deleting the virtual machine scale set attached to the cluster causes the cluster autoscaler to fail. It also causes issues when provisioning resources such as nodes and pods.
 
 > [!NOTE]
-> Modifying any resource under the node resource group in the AKS cluster is an unsupported action and causes cluster operation failures. You can prevent changes from being made to the node resource group by [blocking users from modifying resources](/azure/aks/cluster-configuration#fully-managed-resource-group-preview) that are managed by the AKS cluster.
+> Modifying any resource under the node resource group in the AKS cluster is an unsupported action and will cause cluster operation failures. You can prevent changes from being made to the node resource group by [blocking users from modifying resources](/azure/aks/cluster-configuration#fully-managed-resource-group-preview) managed by the AKS cluster.
 
 ### Reconcile node pool
 
 If the cluster virtual machine scale set is accidentally deleted, you can reconcile the node pool by using `az aks nodepool update`:
 
-```bash
+```shell
 # Update Node Pool Configuration
 az aks nodepool update --resource-group <resource-group-name> --cluster-name <cluster-name> --name <nodepool-name> --tags <tags> --node-taints <taints> --labels <labels>
 
@@ -59,13 +60,13 @@ Monitor the node pool to make sure that it's functioning as expected and that al
 
 ## Cause 2: Tags or any other properties were modified from the node resource group
 
-You may experience scaling errors if you modify or delete Azure-created tags and other resource properties in the node resource group. For more information, see [Can I modify tags and other properties of the AKS resources in the node resource group?](/azure/aks/faq#can-i-modify-tags-and-other-properties-of-the-aks-resources-in-the-node-resource-group)
+You may receive scaling errors if you modify or delete Azure-created tags and other resource properties in the node resource group. For more information, see [Can I modify tags and other properties of the AKS resources in the node resource group?](/azure/aks/faq#can-i-modify-tags-and-other-properties-of-the-aks-resources-in-the-node-resource-group)
 
 ### Reconcile node resource group tags
 
 Use the Azure CLI to make sure that the node resource group has the correct tags for AKS name and the AKS group name:
 
-```bash
+```shell
 # Add or update tags for AKS name and AKS group name
 az group update --name <node-resource-group-name> --set tags.AKS-Managed-Cluster-Name=<aks-managed-cluster-name> tags.AKS-Managed-Cluster-RG=<aks-managed-cluster-rg>
 
@@ -76,21 +77,22 @@ Monitor the resource group to make sure that the tags are correctly applied and
 
 ## Cause 3: The cluster node resource group was deleted
 
-Deleting the cluster node resource group causes issues when you provision the infrastructure resources that are required by the cluster. This action causes the cluster autoscaler to fail.
+Deleting the cluster node resource group causes issues when provisioning the infrastructure resources required by the cluster, which causes the cluster autoscaler to fail.
 
 ## Solution: Update the cluster to the goal state without changing the configuration
 
-To resolve this issue, run the following command to recover the deleted virtual machine scale set or any tags (missing or modified).
+To resolve this issue, you can run the following command to recover the deleted virtual machine scale set or any tags (missing or modified):
 
 > [!NOTE]
-> It might take a few minutes until the operation finishes.
+> It might take a few minutes until the operation completes.
+
+Set your environment variables for the AKS cluster resource group and cluster name before running the command. A random suffix is included to prevent name collisions during repeatable executions, but you must ensure the resource group and cluster exist.
 
 ```azurecli
-az aks update --resource-group <resource-group-name> --name <aks-cluster-name>
+export RANDOM_SUFFIX=$(head -c 3 /dev/urandom | xxd -p)
+export AKS_RG_NAME="MyAksResourceGroup$RANDOM_SUFFIX"
+export AKS_CLUSTER_NAME="MyAksCluster$RANDOM_SUFFIX"
+az aks update --resource-group $AKS_RG_NAME --name $AKS_CLUSTER_NAME --no-wait
 ```
 
-### Additional troubleshooting tips
-
-- Check the Azure Activity Log for any recent changes or deletions.
-
 [!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]