Update troubleshoot-node-auto-provision.md

przlplx · web-flow · commit f1c4dee9f3f5 · 2025-11-18T12:57:09.000-08:00
Edit review per CI 8179
diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md
@@ -9,35 +9,35 @@ ms.topic: troubleshooting
 ms.date: 09/05/2025
 editor: bsoghigian
 ms.reviewer: phwilson, v-ryanberg, v-gsitser
-#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve Node Auto Provisioining managed add-on so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS).
+#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve node auto-provisioning managed add-ons so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS).
 ms.custom: sap:Extensions, Policies and Add-Ons
 ---
 
 # Troubleshoot node auto-provisioning (NAP) in Azure Kubernetes Service (AKS)
 
-This article discusses how to troubleshoot node auto-provisioning (NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure and manages scaling events at the virtual machine or node level.
+This article discusses how to troubleshoot node auto-provisioning (NAP). NAP is a managed add-on that's based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine (VM) or node level.
 
-When you enable NAP, you can experience problems associated with the configuration of the infrastructure autoscaler. This article  helps you troubleshoot errors and resolve common problems that affect NAP but aren't covered in Karpenter [FAQ][karpenter-faq] or [troubleshooting guide][karpenter-troubleshooting].
+When you enable NAP, you might encounter issues that are associated with the configuration of the infrastructure autoscaler. This article helps you troubleshoot errors and resolve common issues that affect NAP but aren't covered in the Karpenter [FAQ][karpenter-faq] or [troubleshooting guide][karpenter-troubleshooting].
 
 ## Prerequisites
 
-Ensure the following tools are installed and configured. They're used in the following sections.
+Make sure that the following tools are installed and configured:
 
 - [Azure Command-Line Interface (CLI)](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the `[az aks install-cli](/cli/azure/aks#az-aks-install-cli)` command.
-- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client available with Azure CLI.
-- Confirm you have NAP enabled on your cluster. For more information, see [node auto provisioning documentation][nap-main-docs].
+- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client that's available together with Azure CLI.
+- NAP, enabled on your cluster. For more information, see [node auto provisioning documentation][nap-main-docs].
 
 ## Common issues
 
-### Nodes not being removed
+### Nodes aren't removed
 
 **Symptoms**
 
-Underutilized or empty nodes remain in the cluster longer than expected.
+Underused or empty nodes remain in the cluster longer than you expect.
 
 **Debugging steps**
 
-1. **Check node utilization**
+1. **Check node usage**
 
 Run the following command:
 
@@ -64,34 +64,34 @@ Run the following command:
 kubectl get events | grep -i "disruption\|consolidation"
 ```
 
-**Common causes**
+**Cause**
 
 Common causes include:
 
-- Pods without proper tolerations.
-- DaemonSets preventing drain.
-- Pod disruption budgets (PDBs) aren't properly set.
-- Nodes are marked with `do-not-disrupt` annotation.
-- Locks blocking changes.
+- Pods that have no proper tolerations
+- DaemonSets that prevent drain
+- Pod disruption budgets (PDBs) that aren't correctly set
+- Nodes that are marked by a `do-not-disrupt` annotation
+- Locks that block changes
 
-**Solutions**
+**Solution**
 
-Solutions include:
+Possible solutions include:
 
 - Add proper tolerations to pods.
 - Review `DaemonSet` configurations.  
-- Adjust PDBs to allow disruption
-- Remove `do-not-disrupt` annotations if appropriate.
+- Adjust PDBs to allow disruption.
+- Remove the `do-not-disrupt` annotations, as appropriate.
 - Review lock configurations.
 
 ## Networking issues
 
-For most networking-related issues, there are two levels available for networking observability:
+For most networking-related issues, use either of the available levels of networking observability:
 
-- [Container network metrics][aks-container-metrics] (default): Allows for node level metrics. 
-- [Advanced container network metrics][advanced-container-network-metrics]: In addition to node level metrics, you can also observe pod-level metrics including fully qualified domain name (FQDN) metrics for troubleshooting.
+- [Container network metrics][aks-container-metrics] (default): Enables observation of node-level metrics. 
+- [Advanced container network metrics][advanced-container-network-metrics]: Enables observation of pod-level metrics, including fully qualified domain name (FQDN) metrics for troubleshooting.
 
-### Pod connectivity problems
+### Pod connectivity issues
 
 **Symptoms**
 
@@ -109,7 +109,7 @@ kubectl exec -it <pod-name> -- ping <target-ip>
 kubectl exec -it <pod-name> -- nslookup kubernetes.default
 ```
 
-Another option to test node-to-node or pod-to-pod connectivity is with the open-source [goldpinger](https://github.com/bloomberg/goldpinger) tool. 
+Another option to test node-to-node or pod-to-pod connectivity is to use the open-source [goldpinger](https://github.com/bloomberg/goldpinger) tool. 
 
 2. **Check network plugin status**
 
@@ -119,7 +119,7 @@ Run the following command:
 kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy"
 ```
 
-If you're using Azure Container Networking Interface (CNI) with overlay, verify your nodes have these labels: 
+If you're using Azure Container Networking Interface (CNI) in overlay mode, verify that your nodes have these labels: 
 
 ```azurecli-interactive
     kubernetes.azure.com/azure-cni-overlay: "true"
@@ -128,7 +128,7 @@ If you're using Azure Container Networking Interface (CNI) with overlay, verify
     kubernetes.azure.com/network-subscription: <redacted>
 ```
 
-4. **Validate the CNI configuration files**
+4. **Verify the CNI configuration files**
 
 The CNI conflist files define network plugin configurations. Check which files are present:
 
@@ -142,10 +142,10 @@ ls -la /etc/cni/net.d/
 
 **Understanding conflist files**
 
-For this scenario, there are two types of conflist files:
+This scenario includes two kinds of conflist files:
 
-- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with all CNIs not using overlay.
-- `15-azure-swift-overlay.conflist`: Azure CNI with overlay networking (used with Cilium or overlay mode).
+- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking of all CNIs that don't use overlay mode.
+- `15-azure-swift-overlay.conflist`: Azure CNI Overlay networking (used by Cilium or in overlay mode).
 
 **Inspect the configuration content**
 
@@ -165,10 +165,10 @@ cat /etc/cni/net.d/*.conflist
 
 Common conflist issues include: 
 
-- Missing or corrupted configuration files.
-- Incorrect network mode for your cluster setup.
-- Mismatched IP Address Management (IPAM) configuration.
-- Wrong plugin order in the configuration chain.
+- Missing or corrupted configuration files
+- Incorrect network mode for your cluster setup
+- Mismatched IP Address Management (IPAM) configuration
+- Wrong plugin order in the configuration chain
 
 5. **Check CNI-to-Advanced Container Networking Services (ACNS) communication**
 
@@ -181,35 +181,35 @@ kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100
 
 **CNI-to-ACNS troubleshooting**
 
-- **If ACNS logs show "no IPs available"**: Indicates an ACNS or AKS watch on the Neural Network Coding (NNC).
-- **If CNI calls don't appear in ACNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed.
+- **If ACNS logs show "no IPs available"**: Indicates an ACNS or AKS watch that's enacted on the Neural Network Coding (NNC).
+- **If CNI calls don't appear in ACNS logs**: Usually indicates that the wrong CNI is installed. Verify that the correct CNI plugin is deployed.
 
-**Common causes**
+**Causes**
 
 Common causes include:
 
-- Network security group (NSG) rules.
-- Incorrect subnet configuration.
-- CNI plugin issues.
-- DNS resolution problems.
+- Network security group (NSG) rules
+- Incorrect subnet configuration
+- CNI plugin issues
+- DNS resolution problems
 
-**Solutions**
+**Solution**
 
-Solutions include:
+Possible solutions include:
 
-- Review [Network Security Group][network-security-group-docs] rules for required traffic.
-- Verify subnet configuration in `AKSNodeClass`. For more information, see [AKSNodeClass documentation][aksnodeclass-subnet-config].
-- Restart CNI plugin pods.
-- Check `CoreDNS` configuration. For more information, see [CoreDNS documentation][coredns-troubleshoot].
+- Review the [Network Security Group][network-security-group-docs] rules for required traffic.
+- Verify the subnet configuration in `AKSNodeClass`. For more information, see [AKSNodeClass documentation][aksnodeclass-subnet-config].
+- Restart the CNI plugin pods.
+- Check the `CoreDNS` configuration. For more information, see [CoreDNS documentation][coredns-troubleshoot].
 
 ### DNS service IP issues
 
->[!NOTE]
->The `--dns-service-ip` parameter is only supported for NAP clusters and isn't available for self-hosted Karpenter installations.
+> [!NOTE]
+> The `--dns-service-ip` parameter is supported for only NAP clusters and isn't available for self-hosted Karpenter installations.
 
 **Symptoms**
 
-Pods can't resolve DNS names or kubelet fails to register with API server due to DNS resolution failures.
+Pods can't resolve DNS names or kubelet doesn't register together with the API server because of DNS resolution failures.
 
 **Debugging steps**
 
@@ -266,7 +266,7 @@ kubectl get pods -n kube-system -l k8s-app=kube-dns
 kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
 ```
 
-5. **Validate network connectivity to DNS service**
+5. **Verify network connectivity to DNS service**
 
 Run the following command:
 
@@ -277,34 +277,34 @@ telnet 10.0.0.10 53  # Replace with your actual DNS service IP
 nc -zv 10.0.0.10 53
 ```
 
-**Common causes**
+**Cause**
 
 Common causes include:
 
-- Incorrect `--dns-service-ip` parameter in `AKSNodeClass`.
-- DNS service IP isn't in the service Classless Inter-Domain Routing (CIDR) range.
-- Network connectivity issues between node and DNS service.
-- `CoreDNS` pods not running or misconfigured.
+- The `--dns-service-ip` parameter in `AKSNodeClass` is incorrect.
+- The DNS service IP isn't in the service Classless Inter-Domain Routing (CIDR) range.
+- Network connectivity issues exist between the node and DNS service.
+- `CoreDNS` pods aren't running or are misconfigured.
 - Firewall rules block DNS traffic.
 
-**Solutions**
+**Solution**
 
-Solutions include:
+Possible solutions include:
 
-- Verify `--dns-service-ip` matches the actual DNS service. Do this with the following command: `kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'`
-- Ensure DNS service IP is within the service CIDR range specified during cluster creation.
-- Check Karpenter nodes can reach the service subnets
-- Restart `CoreDNS pods` if they're in error state. Do this with the following command:  `kubectl rollout restart deployment/coredns -n kube-system`
-- Verify NSG rules allow traffic on port 53 (TCP/User Datagram Protocol (UDP)).
-- Run a connectivity analysis with [Azure Virtual Network Verifier](/azure/virtual-network-manager/overview) to validate outbound connectivity.
+- Verify that the `--dns-service-ip` value matches the actual DNS service. To verify, run the following command: `kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'`.
+- Make sure that the DNS service IP is within the service CIDR range specified during cluster creation.
+- Check whether Karpenter nodes can reach the service subnets
+- Restart `CoreDNS pods` if they're in an error state. To restart, run the following command: `kubectl rollout restart deployment/coredns -n kube-system`
+- Verify that NSG rules allow traffic on port 53 (TCP/User Datagram Protocol (UDP)).
+- Run a connectivity analysis by using the [Azure Virtual Network Verifier](/azure/virtual-network-manager/overview) to verify outbound connectivity.
 
 ## Azure-specific issues
 
-### Spot virtual machine (VM) issues
+### Spot VM issues
 
 **Symptoms**
 
-Unexpected node terminations occur when using spot instances.
+Unexpected node terminations occur when you use spot instances.
 
 **Debugging steps**
 
@@ -324,20 +324,20 @@ Run the following command:
 az vm list-sizes --location <region> --query "[?contains(name, 'Standard_D2s_v3')]"
 ```
 
-**Solutions**
+**Solution**
 
-Solutions include:
+Possible solutions include:
 
 - Use diverse instance types for better availability.
 - Implement proper pod disruption budgets.
 - Consider mixed spot and on-demand strategies.
-- Use workloads tolerant of node preemption.
+- Use workloads that are tolerant of node preemption.
 
 ### Quota exceeded
 
 **Symptoms**
 
-VM creation fails with quota exceeded errors.
+VM creation fails and generates "quota exceeded" errors.
 
 **Debugging steps**
 
@@ -349,12 +349,15 @@ Run the following command:
 az vm list-usage --location <region> --query "[?currentValue >= limit]"
 ```
 
-**Solutions**
+**Solution**
 
-Solutions include:
+Possible solutions include:
 
-- Request quota increases through Azure portal.
-- Expand nodepool custom resource definitions (CRDs) to more VM sizes. For more information, see [NodePool configuration documentation][nap-nodepool-docs]. For example, a nodepool specification that allows for D-family VM is less likely to hit quota errors that stop VM creation compared to a nodepool specification specific to only one exact VM size. 
+- Request quota increases through the Azure portal.
+- Expand nodepool custom resource definitions (CRDs) to include more VM sizes. For more information, see [NodePool configuration documentation][nap-nodepool-docs]. For example, nodepool specification A is less likely than nodepool specification B to trigger quota errors that stop VM creation if A includes D-family VMs and B is specific to only one VM size.
+
+[!INCLUDE [Third-party disclaimer](~/includes/third-party-disclaimer.md)]
+
+[!INCLUDE [Third-party contact disclaimer](~/includes/third-party-contact-disclaimer.md)]
 
 [!INCLUDE [Azure Help Support](~/includes/azure-help-support.md)]
-[!INCLUDE [Third-party contact disclaimer](~/includes/third-party-contact-disclaimer.md)]