Update troubleshoot-node-auto-provision.md

ryanberg-aquent · ryanberg-aquent · commit 2be919464958 · 2025-11-17T10:39:01.000-10:00
diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md
@@ -1,94 +1,125 @@
 ---
-title: Troubleshoot the Node Auto Provisioning managed add-on
-description: Learn how to troubleshoot Node Auto Provisisioning in Azure Kubernetes Service (AKS).
+title: Troubleshoot Node Auto-Provisioning Managed Add-on
+description: Learn how to troubleshoot node auto-provisisioning (NAP) in Azure Kubernetes Service (AKS).
 ms.service: azure-kubernetes-service
+author: JarrettRenshaw
+ms.author: jarrettr
+manager: dcscontentpm
+ms.topic: troubleshooting
 ms.date: 09/05/2025
 editor: bsoghigian
-ms.reviewer: 
+ms.reviewer: phwilson, v-ryanberg, v-gsitser
 #Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve Node Auto Provisioining managed add-on so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS).
 ms.custom: sap:Extensions, Policies and Add-Ons
 ---
 
-# Troubleshoot node auto provisioning (NAP) in Azure Kubernetes Service (AKS)
+# Troubleshoot node auto-provisioning (NAP) in Azure Kubernetes Service (AKS)
 
-This article discusses how to troubleshoot Node auto provisioning(NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine, or node level.
-When you enable Node Auto Provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting].
+This article discusses how to troubleshoot node auto-provisioning (NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure and manages scaling events at the virtual machine or node level.
+
+When you enable NAP, you can experience problems associated with the configuration of the infrastructure autoscaler. This article  helps you troubleshoot errors and resolve common problems that affect NAP but aren't covered in Karpenter [FAQ][karpenter-faq] or [troubleshooting guide][karpenter-troubleshooting].
 
 ## Prerequisites
 
 Ensure the following tools are installed and configured. They're used in the following sections.
 
-- [Azure CLI](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command.
-- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with the Azure CLI.
-- Confirm you have Node Auto Provisioning enabled on your cluster. For steps on enabling node auto provisioning in your cluster, visit our [node auto provisioning documentation][nap-main-docs].
+- [Azure Command-Line Interface (CLI)](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the `[az aks install-cli](/cli/azure/aks#az-aks-install-cli)` command.
+- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with Azure CLI.
+- Confirm you have NAP enabled on your cluster. For more information, see [node auto provisioning documentation][nap-main-docs].
 
-## Common Issues
+## Common issues
 
-### Nodes Not Being Removed
+### Nodes not being removed
 
-**Symptoms**: Underutilized or empty nodes remain in the cluster longer than expected.
+**Symptoms**
 
-**Debugging Steps**:
+Underutilized or empty nodes remain in the cluster longer than expected.
+
+**Debugging steps**
+
+1. **Check node utilization**
+
+Run the following command:
 
-1. **Check node utilization**:
 ```azurecli-interactive
 kubectl top nodes
 kubectl describe node <node-name>
 ```
+
 You can also use the open-source [AKS Node Viewer](https://github.com/Azure/aks-node-viewer) tool to visualize node usage.
 
-2. **Look for blocking pods**:
+2. **Look for blocking pods**
+
+Run the following command:
+
 ```azurecli-interactive
 kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
 ```
 
-3. **Check for disruption blocks**:
+3. **Check for disruption blocks**
+
+Run the following command:
+
 ```azurecli-interactive
 kubectl get events | grep -i "disruption\|consolidation"
 ```
 
-**Common Causes**:
-- Pods without proper tolerations
-- DaemonSets preventing drain
-- Pod disruption budgets(PDBs) are not properly set
-- Nodes are marked with `do-not-disrupt` annotation
-- Locks blocking changes
+**Common causes**
 
-**Solutions**:
-- Add proper tolerations to pods
-- Review DaemonSet configurations  
-- Adjust pod disruption budgets to allow disruption
-- Remove `do-not-disrupt` annotations if appropriate
-- Review lock configurations
+Common causes include:
 
+- Pods without proper tolerations.
+- DaemonSets preventing drain.
+- Pod disruption budgets (PDBs) aren't properly set.
+- Nodes are marked with `do-not-disrupt` annotation.
+- Locks blocking changes.
 
-## Networking Issues
+**Solutions**
 
-For most Networking related issues, there are two levels available for networking observability
-- [Container Network Metrics][aks-container-metrics] (default): Allows for node level metrics 
-- [Advanced Container Network Metrics][advanced-container-network-metrics]: In addition to node level metrics, you can also observe pod-level metrics including FQDN metrics for troubleshooting.
+Solutions include:
 
-### Pod Connectivity Problems
+- Adding proper tolerations to pods.
+- Reviewing DaemonSet configurations.  
+- Adjusting PDBs to allow disruption
+- Removing `do-not-disrupt` annotations if appropriate.
+- Reviewing lock configurations.
 
-**Symptoms**: Pods can't communicate with other pods or external services.
+## Networking issues
 
-**Debugging Steps**:
+For most networking-related issues, there are two levels available for networking observability:
+
+- [Container network metrics][aks-container-metrics] (default): Allows for node level metrics. 
+- [Advanced container network metrics][advanced-container-network-metrics]: In addition to node level metrics, you can also observe pod-level metrics including fully qualified domain name (FQDN) metrics for troubleshooting.
+
+### Pod connectivity problems
+
+**Symptoms**
+
+Pods can't communicate with other pods or external services.
+
+**Debugging steps**
+
+1. **Test basic connectivity**
+
+Run the following command:
 
-1. **Test basic connectivity**:
 ```azurecli-interactive
 # From within a pod
 kubectl exec -it <pod-name> -- ping <target-ip>
 kubectl exec -it <pod-name> -- nslookup kubernetes.default
 ```
 
-Another option to test node to node or pod to pod connectivity is with the open-source [goldpinger](https://github.com/bloomberg/goldpinger) tool. 
+Another option to test node-to-node or pod-to-pod connectivity is with the open-source [goldpinger](https://github.com/bloomberg/goldpinger) tool. 
+
+2. **Check network plugin status**
+
+Run the following command:
 
-2. **Check network plugin status**:
 ```azurecli-interactive
 kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy"
 ```
-3. **If using azure cni with overlay** 
-Validate your nodes have these labels 
+
+If you're using Azure Container Networking Interface (CNI) with overlay, verfify your nodes have these labels: 
 
 ```azurecli-interactive
     kubernetes.azure.com/azure-cni-overlay: "true"
@@ -109,11 +140,17 @@ ls -la /etc/cni/net.d/
 # 10-azure.conflist   15-azure-swift-overlay.conflist
 ```
 
-**Understanding conflist files**:
-- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with all CNI's not using overlay
-- `15-azure-swift-overlay.conflist`: Azure CNI with overlay networking (used with Cilium or overlay mode)
+**Understanding conflist files**
+
+For this scenario, there are two types of conflist files:
+
+- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with all CNIs not using overlay.
+- `15-azure-swift-overlay.conflist`: Azure CNI with overlay networking (used with Cilium or overlay mode).
+
+**Inspect the configuration content**
+
+Run the following command:
 
-**Inspect the configuration content**:
 ```azurecli-interactive
 # Check the actual CNI configuration
 cat /etc/cni/net.d/*.conflist
@@ -124,35 +161,48 @@ cat /etc/cni/net.d/*.conflist
 # - "ipam": IP address management configuration
 ```
 
-**Common conflist issues**:
-- Missing or corrupted configuration files
-- Incorrect network mode for your cluster setup
-- Mismatched IPAM configuration
-- Wrong plugin order in the configuration chain
+**Common conflist issues**
+
+Common conflist issues include: 
+
+- Missing or corrupted configuration files.
+- Incorrect network mode for your cluster setup.
+- Mismatched IP Address Management (IPAM) configuration.
+- Wrong plugin order in the configuration chain.
+
+5. **Check CNI-to-Advanced Container Networking Services (ACNS) communication**
+
+Run the following command:
 
-5. **Check CNI to CNS communication**:
 ```azurecli-interactive
 # Check CNS logs for IP allocation requests from CNI
 kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100
 ```
 
-**CNI to CNS Troubleshooting**:
-- **If CNS logs show "no IPs available"**: This indicates a CNS or AKS' watch on the NNCs.
-- **If CNI calls don't appear in CNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed.
+**CNI-to-ACNS troubleshooting**
 
-**Common Causes**:
-- Network security group(NSG) rules
-- Incorrect subnet configuration
-- CNI plugin issues
-- DNS resolution problems
+- **If ACNS logs show "no IPs available"**: This indicates an ACNS or AKS watch on the Neural Network Coding (NNC).
+- **If CNI calls don't appear in ACNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed.
 
-**Solutions**:
-- Review [Network Sescurity Group][network-security-group-docs] rules for required traffic
-- Verify subnet configuration in AKSNodeClass. See [AKSNodeClass documentation][aksnodeclass-subnet-config] on subnet configuration
-- Restart CNI plugin pods
-- Check CoreDNS configuration. See [CoreDNS documentation][coredns-troubleshoot]
+**Common causes**
+
+Common causes include:
+
+- Network security group rules.
+- Incorrect subnet configuration.
+- CNI plugin issues.
+- DNS resolution problems.
+
+**Solutions**
+
+Solutions include:
+
+- Reviewing [Network Sescurity Group][network-security-group-docs] rules for required traffic.
+- Verifying subnet configuration in `AKSNodeClass`. For more information, see [AKSNodeClass documentation][aksnodeclass-subnet-config].
+- Restarting CNI plugin pods.
+- Checking `CoreDNS` configuration. For more information, see [CoreDNS documentation][coredns-troubleshoot].
 
-### DNS Service IP Issues
+### DNS service IP issues
 
 >[!NOTE]
 >The `--dns-service-ip` parameter is only supported for NAP (Node Auto Provisioning) clusters and is not available for self-hosted Karpenter installations.