Skip to content

Commit 2be9194

Browse files
Update troubleshoot-node-auto-provision.md
1 parent 908e290 commit 2be9194

1 file changed

Lines changed: 114 additions & 64 deletions

File tree

support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md

Lines changed: 114 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -1,94 +1,125 @@
11
---
2-
title: Troubleshoot the Node Auto Provisioning managed add-on
3-
description: Learn how to troubleshoot Node Auto Provisisioning in Azure Kubernetes Service (AKS).
2+
title: Troubleshoot Node Auto-Provisioning Managed Add-on
3+
description: Learn how to troubleshoot node auto-provisisioning (NAP) in Azure Kubernetes Service (AKS).
44
ms.service: azure-kubernetes-service
5+
author: JarrettRenshaw
6+
ms.author: jarrettr
7+
manager: dcscontentpm
8+
ms.topic: troubleshooting
59
ms.date: 09/05/2025
610
editor: bsoghigian
7-
ms.reviewer:
11+
ms.reviewer: phwilson, v-ryanberg, v-gsitser
812
#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve Node Auto Provisioining managed add-on so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS).
913
ms.custom: sap:Extensions, Policies and Add-Ons
1014
---
1115

12-
# Troubleshoot node auto provisioning (NAP) in Azure Kubernetes Service (AKS)
16+
# Troubleshoot node auto-provisioning (NAP) in Azure Kubernetes Service (AKS)
1317

14-
This article discusses how to troubleshoot Node auto provisioning(NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine, or node level.
15-
When you enable Node Auto Provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting].
18+
This article discusses how to troubleshoot node auto-provisioning (NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure and manages scaling events at the virtual machine or node level.
19+
20+
When you enable NAP, you can experience problems associated with the configuration of the infrastructure autoscaler. This article helps you troubleshoot errors and resolve common problems that affect NAP but aren't covered in Karpenter [FAQ][karpenter-faq] or [troubleshooting guide][karpenter-troubleshooting].
1621

1722
## Prerequisites
1823

1924
Ensure the following tools are installed and configured. They're used in the following sections.
2025

21-
- [Azure CLI](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command.
22-
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with the Azure CLI.
23-
- Confirm you have Node Auto Provisioning enabled on your cluster. For steps on enabling node auto provisioning in your cluster, visit our [node auto provisioning documentation][nap-main-docs].
26+
- [Azure Command-Line Interface (CLI)](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the `[az aks install-cli](/cli/azure/aks#az-aks-install-cli)` command.
27+
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with Azure CLI.
28+
- Confirm you have NAP enabled on your cluster. For more information, see [node auto provisioning documentation][nap-main-docs].
2429

25-
## Common Issues
30+
## Common issues
2631

27-
### Nodes Not Being Removed
32+
### Nodes not being removed
2833

29-
**Symptoms**: Underutilized or empty nodes remain in the cluster longer than expected.
34+
**Symptoms**
3035

31-
**Debugging Steps**:
36+
Underutilized or empty nodes remain in the cluster longer than expected.
37+
38+
**Debugging steps**
39+
40+
1. **Check node utilization**
41+
42+
Run the following command:
3243

33-
1. **Check node utilization**:
3444
```azurecli-interactive
3545
kubectl top nodes
3646
kubectl describe node <node-name>
3747
```
48+
3849
You can also use the open-source [AKS Node Viewer](https://github.com/Azure/aks-node-viewer) tool to visualize node usage.
3950

40-
2. **Look for blocking pods**:
51+
2. **Look for blocking pods**
52+
53+
Run the following command:
54+
4155
```azurecli-interactive
4256
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
4357
```
4458

45-
3. **Check for disruption blocks**:
59+
3. **Check for disruption blocks**
60+
61+
Run the following command:
62+
4663
```azurecli-interactive
4764
kubectl get events | grep -i "disruption\|consolidation"
4865
```
4966

50-
**Common Causes**:
51-
- Pods without proper tolerations
52-
- DaemonSets preventing drain
53-
- Pod disruption budgets(PDBs) are not properly set
54-
- Nodes are marked with `do-not-disrupt` annotation
55-
- Locks blocking changes
67+
**Common causes**
5668

57-
**Solutions**:
58-
- Add proper tolerations to pods
59-
- Review DaemonSet configurations
60-
- Adjust pod disruption budgets to allow disruption
61-
- Remove `do-not-disrupt` annotations if appropriate
62-
- Review lock configurations
69+
Common causes include:
6370

71+
- Pods without proper tolerations.
72+
- DaemonSets preventing drain.
73+
- Pod disruption budgets (PDBs) aren't properly set.
74+
- Nodes are marked with `do-not-disrupt` annotation.
75+
- Locks blocking changes.
6476

65-
## Networking Issues
77+
**Solutions**
6678

67-
For most Networking related issues, there are two levels available for networking observability
68-
- [Container Network Metrics][aks-container-metrics] (default): Allows for node level metrics
69-
- [Advanced Container Network Metrics][advanced-container-network-metrics]: In addition to node level metrics, you can also observe pod-level metrics including FQDN metrics for troubleshooting.
79+
Solutions include:
7080

71-
### Pod Connectivity Problems
81+
- Adding proper tolerations to pods.
82+
- Reviewing DaemonSet configurations.
83+
- Adjusting PDBs to allow disruption
84+
- Removing `do-not-disrupt` annotations if appropriate.
85+
- Reviewing lock configurations.
7286

73-
**Symptoms**: Pods can't communicate with other pods or external services.
87+
## Networking issues
7488

75-
**Debugging Steps**:
89+
For most networking-related issues, there are two levels available for networking observability:
90+
91+
- [Container network metrics][aks-container-metrics] (default): Allows for node level metrics.
92+
- [Advanced container network metrics][advanced-container-network-metrics]: In addition to node level metrics, you can also observe pod-level metrics including fully qualified domain name (FQDN) metrics for troubleshooting.
93+
94+
### Pod connectivity problems
95+
96+
**Symptoms**
97+
98+
Pods can't communicate with other pods or external services.
99+
100+
**Debugging steps**
101+
102+
1. **Test basic connectivity**
103+
104+
Run the following command:
76105

77-
1. **Test basic connectivity**:
78106
```azurecli-interactive
79107
# From within a pod
80108
kubectl exec -it <pod-name> -- ping <target-ip>
81109
kubectl exec -it <pod-name> -- nslookup kubernetes.default
82110
```
83111

84-
Another option to test node to node or pod to pod connectivity is with the open-source [goldpinger](https://github.com/bloomberg/goldpinger) tool.
112+
Another option to test node-to-node or pod-to-pod connectivity is with the open-source [goldpinger](https://github.com/bloomberg/goldpinger) tool.
113+
114+
2. **Check network plugin status**
115+
116+
Run the following command:
85117

86-
2. **Check network plugin status**:
87118
```azurecli-interactive
88119
kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy"
89120
```
90-
3. **If using azure cni with overlay**
91-
Validate your nodes have these labels
121+
122+
If you're using Azure Container Networking Interface (CNI) with overlay, verfify your nodes have these labels:
92123

93124
```azurecli-interactive
94125
kubernetes.azure.com/azure-cni-overlay: "true"
@@ -109,11 +140,17 @@ ls -la /etc/cni/net.d/
109140
# 10-azure.conflist 15-azure-swift-overlay.conflist
110141
```
111142

112-
**Understanding conflist files**:
113-
- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with all CNI's not using overlay
114-
- `15-azure-swift-overlay.conflist`: Azure CNI with overlay networking (used with Cilium or overlay mode)
143+
**Understanding conflist files**
144+
145+
For this scenario, there are two types of conflist files:
146+
147+
- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with all CNIs not using overlay.
148+
- `15-azure-swift-overlay.conflist`: Azure CNI with overlay networking (used with Cilium or overlay mode).
149+
150+
**Inspect the configuration content**
151+
152+
Run the following command:
115153

116-
**Inspect the configuration content**:
117154
```azurecli-interactive
118155
# Check the actual CNI configuration
119156
cat /etc/cni/net.d/*.conflist
@@ -124,35 +161,48 @@ cat /etc/cni/net.d/*.conflist
124161
# - "ipam": IP address management configuration
125162
```
126163

127-
**Common conflist issues**:
128-
- Missing or corrupted configuration files
129-
- Incorrect network mode for your cluster setup
130-
- Mismatched IPAM configuration
131-
- Wrong plugin order in the configuration chain
164+
**Common conflist issues**
165+
166+
Common conflist issues include:
167+
168+
- Missing or corrupted configuration files.
169+
- Incorrect network mode for your cluster setup.
170+
- Mismatched IP Address Management (IPAM) configuration.
171+
- Wrong plugin order in the configuration chain.
172+
173+
5. **Check CNI-to-Advanced Container Networking Services (ACNS) communication**
174+
175+
Run the following command:
132176

133-
5. **Check CNI to CNS communication**:
134177
```azurecli-interactive
135178
# Check CNS logs for IP allocation requests from CNI
136179
kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100
137180
```
138181

139-
**CNI to CNS Troubleshooting**:
140-
- **If CNS logs show "no IPs available"**: This indicates a CNS or AKS' watch on the NNCs.
141-
- **If CNI calls don't appear in CNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed.
182+
**CNI-to-ACNS troubleshooting**
142183

143-
**Common Causes**:
144-
- Network security group(NSG) rules
145-
- Incorrect subnet configuration
146-
- CNI plugin issues
147-
- DNS resolution problems
184+
- **If ACNS logs show "no IPs available"**: This indicates an ACNS or AKS watch on the Neural Network Coding (NNC).
185+
- **If CNI calls don't appear in ACNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed.
148186

149-
**Solutions**:
150-
- Review [Network Sescurity Group][network-security-group-docs] rules for required traffic
151-
- Verify subnet configuration in AKSNodeClass. See [AKSNodeClass documentation][aksnodeclass-subnet-config] on subnet configuration
152-
- Restart CNI plugin pods
153-
- Check CoreDNS configuration. See [CoreDNS documentation][coredns-troubleshoot]
187+
**Common causes**
188+
189+
Common causes include:
190+
191+
- Network security group rules.
192+
- Incorrect subnet configuration.
193+
- CNI plugin issues.
194+
- DNS resolution problems.
195+
196+
**Solutions**
197+
198+
Solutions include:
199+
200+
- Reviewing [Network Sescurity Group][network-security-group-docs] rules for required traffic.
201+
- Verifying subnet configuration in `AKSNodeClass`. For more information, see [AKSNodeClass documentation][aksnodeclass-subnet-config].
202+
- Restarting CNI plugin pods.
203+
- Checking `CoreDNS` configuration. For more information, see [CoreDNS documentation][coredns-troubleshoot].
154204

155-
### DNS Service IP Issues
205+
### DNS service IP issues
156206

157207
>[!NOTE]
158208
>The `--dns-service-ip` parameter is only supported for NAP (Node Auto Provisioning) clusters and is not available for self-hosted Karpenter installations.

0 commit comments

Comments
 (0)