You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: Learn how to troubleshoot node auto-provisisioning (NAP) in Azure Kubernetes Service (AKS).
4
4
ms.service: azure-kubernetes-service
5
+
author: JarrettRenshaw
6
+
ms.author: jarrettr
7
+
manager: dcscontentpm
8
+
ms.topic: troubleshooting
5
9
ms.date: 09/05/2025
6
10
editor: bsoghigian
7
-
ms.reviewer:
11
+
ms.reviewer: phwilson, v-ryanberg, v-gsitser
8
12
#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve Node Auto Provisioining managed add-on so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS).
9
13
ms.custom: sap:Extensions, Policies and Add-Ons
10
14
---
11
15
12
-
# Troubleshoot node autoprovisioning (NAP) in Azure Kubernetes Service (AKS)
16
+
# Troubleshoot node auto-provisioning (NAP) in Azure Kubernetes Service (AKS)
13
17
14
-
This article discusses how to troubleshoot Node auto provisioning(NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine, or node level.
15
-
When you enable Node Auto Provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting].
18
+
This article discusses how to troubleshoot node auto-provisioning (NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure and manages scaling events at the virtual machine or node level.
19
+
20
+
When you enable NAP, you can experience problems associated with the configuration of the infrastructure autoscaler. This article helps you troubleshoot errors and resolve common problems that affect NAP but aren't covered in Karpenter [FAQ][karpenter-faq] or [troubleshooting guide][karpenter-troubleshooting].
16
21
17
22
## Prerequisites
18
23
19
24
Ensure the following tools are installed and configured. They're used in the following sections.
20
25
21
-
-[Azure CLI](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command.
22
-
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with the Azure CLI.
23
-
- Confirm you have Node Auto Provisioning enabled on your cluster. For steps on enabling node auto provisioning in your cluster, visit our[node auto provisioning documentation][nap-main-docs].
26
+
-[Azure Command-Line Interface (CLI)](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the `[az aks install-cli](/cli/azure/aks#az-aks-install-cli)` command.
27
+
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with Azure CLI.
28
+
- Confirm you have NAP enabled on your cluster. For more information, see[node auto provisioning documentation][nap-main-docs].
24
29
25
-
## Common Issues
30
+
## Common issues
26
31
27
-
### Nodes Not Being Removed
32
+
### Nodes not being removed
28
33
29
-
**Symptoms**: Underutilized or empty nodes remain in the cluster longer than expected.
34
+
**Symptoms**
30
35
31
-
**Debugging Steps**:
36
+
Underutilized or empty nodes remain in the cluster longer than expected.
37
+
38
+
**Debugging steps**
39
+
40
+
1.**Check node utilization**
41
+
42
+
Run the following command:
32
43
33
-
1.**Check node utilization**:
34
44
```azurecli-interactive
35
45
kubectl top nodes
36
46
kubectl describe node <node-name>
37
47
```
48
+
38
49
You can also use the open-source [AKS Node Viewer](https://github.com/Azure/aks-node-viewer) tool to visualize node usage.
39
50
40
-
2.**Look for blocking pods**:
51
+
2.**Look for blocking pods**
52
+
53
+
Run the following command:
54
+
41
55
```azurecli-interactive
42
56
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
43
57
```
44
58
45
-
3.**Check for disruption blocks**:
59
+
3.**Check for disruption blocks**
60
+
61
+
Run the following command:
62
+
46
63
```azurecli-interactive
47
64
kubectl get events | grep -i "disruption\|consolidation"
48
65
```
49
66
50
-
**Common Causes**:
51
-
- Pods without proper tolerations
52
-
- DaemonSets preventing drain
53
-
- Pod disruption budgets(PDBs) are not properly set
54
-
- Nodes are marked with `do-not-disrupt` annotation
55
-
- Locks blocking changes
67
+
**Common causes**
56
68
57
-
**Solutions**:
58
-
- Add proper tolerations to pods
59
-
- Review DaemonSet configurations
60
-
- Adjust pod disruption budgets to allow disruption
61
-
- Remove `do-not-disrupt` annotations if appropriate
62
-
- Review lock configurations
69
+
Common causes include:
63
70
71
+
- Pods without proper tolerations.
72
+
- DaemonSets preventing drain.
73
+
- Pod disruption budgets (PDBs) aren't properly set.
74
+
- Nodes are marked with `do-not-disrupt` annotation.
75
+
- Locks blocking changes.
64
76
65
-
## Networking Issues
77
+
**Solutions**
66
78
67
-
For most Networking related issues, there are two levels available for networking observability
68
-
-[Container Network Metrics][aks-container-metrics] (default): Allows for node level metrics
69
-
-[Advanced Container Network Metrics][advanced-container-network-metrics]: In addition to node level metrics, you can also observe pod-level metrics including FQDN metrics for troubleshooting.
79
+
Solutions include:
70
80
71
-
### Pod Connectivity Problems
81
+
- Adding proper tolerations to pods.
82
+
- Reviewing DaemonSet configurations.
83
+
- Adjusting PDBs to allow disruption
84
+
- Removing `do-not-disrupt` annotations if appropriate.
85
+
- Reviewing lock configurations.
72
86
73
-
**Symptoms**: Pods can't communicate with other pods or external services.
87
+
## Networking issues
74
88
75
-
**Debugging Steps**:
89
+
For most networking-related issues, there are two levels available for networking observability:
90
+
91
+
-[Container network metrics][aks-container-metrics] (default): Allows for node level metrics.
92
+
-[Advanced container network metrics][advanced-container-network-metrics]: In addition to node level metrics, you can also observe pod-level metrics including fully qualified domain name (FQDN) metrics for troubleshooting.
93
+
94
+
### Pod connectivity problems
95
+
96
+
**Symptoms**
97
+
98
+
Pods can't communicate with other pods or external services.
-**If CNS logs show "no IPs available"**: This indicates a CNS or AKS' watch on the NNCs.
141
-
-**If CNI calls don't appear in CNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed.
182
+
**CNI-to-ACNS troubleshooting**
142
183
143
-
**Common Causes**:
144
-
- Network security group(NSG) rules
145
-
- Incorrect subnet configuration
146
-
- CNI plugin issues
147
-
- DNS resolution problems
184
+
-**If ACNS logs show "no IPs available"**: This indicates an ACNS or AKS watch on the Neural Network Coding (NNC).
185
+
-**If CNI calls don't appear in ACNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed.
148
186
149
-
**Solutions**:
150
-
- Review [Network Sescurity Group][network-security-group-docs] rules for required traffic
151
-
- Verify subnet configuration in AKSNodeClass. See [AKSNodeClass documentation][aksnodeclass-subnet-config] on subnet configuration
152
-
- Restart CNI plugin pods
153
-
- Check CoreDNS configuration. See [CoreDNS documentation][coredns-troubleshoot]
187
+
**Common causes**
188
+
189
+
Common causes include:
190
+
191
+
- Network security group rules.
192
+
- Incorrect subnet configuration.
193
+
- CNI plugin issues.
194
+
- DNS resolution problems.
195
+
196
+
**Solutions**
197
+
198
+
Solutions include:
199
+
200
+
- Reviewing [Network Sescurity Group][network-security-group-docs] rules for required traffic.
201
+
- Verifying subnet configuration in `AKSNodeClass`. For more information, see [AKSNodeClass documentation][aksnodeclass-subnet-config].
202
+
- Restarting CNI plugin pods.
203
+
- Checking `CoreDNS` configuration. For more information, see [CoreDNS documentation][coredns-troubleshoot].
154
204
155
-
### DNS Service IP Issues
205
+
### DNS service IP issues
156
206
157
207
>[!NOTE]
158
208
>The `--dns-service-ip` parameter is only supported for NAP (Node Auto Provisioning) clusters and is not available for self-hosted Karpenter installations.
0 commit comments