You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md
+77-74Lines changed: 77 additions & 74 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,35 +9,35 @@ ms.topic: troubleshooting
9
9
ms.date: 09/05/2025
10
10
editor: bsoghigian
11
11
ms.reviewer: phwilson, v-ryanberg, v-gsitser
12
-
#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve Node Auto Provisioining managed add-on so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS).
12
+
#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve node auto-provisioning managed add-ons so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS).
13
13
ms.custom: sap:Extensions, Policies and Add-Ons
14
14
---
15
15
16
16
# Troubleshoot node auto-provisioning (NAP) in Azure Kubernetes Service (AKS)
17
17
18
-
This article discusses how to troubleshoot node auto-provisioning (NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure and manages scaling events at the virtual machine or node level.
18
+
This article discusses how to troubleshoot node auto-provisioning (NAP). NAP is a managed add-on that's based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine (VM) or node level.
19
19
20
-
When you enable NAP, you can experience problems associated with the configuration of the infrastructure autoscaler. This article helps you troubleshoot errors and resolve common problems that affect NAP but aren't covered in Karpenter [FAQ][karpenter-faq] or [troubleshooting guide][karpenter-troubleshooting].
20
+
When you enable NAP, you might encounter issues that are associated with the configuration of the infrastructure autoscaler. This article helps you troubleshoot errors and resolve common issues that affect NAP but aren't covered in the Karpenter [FAQ][karpenter-faq] or [troubleshooting guide][karpenter-troubleshooting].
21
21
22
22
## Prerequisites
23
23
24
-
Ensure the following tools are installed and configured. They're used in the following sections.
24
+
Make sure that the following tools are installed and configured:
25
25
26
26
-[Azure Command-Line Interface (CLI)](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the `[az aks install-cli](/cli/azure/aks#az-aks-install-cli)` command.
27
-
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client available with Azure CLI.
28
-
-Confirm you have NAP enabled on your cluster. For more information, see [node auto provisioning documentation][nap-main-docs].
27
+
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client that's available together with Azure CLI.
28
+
- NAP, enabled on your cluster. For more information, see [node auto provisioning documentation][nap-main-docs].
29
29
30
30
## Common issues
31
31
32
-
### Nodes not being removed
32
+
### Nodes aren't removed
33
33
34
34
**Symptoms**
35
35
36
-
Underutilized or empty nodes remain in the cluster longer than expected.
36
+
Underused or empty nodes remain in the cluster longer than you expect.
37
37
38
38
**Debugging steps**
39
39
40
-
1.**Check node utilization**
40
+
1.**Check node usage**
41
41
42
42
Run the following command:
43
43
@@ -64,34 +64,34 @@ Run the following command:
64
64
kubectl get events | grep -i "disruption\|consolidation"
65
65
```
66
66
67
-
**Common causes**
67
+
**Cause**
68
68
69
69
Common causes include:
70
70
71
-
- Pods without proper tolerations.
72
-
- DaemonSets preventing drain.
73
-
- Pod disruption budgets (PDBs) aren't properly set.
74
-
- Nodes are marked with `do-not-disrupt` annotation.
75
-
- Locks blocking changes.
71
+
- Pods that have no proper tolerations
72
+
- DaemonSets that prevent drain
73
+
- Pod disruption budgets (PDBs) that aren't correctly set
74
+
- Nodes that are marked by a `do-not-disrupt` annotation
75
+
- Locks that block changes
76
76
77
-
**Solutions**
77
+
**Solution**
78
78
79
-
Solutions include:
79
+
Possible solutions include:
80
80
81
81
- Add proper tolerations to pods.
82
82
- Review `DaemonSet` configurations.
83
-
- Adjust PDBs to allow disruption
84
-
- Remove `do-not-disrupt` annotations if appropriate.
83
+
- Adjust PDBs to allow disruption.
84
+
- Remove the `do-not-disrupt` annotations, as appropriate.
85
85
- Review lock configurations.
86
86
87
87
## Networking issues
88
88
89
-
For most networking-related issues, there are two levels available for networking observability:
89
+
For most networking-related issues, use either of the available levels of networking observability:
90
90
91
-
-[Container network metrics][aks-container-metrics] (default): Allows for node level metrics.
92
-
-[Advanced container network metrics][advanced-container-network-metrics]: In addition to node level metrics, you can also observe pod-level metrics including fully qualified domain name (FQDN) metrics for troubleshooting.
91
+
-[Container network metrics][aks-container-metrics] (default): Enables observation of node-level metrics.
92
+
-[Advanced container network metrics][advanced-container-network-metrics]: Enables observation of pod-level metrics, including fully qualified domain name (FQDN) metrics for troubleshooting.
5.**Validate network connectivity to DNS service**
269
+
5.**Verify network connectivity to DNS service**
270
270
271
271
Run the following command:
272
272
@@ -277,34 +277,34 @@ telnet 10.0.0.10 53 # Replace with your actual DNS service IP
277
277
nc -zv 10.0.0.10 53
278
278
```
279
279
280
-
**Common causes**
280
+
**Cause**
281
281
282
282
Common causes include:
283
283
284
-
-Incorrect`--dns-service-ip` parameter in `AKSNodeClass`.
285
-
- DNS service IP isn't in the service Classless Inter-Domain Routing (CIDR) range.
286
-
- Network connectivity issues between node and DNS service.
287
-
-`CoreDNS` pods not running or misconfigured.
284
+
-The`--dns-service-ip` parameter in `AKSNodeClass` is incorrect.
285
+
-The DNS service IP isn't in the service Classless Inter-Domain Routing (CIDR) range.
286
+
- Network connectivity issues exist between the node and DNS service.
287
+
-`CoreDNS` pods aren't running or are misconfigured.
288
288
- Firewall rules block DNS traffic.
289
289
290
-
**Solutions**
290
+
**Solution**
291
291
292
-
Solutions include:
292
+
Possible solutions include:
293
293
294
-
- Verify `--dns-service-ip` matches the actual DNS service. Do this with the following command: `kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'`
295
-
-Ensure DNS service IP is within the service CIDR range specified during cluster creation.
296
-
- Check Karpenter nodes can reach the service subnets
297
-
- Restart `CoreDNS pods` if they're in error state. Do this with the following command:`kubectl rollout restart deployment/coredns -n kube-system`
298
-
- Verify NSG rules allow traffic on port 53 (TCP/User Datagram Protocol (UDP)).
299
-
- Run a connectivity analysis with [Azure Virtual Network Verifier](/azure/virtual-network-manager/overview) to validate outbound connectivity.
294
+
- Verify that the `--dns-service-ip`value matches the actual DNS service. To verify, run the following command: `kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'`.
295
+
-Make sure that the DNS service IP is within the service CIDR range specified during cluster creation.
296
+
- Check whether Karpenter nodes can reach the service subnets
297
+
- Restart `CoreDNS pods` if they're in an error state. To restart, run the following command: `kubectl rollout restart deployment/coredns -n kube-system`
298
+
- Verify that NSG rules allow traffic on port 53 (TCP/User Datagram Protocol (UDP)).
299
+
- Run a connectivity analysis by using the [Azure Virtual Network Verifier](/azure/virtual-network-manager/overview) to verify outbound connectivity.
300
300
301
301
## Azure-specific issues
302
302
303
-
### Spot virtual machine (VM) issues
303
+
### Spot VM issues
304
304
305
305
**Symptoms**
306
306
307
-
Unexpected node terminations occur when using spot instances.
307
+
Unexpected node terminations occur when you use spot instances.
308
308
309
309
**Debugging steps**
310
310
@@ -324,20 +324,20 @@ Run the following command:
324
324
az vm list-sizes --location <region> --query "[?contains(name, 'Standard_D2s_v3')]"
325
325
```
326
326
327
-
**Solutions**
327
+
**Solution**
328
328
329
-
Solutions include:
329
+
Possible solutions include:
330
330
331
331
- Use diverse instance types for better availability.
332
332
- Implement proper pod disruption budgets.
333
333
- Consider mixed spot and on-demand strategies.
334
-
- Use workloads tolerant of node preemption.
334
+
- Use workloads that are tolerant of node preemption.
335
335
336
336
### Quota exceeded
337
337
338
338
**Symptoms**
339
339
340
-
VM creation fails with quota exceeded errors.
340
+
VM creation fails and generates "quota exceeded" errors.
341
341
342
342
**Debugging steps**
343
343
@@ -349,12 +349,15 @@ Run the following command:
349
349
az vm list-usage --location <region> --query "[?currentValue >= limit]"
350
350
```
351
351
352
-
**Solutions**
352
+
**Solution**
353
353
354
-
Solutions include:
354
+
Possible solutions include:
355
355
356
-
- Request quota increases through Azure portal.
357
-
- Expand nodepool custom resource definitions (CRDs) to more VM sizes. For more information, see [NodePool configuration documentation][nap-nodepool-docs]. For example, a nodepool specification that allows for D-family VM is less likely to hit quota errors that stop VM creation compared to a nodepool specification specific to only one exact VM size.
356
+
- Request quota increases through the Azure portal.
357
+
- Expand nodepool custom resource definitions (CRDs) to include more VM sizes. For more information, see [NodePool configuration documentation][nap-nodepool-docs]. For example, nodepool specification A is less likely than nodepool specification B to trigger quota errors that stop VM creation if A includes D-family VMs and B is specific to only one VM size.
0 commit comments