Skip to content

Commit 908e290

Browse files
migrating content from public to private for nap-tsg-update Merge branch 'nap-tsg-update' of https://github.com/wdarko1/supportarticles-docs into nap-tsg-update
2 parents fd61ea3 + 35512ea commit 908e290

2 files changed

Lines changed: 285 additions & 0 deletions

File tree

Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
---
2+
title: Troubleshoot the Node Auto Provisioning managed add-on
3+
description: Learn how to troubleshoot Node Auto Provisisioning in Azure Kubernetes Service (AKS).
4+
ms.service: azure-kubernetes-service
5+
ms.date: 09/05/2025
6+
editor: bsoghigian
7+
ms.reviewer:
8+
#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve Node Auto Provisioining managed add-on so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS).
9+
ms.custom: sap:Extensions, Policies and Add-Ons
10+
---
11+
12+
# Troubleshoot node auto provisioning (NAP) in Azure Kubernetes Service (AKS)
13+
14+
This article discusses how to troubleshoot Node auto provisioning(NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine, or node level.
15+
When you enable Node Auto Provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting].
16+
17+
## Prerequisites
18+
19+
Ensure the following tools are installed and configured. They're used in the following sections.
20+
21+
- [Azure CLI](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command.
22+
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with the Azure CLI.
23+
- Confirm you have Node Auto Provisioning enabled on your cluster. For steps on enabling node auto provisioning in your cluster, visit our [node auto provisioning documentation][nap-main-docs].
24+
25+
## Common Issues
26+
27+
### Nodes Not Being Removed
28+
29+
**Symptoms**: Underutilized or empty nodes remain in the cluster longer than expected.
30+
31+
**Debugging Steps**:
32+
33+
1. **Check node utilization**:
34+
```azurecli-interactive
35+
kubectl top nodes
36+
kubectl describe node <node-name>
37+
```
38+
You can also use the open-source [AKS Node Viewer](https://github.com/Azure/aks-node-viewer) tool to visualize node usage.
39+
40+
2. **Look for blocking pods**:
41+
```azurecli-interactive
42+
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
43+
```
44+
45+
3. **Check for disruption blocks**:
46+
```azurecli-interactive
47+
kubectl get events | grep -i "disruption\|consolidation"
48+
```
49+
50+
**Common Causes**:
51+
- Pods without proper tolerations
52+
- DaemonSets preventing drain
53+
- Pod disruption budgets(PDBs) are not properly set
54+
- Nodes are marked with `do-not-disrupt` annotation
55+
- Locks blocking changes
56+
57+
**Solutions**:
58+
- Add proper tolerations to pods
59+
- Review DaemonSet configurations
60+
- Adjust pod disruption budgets to allow disruption
61+
- Remove `do-not-disrupt` annotations if appropriate
62+
- Review lock configurations
63+
64+
65+
## Networking Issues
66+
67+
For most Networking related issues, there are two levels available for networking observability
68+
- [Container Network Metrics][aks-container-metrics] (default): Allows for node level metrics
69+
- [Advanced Container Network Metrics][advanced-container-network-metrics]: In addition to node level metrics, you can also observe pod-level metrics including FQDN metrics for troubleshooting.
70+
71+
### Pod Connectivity Problems
72+
73+
**Symptoms**: Pods can't communicate with other pods or external services.
74+
75+
**Debugging Steps**:
76+
77+
1. **Test basic connectivity**:
78+
```azurecli-interactive
79+
# From within a pod
80+
kubectl exec -it <pod-name> -- ping <target-ip>
81+
kubectl exec -it <pod-name> -- nslookup kubernetes.default
82+
```
83+
84+
Another option to test node to node or pod to pod connectivity is with the open-source [goldpinger](https://github.com/bloomberg/goldpinger) tool.
85+
86+
2. **Check network plugin status**:
87+
```azurecli-interactive
88+
kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy"
89+
```
90+
3. **If using azure cni with overlay**
91+
Validate your nodes have these labels
92+
93+
```azurecli-interactive
94+
kubernetes.azure.com/azure-cni-overlay: "true"
95+
kubernetes.azure.com/network-name: aks-vnet-<redacted>
96+
kubernetes.azure.com/network-resourcegroup: <redacted>
97+
kubernetes.azure.com/network-subscription: <redacted>
98+
```
99+
100+
4. **Validate the CNI configuration files**
101+
102+
The CNI conflist files define network plugin configurations. Check which files are present:
103+
104+
```azurecli-interactive
105+
# List CNI configuration files
106+
ls -la /etc/cni/net.d/
107+
108+
# Example output:
109+
# 10-azure.conflist 15-azure-swift-overlay.conflist
110+
```
111+
112+
**Understanding conflist files**:
113+
- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with all CNI's not using overlay
114+
- `15-azure-swift-overlay.conflist`: Azure CNI with overlay networking (used with Cilium or overlay mode)
115+
116+
**Inspect the configuration content**:
117+
```azurecli-interactive
118+
# Check the actual CNI configuration
119+
cat /etc/cni/net.d/*.conflist
120+
121+
# Look for key fields:
122+
# - "type": should be "azure-vnet" for Azure CNI
123+
# - "mode": "bridge" for standard, "transparent" for overlay
124+
# - "ipam": IP address management configuration
125+
```
126+
127+
**Common conflist issues**:
128+
- Missing or corrupted configuration files
129+
- Incorrect network mode for your cluster setup
130+
- Mismatched IPAM configuration
131+
- Wrong plugin order in the configuration chain
132+
133+
5. **Check CNI to CNS communication**:
134+
```azurecli-interactive
135+
# Check CNS logs for IP allocation requests from CNI
136+
kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100
137+
```
138+
139+
**CNI to CNS Troubleshooting**:
140+
- **If CNS logs show "no IPs available"**: This indicates a CNS or AKS' watch on the NNCs.
141+
- **If CNI calls don't appear in CNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed.
142+
143+
**Common Causes**:
144+
- Network security group(NSG) rules
145+
- Incorrect subnet configuration
146+
- CNI plugin issues
147+
- DNS resolution problems
148+
149+
**Solutions**:
150+
- Review [Network Sescurity Group][network-security-group-docs] rules for required traffic
151+
- Verify subnet configuration in AKSNodeClass. See [AKSNodeClass documentation][aksnodeclass-subnet-config] on subnet configuration
152+
- Restart CNI plugin pods
153+
- Check CoreDNS configuration. See [CoreDNS documentation][coredns-troubleshoot]
154+
155+
### DNS Service IP Issues
156+
157+
>[!NOTE]
158+
>The `--dns-service-ip` parameter is only supported for NAP (Node Auto Provisioning) clusters and is not available for self-hosted Karpenter installations.
159+
160+
**Symptoms**: Pods can't resolve DNS names or kubelet fails to register with API server due to DNS resolution failures.
161+
162+
**Debugging Steps**:
163+
164+
1. **Check kubelet DNS configuration**:
165+
```azurecli-interactive
166+
# SSH to the Karpenter node and check kubelet config
167+
sudo cat /var/lib/kubelet/config.yaml | grep -A 5 clusterDNS
168+
169+
# Expected output should show the correct DNS service IP
170+
# clusterDNS:
171+
# - "10.0.0.10" # This should match your cluster's DNS service IP
172+
```
173+
174+
2. **Verify DNS service IP matches cluster configuration**:
175+
```azurecli-interactive
176+
# Get the actual DNS service IP from your cluster
177+
kubectl get service -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'
178+
179+
# Compare with what AKS reports
180+
az aks show --resource-group <rg> --name <cluster-name> --query "networkProfile.dnsServiceIp" -o tsv
181+
```
182+
183+
3. **Test DNS resolution from the node**:
184+
```azurecli-interactive
185+
# SSH to the Karpenter node and test DNS resolution
186+
# Test using the DNS service IP directly
187+
dig @10.0.0.10 kubernetes.default.svc.cluster.local
188+
189+
# Test using system resolver
190+
nslookup kubernetes.default.svc.cluster.local
191+
192+
# Test external DNS resolution
193+
dig azure.com
194+
```
195+
196+
4. **Check DNS pods status**:
197+
```azurecli-interactive
198+
# Verify CoreDNS pods are running
199+
kubectl get pods -n kube-system -l k8s-app=kube-dns
200+
201+
# Check CoreDNS logs for errors
202+
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
203+
```
204+
205+
5. **Validate network connectivity to DNS service**:
206+
```azurecli-interactive
207+
# From the Karpenter node, test connectivity to DNS service
208+
telnet 10.0.0.10 53 # Replace with your actual DNS service IP
209+
# Or using nc if telnet is not available
210+
nc -zv 10.0.0.10 53
211+
```
212+
213+
**Common Causes**:
214+
- Incorrect `--dns-service-ip` parameter in AKSNodeClass
215+
- DNS service IP not in the service CIDR range
216+
- Network connectivity issues between node and DNS service
217+
- CoreDNS pods not running or misconfigured
218+
- Firewall rules blocking DNS traffic
219+
220+
**Solutions**:
221+
- Verify `--dns-service-ip` matches the actual DNS service: `kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'`
222+
- Ensure DNS service IP is within the service CIDR range specified during cluster creation
223+
- Check that Karpenter nodes can reach the service subnet
224+
- Restart CoreDNS pods if they're in error state: `kubectl rollout restart deployment/coredns -n kube-system`
225+
- Verify NSG rules allow traffic on port 53 (TCP/UDP)
226+
- Run a connectivity analysis with the [Azure Virtual Network Verifier][connectivity-tool] tool to validate outbound connectivity
227+
228+
## Azure-Specific Issues
229+
230+
### Spot VM Issues
231+
232+
**Symptoms**: Unexpected node terminations when using spot instances.
233+
234+
**Debugging Steps**:
235+
236+
1. **Check node events**:
237+
238+
```azurecli-interactive
239+
kubectl get events | grep -i "spot\|evict"
240+
```
241+
242+
2. **Monitor spot VM pricing**:
243+
244+
```azurecli-interactive
245+
az vm list-sizes --location <region> --query "[?contains(name, 'Standard_D2s_v3')]"
246+
```
247+
248+
**Solutions**:
249+
- Use diverse instance types for better availability
250+
- Implement proper pod disruption budgets
251+
- Consider mixed spot/on-demand strategies
252+
- Use workloads tolerant of node preemption
253+
254+
### Quota Exceeded
255+
256+
**Symptoms**: VM creation fails with quota exceeded errors.
257+
258+
**Debugging Steps**:
259+
260+
1. **Check current quota usage**:
261+
```azurecli-interactive
262+
az vm list-usage --location <region> --query "[?currentValue >= limit]"
263+
```
264+
265+
**Solutions**:
266+
- Request quota increases through Azure portal
267+
- Expand NodePool CRD to more VM sizes. See [NodePool configuration documentation][nap-nodepool-docs] for details. For example, A NodePool specification which allows for D-family virtual machines is less likely to hit quota errors that stop VM creation, compared to a NodePool specification specific to only one exact VM Size.
268+
269+
[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]
270+
271+
272+
[aks-firewall-requirements]: /azure/aks/limit-egress-traffic#azure-global-required-network-rules
273+
[karpenter-troubleshooting]: https://karpenter.sh/docs/troubleshooting/
274+
[karpenter-faq]: https://karpenter.sh/docs/faq/
275+
[network-security-group-docs]: /azure/virtual-network/network-security-groups-overview
276+
[aksnodeclass-subnet-config]: /azure/aks/node-autoprovision-aksnodeclass#virtual-network-subnet-configuration
277+
[nap-nodepool-docs]: /azure/aks/node-autoprovision-node-pools
278+
[nap-main-docs]: /azure/aks/node-autoprovision
279+
[coredns-troubleshoot]: /azure/aks/coredns-custom#troubleshooting
280+
[aks-container-metrics]: /azure/aks/container-network-observability-metrics
281+
[advanced-container-network-metrics]: /azure/aks/advanced-container-networking-services-overview
282+
[connectivity-tool]: /azure/azure-kubernetes/connectivity/basic-troubleshooting-outbound-connections#check-if-azure-network-resources-are-blocking-traffic-to-the-endpoint
283+

support/azure/azure-kubernetes/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -257,6 +257,8 @@ items:
257257
href: extensions/troubleshoot-managed-namespaces.md
258258
- name: Troubleshoot network isolated clusters
259259
href: extensions/troubleshoot-network-isolated-cluster.md
260+
- name: Troubleshoot node auto provisioning
261+
href: extensions/troubleshoot-node-auto-provision.md
260262
- name: KEDA add-on
261263
items:
262264
- name: Breaking changes in KEDA add-on 2.15 and 2.14

0 commit comments

Comments
 (0)