Skip to content

Commit 41a3cf2

Browse files
authored
Add NAP troubleshooting doc
Add NAP troubleshooting + FAQ doc
1 parent 977d0ed commit 41a3cf2

1 file changed

Lines changed: 263 additions & 0 deletions

File tree

Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
---
2+
title: Troubleshoot the Node Auto-provisioning managed add-on
3+
description: Learn how to troubleshoot Node Auto-provisisioning in Azure Kubernetes Service (AKS).
4+
ms.service: azure-kubernetes-service
5+
ms.date: 09/05/2025
6+
editor: wdarko1
7+
ms.reviewer:
8+
#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve Node Auto-provisioining managed add-on so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS).
9+
ms.custom: sap:Extensions, Policies and Add-Ons
10+
---
11+
12+
# Troubleshoot the node auto provisioning (NAP) in Azure Kubernetes Service (AKS)
13+
14+
This article discusses how to troubleshoot Node auto provisioning(NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine, or node level.
15+
When you enable Node Auto-provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting].
16+
17+
## Prerequisites
18+
19+
Ensure the following tools are installed and configured. They're used in the following sections.
20+
21+
- [Azure CLI](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command.
22+
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with the Azure CLI.
23+
- Confirm you have Node Auto-provisioning enabled on your cluster.
24+
25+
## Common Issues
26+
27+
### Nodes Not Being Removed
28+
29+
**Symptoms**: Underutilized nodes remain in the cluster longer than expected.
30+
31+
**Debugging Steps**:
32+
33+
1. **Check node utilization**:
34+
```azurecli-interactive
35+
kubectl top nodes
36+
kubectl describe node <node-name>
37+
```
38+
39+
2. **Look for blocking pods**:
40+
```azurecli-interactive
41+
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
42+
```
43+
44+
3. **Check for disruption blocks**:
45+
```azurecli-interactive
46+
kubectl get events | grep -i "disruption\|consolidation"
47+
```
48+
49+
**Common Causes**:
50+
- Pods without proper tolerations
51+
- DaemonSets preventing drain
52+
- Pod disruption budgets(PDBs) not properly set
53+
- Nodes marked with `do-not-disrupt` annotation
54+
55+
**Solutions**:
56+
- Add proper tolerations to pods
57+
- Review DaemonSet configurations
58+
- Adjust pod disruption budgets to allow disruption
59+
- Remove `do-not-disrupt` annotations if appropriate
60+
61+
62+
## Networking Issues
63+
64+
### Pod Connectivity Problems
65+
66+
**Symptoms**: Pods can't communicate with other pods or external services.
67+
68+
**Debugging Steps**:
69+
70+
1. **Test basic connectivity**:
71+
```azurecli-interactive
72+
# From within a pod
73+
kubectl exec -it <pod-name> -- ping <target-ip>
74+
kubectl exec -it <pod-name> -- nslookup kubernetes.default
75+
```
76+
77+
2. **Check network plugin status**:
78+
```azurecli-interactive
79+
kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy"
80+
```
81+
3. **If using azure cni with overlay or cilium**
82+
Validate your nodes have these labels
83+
84+
```
85+
kubernetes.azure.com/azure-cni-overlay: "true"
86+
kubernetes.azure.com/network-name: aks-vnet-<redacted>
87+
kubernetes.azure.com/network-resourcegroup: <redacted>
88+
kubernetes.azure.com/network-subscription: <redacted>
89+
```
90+
91+
4. **Validate the CNI configuration files**
92+
93+
The CNI conflist files define network plugin configurations. Check which files are present:
94+
95+
```azurecli-interactive
96+
# List CNI configuration files
97+
ls -la /etc/cni/net.d/
98+
99+
# Example output:
100+
# 10-azure.conflist 15-azure-swift-overlay.conflist
101+
```
102+
103+
**Understanding conflist files**:
104+
- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with node subnet
105+
- `15-azure-swift-overlay.conflist`: Azure CNI with overlay networking (used with Cilium or overlay mode)
106+
107+
**Inspect the configuration content**:
108+
```azurecli-interactive
109+
# Check the actual CNI configuration
110+
cat /etc/cni/net.d/*.conflist
111+
112+
# Look for key fields:
113+
# - "type": should be "azure-vnet" for Azure CNI
114+
# - "mode": "bridge" for standard, "transparent" for overlay
115+
# - "ipam": IP address management configuration
116+
```
117+
118+
**Common conflist issues**:
119+
- Missing or corrupted configuration files
120+
- Incorrect network mode for your cluster setup
121+
- Mismatched IPAM configuration
122+
- Wrong plugin order in the configuration chain
123+
124+
5. **Check CNI to CNS communication**:
125+
```azurecli-interactive
126+
# Check CNS logs for IP allocation requests from CNI
127+
kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100
128+
```
129+
130+
**CNI to CNS Troubleshooting**:
131+
- **If CNS logs show "no IPs available"**: This indicates a CNS or aks's watch on the NNCs.
132+
- **If CNI calls don't appear in CNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed.
133+
134+
**Common Causes**:
135+
- Network security group rules
136+
- Incorrect subnet configuration
137+
- CNI plugin issues
138+
- DNS resolution problems
139+
140+
**Solutions**:
141+
- Review NSG rules for required traffic
142+
- Verify subnet configuration in AKSNodeClass
143+
- Restart CNI plugin pods
144+
- Check CoreDNS configuration
145+
146+
### DNS Service IP Issues
147+
148+
**Note**: The `--dns-service-ip` parameter is only supported for NAP (Node Auto Provisioning) clusters and is not available for self-hosted Karpenter installations.
149+
150+
**Symptoms**: Pods can't resolve DNS names or kubelet fails to register with API server due to DNS resolution failures.
151+
152+
**Debugging Steps**:
153+
154+
1. **Check kubelet DNS configuration**:
155+
```azurecli-interactive
156+
# SSH to the Karpenter node and check kubelet config
157+
sudo cat /var/lib/kubelet/config.yaml | grep -A 5 clusterDNS
158+
159+
# Expected output should show the correct DNS service IP
160+
# clusterDNS:
161+
# - "10.0.0.10" # This should match your cluster's DNS service IP
162+
```
163+
164+
2. **Verify DNS service IP matches cluster configuration**:
165+
```azurecli-interactive
166+
# Get the actual DNS service IP from your cluster
167+
kubectl get service -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'
168+
169+
# Compare with what AKS reports
170+
az aks show --resource-group <rg> --name <cluster-name> --query "networkProfile.dnsServiceIp" -o tsv
171+
```
172+
173+
3. **Test DNS resolution from the node**:
174+
```azurecli-interactive
175+
# SSH to the Karpenter node and test DNS resolution
176+
# Test using the DNS service IP directly
177+
dig @10.0.0.10 kubernetes.default.svc.cluster.local
178+
179+
# Test using system resolver
180+
nslookup kubernetes.default.svc.cluster.local
181+
182+
# Test external DNS resolution
183+
dig google.com
184+
```
185+
186+
4. **Check DNS pods status**:
187+
```azurecli-interactive
188+
# Verify CoreDNS pods are running
189+
kubectl get pods -n kube-system -l k8s-app=kube-dns
190+
191+
# Check CoreDNS logs for errors
192+
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
193+
```
194+
195+
5. **Validate network connectivity to DNS service**:
196+
```azurecli-interactive
197+
# From the Karpenter node, test connectivity to DNS service
198+
telnet 10.0.0.10 53 # Replace with your actual DNS service IP
199+
# Or using nc if telnet is not available
200+
nc -zv 10.0.0.10 53
201+
```
202+
203+
**Common Causes**:
204+
- Incorrect `--dns-service-ip` parameter in AKSNodeClass
205+
- DNS service IP not in the service CIDR range
206+
- Network connectivity issues between node and DNS service
207+
- CoreDNS pods not running or misconfigured
208+
- Firewall rules blocking DNS traffic
209+
210+
**Solutions**:
211+
- Verify `--dns-service-ip` matches the actual DNS service: `kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'`
212+
- Ensure DNS service IP is within the service CIDR range specified during cluster creation
213+
- Check that Karpenter nodes can reach the service subnet
214+
- Restart CoreDNS pods if they're in error state: `kubectl rollout restart deployment/coredns -n kube-system`
215+
- Verify NSG rules allow traffic on port 53 (TCP/UDP)
216+
217+
## Azure-Specific Issues
218+
219+
### Spot VM Issues
220+
221+
**Symptoms**: Unexpected node terminations when using spot instances.
222+
223+
**Debugging Steps**:
224+
225+
1. **Check node events**:
226+
227+
```azurecli-interactive
228+
kubectl get events | grep -i "spot\|evict"
229+
```
230+
231+
2. **Monitor spot VM pricing**:
232+
233+
```azurecli-interactive
234+
az vm list-sizes --location <region> --query "[?contains(name, 'Standard_D2s_v3')]"
235+
```
236+
237+
**Solutions**:
238+
- Use diverse instance types for better availability
239+
- Implement proper pod disruption budgets
240+
- Consider mixed spot/on-demand strategies
241+
- Use workloads tolerant of node preemption
242+
243+
### Quota Exceeded
244+
245+
**Symptoms**: VM creation fails with quota exceeded errors.
246+
247+
**Debugging Steps**:
248+
249+
1. **Check current quota usage**:
250+
```azurecli-interactive
251+
az vm list-usage --location <region> --query "[?currentValue >= limit]"
252+
```
253+
254+
**Solutions**:
255+
- Request quota increases through Azure portal
256+
- Use different VM sizes with available quota
257+
258+
[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]
259+
260+
261+
[aks-firewall-requirements]: /azure/aks/limit-egress-traffic#azure-global-required-network-rules
262+
[karpenter-troubleshooting]: h[ttps://keda.sh/docs/latest/troubleshooting/](https://karpenter.sh/docs/troubleshooting/)
263+
[karpenter-faq]: https://karpenter.sh/docs/faq/

0 commit comments

Comments
 (0)