Skip to content

Commit f1c4dee

Browse files
authored
Update troubleshoot-node-auto-provision.md
Edit review per CI 8179
1 parent 20cb8d0 commit f1c4dee

1 file changed

Lines changed: 77 additions & 74 deletions

File tree

support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md

Lines changed: 77 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -9,35 +9,35 @@ ms.topic: troubleshooting
99
ms.date: 09/05/2025
1010
editor: bsoghigian
1111
ms.reviewer: phwilson, v-ryanberg, v-gsitser
12-
#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve Node Auto Provisioining managed add-on so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS).
12+
#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve node auto-provisioning managed add-ons so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS).
1313
ms.custom: sap:Extensions, Policies and Add-Ons
1414
---
1515

1616
# Troubleshoot node auto-provisioning (NAP) in Azure Kubernetes Service (AKS)
1717

18-
This article discusses how to troubleshoot node auto-provisioning (NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure and manages scaling events at the virtual machine or node level.
18+
This article discusses how to troubleshoot node auto-provisioning (NAP). NAP is a managed add-on that's based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine (VM) or node level.
1919

20-
When you enable NAP, you can experience problems associated with the configuration of the infrastructure autoscaler. This article helps you troubleshoot errors and resolve common problems that affect NAP but aren't covered in Karpenter [FAQ][karpenter-faq] or [troubleshooting guide][karpenter-troubleshooting].
20+
When you enable NAP, you might encounter issues that are associated with the configuration of the infrastructure autoscaler. This article helps you troubleshoot errors and resolve common issues that affect NAP but aren't covered in the Karpenter [FAQ][karpenter-faq] or [troubleshooting guide][karpenter-troubleshooting].
2121

2222
## Prerequisites
2323

24-
Ensure the following tools are installed and configured. They're used in the following sections.
24+
Make sure that the following tools are installed and configured:
2525

2626
- [Azure Command-Line Interface (CLI)](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the `[az aks install-cli](/cli/azure/aks#az-aks-install-cli)` command.
27-
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client available with Azure CLI.
28-
- Confirm you have NAP enabled on your cluster. For more information, see [node auto provisioning documentation][nap-main-docs].
27+
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client that's available together with Azure CLI.
28+
- NAP, enabled on your cluster. For more information, see [node auto provisioning documentation][nap-main-docs].
2929

3030
## Common issues
3131

32-
### Nodes not being removed
32+
### Nodes aren't removed
3333

3434
**Symptoms**
3535

36-
Underutilized or empty nodes remain in the cluster longer than expected.
36+
Underused or empty nodes remain in the cluster longer than you expect.
3737

3838
**Debugging steps**
3939

40-
1. **Check node utilization**
40+
1. **Check node usage**
4141

4242
Run the following command:
4343

@@ -64,34 +64,34 @@ Run the following command:
6464
kubectl get events | grep -i "disruption\|consolidation"
6565
```
6666

67-
**Common causes**
67+
**Cause**
6868

6969
Common causes include:
7070

71-
- Pods without proper tolerations.
72-
- DaemonSets preventing drain.
73-
- Pod disruption budgets (PDBs) aren't properly set.
74-
- Nodes are marked with `do-not-disrupt` annotation.
75-
- Locks blocking changes.
71+
- Pods that have no proper tolerations
72+
- DaemonSets that prevent drain
73+
- Pod disruption budgets (PDBs) that aren't correctly set
74+
- Nodes that are marked by a `do-not-disrupt` annotation
75+
- Locks that block changes
7676

77-
**Solutions**
77+
**Solution**
7878

79-
Solutions include:
79+
Possible solutions include:
8080

8181
- Add proper tolerations to pods.
8282
- Review `DaemonSet` configurations.
83-
- Adjust PDBs to allow disruption
84-
- Remove `do-not-disrupt` annotations if appropriate.
83+
- Adjust PDBs to allow disruption.
84+
- Remove the `do-not-disrupt` annotations, as appropriate.
8585
- Review lock configurations.
8686

8787
## Networking issues
8888

89-
For most networking-related issues, there are two levels available for networking observability:
89+
For most networking-related issues, use either of the available levels of networking observability:
9090

91-
- [Container network metrics][aks-container-metrics] (default): Allows for node level metrics.
92-
- [Advanced container network metrics][advanced-container-network-metrics]: In addition to node level metrics, you can also observe pod-level metrics including fully qualified domain name (FQDN) metrics for troubleshooting.
91+
- [Container network metrics][aks-container-metrics] (default): Enables observation of node-level metrics.
92+
- [Advanced container network metrics][advanced-container-network-metrics]: Enables observation of pod-level metrics, including fully qualified domain name (FQDN) metrics for troubleshooting.
9393

94-
### Pod connectivity problems
94+
### Pod connectivity issues
9595

9696
**Symptoms**
9797

@@ -109,7 +109,7 @@ kubectl exec -it <pod-name> -- ping <target-ip>
109109
kubectl exec -it <pod-name> -- nslookup kubernetes.default
110110
```
111111

112-
Another option to test node-to-node or pod-to-pod connectivity is with the open-source [goldpinger](https://github.com/bloomberg/goldpinger) tool.
112+
Another option to test node-to-node or pod-to-pod connectivity is to use the open-source [goldpinger](https://github.com/bloomberg/goldpinger) tool.
113113

114114
2. **Check network plugin status**
115115

@@ -119,7 +119,7 @@ Run the following command:
119119
kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy"
120120
```
121121

122-
If you're using Azure Container Networking Interface (CNI) with overlay, verify your nodes have these labels:
122+
If you're using Azure Container Networking Interface (CNI) in overlay mode, verify that your nodes have these labels:
123123

124124
```azurecli-interactive
125125
kubernetes.azure.com/azure-cni-overlay: "true"
@@ -128,7 +128,7 @@ If you're using Azure Container Networking Interface (CNI) with overlay, verify
128128
kubernetes.azure.com/network-subscription: <redacted>
129129
```
130130

131-
4. **Validate the CNI configuration files**
131+
4. **Verify the CNI configuration files**
132132

133133
The CNI conflist files define network plugin configurations. Check which files are present:
134134

@@ -142,10 +142,10 @@ ls -la /etc/cni/net.d/
142142

143143
**Understanding conflist files**
144144

145-
For this scenario, there are two types of conflist files:
145+
This scenario includes two kinds of conflist files:
146146

147-
- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with all CNIs not using overlay.
148-
- `15-azure-swift-overlay.conflist`: Azure CNI with overlay networking (used with Cilium or overlay mode).
147+
- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking of all CNIs that don't use overlay mode.
148+
- `15-azure-swift-overlay.conflist`: Azure CNI Overlay networking (used by Cilium or in overlay mode).
149149

150150
**Inspect the configuration content**
151151

@@ -165,10 +165,10 @@ cat /etc/cni/net.d/*.conflist
165165

166166
Common conflist issues include:
167167

168-
- Missing or corrupted configuration files.
169-
- Incorrect network mode for your cluster setup.
170-
- Mismatched IP Address Management (IPAM) configuration.
171-
- Wrong plugin order in the configuration chain.
168+
- Missing or corrupted configuration files
169+
- Incorrect network mode for your cluster setup
170+
- Mismatched IP Address Management (IPAM) configuration
171+
- Wrong plugin order in the configuration chain
172172

173173
5. **Check CNI-to-Advanced Container Networking Services (ACNS) communication**
174174

@@ -181,35 +181,35 @@ kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100
181181

182182
**CNI-to-ACNS troubleshooting**
183183

184-
- **If ACNS logs show "no IPs available"**: Indicates an ACNS or AKS watch on the Neural Network Coding (NNC).
185-
- **If CNI calls don't appear in ACNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed.
184+
- **If ACNS logs show "no IPs available"**: Indicates an ACNS or AKS watch that's enacted on the Neural Network Coding (NNC).
185+
- **If CNI calls don't appear in ACNS logs**: Usually indicates that the wrong CNI is installed. Verify that the correct CNI plugin is deployed.
186186

187-
**Common causes**
187+
**Causes**
188188

189189
Common causes include:
190190

191-
- Network security group (NSG) rules.
192-
- Incorrect subnet configuration.
193-
- CNI plugin issues.
194-
- DNS resolution problems.
191+
- Network security group (NSG) rules
192+
- Incorrect subnet configuration
193+
- CNI plugin issues
194+
- DNS resolution problems
195195

196-
**Solutions**
196+
**Solution**
197197

198-
Solutions include:
198+
Possible solutions include:
199199

200-
- Review [Network Security Group][network-security-group-docs] rules for required traffic.
201-
- Verify subnet configuration in `AKSNodeClass`. For more information, see [AKSNodeClass documentation][aksnodeclass-subnet-config].
202-
- Restart CNI plugin pods.
203-
- Check `CoreDNS` configuration. For more information, see [CoreDNS documentation][coredns-troubleshoot].
200+
- Review the [Network Security Group][network-security-group-docs] rules for required traffic.
201+
- Verify the subnet configuration in `AKSNodeClass`. For more information, see [AKSNodeClass documentation][aksnodeclass-subnet-config].
202+
- Restart the CNI plugin pods.
203+
- Check the `CoreDNS` configuration. For more information, see [CoreDNS documentation][coredns-troubleshoot].
204204

205205
### DNS service IP issues
206206

207-
>[!NOTE]
208-
>The `--dns-service-ip` parameter is only supported for NAP clusters and isn't available for self-hosted Karpenter installations.
207+
> [!NOTE]
208+
> The `--dns-service-ip` parameter is supported for only NAP clusters and isn't available for self-hosted Karpenter installations.
209209
210210
**Symptoms**
211211

212-
Pods can't resolve DNS names or kubelet fails to register with API server due to DNS resolution failures.
212+
Pods can't resolve DNS names or kubelet doesn't register together with the API server because of DNS resolution failures.
213213

214214
**Debugging steps**
215215

@@ -266,7 +266,7 @@ kubectl get pods -n kube-system -l k8s-app=kube-dns
266266
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
267267
```
268268

269-
5. **Validate network connectivity to DNS service**
269+
5. **Verify network connectivity to DNS service**
270270

271271
Run the following command:
272272

@@ -277,34 +277,34 @@ telnet 10.0.0.10 53 # Replace with your actual DNS service IP
277277
nc -zv 10.0.0.10 53
278278
```
279279

280-
**Common causes**
280+
**Cause**
281281

282282
Common causes include:
283283

284-
- Incorrect `--dns-service-ip` parameter in `AKSNodeClass`.
285-
- DNS service IP isn't in the service Classless Inter-Domain Routing (CIDR) range.
286-
- Network connectivity issues between node and DNS service.
287-
- `CoreDNS` pods not running or misconfigured.
284+
- The `--dns-service-ip` parameter in `AKSNodeClass` is incorrect.
285+
- The DNS service IP isn't in the service Classless Inter-Domain Routing (CIDR) range.
286+
- Network connectivity issues exist between the node and DNS service.
287+
- `CoreDNS` pods aren't running or are misconfigured.
288288
- Firewall rules block DNS traffic.
289289

290-
**Solutions**
290+
**Solution**
291291

292-
Solutions include:
292+
Possible solutions include:
293293

294-
- Verify `--dns-service-ip` matches the actual DNS service. Do this with the following command: `kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'`
295-
- Ensure DNS service IP is within the service CIDR range specified during cluster creation.
296-
- Check Karpenter nodes can reach the service subnets
297-
- Restart `CoreDNS pods` if they're in error state. Do this with the following command: `kubectl rollout restart deployment/coredns -n kube-system`
298-
- Verify NSG rules allow traffic on port 53 (TCP/User Datagram Protocol (UDP)).
299-
- Run a connectivity analysis with [Azure Virtual Network Verifier](/azure/virtual-network-manager/overview) to validate outbound connectivity.
294+
- Verify that the `--dns-service-ip` value matches the actual DNS service. To verify, run the following command: `kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'`.
295+
- Make sure that the DNS service IP is within the service CIDR range specified during cluster creation.
296+
- Check whether Karpenter nodes can reach the service subnets
297+
- Restart `CoreDNS pods` if they're in an error state. To restart, run the following command: `kubectl rollout restart deployment/coredns -n kube-system`
298+
- Verify that NSG rules allow traffic on port 53 (TCP/User Datagram Protocol (UDP)).
299+
- Run a connectivity analysis by using the [Azure Virtual Network Verifier](/azure/virtual-network-manager/overview) to verify outbound connectivity.
300300

301301
## Azure-specific issues
302302

303-
### Spot virtual machine (VM) issues
303+
### Spot VM issues
304304

305305
**Symptoms**
306306

307-
Unexpected node terminations occur when using spot instances.
307+
Unexpected node terminations occur when you use spot instances.
308308

309309
**Debugging steps**
310310

@@ -324,20 +324,20 @@ Run the following command:
324324
az vm list-sizes --location <region> --query "[?contains(name, 'Standard_D2s_v3')]"
325325
```
326326

327-
**Solutions**
327+
**Solution**
328328

329-
Solutions include:
329+
Possible solutions include:
330330

331331
- Use diverse instance types for better availability.
332332
- Implement proper pod disruption budgets.
333333
- Consider mixed spot and on-demand strategies.
334-
- Use workloads tolerant of node preemption.
334+
- Use workloads that are tolerant of node preemption.
335335

336336
### Quota exceeded
337337

338338
**Symptoms**
339339

340-
VM creation fails with quota exceeded errors.
340+
VM creation fails and generates "quota exceeded" errors.
341341

342342
**Debugging steps**
343343

@@ -349,12 +349,15 @@ Run the following command:
349349
az vm list-usage --location <region> --query "[?currentValue >= limit]"
350350
```
351351

352-
**Solutions**
352+
**Solution**
353353

354-
Solutions include:
354+
Possible solutions include:
355355

356-
- Request quota increases through Azure portal.
357-
- Expand nodepool custom resource definitions (CRDs) to more VM sizes. For more information, see [NodePool configuration documentation][nap-nodepool-docs]. For example, a nodepool specification that allows for D-family VM is less likely to hit quota errors that stop VM creation compared to a nodepool specification specific to only one exact VM size.
356+
- Request quota increases through the Azure portal.
357+
- Expand nodepool custom resource definitions (CRDs) to include more VM sizes. For more information, see [NodePool configuration documentation][nap-nodepool-docs]. For example, nodepool specification A is less likely than nodepool specification B to trigger quota errors that stop VM creation if A includes D-family VMs and B is specific to only one VM size.
358+
359+
[!INCLUDE [Third-party disclaimer](~/includes/third-party-disclaimer.md)]
360+
361+
[!INCLUDE [Third-party contact disclaimer](~/includes/third-party-contact-disclaimer.md)]
358362

359363
[!INCLUDE [Azure Help Support](~/includes/azure-help-support.md)]
360-
[!INCLUDE [Third-party contact disclaimer](~/includes/third-party-contact-disclaimer.md)]

0 commit comments

Comments
 (0)