Skip to content

Commit fabc725

Browse files
Update troubleshoot-node-auto-provision.md
1 parent 2be9194 commit fabc725

1 file changed

Lines changed: 89 additions & 62 deletions

File tree

support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md

Lines changed: 89 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -78,11 +78,11 @@ Common causes include:
7878

7979
Solutions include:
8080

81-
- Adding proper tolerations to pods.
82-
- Reviewing DaemonSet configurations.
83-
- Adjusting PDBs to allow disruption
84-
- Removing `do-not-disrupt` annotations if appropriate.
85-
- Reviewing lock configurations.
81+
- Add proper tolerations to pods.
82+
- Review `DaemonSet` configurations.
83+
- Adjust PDBs to allow disruption
84+
- Remove `do-not-disrupt` annotations if appropriate.
85+
- Review lock configurations.
8686

8787
## Networking issues
8888

@@ -188,7 +188,7 @@ kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100
188188

189189
Common causes include:
190190

191-
- Network security group rules.
191+
- Network security group (NSG) rules.
192192
- Incorrect subnet configuration.
193193
- CNI plugin issues.
194194
- DNS resolution problems.
@@ -197,21 +197,26 @@ Common causes include:
197197

198198
Solutions include:
199199

200-
- Reviewing [Network Sescurity Group][network-security-group-docs] rules for required traffic.
201-
- Verifying subnet configuration in `AKSNodeClass`. For more information, see [AKSNodeClass documentation][aksnodeclass-subnet-config].
202-
- Restarting CNI plugin pods.
203-
- Checking `CoreDNS` configuration. For more information, see [CoreDNS documentation][coredns-troubleshoot].
200+
- Review [Network Sescurity Group][network-security-group-docs] rules for required traffic.
201+
- Verify subnet configuration in `AKSNodeClass`. For more information, see [AKSNodeClass documentation][aksnodeclass-subnet-config].
202+
- Restart CNI plugin pods.
203+
- Check `CoreDNS` configuration. For more information, see [CoreDNS documentation][coredns-troubleshoot].
204204

205205
### DNS service IP issues
206206

207207
>[!NOTE]
208-
>The `--dns-service-ip` parameter is only supported for NAP (Node Auto Provisioning) clusters and is not available for self-hosted Karpenter installations.
208+
>The `--dns-service-ip` parameter is only supported for NAP clusters and isn't available for self-hosted Karpenter installations.
209209
210-
**Symptoms**: Pods can't resolve DNS names or kubelet fails to register with API server due to DNS resolution failures.
210+
**Symptoms**
211+
212+
Pods can't resolve DNS names or kubelet fails to register with API server due to DNS resolution failures.
213+
214+
**Debugging steps**
215+
216+
1. **Check kubelet DNS configuration**
211217

212-
**Debugging Steps**:
218+
Run the following command:
213219

214-
1. **Check kubelet DNS configuration**:
215220
```azurecli-interactive
216221
# SSH to the Karpenter node and check kubelet config
217222
sudo cat /var/lib/kubelet/config.yaml | grep -A 5 clusterDNS
@@ -221,7 +226,10 @@ sudo cat /var/lib/kubelet/config.yaml | grep -A 5 clusterDNS
221226
# - "10.0.0.10" # This should match your cluster's DNS service IP
222227
```
223228

224-
2. **Verify DNS service IP matches cluster configuration**:
229+
2. **Verify DNS service IP matches cluster configuration**
230+
231+
Run the following command:
232+
225233
```azurecli-interactive
226234
# Get the actual DNS service IP from your cluster
227235
kubectl get service -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'
@@ -230,7 +238,10 @@ kubectl get service -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'
230238
az aks show --resource-group <rg> --name <cluster-name> --query "networkProfile.dnsServiceIp" -o tsv
231239
```
232240

233-
3. **Test DNS resolution from the node**:
241+
3. **Test DNS resolution from the node**
242+
243+
Run the following command:
244+
234245
```azurecli-interactive
235246
# SSH to the Karpenter node and test DNS resolution
236247
# Test using the DNS service IP directly
@@ -243,7 +254,10 @@ nslookup kubernetes.default.svc.cluster.local
243254
dig azure.com
244255
```
245256

246-
4. **Check DNS pods status**:
257+
4. **Check DNS pods status**
258+
259+
Run the following command:
260+
247261
```azurecli-interactive
248262
# Verify CoreDNS pods are running
249263
kubectl get pods -n kube-system -l k8s-app=kube-dns
@@ -252,82 +266,95 @@ kubectl get pods -n kube-system -l k8s-app=kube-dns
252266
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
253267
```
254268

255-
5. **Validate network connectivity to DNS service**:
269+
5. **Validate network connectivity to DNS service**
270+
271+
Run the following command:
272+
256273
```azurecli-interactive
257274
# From the Karpenter node, test connectivity to DNS service
258275
telnet 10.0.0.10 53 # Replace with your actual DNS service IP
259276
# Or using nc if telnet is not available
260277
nc -zv 10.0.0.10 53
261278
```
262279

263-
**Common Causes**:
264-
- Incorrect `--dns-service-ip` parameter in AKSNodeClass
265-
- DNS service IP not in the service CIDR range
266-
- Network connectivity issues between node and DNS service
267-
- CoreDNS pods not running or misconfigured
268-
- Firewall rules blocking DNS traffic
280+
**Common causes**
269281

270-
**Solutions**:
271-
- Verify `--dns-service-ip` matches the actual DNS service: `kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'`
272-
- Ensure DNS service IP is within the service CIDR range specified during cluster creation
273-
- Check that Karpenter nodes can reach the service subnet
274-
- Restart CoreDNS pods if they're in error state: `kubectl rollout restart deployment/coredns -n kube-system`
275-
- Verify NSG rules allow traffic on port 53 (TCP/UDP)
276-
- Run a connectivity analysis with the [Azure Virtual Network Verifier][connectivity-tool] tool to validate outbound connectivity
282+
Common causes include:
277283

278-
## Azure-Specific Issues
284+
- Incorrect `--dns-service-ip` parameter in `AKSNodeClass`.
285+
- DNS service IP isn't in the service Classless Inter-Domain Routing (CIDR) range.
286+
- Network connectivity issues between node and DNS service.
287+
- `CoreDNS` pods not running or misconfigured.
288+
- Firewall rules block DNS traffic.
279289

280-
### Spot VM Issues
290+
**Solutions**
291+
292+
Solutions include:
281293

282-
**Symptoms**: Unexpected node terminations when using spot instances.
294+
- Verify `--dns-service-ip` matches the actual DNS service. Do this with the following command: `kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'`
295+
- Ensure DNS service IP is within the service CIDR range specified during cluster creation.
296+
- Check Karpenter nodes can reach the service subnets
297+
- Restart `CoreDNS pods` if they're in error state. Do this with the following command: `kubectl rollout restart deployment/coredns -n kube-system`
298+
- Verify NSG rules allow traffic on port 53 (TCP/User Datagram Protocol (UDP)).
299+
- Run a connectivity analysis with [Azure Virtual Network Verifier](/azure/virtual-network-manager/overview) to validate outbound connectivity.
283300

284-
**Debugging Steps**:
301+
## Azure-specific issues
285302

286-
1. **Check node events**:
303+
### Spot virtual machine (VM) issues
304+
305+
**Symptoms**
306+
307+
Unexpected node terminations occur when using spot instances.
308+
309+
**Debugging steps**
310+
311+
1. **Check node events**
312+
313+
Run the following command:
287314

288315
```azurecli-interactive
289316
kubectl get events | grep -i "spot\|evict"
290317
```
291318

292-
2. **Monitor spot VM pricing**:
319+
2. **Monitor spot VM pricing**
320+
321+
Run the following command:
293322

294323
```azurecli-interactive
295324
az vm list-sizes --location <region> --query "[?contains(name, 'Standard_D2s_v3')]"
296325
```
297326

298-
**Solutions**:
299-
- Use diverse instance types for better availability
300-
- Implement proper pod disruption budgets
301-
- Consider mixed spot/on-demand strategies
302-
- Use workloads tolerant of node preemption
327+
**Solutions**
328+
329+
Solutions include:
330+
331+
- Use diverse instance types for better availability.
332+
- Implement proper pod disruption budgets.
333+
- Consider mixed spot and on-demand strategies.
334+
- Use workloads tolerant of node preemption.
303335

304-
### Quota Exceeded
336+
### Quota exceeded
305337

306-
**Symptoms**: VM creation fails with quota exceeded errors.
338+
**Symptoms**
339+
340+
VM creation fails with quota exceeded errors.
307341

308-
**Debugging Steps**:
342+
**Debugging steps**
309343

310-
1. **Check current quota usage**:
344+
1. **Check current quota usage**
345+
346+
Run the following command:
347+
311348
```azurecli-interactive
312349
az vm list-usage --location <region> --query "[?currentValue >= limit]"
313350
```
314351

315-
**Solutions**:
316-
- Request quota increases through Azure portal
317-
- Expand NodePool CRD to more VM sizes. See [NodePool configuration documentation][nap-nodepool-docs] for details. For example, A NodePool specification which allows for D-family virtual machines is less likely to hit quota errors that stop VM creation, compared to a NodePool specification specific to only one exact VM Size.
318-
319-
[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]
352+
**Solutions**
320353

354+
Solutions include:
321355

322-
[aks-firewall-requirements]: /azure/aks/limit-egress-traffic#azure-global-required-network-rules
323-
[karpenter-troubleshooting]: https://karpenter.sh/docs/troubleshooting/
324-
[karpenter-faq]: https://karpenter.sh/docs/faq/
325-
[network-security-group-docs]: /azure/virtual-network/network-security-groups-overview
326-
[aksnodeclass-subnet-config]: /azure/aks/node-autoprovision-aksnodeclass#virtual-network-subnet-configuration
327-
[nap-nodepool-docs]: /azure/aks/node-autoprovision-node-pools
328-
[nap-main-docs]: /azure/aks/node-autoprovision
329-
[coredns-troubleshoot]: /azure/aks/coredns-custom#troubleshooting
330-
[aks-container-metrics]: /azure/aks/container-network-observability-metrics
331-
[advanced-container-network-metrics]: /azure/aks/advanced-container-networking-services-overview
332-
[connectivity-tool]: /azure/azure-kubernetes/connectivity/basic-troubleshooting-outbound-connections#check-if-azure-network-resources-are-blocking-traffic-to-the-endpoint
356+
- Request quota increases through Azure portal.
357+
- Expand nodepool custom resource definitions (CRDs) to more VM sizes. For more information, see [NodePool configuration documentation][nap-nodepool-docs]. For example, a nodepool specification that allows for D-family VM is less likely to hit quota errors that stop VM creation compared to a nodepool specification specific to only one exact VM size.
333358

359+
[!INCLUDE [Azure Help Support](~/includes/azure-help-support.md)]
360+
[!INCLUDE [Third-party contact disclaimer](~/includes/third-party-contact-disclaimer.md)]

0 commit comments

Comments
 (0)