Skip to content

Commit 92c21f2

Browse files
committed
Add guide to troubleshoot disk latency issues
Signed-off-by: Burak Ok <[email protected]>
1 parent 8386904 commit 92c21f2

2 files changed

Lines changed: 205 additions & 0 deletions

File tree

Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
---
2+
title: Identify containers causing high disk I/O latency in AKS clusters
3+
description: Learn how to use Inspektor Gadget to identify which containers and pods are causing high disk I/O latency in Azure Kubernetes Service clusters.
4+
ms.date: 07/16/2025
5+
ms.author: burakok
6+
ms.reviewer: burakok, mayasingh
7+
ms.service: azure-kubernetes-service
8+
ms.custom: sap:Node/node pool availability and performance
9+
---
10+
# Troubleshoot high disk I/O latency in AKS clusters
11+
12+
Disk I/O latency can severely impact the performance and reliability of workloads running in AKS clusters.This article shows how to use the open source project [Inspektor Gadget](https://inspektor-gadget.io/) to identify which containers and pods are causing high disk I/O latency in Azure Kubernetes Service (AKS).
13+
14+
Inspektor Gadget provides eBPF-based gadgets that help you observe and troubleshoot disk I/O issues in Kubernetes environments.
15+
16+
## Symptoms
17+
18+
You may suspect disk I/O latency issues when you observe the following behaviors in your AKS cluster:
19+
20+
- Applications become unresponsive during file operations
21+
- [Azure Portal metrics](/azure/aks/monitor-aks-reference#supported-metrics-for-microsoftcomputevirtualmachines)(`Data Disk Bandwidth Consumed Percentage` and `Data Disk IOPS Consumed Percentage`) or other system monitoring shows high disk utilization with low throughput
22+
- Database operations take significantly longer than expected
23+
- Pod logs show file system operation errors or timeouts
24+
25+
## Prerequisites
26+
27+
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) command-line tool. To install kubectl by using [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command.
28+
- Access to your AKS cluster with sufficient permissions to run privileged pods
29+
- The open source project [Inspektor Gadget](../logs/capture-system-insights-from-aks.md#what-is-inspektor-gadget) for eBPF-based observability. For more information, see [How to install Inspektor Gadget in an AKS cluster](../logs/capture-system-insights-from-aks.md#how-to-install-inspektor-gadget-in-an-aks-cluster)
30+
31+
> [!NOTE]
32+
> The `top_blockio` gadget requires kernel version 6.5 or later. You can verify your AKS node kernel version by running `kubectl get nodes -o wide` to see the kernel version in the KERNEL-VERSION column.
33+
34+
## Troubleshooting checklist
35+
36+
### Step 1: Profile disk I/O latency with `profile_blockio`
37+
38+
The `profile_blockio` gadget gathers information about block device I/O usage and generates a histogram distribution of I/O latency when the gadget is stopped. This helps you visualize disk I/O performance and identify latency patterns.
39+
40+
```console
41+
kubectl gadget run profile_blockio --node <node-name>
42+
```
43+
44+
> [!NOTE]
45+
> The `profile_blockio` gadget requires specifying a specific node with the `--node` parameter. You can get node names by running `kubectl get nodes`.
46+
47+
**Baseline example** (empty cluster with minimal activity):
48+
49+
```
50+
latency
51+
µs : count distribution
52+
0 -> 1 : 0 | |
53+
1 -> 2 : 0 | |
54+
2 -> 4 : 0 | |
55+
4 -> 8 : 0 | |
56+
8 -> 16 : 0 | |
57+
16 -> 32 : 0 | |
58+
32 -> 64 : 70 | |
59+
64 -> 128 : 22 | |
60+
128 -> 256 : 6 | |
61+
256 -> 512 : 16 | |
62+
512 -> 1024 : 1017 |********* |
63+
1024 -> 2048 : 2205 |******************** |
64+
2048 -> 4096 : 2740 |************************** |
65+
4096 -> 8192 : 1128 |********** |
66+
8192 -> 16384 : 708 |****** |
67+
16384 -> 32768 : 4211 |****************************************|
68+
32768 -> 65536 : 129 |* |
69+
65536 -> 131072 : 185 |* |
70+
131072 -> 262144 : 402 |*** |
71+
262144 -> 524288 : 112 |* |
72+
524288 -> 1048576 : 0 | |
73+
1048576 -> 2097152 : 0 | |
74+
2097152 -> 4194304 : 0 | |
75+
4194304 -> 8388608 : 0 | |
76+
8388608 -> 16777216 : 0 | |
77+
16777216 -> 33554432 : 0 | |
78+
33554432 -> 67108864 : 0 | |
79+
```
80+
81+
**High disk I/O stress example** (with stress-ng --hdd 10 --io 10 running):
82+
83+
```
84+
latency
85+
µs : count distribution
86+
0 -> 1 : 0 | |
87+
1 -> 2 : 0 | |
88+
2 -> 4 : 0 | |
89+
4 -> 8 : 0 | |
90+
8 -> 16 : 42 | |
91+
16 -> 32 : 236 | |
92+
32 -> 64 : 558 |* |
93+
64 -> 128 : 201 | |
94+
128 -> 256 : 147 | |
95+
256 -> 512 : 62 | |
96+
512 -> 1024 : 2660 |****** |
97+
1024 -> 2048 : 6376 |*************** |
98+
2048 -> 4096 : 8374 |******************** |
99+
4096 -> 8192 : 3912 |********* |
100+
8192 -> 16384 : 2099 |***** |
101+
16384 -> 32768 : 16703 |****************************************|
102+
32768 -> 65536 : 1718 |**** |
103+
65536 -> 131072 : 5758 |************* |
104+
131072 -> 262144 : 9552 |********************** |
105+
262144 -> 524288 : 6778 |**************** |
106+
524288 -> 1048576 : 347 | |
107+
1048576 -> 2097152 : 16 | |
108+
2097152 -> 4194304 : 0 | |
109+
4194304 -> 8388608 : 0 | |
110+
8388608 -> 16777216 : 0 | |
111+
16777216 -> 33554432 : 0 | |
112+
33554432 -> 67108864 : 0 | |
113+
```
114+
115+
**Interpreting the results**: Compare the baseline vs. stress scenarios:
116+
- **Baseline**: Most operations (4,211 count) in the 16-32ms range, typical for normal system activity
117+
- **Under stress**: Significantly more operations in higher latency ranges (9,552 operations in 131-262ms, 6,778 in 262-524ms)
118+
- **Performance degradation**: The stress test shows operations extending into the 500ms-2s range, indicating disk saturation
119+
- **Concerning signs**: Look for high counts above 100ms (100,000µs) which may indicate disk performance issues
120+
121+
### Step 2: Find top disk I/O consumers with `top_blockio`
122+
123+
The `top_blockio` gadget provides a periodic list of pods and containers with the highest disk I/O operations. This gadget requires kernel version 6.5 or higher (available on Azure Linux 3).
124+
125+
```console
126+
kubectl gadget run top_blockio --namespace <namespace>
127+
```
128+
129+
Sample output:
130+
131+
```
132+
K8S.NODE K8S.NAMESPACE K8S.PODNAME K8S.CONTAINERNAME COMM PID TID MAJOR MINOR BYTES US IO RW
133+
aks-nodepool1-…99-vmss000000 0 0 8 0 173707264 153873788 357 write
134+
aks-nodepool1-…99-vmss000000 0 0 8 0 24576 1549 6 read
135+
```
136+
137+
Identify containers with unusually high BYTES, US (time spent in microseconds), or IO counts which may indicate high disk activity. In this example, we can see significant write activity (173MB) with considerable time spent (~154 seconds total).
138+
139+
> [!NOTE]
140+
> Empty K8S.NAMESPACE, K8S.PODNAME, and K8S.CONTAINERNAME fields can occur during kernel space initiated operations or high-volume I/O. You can still use the `top_file` gadget for detailed process information when these fields are empty.
141+
142+
### Step 3: Identify files causing high disk activity with `top_file`
143+
144+
The `top_file` gadget reports periodically the read/write activity by file, helping you identify specific files that are causing high disk activity.
145+
146+
```console
147+
kubectl gadget run top_file --namespace <namespace>
148+
```
149+
150+
Sample output:
151+
152+
```
153+
K8S.NODE K8S.NAMESPACE K8S.PODNAME K8S.CONTAINERNAME COMM PID TID READS WRITES FILE T RBYTES WBYTES
154+
aks-nodepool1-…99-vmss000000 default stress-hdd stress-hdd stress 49258 49258 0 17 /stress.ADneNJ R 0 B 18 MB
155+
aks-nodepool1-…99-vmss000000 default stress-hdd stress-hdd stress 49254 49254 0 20 /stress.LEbDOb R 0 B 21 MB
156+
aks-nodepool1-…99-vmss000000 default stress-hdd stress-hdd stress 49252 49252 0 18 /stress.eMOjmP R 0 B 19 MB
157+
aks-nodepool1-…99-vmss000000 default stress-hdd stress-hdd stress 49264 49264 0 22 /stress.fLHpBC R 0 B 23 MB
158+
...
159+
```
160+
161+
This output shows which files are being accessed most frequently, helping you pinpoint specific files contributing to disk latency. In this example, the stress-hdd pod is creating multiple temporary files with significant write activity (18-23MB each).
162+
163+
### Root cause analysis workflow
164+
165+
By combining all three gadgets, you can trace disk latency issues from symptoms to root cause:
166+
167+
1. **`profile_blockio`** identifies that disk latency exists (high counts in 100ms+ ranges)
168+
2. **`top_blockio`** shows which processes are consuming the most disk I/O (173MB writes with 154 seconds total time spent)
169+
3. **`top_file`** reveals the specific files and commands causing the issue (stress command creating /stress.* files)
170+
171+
This complete visibility allows you to:
172+
- **Identify the problematic pod**: `stress-hdd` pod in the `default` namespace
173+
- **Find the specific process**: `stress` command with PIDs 49258, 49254, etc.
174+
- **Locate the problematic files**: Multiple `/stress.*` temporary files with 18-23MB each
175+
- **Understand the I/O pattern**: Heavy write operations creating temporary files
176+
177+
With this information, you can take targeted action rather than making broad system changes.
178+
179+
## Next steps
180+
181+
Based on the results from these gadgets, you can take the following actions:
182+
183+
- **High latency in `profile_blockio`**: Investigate the underlying disk performance, consider using premium SSD or Ultra disk storage
184+
- **High I/O operations in `top_blockio`**: Review application logic to optimize disk access patterns or implement caching
185+
- **Specific files in `top_file`**: Analyze if files can be moved to faster storage, cached, or if application logic can be optimized
186+
187+
For further troubleshooting:
188+
- Move disk-intensive workloads to dedicated node pools with faster storage
189+
- Implement application-level caching to reduce disk I/O
190+
- Consider using Azure managed services (like Azure Database) for data-intensive operations
191+
192+
## Related content
193+
194+
- [Inspektor Gadget documentation](https://inspektor-gadget.io/docs/latest/gadgets/)
195+
- [How to install Inspektor Gadget in an AKS cluster](../logs/capture-system-insights-from-aks.md#how-to-install-inspektor-gadget-in-an-aks-cluster)
196+
- [Troubleshoot high memory consumption in disk-intensive applications](high-memory-consumption-disk-intensive-applications.md)
197+
198+
[!INCLUDE [Third-party information disclaimer](../../../includes/third-party-disclaimer.md)]
199+
[!INCLUDE [Third-party contact information disclaimer](../../../includes/third-party-contact-disclaimer.md)]
200+
[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]
201+
202+
203+

support/azure/azure-kubernetes/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,8 @@
168168
href: availability-performance/identify-high-cpu-consuming-containers-aks.md
169169
- name: Identify containers facing high CPU pressure and throttling
170170
href: availability-performance/troubleshoot-node-cpu-pressure-psi.md
171+
- name: Identify nodes and containers creating high disk latency
172+
href: availability-performance/identify-high-disk-io-latency-containers-aks.md
171173
- name: Identify memory saturation in AKS clusters
172174
href: availability-performance/identify-memory-saturation-aks.md
173175
- name: Troubleshoot high memory consumption in disk-intensive applications

0 commit comments

Comments
 (0)