Skip to content

Commit d7558af

Browse files
Learn Build Service GitHub AppLearn Build Service GitHub App
authored andcommitted
Merging changes synced from https://github.com/MicrosoftDocs/SupportArticles-docs-pr (branch live)
2 parents e466702 + e0377f2 commit d7558af

5 files changed

Lines changed: 218 additions & 2 deletions

File tree

support/azure/app-service/temporary-storage-for-azure-app-service.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ If an application is reporting high resource consumption, the source of the prob
1313

1414
This video demonstrates how to narrow the scope of your troubleshooting to temporary storage, and the next steps for investigating and troubleshooting performance issues.
1515

16-
> [!VIDEO <https://www.youtube.com/embed/bk8h-VYaIXs>]
16+
> [!VIDEO https://www.youtube.com/embed/bk8h-VYaIXs]
1717
1818
## Related content
1919

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
---
2+
title: Identify containers causing high disk I/O latency in AKS clusters
3+
description: Learn how to identify which containers and pods are causing high disk I/O latency in your Azure Kubernetes Service clusters to easily troubleshoot issues using the open source project Inspektor Gadget.
4+
ms.date: 07/16/2025
5+
ms.author: burakok
6+
ms.reviewer: burakok, mayasingh, blanquicet
7+
ms.service: azure-kubernetes-service
8+
ms.custom: sap:Node/node pool availability and performance
9+
---
10+
# Troubleshoot high disk I/O latency in AKS clusters
11+
12+
Disk I/O latency can severely impact the performance and reliability of workloads running in Azure Kubernetes Service (AKS) clusters. This article shows how to use the open source project [Inspektor Gadget](https://aka.ms/ig-website) to identify which containers and pods are causing high disk I/O latency in AKS.
13+
14+
Inspektor Gadget provides eBPF-based gadgets that help you observe and troubleshoot disk I/O issues in Kubernetes environments.
15+
16+
## Symptoms
17+
18+
You may suspect disk I/O latency issues when you observe the following behaviors in your AKS cluster:
19+
20+
- Applications become unresponsive during file operations
21+
- [Azure Portal metrics](/azure/aks/monitor-aks-reference#supported-metrics-for-microsoftcomputevirtualmachines)(`Data Disk Bandwidth Consumed Percentage` and `Data Disk IOPS Consumed Percentage`) or other system monitoring shows high disk utilization with low throughput
22+
- Database operations take significantly longer than expected
23+
- Pod logs show file system operation errors or timeouts
24+
25+
## Prerequisites
26+
27+
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) command-line tool. To install kubectl by using [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command.
28+
- Access to your AKS cluster with sufficient permissions to run privileged pods
29+
- The open source project [Inspektor Gadget](../logs/capture-system-insights-from-aks.md#what-is-inspektor-gadget) for eBPF-based observability. For more information, see [How to install Inspektor Gadget in an AKS cluster](../logs/capture-system-insights-from-aks.md#how-to-install-inspektor-gadget-in-an-aks-cluster)
30+
31+
> [!NOTE]
32+
> The `top_blockio` gadget requires kernel version 6.5 or later. You can verify your AKS node kernel version by running `kubectl get nodes -o wide` to see the kernel version in the KERNEL-VERSION column.
33+
34+
## Troubleshooting checklist
35+
36+
### Step 1: Profile disk I/O latency with `profile_blockio`
37+
38+
The [`profile_blockio`](https://aka.ms/ig-profile-blockio) gadget gathers information about block device I/O usage and periodically generates a histogram distribution of I/O latency. This helps you visualize disk I/O performance and identify latency patterns. We can use this information to gather evidence to support or refute the hypothesis that the symptoms we are seeing are due to disk I/O issues.
39+
40+
```console
41+
kubectl gadget run profile_blockio --node <node-name>
42+
```
43+
44+
> [!NOTE]
45+
> The `profile_blockio` gadget requires specifying a specific node with the `--node` parameter. You can get node names by running `kubectl get nodes`.
46+
47+
**Baseline example** (empty cluster with minimal activity):
48+
49+
```
50+
latency
51+
µs : count distribution
52+
0 -> 1 : 0 | |
53+
1 -> 2 : 0 | |
54+
2 -> 4 : 0 | |
55+
4 -> 8 : 0 | |
56+
8 -> 16 : 0 | |
57+
16 -> 32 : 0 | |
58+
32 -> 64 : 70 | |
59+
64 -> 128 : 22 | |
60+
128 -> 256 : 6 | |
61+
256 -> 512 : 16 | |
62+
512 -> 1024 : 1017 |********* |
63+
1024 -> 2048 : 2205 |******************** |
64+
2048 -> 4096 : 2740 |************************** |
65+
4096 -> 8192 : 1128 |********** |
66+
8192 -> 16384 : 708 |****** |
67+
16384 -> 32768 : 4211 |****************************************|
68+
32768 -> 65536 : 129 |* |
69+
65536 -> 131072 : 185 |* |
70+
131072 -> 262144 : 402 |*** |
71+
262144 -> 524288 : 112 |* |
72+
524288 -> 1048576 : 0 | |
73+
1048576 -> 2097152 : 0 | |
74+
2097152 -> 4194304 : 0 | |
75+
4194304 -> 8388608 : 0 | |
76+
8388608 -> 16777216 : 0 | |
77+
16777216 -> 33554432 : 0 | |
78+
33554432 -> 67108864 : 0 | |
79+
```
80+
81+
**High disk I/O stress example** (with `stress-ng --hdd 10 --io 10` running to simulate I/O load):
82+
83+
```
84+
latency
85+
µs : count distribution
86+
0 -> 1 : 0 | |
87+
1 -> 2 : 0 | |
88+
2 -> 4 : 0 | |
89+
4 -> 8 : 0 | |
90+
8 -> 16 : 42 | |
91+
16 -> 32 : 236 | |
92+
32 -> 64 : 558 |* |
93+
64 -> 128 : 201 | |
94+
128 -> 256 : 147 | |
95+
256 -> 512 : 62 | |
96+
512 -> 1024 : 2660 |****** |
97+
1024 -> 2048 : 6376 |*************** |
98+
2048 -> 4096 : 8374 |******************** |
99+
4096 -> 8192 : 3912 |********* |
100+
8192 -> 16384 : 2099 |***** |
101+
16384 -> 32768 : 16703 |****************************************|
102+
32768 -> 65536 : 1718 |**** |
103+
65536 -> 131072 : 5758 |************* |
104+
131072 -> 262144 : 9552 |********************** |
105+
262144 -> 524288 : 6778 |**************** |
106+
524288 -> 1048576 : 347 | |
107+
1048576 -> 2097152 : 16 | |
108+
2097152 -> 4194304 : 0 | |
109+
4194304 -> 8388608 : 0 | |
110+
8388608 -> 16777216 : 0 | |
111+
16777216 -> 33554432 : 0 | |
112+
33554432 -> 67108864 : 0 | |
113+
```
114+
115+
**Interpreting the results**: To identify which node has I/O pressure you can compare the baseline vs. stress scenarios:
116+
- **Baseline**: Most operations (4,211 count) in the 16-32ms range, typical for normal system activity
117+
- **Under stress**: Significantly more operations in higher latency ranges (9,552 operations in 131-262ms, 6,778 in 262-524ms)
118+
- **Performance degradation**: The stress test shows operations extending into the 500ms-2s range, indicating disk saturation
119+
- **Concerning signs**: Look for high counts above 100ms (100,000µs) which may indicate disk performance issues
120+
121+
### Step 2: Find top disk I/O consumers with `top_blockio`
122+
123+
The [`top_blockio`](https://aka.ms/ig-top-blockio) gadget provides a periodic list of containers with the highest disk I/O operations. Optionally we can limit the tracing to the node we identified in Step 1. This gadget requires kernel version 6.5 or higher (available on [Azure Linux Container Host clusters](/azure/aks/use-azure-linux)).
124+
125+
```console
126+
kubectl gadget run top_blockio --namespace <namespace> --sort -bytes [--node <node-name>]
127+
```
128+
129+
Sample output:
130+
131+
```
132+
K8S.NODE K8S.NAMESPACE K8S.PODNAME K8S.CONTAINERNAME COMM PID TID MAJOR MINOR BYTES US IO RW
133+
aks-nodepool1-…99-vmss000000 0 0 8 0 173707264 153873788 11954 write
134+
aks-nodepool1-…99-vmss000000 0 0 8 0 352256 85222 36 read
135+
aks-nodepool1-…99-vmss000000 default stress-hdd stress-hdd stress 324… 324… 8 0 131072 4450 1 write
136+
aks-nodepool1-…99-vmss000000 default stress-hdd stress-hdd stress 324… 324… 8 0 131072 3651 1 write
137+
aks-nodepool1-…99-vmss000000 default stress-hdd stress-hdd stress 324… 324… 8 0 4096 4096 1 write
138+
```
139+
140+
From the output, we can identify containers with unusually high number of bytes read/written into the disk (`BYTES` column), time spent on reading/writing operations (`US` column), or number of IO operations (`IO` column) which may indicate high disk activity. In this example, we can see significant write activity (173MB) with considerable time spent (~154 seconds total).
141+
142+
> [!NOTE]
143+
> Empty K8S.NAMESPACE, K8S.PODNAME, and K8S.CONTAINERNAME fields can occur during kernel space initiated operations or high-volume I/O. You can still use the `top_file` gadget for detailed process information when these fields are empty.
144+
145+
### Step 3: Identify files causing high disk activity with `top_file`
146+
147+
The [`top_file`](https://aka.ms/ig-top-file) gadget reports periodically the read/write activity by file, helping you identify specific processes in which containers are causing high disk activity.
148+
149+
```console
150+
kubectl gadget run top_file --namespace <namespace> --max-entries 20 --sort -wbytes_raw,-rbytes_raw
151+
```
152+
153+
Sample output:
154+
155+
```
156+
K8S.NODE K8S.NAMESPACE K8S.PODNAME K8S.CONTAINERNAME COMM PID TID READS WRITES FILE T RBYTES WBYTES
157+
aks-nodepool1-…99-vmss000000 default stress-hdd stress-hdd stress 49258 49258 0 17 /stress.ADneNJ R 0 B 18 MB
158+
aks-nodepool1-…99-vmss000000 default stress-hdd stress-hdd stress 49254 49254 0 20 /stress.LEbDOb R 0 B 21 MB
159+
aks-nodepool1-…99-vmss000000 default stress-hdd stress-hdd stress 49252 49252 0 18 /stress.eMOjmP R 0 B 19 MB
160+
aks-nodepool1-…99-vmss000000 default stress-hdd stress-hdd stress 49264 49264 0 22 /stress.fLHpBC R 0 B 23 MB
161+
...
162+
```
163+
164+
This output shows which files are being accessed most frequently, helping you pinpoint what specific file a given process is reading/writing the most. In this example, the stress-hdd pod is creating multiple temporary files with significant write activity (18-23MB each)
165+
166+
### Root cause analysis workflow
167+
168+
By combining all three gadgets, you can trace disk latency issues from symptoms to root cause:
169+
170+
1. **`profile_blockio`** identifies that disk latency exists in a given node (high counts in 100ms+ ranges)
171+
2. **`top_blockio`** shows which processes are generating the most disk I/O (173MB writes with 154 seconds total time spent)
172+
3. **`top_file`** reveals the specific files and commands causing the issue (stress command creating /stress.* files)
173+
174+
This complete visibility allows you to:
175+
- **Identify the problematic pod**: `stress-hdd` pod in the `default` namespace
176+
- **Find the specific process**: `stress` command with PIDs 49258, 49254, etc.
177+
- **Locate the problematic files**: Multiple `/stress.*` temporary files with 18-23MB each
178+
- **Understand the I/O pattern**: Heavy write operations creating temporary files
179+
180+
With this information, you can take targeted action rather than making broad system changes.
181+
182+
## Next steps
183+
184+
Based on the results from these gadgets, you can take the following actions:
185+
186+
- **High latency in `profile_blockio`**: Investigate the underlying disk performance and if the workload needs better disk performance, consider using [storage optimized nodes](/azure/virtual-machines/sizes/overview#storage-optimized)
187+
- **High I/O operations in `top_blockio`**: Review application logic to optimize disk access patterns or implement caching
188+
- **Specific files in `top_file`**: Analyze if files can be moved to faster storage, cached, or if application logic can be optimized
189+
190+
## Related content
191+
192+
- [Inspektor Gadget documentation](https://inspektor-gadget.io/docs/latest/gadgets/)
193+
- [How to install Inspektor Gadget in an AKS cluster](../logs/capture-system-insights-from-aks.md#how-to-install-inspektor-gadget-in-an-aks-cluster)
194+
- [Troubleshoot high memory consumption in disk-intensive applications](high-memory-consumption-disk-intensive-applications.md)
195+
196+
[!INCLUDE [Third-party information disclaimer](../../../includes/third-party-disclaimer.md)]
197+
[!INCLUDE [Third-party contact information disclaimer](../../../includes/third-party-contact-disclaimer.md)]
198+
[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]
199+
200+
201+

support/azure/azure-kubernetes/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,8 @@ items:
5757
href: availability-performance/cluster-node-virtual-machine-failed-state.md
5858
- name: Identify containers facing high CPU pressure and throttling
5959
href: availability-performance/troubleshoot-node-cpu-pressure-psi.md
60+
- name: Identify nodes and containers creating high disk latency
61+
href: availability-performance/identify-high-disk-io-latency-containers-aks.md
6062
- name: Identify memory saturation in AKS clusters
6163
href: availability-performance/identify-memory-saturation-aks.md
6264
- name: Identify nodes and containers utilizing high CPU

support/azure/azure-storage/files/file-sync/file-sync-troubleshoot-cloud-tiering.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -328,6 +328,19 @@ This option doesn't require removing the server endpoint but requires sufficient
328328
3. Use the *OrphanTieredFiles.txt* output file to identify orphaned tiered files on the server.
329329
4. Overwrite the orphaned tiered files by copying the full file from the Azure file share to the Windows Server.
330330
331+
## How to identify files that are excluded from File Sync
332+
1. Open a PowerShell window as administrator.
333+
2. Navigate to the folder. Replace `<volume letter>` and `<syncShare>` with the volume letter and sync share names.
334+
```powershell
335+
cd <volume letter>\ <syncShare>\
336+
```
337+
3. Run this command:
338+
```powershell
339+
dir desktop.ini,thumbs.db,ehthumbs.db,~$*.*,*.laccdb,*.tmp -Recurse -Force -File -ErrorAction Ignore
340+
```
341+
342+
Alternatively, you may use the TreeSize tool. The same list of exclusions can be put into the TreeSize 'filters' configuration to count excluded content. The advantage of this approach is that the content which the administrator does not have access to in case of the former approach will be accessible by TreeSize because it uses the backup/restore permissions when scanning content.
343+
331344
## How to troubleshoot files unexpectedly recalled on a server
332345
333346
Antivirus, backup, and other applications that read large numbers of files cause unintended recalls unless they respect the skip offline attribute and skip reading the content of those files. Skipping offline files for products that support this option helps avoid unintended recalls during operations like antivirus scans or backup jobs.

support/mem/intune/device-configuration/factory-reset-protection-emails-not-enforced.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ If **Factory reset protection emails** is set to **Not configured** (default), I
4343
> [!NOTE]
4444
> **Android 15** introduced FRP hardening. Some OEMs previously skipped FRP in certain paths. As of Android 15, FRP enforcement now aligns with Google’s intended design.
4545
46-
We recommend that you set the **Factory reset** value to **Block** to prevent users from using the factory reset option in the device settings.
46+
We recommend that you set the **Factory reset** value to **Block** to prevent users from using the factory reset option in the device settings. This is only available for fully managed and dedicated devices.
4747

4848
:::image type="content" source="media/factory-reset-protection-emails-not-enforced/factory-reset.png" alt-text="Screenshot of Factory reset options.":::
4949

0 commit comments

Comments
 (0)