Skip to content

Commit 4ffe66e

Browse files
authored
Merge branch 'MicrosoftDocs:main' into Branch-CI5828
2 parents 0fe49f9 + 29a6248 commit 4ffe66e

13 files changed

Lines changed: 254 additions & 7 deletions

File tree

support/azure/azure-kubernetes/availability-performance/identify-high-cpu-consuming-containers-aks.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Identify CPU saturation in AKS clusters
2+
title: Identify high CPU utilization in AKS clusters
33
description: Troubleshoot high CPU that the node and containers consume in an AKS cluster.
44
ms.date: 08/30/2024
55
ms.reviewer: chiragpa, v-weizhu
@@ -8,6 +8,9 @@ ms.custom: sap:Node/node pool availability and performance
88
---
99
# Troubleshoot high CPU usage in AKS clusters
1010

11+
> [!NOTE]
12+
> This article discusses high CPU utilization. In many situations, CPU Pressure Stall Information (PSI) metrics provide a more accurate indication of CPU Pressure than utilization alone. For more information, see [Troubleshoot CPU pressure in AKS clusters using PSI metrics](troubleshoot-node-cpu-pressure-psi.md).
13+
1114
High CPU usage is a symptom of one or more applications or processes that require so much CPU time that the performance or usability of the machine is impacted. High CPU usage can occur in many ways, but it's mostly caused by user configuration.
1215

1316
When a node in an [Azure Kubernetes Service (AKS)](/azure/aks/intro-kubernetes) cluster experiences high CPU usage, the applications running on it can experience degradation in performance and reliability. Applications or processes also become unstable, which may lead to issues beyond slow responses.
Loading
96.5 KB
Loading
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
---
2+
title: Troubleshoot CPU Pressure in AKS Clusters Using PSI Metrics
3+
description: Provides troubleshoot guidance on CPU pressure using PSI metrics in an AKS cluster.
4+
ms.date: 05/21/2025
5+
ms.reviewer: aritraghosh, dafell, alvinli, v-weizhu
6+
ms.service: azure-kubernetes-service
7+
ms.custom: sap:Node/node pool availability and performance
8+
---
9+
# Troubleshoot CPU pressure in AKS clusters using PSI metrics
10+
11+
CPU pressure is a more accurate indicator of resource contention than traditional CPU utilization metrics. While high CPU usage shows resource consumption, it doesn't necessarily indicate performance problems. In an Azure Kubernetes Service (AKS) cluster, understanding CPU pressure through Pressure Stall Information (PSI) metrics helps identify true resource contention issues.
12+
13+
When a node in an AKS cluster experiences CPU pressure, applications might suffer from poor performance even when CPU utilization appears moderate. PSI metrics provide insight into actual resource contention by measuring task delays rather than just resource consumption.
14+
15+
This article helps you monitor CPU pressure using PSI metrics and provides best practices to resolve resource contention issues.
16+
17+
## Symptoms
18+
19+
The following table outlines the common symptoms of CPU pressure:
20+
21+
|Symptom | Description |
22+
|---|---|
23+
|Increased application latency|Services respond slower even when CPU utilization appears moderate.|
24+
|Throttled containers|Containers experience delays in processing despite having CPU resources available on the node.|
25+
|Degraded performance|Applications experience unpredictable performance variations that don't correlate with CPU usage percentages.|
26+
27+
## Troubleshooting checklist
28+
29+
To identify and resolve CPU pressure issues, follow these steps:
30+
31+
### Step 1: Enable and monitor PSI metrics
32+
33+
Use one of the following methods to access PSI metrics:
34+
35+
- In a web browser, use Azure Monitoring Managed Prometheus or other monitoring solution to query PSI metrics.
36+
- In a console, use the Kubernetes command-line tool (`kubectl`).
37+
38+
### [Browser](#tab/browser)
39+
40+
Azure Monitoring Managed Prometheus provides a way to monitor PSI metrics:
41+
42+
1. Enable Azure Monitoring Managed Prometheus for your AKS cluster by following the instructions in [Enable Prometheus and Grafana](/azure/azure-monitor/containers/kubernetes-monitoring-enable#enable-prometheus-and-grafana).
43+
44+
To enable customized scrape metrics for Prometheus, see [Scrape configs](/azure/azure-monitor/containers/prometheus-metrics-scrape-configuration#scrape-configs). We recommend setting `minumum ingestion profile` to `false` and `node-exporter` to `true`.
45+
46+
2. Navigate to the Azure Monitor workspace associated with the AKS cluster from the [Azure portal](https://portal.azure.com).
47+
48+
:::image type="content" source="media/troubleshoot-node-cpu-pressure-psi/configure-azure-monitor-for-containers.png" alt-text="Screenshot that shows how to navigate to the Azure Monitor workspace." lightbox="media/troubleshoot-node-cpu-pressure-psi/configure-azure-monitor-for-containers.png":::
49+
50+
3. Under **Monitoring**, select **Metrics**.
51+
52+
4. Select **Prometheus metrics** as the data source.
53+
54+
> [!NOTE]
55+
> To use the metrics, you need to enable them in Azure Monitoring Managed Prometheus. These metrics are exposed by Node Exporter or cAdvisor.
56+
57+
5. Query specific PSI metrics in Prometheus explorer:
58+
59+
- For node-level CPU pressure, use the `node_pressure_cpu_waiting_seconds_total` Prometheus Query Language (PromQL).
60+
61+
:::image type="content" source="media/troubleshoot-node-cpu-pressure-psi/node-level-cpu-pressure.png" alt-text="Screenshot that shows how to query node-level CPU pressure." lightbox="media/troubleshoot-node-cpu-pressure-psi/node-level-cpu-pressure.png":::
62+
63+
- For pod-level CPU pressure, use the `container_cpu_cfs_throttled_seconds_total` PromQL.
64+
65+
6. Calculate the PSI-some percentage (percentage of time at least one task is stalled on CPU):
66+
67+
`rate(node_pressure_cpu_waiting_seconds_total[5m]) * 100`
68+
69+
> [!NOTE]
70+
> Some of the container level metrics such as `container_pressure_cpu_waiting_seconds_total` and `container_pressure_cpu_stalled_seconds_total` aren't available in AKS as they're part of the Kubelet PSI feature gate that is in alpha state. AKS begins supporting the use of the feature when it reaches beta stage.
71+
72+
### [Command Line](#tab/command-line)
73+
74+
Access PSI metrics safely using kubectl without requiring Secure Shell (SSH) access:
75+
76+
1. Use kubernetes proxy and node metrics:
77+
78+
```bash
79+
# Start the kubernetes proxy in a separate terminal
80+
kubectl proxy
81+
82+
# Access node metrics API
83+
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
84+
```
85+
86+
2. For more detailed PSI metrics, use the `kubectl debug` feature to create a temporary debug pod:
87+
88+
```bash
89+
# Create a debug pod that mounts the host filesystem
90+
kubectl debug node/<node_name> -it --image=busybox
91+
92+
# Once inside the debug pod, check PSI metrics
93+
cat /host/proc/pressure/cpu
94+
```
95+
96+
Here's an example command output:
97+
98+
```output
99+
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
100+
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
101+
```
102+
103+
- The `some` line indicates the percentage of time at least one task is stalled on CPU.
104+
- The `full` line indicates the percentage of time all tasks are stalled on CPU.
105+
106+
---
107+
108+
### Step 2: Review best practices to prevent CPU pressure
109+
110+
Review the following table to learn how to implement best practices for avoiding CPU pressure:
111+
112+
| Best practice | Description |
113+
|---|---|
114+
|Focus on PSI metrics instead of utilization|Use PSI metrics as your primary indicator of resource contention rather than CPU utilization percentages. For more information, see [PSI - Pressure Stall Information](https://docs.kernel.org/accounting/psi.html).|
115+
|Identify pods utilizing the most CPU|Isolate the pods that are utilizing the most CPU and identify solutions to reduce pressure. For more information, see [Troubleshoot high CPU usage in AKS clusters](./identify-high-cpu-consuming-containers-aks.md).|
116+
|Minimize CPU limits|Consider removing CPU limits and rely on [Linux's Completely Fair Scheduler](https://docs.kernel.org/scheduler/sched-design-CFS.html) with CPU shares based on requests. For more information, see [Resource Management for Pods and Containers](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/).|
117+
|Use appropriate Quality of Service (QoS) classes|Set the right QoSclass for each pod based on its importance and contention sensitivity. For more information, see [Configure Quality of Service for Pods](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/).|
118+
|Optimize pod placement|Use pod anti-affinity rules to avoid placing CPU-intensive workloads on the same nodes. For more information, see [Assigning Pods to Nodes](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/).|
119+
|Monitor for brief pressure spikes|Short pressure spikes can indicate issues even when average utilization appears acceptable. For more information, see [Resource metrics pipeline](https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/).|
120+
121+
## Key PSI metrics to monitor
122+
123+
> [!NOTE]
124+
> If a node's CPU usage is moderate but the containers on the node experience CFS throttling, increase the resource limits, or remove them and follow [Linux's Completely Fair Scheduler (CFS)](https://docs.kernel.org/scheduler/sched-design-CFS.html) algorithm.
125+
126+
### Node-level PSI metrics
127+
128+
- `node_pressure_cpu_waiting_seconds_total`: Cumulative time tasks wait for CPU.
129+
- `node_cpu_seconds_total`: Traditional CPU utilization for comparison.
130+
131+
### Container-level PSI indicators
132+
133+
- `container_cpu_cfs_throttled_periods_total`: The number of periods a container is throttled.
134+
- `container_cpu_cfs_throttled_seconds_total`: Total time a container is throttled.
135+
- Throttling percentage: `rate(container_cpu_cfs_throttled_periods_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) * 100`
136+
137+
## Why using PSI metrics?
138+
139+
AKS uses PSI metrics as an indicator for CPU pressure instead of load average for several reasons:
140+
141+
- In oversized and multi-core nodes, load average often underreports CPU saturation.
142+
- On chattier and containerized nodes, load average can over-signal, leading to alert fatigue.
143+
- Since load average doesn't have per-cgroup visibility, noisy pods can hide behind a low system average.
144+
145+
## References
146+
147+
- [Linux PSI documentation](https://docs.kernel.org/accounting/psi.html)
148+
- [Kubernetes resource management](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/)
149+
- [AKS performance best practices](/azure/aks/concepts-clusters-workloads)
150+
- [Enable Prometheus and Grafana](/azure/azure-monitor/containers/kubernetes-monitoring-enable#enable-prometheus-and-grafana)
151+
- [Quality of Service in Kubernetes](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/)
152+
- [Linux Completely Fair Scheduler](https://docs.kernel.org/scheduler/sched-design-CFS.html)
153+
154+
[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]

support/azure/azure-kubernetes/toc.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,8 +162,10 @@
162162
href: availability-performance/container-image-pull-performance.md
163163
- name: AKS cluster/node is in failed state
164164
href: availability-performance/cluster-node-virtual-machine-failed-state.md
165-
- name: Identify nodes and containers consuming high CPU
165+
- name: Identify nodes and containers utilizing high CPU
166166
href: availability-performance/identify-high-cpu-consuming-containers-aks.md
167+
- name: Identify containers facing high CPU pressure and throttling
168+
href: availability-performance/troubleshoot-node-cpu-pressure-psi.md
167169
- name: Identify memory saturation in AKS clusters
168170
href: availability-performance/identify-memory-saturation-aks.md
169171
- name: Troubleshoot high memory consumption due to Linux kernel behaviors

support/azure/azure-storage/files/file-sync/file-sync-troubleshoot-installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,7 @@ Reset-StorageSyncServer
216216
```
217217

218218
> [!Note]
219-
> If the server is part of a cluster, use the `Reset-StorageSyncServer` `-CleanClusterRegistration` parameter to remove the server from the Azure File Sync cluster registration detail.
219+
> If the server is part of a cluster, the `Reset-StorageSyncServer` `-CleanClusterRegistration` parameter will unregister all servers in the cluster.
220220
221221
<a id="web-site-not-trusted"></a>**When I register a server, I see numerous "web site not trusted" responses. Why?**
222222

support/azure/azure-storage/files/performance/files-troubleshoot-performance.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Azure Files performance troubleshooting guide
33
description: Troubleshoot performance issues with Azure file shares and discover potential causes and associated workarounds for these problems.
44
ms.service: azure-file-storage
55
ms.custom: sap:Performance, linux-related-content
6-
ms.date: 01/23/2025
6+
ms.date: 05/21/2025
77
ms.reviewer: kendownie, v-weizhu
88
#Customer intent: As a system admin, I want to troubleshoot performance issues with Azure file shares to improve performance for applications and users.
99
---
@@ -120,7 +120,7 @@ If you're using a premium file share, increase the provisioned file share size t
120120

121121
If the majority of your requests are metadata-centric (such as `createfile`, `openfile`, `closefile`, `queryinfo`, or `querydirectory`), the latency will be worse than that of read/write operations.
122122

123-
To determine whether most of your requests are metadata-centric, start by following steps 1-4 as previously outlined in Cause 1. For step 5, instead of adding a filter for **Response type**, add a property filter for **API name**.
123+
To determine whether most of your requests are metadata-centric, start by following steps 1-4 as previously outlined in Cause 1. For step 5, instead of adding a filter for **Response type**, add a property filter for **API name**. For more information, see [Monitor utilization by metadata IOPS](/azure/storage/files/analyze-files-metrics?tabs=azure-portal#monitor-utilization-by-metadata-iops).
124124

125125
:::image type="content" source="media/files-troubleshoot-performance/metadata-metrics.png" alt-text="Screenshot that shows the 'API name' property filter.":::
126126

20.1 KB
Loading
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
---
2+
title: Windows VM Startup gets Stuck on "Please wait for the Group Policy Client" in Azure
3+
description: Provides troubleshooting steps for an Azure virtual machine (VM) that gets stuck in startup on the "Please wait for the Group Policy Client" screen.
4+
ms.date: 05/14/2025
5+
author: cwhitley-MSFT
6+
ms.author: cwhitley
7+
ms.service: azure-virtual-machines
8+
ms.collection: windows
9+
ms.custom: sap:My VM is not booting
10+
---
11+
12+
# VM startup gets stuck at "Please wait for the Group Policy Client"
13+
14+
**Applies to:** :heavy_check_mark: Windows VMs
15+
16+
This article discusses an issue that causes a Microsoft Azure virtual machine (VM) to get stuck during startup on the **Please wait for the Group Policy Client** screen.
17+
18+
## Symptoms
19+
20+
A Windows VM doesn't start. When you use [Boot diagnostics](./boot-diagnostics.md) to view the screenshot of the VM, you see that the Windows operating system displays the message, "Please wait for the Group Policy Client."
21+
22+
:::image type="content" source="media/please-wait-for-the-group-policy-client/please-wait-for-the-group-policy-client.png" alt-text="Screenshot of Windows operating system displaying the message 'Please wait for the Group Policy Client'.":::
23+
24+
## Cause
25+
26+
When a Windows VM starts, it might take some time to apply Group Policy system settings. If the VM is applying many policies, or if the policies are complex, this process can take longer than usual.
27+
28+
We recommend that you allow up to one hour for the VM to finish applying these settings. If the VM remains stuck on the same screen after that time, more troubleshooting might be necessary to identify the specific cause of the issue.
29+
30+
## Collect memory dump file for troubleshooting
31+
32+
For this scenario, Azure Support requires a memory dump file in order to be able to troubleshoot and diagnose the issue.
33+
34+
To collect a memory dump file, follow the steps in [this article](./collect-os-memory-dump-file.md). Then, [create a support request](https://ms.portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/overview?DMC=troubleshoot).
35+
36+
[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]

support/azure/virtual-machines/windows/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,8 @@
348348
href: azure-vm-cannot-rdp-driver-irql-not-less-equal.md
349349
- name: Azure VM cannot RDP - working on features
350350
href: azure-vm-cannot-rdp-working-features.md
351+
- name: Azure VM startup hangs at "Please wait for the Group Policy Client"
352+
href: please-wait-for-the-group-policy-client.md
351353

352354
- name: Cannot start or stop my VM
353355
items:

0 commit comments

Comments
 (0)