Skip to content

Commit 764f99c

Browse files
committed
first edits
1 parent d18beda commit 764f99c

1 file changed

Lines changed: 75 additions & 81 deletions

File tree

support/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd.md

Lines changed: 75 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -49,17 +49,74 @@ The following table outlines the common symptoms of API server failures.
4949
| Timeouts from the API server | Frequent timeouts that are beyond the guarantees in [the AKS API server SLA](/azure/aks/free-standard-pricing-tiers#uptime-sla-terms-and-conditions). For example, `kubectl` commands timeout. |
5050
| High latencies | High latencies that make the Kubernetes SLOs fail. For example, the `kubectl` command takes more than 30 seconds to list pods. |
5151
| API server pod in `CrashLoopbackOff` status or facing webhook call failures | Verify that you don't have any custom admission webhook (such as the [Kyverno](https://kyverno.io/docs/introduction/) policy engine) that's blocking the calls to the API server. |
52-
| Elevated HTTP 429 responses from the API server | API server is throttling calls. Refer to the troubleshooting checklist|
52+
| Elevated HTTP 429 responses from the API server | API server is throttling calls. Refer to the potential causes below|
5353

54-
## Troubleshooting checklist
5554

56-
If you experience high latency times, follow these steps to pinpoint the offending client and the types of API calls that fail.
55+
## Cause and Resolution
56+
57+
### Cause 1: A network rule blocks the traffic from agent nodes to the API server
58+
59+
A network rule can block traffic between the agent nodes and the API server.
60+
61+
To check whether a misconfigured network policy is blocking communication between the API server and agent nodes, run the following [kubectl-aks](https://go.microsoft.com/fwlink/p/?linkid=2259767) commands:
62+
63+
```bash
64+
kubectl aks config import \
65+
--subscription <mySubscriptionID> \
66+
--resource-group <myResourceGroup> \
67+
--cluster-name <myAKSCluster>
68+
69+
kubectl aks check-apiserver-connectivity --node <myNode>
70+
```
5771

58-
### <a id="identifytopuseragents"></a> Step 1: Identify top user agents by the number of requests
72+
The [config import](https://go.microsoft.com/fwlink/p/?linkid=2259867#importing-configuration) command retrieves the Virtual Machine Scale Set information for all the nodes in the cluster. Then, the [check-apiserver-connectivity](https://go.microsoft.com/fwlink/p/?linkid=2259674) command uses this information to verify the network connectivity between the API server and a specified node, specifically for its underlying scale set instance.
73+
74+
> [!NOTE]
75+
> If the output of the `check-apiserver-connectivity` command contains the `Connectivity check: succeeded` message, then the network connectivity is unimpeded.
76+
77+
### Solution 1: Fix the network policy to remove the traffic blockage
78+
79+
If the command output indicates that a connection failure occurred, reconfigure the network policy so that it doesn't unnecessarily block traffic between the agent nodes and the API server.
80+
81+
### Cause 2: An offending client leaks etcd objects and causes a slowdown of etcd
82+
83+
A common situation is that objects are continuously created even though existing unused objects in the etcd database aren't removed. This situation can cause performance problems if etcd handles too many objects (more than 10,000) of any type. A rapid increase of changes on such objects could also cause the default size of the etcd database (by default, 4 gigabytes) to be exceeded.
84+
85+
To check the etcd database usage, navigate to **Diagnose and Solve problems** in the Azure portal. Run the **Etcd Availability Issues** diagnosis tool by searching for "_etcd_" in the Search box. The diagnosis tool shows the usage breakdown and the total database size.
86+
87+
:::image type="content" source="media/troubleshoot-apiserver-etcd/etcd-detector.png" alt-text="Azure portal screenshot that shows the Etcd Availability Diagnosis for Azure Kubernetes Service (AKS)." lightbox="media/troubleshoot-apiserver-etcd/etcd-detector.png":::
88+
89+
To get a quick view of the current size of your etcd database in bytes, run the following command:
90+
91+
```bash
92+
kubectl get --raw /metrics | grep -E "etcd_db_total_size_in_bytes|apiserver_storage_size_bytes|apiserver_storage_db_total_size_in_bytes"
93+
```
94+
95+
> [!NOTE]
96+
> The metric name in the previous command is different for different Kubernetes versions. For Kubernetes 1.25 and earlier versions, use `etcd_db_total_size_in_bytes`. For Kubernetes 1.26 to 1.28, use `apiserver_storage_db_total_size_in_bytes`.
97+
98+
### Solution 2: Define quotas for object creation, delete objects, or limit object lifetime in etcd
99+
100+
To prevent etcd from reaching capacity and causing cluster downtime, you can limit the maximum number of resources that are created. You can also slow the number of revisions that are generated for resource instances. To limit the number of objects that can be created, [define object quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/#object-count-quota).
101+
102+
If you identified objects that are no longer in use but consume resources, consider deleting them. For example, delete completed jobs to free up space:
103+
104+
```bash
105+
kubectl delete jobs --field-selector status.successful=1
106+
```
107+
108+
For objects that support [automatic cleanup](https://kubernetes.io/docs/concepts/architecture/garbage-collection/), set Time to Live (TTL) values to limit the lifetime of these objects. You can also label your objects so that you can bulk delete all the objects of a specific type by using label selectors. If you establish [owner references](https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/) among objects, any dependent objects are automatically deleted after the parent object is deleted.
109+
110+
### Cause 3: An offending client makes excessive LIST or PUT calls
111+
112+
If you determine that etcd isn't overloaded with too many objects, an offending client might be making too many `LIST` or `PUT` calls to the API server.
113+
If you experience high latency or frequent timeouts, follow these steps to pinpoint the offending client and the types of API calls that fail.
114+
115+
#### <a id="identifytopuseragents"></a> Step 1: Identify top user agents by the number of requests
59116

60117
To identify which clients generate the most requests (and potentially the most API server load), run a query that resembles the following code. This query lists the top 10 user agents by the number of API server requests sent.
61118

62-
#### [Resource-specific](#tab/resource-specific)
119+
##### [Resource-specific](#tab/resource-specific)
63120

64121
```kusto
65122
AKSAudit
@@ -69,7 +126,7 @@ AKSAudit
69126
| project UserAgent, count_
70127
```
71128

72-
#### [Azure diagnostics](#tab/azure-diagnostics)
129+
##### [Azure diagnostics](#tab/azure-diagnostics)
73130

74131
```kusto
75132
AzureDiagnostics
@@ -87,8 +144,8 @@ AzureDiagnostics
87144
> If your query returns no results, you might have selected the wrong table to query diagnostics logs. In resource-specific mode, data is written to individual tables, depending on the category of the resource. Diagnostics logs are written to the `AKSAudit` table. In Azure diagnostics mode, all data is written to the `AzureDiagnostics` table. For more information, see [Azure resource logs](/azure/azure-monitor/essentials/resource-logs).
88145
89146
Although it's helpful to know which clients generate the highest request volume, high request volume alone might not be a cause for concern. The response latency that clients experience is a better indicator of the actual load that each one generates on the API server.
90-
### Step 2 Identify and chart latency for user agentd
91-
#### [Diagnose and Solve](#/tab/Diagnose-and-solve)
147+
#### Step 2 Identify and analyse latency for user agent
148+
##### Using Diagnose and Solve on azure portal
92149

93150
AKS now provides a built-in analyzer, the API Server Resource Intensive Listing Detector, to help you identify agents that make resource-intensive LIST calls. These calls are a leading cause of API server and etcd performance issues.
94151

@@ -105,7 +162,7 @@ The detector analyzes recent API server activity and highlights agents or worklo
105162

106163
:::image type="content" source="media/troubleshoot-apiserver-etcd/resource-intensive-listing-analyzer-2.png" alt-text="Screenshot that shows the apiserver perf detector detailed view." lightbox="media/troubleshoot-apiserver-etcd/resource-intensive-listing-analyzer-2.png":::
107164

108-
##### How to interpret the detector output
165+
###### How to interpret the detector output
109166

110167
- **Summary:**
111168
Indicates if resource-intensive LIST calls were detected and describes possible impacts on your cluster.
@@ -116,20 +173,16 @@ The detector analyzes recent API server activity and highlights agents or worklo
116173
- **Charts and tables:**
117174
Identify which agents, namespaces, or workloads are generating the most resource-intensive LIST calls.
118175

119-
> Only successful LIST calls are counted. Failed or throttled calls are excluded.
120-
121-
The analyzer also provides recommendations directly in the Azure portal. These recommendations are tailored to the detected patterns to help you remediate and optimize your cluster.
122-
123176
> [!NOTE]
124-
> The API server resource intensive listing detector is available to all users who have access to the AKS resource in the Azure portal. No special permissions or prerequisites are required.
125-
>
126-
> After you identify the offending agents and apply the recommendations, you can use [the API Priority and Fairness feature](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/) to throttle or isolate problematic clients. Alternatively, refer to the "Cause 3" section of [Troubleshoot API server and etcd problems in Azure Kubernetes Services](/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd?branch=pr-en-us-9260&tabs=resource-specific#cause-3-an-offending-client-makes-excessive-list-or-put-calls).
177+
> * The API server resource intensive listing detector is available to all users who have access to the AKS resource in the Azure portal. No special permissions or prerequisites are required.
178+
> * Only successful LIST calls are counted. Failed or throttled calls are excluded.
179+
> * After you identify the offending agents and apply the recommendations, you can use [the API Priority and Fairness feature](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/) to throttle or isolate problematic clients. Alternatively, refer to the "Cause 3" section of [Troubleshoot API server and etcd problems in Azure Kubernetes Services](/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd?branch=pr-en-us-9260&tabs=resource-specific#cause-3-an-offending-client-makes-excessive-list-or-put-calls).
127180
128-
#### [Logs](#/tab/logs)
181+
##### Using Logs
129182

130183
To identify the average latency of API server requests per user agent, as plotted on a time chart, run the following query.
131184

132-
##### [Resource-specific](#tab/resource-specific)
185+
###### [Resource-specific](#tab/resource-specific)
133186

134187
```kusto
135188
AKSAudit
@@ -141,7 +194,7 @@ AKSAudit
141194
| render timechart
142195
```
143196

144-
##### [Azure diagnostics](#tab/azure-diagnostics)
197+
###### [Azure diagnostics](#tab/azure-diagnostics)
145198

146199
```kusto
147200
AzureDiagnostics
@@ -162,11 +215,11 @@ This query is a follow-up to the query in the ["Identify top user agents by the
162215
> [!TIP]
163216
> By analyzing this data, you can identify patterns and anomalies that can indicate problems on your AKS cluster or applications. For example, you might notice that a particular user is experiencing high latency. This scenario can indicate the type of API calls that are causing excessive load on the API server or etcd.
164217
165-
### Step 3: Identify bad API calls for a given user agent
218+
#### Step 3: Identify bad API calls for a given user agent
166219

167220
Run the following query to tabulate the 99th percentile (P99) latency of API calls across different resource types for a given client.
168221

169-
#### [Resource-specific](#tab/resource-specific)
222+
##### [Resource-specific](#tab/resource-specific)
170223

171224
```kusto
172225
AKSAudit
@@ -182,7 +235,7 @@ AKSAudit
182235
| render table
183236
```
184237

185-
#### [Azure diagnostics](#tab/azure-diagnostics)
238+
##### [Azure diagnostics](#tab/azure-diagnostics)
186239

187240
```kusto
188241
AzureDiagnostics
@@ -204,65 +257,6 @@ AzureDiagnostics
204257

205258
The results from this query can be useful to identify the kinds of API calls that fail the upstream Kubernetes SLOs. In most cases, an offending client might be making too many `LIST` calls on a large set of objects or objects that are too large. Unfortunately, no hard scalability limits are available to guide users about API server scalability. API server or etcd scalability limits depend on various factors that are explained in [Kubernetes Scalability thresholds](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md).
206259

207-
## Cause and Resolution
208-
209-
### Cause 1: A network rule blocks the traffic from agent nodes to the API server
210-
211-
A network rule can block traffic between the agent nodes and the API server.
212-
213-
To check whether a misconfigured network policy is blocking communication between the API server and agent nodes, run the following [kubectl-aks](https://go.microsoft.com/fwlink/p/?linkid=2259767) commands:
214-
215-
```bash
216-
kubectl aks config import \
217-
--subscription <mySubscriptionID> \
218-
--resource-group <myResourceGroup> \
219-
--cluster-name <myAKSCluster>
220-
221-
kubectl aks check-apiserver-connectivity --node <myNode>
222-
```
223-
224-
The [config import](https://go.microsoft.com/fwlink/p/?linkid=2259867#importing-configuration) command retrieves the Virtual Machine Scale Set information for all the nodes in the cluster. Then, the [check-apiserver-connectivity](https://go.microsoft.com/fwlink/p/?linkid=2259674) command uses this information to verify the network connectivity between the API server and a specified node, specifically for its underlying scale set instance.
225-
226-
> [!NOTE]
227-
> If the output of the `check-apiserver-connectivity` command contains the `Connectivity check: succeeded` message, then the network connectivity is unimpeded.
228-
229-
### Solution 1: Fix the network policy to remove the traffic blockage
230-
231-
If the command output indicates that a connection failure occurred, reconfigure the network policy so that it doesn't unnecessarily block traffic between the agent nodes and the API server.
232-
233-
### Cause 2: An offending client leaks etcd objects and causes a slowdown of etcd
234-
235-
A common situation is that objects are continuously created even though existing unused objects in the etcd database aren't removed. This situation can cause performance problems if etcd handles too many objects (more than 10,000) of any type. A rapid increase of changes on such objects could also cause the default size of the etcd database (by default, 4 gigabytes) to be exceeded.
236-
237-
To check the etcd database usage, navigate to **Diagnose and Solve problems** in the Azure portal. Run the **Etcd Availability Issues** diagnosis tool by searching for "_etcd_" in the Search box. The diagnosis tool shows the usage breakdown and the total database size.
238-
239-
:::image type="content" source="media/troubleshoot-apiserver-etcd/etcd-detector.png" alt-text="Azure portal screenshot that shows the Etcd Availability Diagnosis for Azure Kubernetes Service (AKS)." lightbox="media/troubleshoot-apiserver-etcd/etcd-detector.png":::
240-
241-
To get a quick view of the current size of your etcd database in bytes, run the following command:
242-
243-
```bash
244-
kubectl get --raw /metrics | grep -E "etcd_db_total_size_in_bytes|apiserver_storage_size_bytes|apiserver_storage_db_total_size_in_bytes"
245-
```
246-
247-
> [!NOTE]
248-
> The metric name in the previous command is different for different Kubernetes versions. For Kubernetes 1.25 and earlier versions, use `etcd_db_total_size_in_bytes`. For Kubernetes 1.26 to 1.28, use `apiserver_storage_db_total_size_in_bytes`.
249-
250-
### Solution 2: Define quotas for object creation, delete objects, or limit object lifetime in etcd
251-
252-
To prevent etcd from reaching capacity and causing cluster downtime, you can limit the maximum number of resources that are created. You can also slow the number of revisions that are generated for resource instances. To limit the number of objects that can be created, [define object quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/#object-count-quota).
253-
254-
If you identified objects that are no longer in use but consume resources, consider deleting them. For example, delete completed jobs to free up space:
255-
256-
```bash
257-
kubectl delete jobs --field-selector status.successful=1
258-
```
259-
260-
For objects that support [automatic cleanup](https://kubernetes.io/docs/concepts/architecture/garbage-collection/), set Time to Live (TTL) values to limit the lifetime of these objects. You can also label your objects so that you can bulk delete all the objects of a specific type by using label selectors. If you establish [owner references](https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/) among objects, any dependent objects are automatically deleted after the parent object is deleted.
261-
262-
### Cause 3: An offending client makes excessive LIST or PUT calls
263-
264-
If you determine that etcd isn't overloaded with too many objects, an offending client might be making too many `LIST` or `PUT` calls to the API server.
265-
266260
### Solution 3a: Tune your API call pattern
267261

268262
To reduce the pressure on the control plane, consider tuning your client's API call pattern.

0 commit comments

Comments
 (0)