Skip to content

Commit 8031682

Browse files
committed
resolving comments
1 parent b606c0b commit 8031682

4 files changed

Lines changed: 37 additions & 13 deletions

File tree

118 KB
Loading

support/azure/azure-kubernetes/create-upgrade-delete/image-4.png renamed to support/azure/azure-kubernetes/create-upgrade-delete/media/troubleshoot-apiserver-etcd/flow-schema.png

File renamed without changes.

support/azure/azure-kubernetes/create-upgrade-delete/image-5.png renamed to support/azure/azure-kubernetes/create-upgrade-delete/media/troubleshoot-apiserver-etcd/priority-level-configuration.png

File renamed without changes.

support/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd.md

Lines changed: 37 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,27 @@ Check the events that are related to your API server. You might see event messag
8686

8787
> Internal error occurred: failed calling webhook "mutate.kyverno.svc-fail": failed to call webhook: Post "https\://kyverno-system-kyverno-system-svc.kyverno-system.svc:443/mutate/fail?timeout=10s": write unix @->/tunnel-uds/proxysocket: write: broken pipe
8888
89+
##### [**Resource-specific**](#tab/resource-specific)
90+
91+
```kusto
92+
AKSControlPlane
93+
| where Category=="kube-apiserver"
94+
| where Message contains "Failed calling webhook, failing closed"
95+
| limit 100
96+
| project TimeGenerated, Level, Message
97+
```
98+
##### [**Azure diagnostics**](#tab/azure-diagnostics)
99+
100+
```kusto
101+
AzureDiagnostics
102+
| where TimeGenerated between(now(-1h)..now()) // When you experienced the problem
103+
| where Category == "kube-apiserver"
104+
| where log_s contains "Failed calling webhook, failing closed"
105+
| extend event = parse_json(log_s)
106+
| limit 100
107+
| project TimeGenerated, event, Category
108+
```
109+
89110
In this example, the validating webhook is blocking the creation of some API server objects. Because this scenario might occur during bootstrap time, the API server and Konnectivity pods can't be created. Therefore, the webhook can't connect to those pods. This sequence of events causes the deadlock and the error message.
90111

91112
### Solution 2: Delete webhook configurations
@@ -96,9 +117,9 @@ To fix this problem, delete the validating and mutating webhook configurations.
96117

97118
A common situation is that objects are continuously created even though existing unused objects in the etcd database aren't removed. This situation can cause performance problems if etcd handles too many objects (more than 10,000) of any type. A rapid increase of changes on such objects could also cause the default size of the etcd database (by default, 8 gigabytes) to be exceeded.
98119

99-
To check the etcd database usage, navigate to **Diagnose and Solve problems** in the Azure portal. Run the **Etcd Availability Issues** diagnosis tool by searching for "_etcd_" in the Search box. The diagnosis tool shows the usage breakdown and the total database size.
120+
To check the etcd database usage, navigate to **Diagnose and Solve problems** -> **Cluster and Control Plane Availability and Performance** in the Azure portal. Run the **Etcd Capacity Issues** and **Etcd Performance Issues** diagnosis tool. The diagnosis tool shows the usage breakdown and the total database size.
100121

101-
:::image type="content" source="media/troubleshoot-apiserver-etcd/etcd-detector.png" alt-text="Azure portal screenshot that shows the Etcd Availability Diagnosis for Azure Kubernetes Service (AKS)." lightbox="media/troubleshoot-apiserver-etcd/etcd-detector.png":::
122+
:::image type="content" source="media/troubleshoot-apiserver-etcd/etcd-detector.png" alt-text="Azure portal screenshot that shows the Etcd Capacity Issues Diagnosis Tool for Azure Kubernetes Service (AKS)." lightbox="media/troubleshoot-apiserver-etcd/etcd-detector.png":::
102123

103124
To get a quick view of the current size of your etcd database in bytes, run the following command:
104125

@@ -107,7 +128,8 @@ kubectl get --raw /metrics | grep -E "etcd_db_total_size_in_bytes|apiserver_stor
107128
```
108129

109130
> [!NOTE]
110-
> The metric name in the previous command is different for different Kubernetes versions. For Kubernetes 1.25 and earlier versions, use `etcd_db_total_size_in_bytes`. For Kubernetes 1.26 to 1.28, use `apiserver_storage_db_total_size_in_bytes`. An Etcd database with size > 2 gigabytes is considered a large etcd db.
131+
> - If your Control Plane is unavailable, the kubectl commands will not work. Please use **Diagnose and Solve problems** in Azure portal as shown in the section above.
132+
- The metric name in the previous command is different for different Kubernetes versions. For Kubernetes 1.25 and earlier versions, use `etcd_db_total_size_in_bytes`. For Kubernetes 1.26 to 1.28, use `apiserver_storage_db_total_size_in_bytes`. An Etcd database with size > 2 gigabytes is considered a large etcd db.
111133
112134
### Solution 3: Define quotas for object creation, delete objects, or limit object lifetime in etcd
113135

@@ -136,11 +158,13 @@ If you're experiencing a high rate of HTTP 429 errors, one possible cause is tha
136158
kubectl get flowschemas
137159
kubectl get prioritylevelconfigurations
138160
```
139-
<img src="image-4.png" alt="FlowSchema" width="600">
161+
<img src="media/troubleshoot-apiserver-etcd/flow-schema.png" alt="FlowSchema" width="600">
140162

141163
<br>
142164

143-
<img src="image-5.png" alt="PriorityLevelConfiguration" width="600">
165+
<img src="media/troubleshoot-apiserver-etcd/priority-level-configuration.png" alt="FlowSchema" width="600">
166+
167+
<br>
144168

145169
- Check Kubernetes Events
146170

@@ -171,8 +195,8 @@ kubectl delete prioritylevelconfiguration aks-managed-apiserver-guard
171195
- You can also [modify the aks-managed-apiserver-guard FlowSchema and PriorityLevelConfiguration](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/#good-practice-apf-settings) by applying the label **aks-managed-skip-update-operation: true**. This label preserves the modified configurations and prevents AKS from reconciling them back to default values. This is relevant if you are applying a custom FlowSchema and PriorityLevelConfiguration tailored to your cluster’s requirements as specified in [solution 5b](#solution-5b-throttle-a-client-thats-overwhelming-the-control-plane) and do not want AKS to automatically manage client throttling.
172196

173197
```bash
174-
kubectl label prioritylevelconfiguration aks-managed-apiserver-guard
175-
kubectl label flowschema aks-managed-apiserver-guard
198+
kubectl label prioritylevelconfiguration aks-managed-apiserver-guard aks-managed-skip-update-operation=true
199+
kubectl label flowschema aks-managed-apiserver-guard aks-managed-skip-update-operation=true
176200
```
177201

178202

@@ -185,7 +209,7 @@ If you experience high latency or frequent timeouts, follow these steps to pinpo
185209

186210
To identify which clients generate the most requests (and potentially the most API server load), run a query that resembles the following code. This query lists the top 10 user agents by the number of API server requests sent.
187211

188-
[**Resource-specific**](#tab/resource-specific)
212+
##### [**Resource-specific**](#tab/resource-specific)
189213

190214
```kusto
191215
AKSAudit
@@ -195,7 +219,7 @@ AKSAudit
195219
| project UserAgent, count_
196220
```
197221

198-
[**Azure diagnostics**](#tab/azure-diagnostics)
222+
##### [**Azure diagnostics**](#tab/azure-diagnostics)
199223

200224
```kusto
201225
AzureDiagnostics
@@ -252,7 +276,7 @@ The detector analyzes recent API server activity and highlights agents or worklo
252276

253277
To identify the average latency of API server requests per user agent, as plotted on a time chart, run the following query.
254278

255-
[**Resource-specific**](#tab/resource-specific)
279+
##### [**Resource-specific**](#tab/resource-specific)
256280

257281
```kusto
258282
AKSAudit
@@ -264,7 +288,7 @@ AKSAudit
264288
| render timechart
265289
```
266290

267-
[**Azure diagnostics**](#tab/azure-diagnostics)
291+
##### [**Azure diagnostics**](#tab/azure-diagnostics)
268292

269293
```kusto
270294
AzureDiagnostics
@@ -289,7 +313,7 @@ This query is a follow-up to the query in the ["Identify top user agents by the
289313

290314
Run the following query to tabulate the 99th percentile (P99) latency of API calls across different resource types for a given client.
291315

292-
[**Resource-specific**](#tab/resource-specific)
316+
##### [**Resource-specific**](#tab/resource-specific)
293317

294318
```kusto
295319
AKSAudit
@@ -305,7 +329,7 @@ AKSAudit
305329
| render table
306330
```
307331

308-
[**Azure diagnostics**](#tab/azure-diagnostics)
332+
##### [**Azure diagnostics**](#tab/azure-diagnostics)
309333

310334
```kusto
311335
AzureDiagnostics

0 commit comments

Comments
 (0)