You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| where TimeGenerated between(now(-1h)..now()) // When you experienced the problem
103
+
| where Category == "kube-apiserver"
104
+
| where log_s contains "Failed calling webhook, failing closed"
105
+
| extend event = parse_json(log_s)
106
+
| limit 100
107
+
| project TimeGenerated, event, Category
108
+
```
109
+
89
110
In this example, the validating webhook is blocking the creation of some API server objects. Because this scenario might occur during bootstrap time, the API server and Konnectivity pods can't be created. Therefore, the webhook can't connect to those pods. This sequence of events causes the deadlock and the error message.
90
111
91
112
### Solution 2: Delete webhook configurations
@@ -96,9 +117,9 @@ To fix this problem, delete the validating and mutating webhook configurations.
96
117
97
118
A common situation is that objects are continuously created even though existing unused objects in the etcd database aren't removed. This situation can cause performance problems if etcd handles too many objects (more than 10,000) of any type. A rapid increase of changes on such objects could also cause the default size of the etcd database (by default, 8 gigabytes) to be exceeded.
98
119
99
-
To check the etcd database usage, navigate to **Diagnose and Solve problems** in the Azure portal. Run the **Etcd Availability Issues**diagnosis tool by searching for "_etcd_" in the Search box. The diagnosis tool shows the usage breakdown and the total database size.
120
+
To check the etcd database usage, navigate to **Diagnose and Solve problems**-> **Cluster and Control Plane Availability and Performance**in the Azure portal. Run the **Etcd Capacity Issues**and **Etcd Performance Issues** diagnosis tool. The diagnosis tool shows the usage breakdown and the total database size.
100
121
101
-
:::image type="content" source="media/troubleshoot-apiserver-etcd/etcd-detector.png" alt-text="Azure portal screenshot that shows the Etcd Availability Diagnosis for Azure Kubernetes Service (AKS)." lightbox="media/troubleshoot-apiserver-etcd/etcd-detector.png":::
122
+
:::image type="content" source="media/troubleshoot-apiserver-etcd/etcd-detector.png" alt-text="Azure portal screenshot that shows the Etcd Capacity Issues Diagnosis Tool for Azure Kubernetes Service (AKS)." lightbox="media/troubleshoot-apiserver-etcd/etcd-detector.png":::
102
123
103
124
To get a quick view of the current size of your etcd database in bytes, run the following command:
> The metric name in the previous command is different for different Kubernetes versions. For Kubernetes 1.25 and earlier versions, use `etcd_db_total_size_in_bytes`. For Kubernetes 1.26 to 1.28, use `apiserver_storage_db_total_size_in_bytes`. An Etcd database with size > 2 gigabytes is considered a large etcd db.
131
+
> - If your Control Plane is unavailable, the kubectl commands will not work. Please use **Diagnose and Solve problems** in Azure portal as shown in the section above.
132
+
- The metric name in the previous command is different for different Kubernetes versions. For Kubernetes 1.25 and earlier versions, use `etcd_db_total_size_in_bytes`. For Kubernetes 1.26 to 1.28, use `apiserver_storage_db_total_size_in_bytes`. An Etcd database with size > 2 gigabytes is considered a large etcd db.
111
133
112
134
### Solution 3: Define quotas for object creation, delete objects, or limit object lifetime in etcd
113
135
@@ -136,11 +158,13 @@ If you're experiencing a high rate of HTTP 429 errors, one possible cause is tha
- You can also [modify the aks-managed-apiserver-guard FlowSchema and PriorityLevelConfiguration](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/#good-practice-apf-settings) by applying the label **aks-managed-skip-update-operation: true**. This label preserves the modified configurations and prevents AKS from reconciling them back to default values. This is relevant if you are applying a custom FlowSchema and PriorityLevelConfiguration tailored to your cluster’s requirements as specified in [solution 5b](#solution-5b-throttle-a-client-thats-overwhelming-the-control-plane) and do not want AKS to automatically manage client throttling.
@@ -185,7 +209,7 @@ If you experience high latency or frequent timeouts, follow these steps to pinpo
185
209
186
210
To identify which clients generate the most requests (and potentially the most API server load), run a query that resembles the following code. This query lists the top 10 user agents by the number of API server requests sent.
0 commit comments