You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> The metric name in the previous command is different for different Kubernetes versions. For Kubernetes 1.25 and earlier versions, use `etcd_db_total_size_in_bytes`. For Kubernetes 1.26 to 1.28, use `apiserver_storage_db_total_size_in_bytes`.
110
+
> The metric name in the previous command is different for different Kubernetes versions. For Kubernetes 1.25 and earlier versions, use `etcd_db_total_size_in_bytes`. For Kubernetes 1.26 to 1.28, use `apiserver_storage_db_total_size_in_bytes`. An Etcd database with size > 2 gigabytes is considered a large etcd db.
111
111
112
112
### Solution 3: Define quotas for object creation, delete objects, or limit object lifetime in etcd
113
113
@@ -121,12 +121,17 @@ kubectl delete jobs --field-selector status.successful=1
121
121
122
122
For objects that support [automatic cleanup](https://kubernetes.io/docs/concepts/architecture/garbage-collection/), set Time to Live (TTL) values to limit the lifetime of these objects. You can also label your objects so that you can bulk delete all the objects of a specific type by using label selectors. If you establish [owner references](https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/) among objects, any dependent objects are automatically deleted after the parent object is deleted.
123
123
124
+
Also refer to [Solution 6](#solution-6-use-existing-diagnostic-tools-to-identify-and-resolve-the-underlying-cause) for object size reduction techniques.
125
+
124
126
### Cause 4: AKS managed API server guard was applied
125
127
126
128
If you're experiencing a high rate of HTTP 429 errors, one possible cause is that AKS has applied a managed API server guard. It is achieved by applying a FlowSchema and PriorityLevelConfiguration called **"aks-managed-apiserver-guard"**. This safeguard is triggered when the API server encounters frequent out-of-memory (OOM) events after scaling efforts on the API server have failed to stabilise it. This guard is designed as a last-resort measure to safeguard the API server by throttling non-system client requests to the API server and prevent it from becoming completely unresponsive.
127
129
128
130
- Check cluster for the presence of **"aks-managed-apiserver-guard"** FlowSchema and PriorityLevelConfiguration or check kubernetes events
129
131
132
+
> [!NOTE]
133
+
> Kubectl commands may take longer than expected or time out when the API server is overloaded. Retry if it fails.
134
+
130
135
```bash
131
136
kubectl get flowschemas
132
137
kubectl get prioritylevelconfigurations
@@ -143,34 +148,36 @@ kubectl get prioritylevelconfigurations
143
148
kubectl get events -n kube-system aks-managed-apiserver-throttling-enabled
144
149
```
145
150
151
+
146
152
### Solution 4: Identify unoptimized clients and mitigate
147
153
148
154
#### Step 1: Identify unoptimized clients
149
155
150
156
- See [Cause 5](#cause-5-an-offending-client-makes-excessive-list-or-put-calls) to identify problematic clients and refine their LIST call patterns - especially those generating high-frequency or high-latency requests as they are the primary contributors to API server degradation. Refer to [best practices](/azure-aks-docs-pr/articles/aks/best-practices-performance-scale-large.md#kubernetes-clients) for further guidance on client optimization.
151
157
152
158
#### Step 2: Mitigation
153
-
> [!WARNING]
154
-
> Do not perform any mitigation steps until the client's call pattern is optimized, as this could lead to the API server becoming fully unresponsive.
155
-
156
-
- Once the unoptimized client's call pattern to API server has been optimizied, remove the aks-managed-apiserver-guard FlowSchema and PriorityLevelConfiguration
157
159
160
+
- Scale down the cluster to reduce the load on the API server
161
+
- Use [Control Plane Metrics](/azure-aks-docs-pr/articles/aks/control-plane-metrics-monitor.md) to monitor the load on the API server. Refer the [blog](https://techcommunity.microsoft.com/blog/appsonazureblog/azure-platform-metrics-for-aks-control-plane-monitoring/4385770) for more details.
162
+
- Once the above steps are complete, delete aks-managed-apiserver-guard
- You can also [modify the aks-managed-apiserver-guard FlowSchema and PriorityLevelConfiguration](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/#good-practice-apf-settings) by applying the label **aks-managed-skip-update-operation: true**. This label preserves the modified configurations and prevents AKS from reconciling them back to default values.
167
+
> [!WARNING]
168
+
> Avoid scaling the cluster back to the originally intended scale point until client call patterns have been optimized, refer to **[best practices](/azure-aks-docs-pr/articles/aks/best-practices-performance-scale-large.md#kubernetes-clients)**. Premature scaling may cause the API server to crash again.
169
+
170
+
- You can also [modify the aks-managed-apiserver-guard FlowSchema and PriorityLevelConfiguration](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/#good-practice-apf-settings) by applying the label **aks-managed-skip-update-operation: true**. This label preserves the modified configurations and prevents AKS from reconciling them back to default values. This is relevant if you are applying a custom FlowSchema and PriorityLevelConfiguration tailored to your cluster’s requirements as specified in [solution 5b](#solution-5b-throttle-a-client-thats-overwhelming-the-control-plane) and do not want AKS to automatically manage client throttling.
> It's advisable to rather delete aks-managed-apiserver-guard after optimizing the client's LIST pattern and applying a custom FlowSchema and PriorityLevelConfiguration applicable to your cluster's requirement as specified in [solution 5b](#solution-5b-throttle-a-client-thats-overwhelming-the-control-plane), instead of modifying the default aks-managed-apiserver-guard. Modifying it will cause AKS to be not be able reapply the aks-managed-apiserver-guard with defaults if the API server continues to experience out-of-memory (OOM) events in the future.
176
+
170
177
171
178
### Cause 5: An offending client makes excessive LIST or PUT calls
172
179
173
-
If you determine that etcd isn't overloaded with too many objects, an offending client might be making too many `LIST` or `PUT` calls to the API server.
180
+
If etcd isn't overloaded with too many objects as defined in [Cause 3](#cause-3-an-offending-client-leaks-etcd-objects-and-causes-a-slowdown-of-etcd), an offending client might be making too many `LIST` or `PUT` calls to the API server.
174
181
If you experience high latency or frequent timeouts, follow these steps to pinpoint the offending client and the types of API calls that fail.
175
182
176
183
#### <aid="identifytopuseragents"></a> Step 1: Identify top user agents by the number of requests
0 commit comments