Skip to content

Commit 3eaedad

Browse files
committed
resolving comments
1 parent cca5f2b commit 3eaedad

1 file changed

Lines changed: 16 additions & 9 deletions

File tree

support/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd.md

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ kubectl get --raw /metrics | grep -E "etcd_db_total_size_in_bytes|apiserver_stor
107107
```
108108

109109
> [!NOTE]
110-
> The metric name in the previous command is different for different Kubernetes versions. For Kubernetes 1.25 and earlier versions, use `etcd_db_total_size_in_bytes`. For Kubernetes 1.26 to 1.28, use `apiserver_storage_db_total_size_in_bytes`.
110+
> The metric name in the previous command is different for different Kubernetes versions. For Kubernetes 1.25 and earlier versions, use `etcd_db_total_size_in_bytes`. For Kubernetes 1.26 to 1.28, use `apiserver_storage_db_total_size_in_bytes`. An Etcd database with size > 2 gigabytes is considered a large etcd db.
111111
112112
### Solution 3: Define quotas for object creation, delete objects, or limit object lifetime in etcd
113113

@@ -121,12 +121,17 @@ kubectl delete jobs --field-selector status.successful=1
121121

122122
For objects that support [automatic cleanup](https://kubernetes.io/docs/concepts/architecture/garbage-collection/), set Time to Live (TTL) values to limit the lifetime of these objects. You can also label your objects so that you can bulk delete all the objects of a specific type by using label selectors. If you establish [owner references](https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/) among objects, any dependent objects are automatically deleted after the parent object is deleted.
123123

124+
Also refer to [Solution 6](#solution-6-use-existing-diagnostic-tools-to-identify-and-resolve-the-underlying-cause) for object size reduction techniques.
125+
124126
### Cause 4: AKS managed API server guard was applied
125127

126128
If you're experiencing a high rate of HTTP 429 errors, one possible cause is that AKS has applied a managed API server guard. It is achieved by applying a FlowSchema and PriorityLevelConfiguration called **"aks-managed-apiserver-guard"**. This safeguard is triggered when the API server encounters frequent out-of-memory (OOM) events after scaling efforts on the API server have failed to stabilise it. This guard is designed as a last-resort measure to safeguard the API server by throttling non-system client requests to the API server and prevent it from becoming completely unresponsive.
127129

128130
- Check cluster for the presence of **"aks-managed-apiserver-guard"** FlowSchema and PriorityLevelConfiguration or check kubernetes events
129131

132+
> [!NOTE]
133+
> Kubectl commands may take longer than expected or time out when the API server is overloaded. Retry if it fails.
134+
130135
```bash
131136
kubectl get flowschemas
132137
kubectl get prioritylevelconfigurations
@@ -143,34 +148,36 @@ kubectl get prioritylevelconfigurations
143148
kubectl get events -n kube-system aks-managed-apiserver-throttling-enabled
144149
```
145150

151+
146152
### Solution 4: Identify unoptimized clients and mitigate
147153

148154
#### Step 1: Identify unoptimized clients
149155

150156
- See [Cause 5](#cause-5-an-offending-client-makes-excessive-list-or-put-calls) to identify problematic clients and refine their LIST call patterns - especially those generating high-frequency or high-latency requests as they are the primary contributors to API server degradation. Refer to [best practices](/azure-aks-docs-pr/articles/aks/best-practices-performance-scale-large.md#kubernetes-clients) for further guidance on client optimization.
151157

152158
#### Step 2: Mitigation
153-
> [!WARNING]
154-
> Do not perform any mitigation steps until the client's call pattern is optimized, as this could lead to the API server becoming fully unresponsive.
155-
156-
- Once the unoptimized client's call pattern to API server has been optimizied, remove the aks-managed-apiserver-guard FlowSchema and PriorityLevelConfiguration
157159

160+
- Scale down the cluster to reduce the load on the API server
161+
- Use [Control Plane Metrics](/azure-aks-docs-pr/articles/aks/control-plane-metrics-monitor.md) to monitor the load on the API server. Refer the [blog](https://techcommunity.microsoft.com/blog/appsonazureblog/azure-platform-metrics-for-aks-control-plane-monitoring/4385770) for more details.
162+
- Once the above steps are complete, delete aks-managed-apiserver-guard
158163
```bash
159164
kubectl delete flowschema aks-managed-apiserver-guard
160165
kubectl delete prioritylevelconfiguration aks-managed-apiserver-guard
161166
```
162-
- You can also [modify the aks-managed-apiserver-guard FlowSchema and PriorityLevelConfiguration](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/#good-practice-apf-settings) by applying the label **aks-managed-skip-update-operation: true**. This label preserves the modified configurations and prevents AKS from reconciling them back to default values.
167+
> [!WARNING]
168+
> Avoid scaling the cluster back to the originally intended scale point until client call patterns have been optimized, refer to **[best practices](/azure-aks-docs-pr/articles/aks/best-practices-performance-scale-large.md#kubernetes-clients)**. Premature scaling may cause the API server to crash again.
169+
170+
- You can also [modify the aks-managed-apiserver-guard FlowSchema and PriorityLevelConfiguration](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/#good-practice-apf-settings) by applying the label **aks-managed-skip-update-operation: true**. This label preserves the modified configurations and prevents AKS from reconciling them back to default values. This is relevant if you are applying a custom FlowSchema and PriorityLevelConfiguration tailored to your cluster’s requirements as specified in [solution 5b](#solution-5b-throttle-a-client-thats-overwhelming-the-control-plane) and do not want AKS to automatically manage client throttling.
163171

164172
```bash
165173
kubectl label prioritylevelconfiguration aks-managed-apiserver-guard
166174
kubectl label flowschema aks-managed-apiserver-guard
167175
```
168-
> [!NOTE]
169-
> It's advisable to rather delete aks-managed-apiserver-guard after optimizing the client's LIST pattern and applying a custom FlowSchema and PriorityLevelConfiguration applicable to your cluster's requirement as specified in [solution 5b](#solution-5b-throttle-a-client-thats-overwhelming-the-control-plane), instead of modifying the default aks-managed-apiserver-guard. Modifying it will cause AKS to be not be able reapply the aks-managed-apiserver-guard with defaults if the API server continues to experience out-of-memory (OOM) events in the future.
176+
170177

171178
### Cause 5: An offending client makes excessive LIST or PUT calls
172179

173-
If you determine that etcd isn't overloaded with too many objects, an offending client might be making too many `LIST` or `PUT` calls to the API server.
180+
If etcd isn't overloaded with too many objects as defined in [Cause 3](#cause-3-an-offending-client-leaks-etcd-objects-and-causes-a-slowdown-of-etcd), an offending client might be making too many `LIST` or `PUT` calls to the API server.
174181
If you experience high latency or frequent timeouts, follow these steps to pinpoint the offending client and the types of API calls that fail.
175182

176183
#### <a id="identifytopuseragents"></a> Step 1: Identify top user agents by the number of requests

0 commit comments

Comments
 (0)