first edits

kevintom0927 · kevintom0927 · commit 764f99c319f6 · 2025-09-16T23:48:24.000-05:00
diff --git a/support/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd.md b/support/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd.md
@@ -49,17 +49,74 @@ The following table outlines the common symptoms of API server failures.
 | Timeouts from the API server | Frequent timeouts that are beyond the guarantees in [the AKS API server SLA](/azure/aks/free-standard-pricing-tiers#uptime-sla-terms-and-conditions). For example, `kubectl` commands timeout. |
 | High latencies | High latencies that make the Kubernetes SLOs fail. For example, the `kubectl` command takes more than 30 seconds to list pods. |
 | API server pod in `CrashLoopbackOff` status or facing webhook call failures | Verify that you don't have any custom admission webhook (such as the [Kyverno](https://kyverno.io/docs/introduction/) policy engine) that's blocking the calls to the API server. |
-| Elevated HTTP 429 responses from the API server | API server is throttling calls. Refer to the troubleshooting checklist|
+| Elevated HTTP 429 responses from the API server | API server is throttling calls. Refer to the potential causes below|
 
-## Troubleshooting checklist
 
-If you experience high latency times, follow these steps to pinpoint the offending client and the types of API calls that fail.
+## Cause and Resolution
+
+### Cause 1: A network rule blocks the traffic from agent nodes to the API server
+
+A network rule can block traffic between the agent nodes and the API server.
+
+To check whether a misconfigured network policy is blocking communication between the API server and agent nodes, run the following [kubectl-aks](https://go.microsoft.com/fwlink/p/?linkid=2259767) commands:
+
+```bash
+kubectl aks config import \
+    --subscription <mySubscriptionID> \
+    --resource-group <myResourceGroup> \
+    --cluster-name <myAKSCluster>
+
+kubectl aks check-apiserver-connectivity --node <myNode>
+```
 
-### <a id="identifytopuseragents"></a> Step 1: Identify top user agents by the number of requests
+The [config import](https://go.microsoft.com/fwlink/p/?linkid=2259867#importing-configuration) command retrieves the Virtual Machine Scale Set information for all the nodes in the cluster. Then, the [check-apiserver-connectivity](https://go.microsoft.com/fwlink/p/?linkid=2259674) command uses this information to verify the network connectivity between the API server and a specified node, specifically for its underlying scale set instance.
+
+> [!NOTE]
+> If the output of the `check-apiserver-connectivity` command contains the `Connectivity check: succeeded` message, then the network connectivity is unimpeded.
+
+### Solution 1: Fix the network policy to remove the traffic blockage
+
+If the command output indicates that a connection failure occurred, reconfigure the network policy so that it doesn't unnecessarily block traffic between the agent nodes and the API server.
+
+### Cause 2: An offending client leaks etcd objects and causes a slowdown of etcd
+
+A common situation is that objects are continuously created even though existing unused objects in the etcd database aren't removed. This situation can cause performance problems if etcd handles too many objects (more than 10,000) of any type. A rapid increase of changes on such objects could also cause the default size of the etcd database (by default, 4 gigabytes) to be exceeded.
+
+To check the etcd database usage, navigate to **Diagnose and Solve problems** in the Azure portal. Run the **Etcd Availability Issues** diagnosis tool by searching for "_etcd_" in the Search box. The diagnosis tool shows the usage breakdown and the total database size.
+
+:::image type="content" source="media/troubleshoot-apiserver-etcd/etcd-detector.png" alt-text="Azure portal screenshot that shows the Etcd Availability Diagnosis for Azure Kubernetes Service (AKS)." lightbox="media/troubleshoot-apiserver-etcd/etcd-detector.png":::
+
+To get a quick view of the current size of your etcd database in bytes, run the following command:
+
+```bash
+kubectl get --raw /metrics | grep -E "etcd_db_total_size_in_bytes|apiserver_storage_size_bytes|apiserver_storage_db_total_size_in_bytes"
+```
+
+> [!NOTE]
+> The metric name in the previous command is different for different Kubernetes versions. For Kubernetes 1.25 and earlier versions, use `etcd_db_total_size_in_bytes`. For Kubernetes 1.26 to 1.28, use `apiserver_storage_db_total_size_in_bytes`.
+
+### Solution 2: Define quotas for object creation, delete objects, or limit object lifetime in etcd
+
+To prevent etcd from reaching capacity and causing cluster downtime, you can limit the maximum number of resources that are created. You can also slow the number of revisions that are generated for resource instances. To limit the number of objects that can be created, [define object quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/#object-count-quota).
+
+If you identified objects that are no longer in use but consume resources, consider deleting them. For example, delete completed jobs to free up space:
+
+```bash
+kubectl delete jobs --field-selector status.successful=1
+```
+
+For objects that support [automatic cleanup](https://kubernetes.io/docs/concepts/architecture/garbage-collection/), set Time to Live (TTL) values to limit the lifetime of these objects. You can also label your objects so that you can bulk delete all the objects of a specific type by using label selectors. If you establish [owner references](https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/) among objects, any dependent objects are automatically deleted after the parent object is deleted.
+
+### Cause 3: An offending client makes excessive LIST or PUT calls
+
+If you determine that etcd isn't overloaded with too many objects, an offending client might be making too many `LIST` or `PUT` calls to the API server.
+If you experience high latency or frequent timeouts, follow these steps to pinpoint the offending client and the types of API calls that fail.
+
+#### <a id="identifytopuseragents"></a> Step 1: Identify top user agents by the number of requests
 
 To identify which clients generate the most requests (and potentially the most API server load), run a query that resembles the following code. This query lists the top 10 user agents by the number of API server requests sent.
 
-#### [Resource-specific](#tab/resource-specific)
+##### [Resource-specific](#tab/resource-specific)
 
 ```kusto
 AKSAudit
@@ -69,7 +126,7 @@ AKSAudit
 | project UserAgent, count_
 ```
 
-#### [Azure diagnostics](#tab/azure-diagnostics)
+##### [Azure diagnostics](#tab/azure-diagnostics)
 
 ```kusto
 AzureDiagnostics
@@ -87,8 +144,8 @@ AzureDiagnostics
 > If your query returns no results, you might have selected the wrong table to query diagnostics logs. In resource-specific mode, data is written to individual tables, depending on the category of the resource. Diagnostics logs are written to the `AKSAudit` table. In Azure diagnostics mode, all data is written to the `AzureDiagnostics` table. For more information, see [Azure resource logs](/azure/azure-monitor/essentials/resource-logs).
 
 Although it's helpful to know which clients generate the highest request volume, high request volume alone might not be a cause for concern. The response latency that clients experience is a better indicator of the actual load that each one generates on the API server.
-### Step 2 Identify and chart latency for user agentd
-#### [Diagnose and Solve](#/tab/Diagnose-and-solve)
+#### Step 2 Identify and analyse latency for user agent
+##### Using Diagnose and Solve on azure portal 
 
 AKS now provides a built-in analyzer, the API Server Resource Intensive Listing Detector, to help you identify agents that make resource-intensive LIST calls. These calls are a leading cause of API server and etcd performance issues.
 
@@ -105,7 +162,7 @@ The detector analyzes recent API server activity and highlights agents or worklo
 
 :::image type="content" source="media/troubleshoot-apiserver-etcd/resource-intensive-listing-analyzer-2.png" alt-text="Screenshot that shows the apiserver perf detector detailed view." lightbox="media/troubleshoot-apiserver-etcd/resource-intensive-listing-analyzer-2.png":::
 
-##### How to interpret the detector output
+###### How to interpret the detector output
 
 - **Summary:**  
   Indicates if resource-intensive LIST calls were detected and describes possible impacts on your cluster.
@@ -116,20 +173,16 @@ The detector analyzes recent API server activity and highlights agents or worklo
 - **Charts and tables:**  
   Identify which agents, namespaces, or workloads are generating the most resource-intensive LIST calls.
 
-> Only successful LIST calls are counted. Failed or throttled calls are excluded.
-
-The analyzer also provides recommendations directly in the Azure portal. These recommendations are tailored to the detected patterns to help you remediate and optimize your cluster.
-
 > [!NOTE]
-> The API server resource intensive listing detector is available to all users who have access to the AKS resource in the Azure portal. No special permissions or prerequisites are required.
-> 
-> After you identify the offending agents and apply the recommendations, you can use [the API Priority and Fairness feature](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/) to throttle or isolate problematic clients. Alternatively, refer to the "Cause 3" section of [Troubleshoot API server and etcd problems in Azure Kubernetes Services](/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd?branch=pr-en-us-9260&tabs=resource-specific#cause-3-an-offending-client-makes-excessive-list-or-put-calls).
+> * The API server resource intensive listing detector is available to all users who have access to the AKS resource in the Azure portal. No special permissions or prerequisites are required.
+> * Only successful LIST calls are counted. Failed or throttled calls are excluded.
+> * After you identify the offending agents and apply the recommendations, you can use [the API Priority and Fairness feature](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/) to throttle or isolate problematic clients. Alternatively, refer to the "Cause 3" section of [Troubleshoot API server and etcd problems in Azure Kubernetes Services](/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/troubleshoot-apiserver-etcd?branch=pr-en-us-9260&tabs=resource-specific#cause-3-an-offending-client-makes-excessive-list-or-put-calls).
 
-#### [Logs](#/tab/logs)
+##### Using Logs
 
 To identify the average latency of API server requests per user agent, as plotted on a time chart, run the following query.
 
-##### [Resource-specific](#tab/resource-specific)
+###### [Resource-specific](#tab/resource-specific)
 
 ```kusto
 AKSAudit
@@ -141,7 +194,7 @@ AKSAudit
 | render timechart
 ```
 
-##### [Azure diagnostics](#tab/azure-diagnostics)
+###### [Azure diagnostics](#tab/azure-diagnostics)
 
 ```kusto
 AzureDiagnostics
@@ -162,11 +215,11 @@ This query is a follow-up to the query in the ["Identify top user agents by the
 > [!TIP]
 > By analyzing this data, you can identify patterns and anomalies that can indicate problems on your AKS cluster or applications. For example, you might notice that a particular user is experiencing high latency. This scenario can indicate the type of API calls that are causing excessive load on the API server or etcd.
 
-### Step 3: Identify bad API calls for a given user agent
+#### Step 3: Identify bad API calls for a given user agent
 
 Run the following query to tabulate the 99th percentile (P99) latency of API calls across different resource types for a given client.
 
-#### [Resource-specific](#tab/resource-specific)
+##### [Resource-specific](#tab/resource-specific)
 
 ```kusto
 AKSAudit
@@ -182,7 +235,7 @@ AKSAudit
 | render table
 ```
 
-#### [Azure diagnostics](#tab/azure-diagnostics)
+##### [Azure diagnostics](#tab/azure-diagnostics)
 
 ```kusto
 AzureDiagnostics
@@ -204,65 +257,6 @@ AzureDiagnostics
 
 The results from this query can be useful to identify the kinds of API calls that fail the upstream Kubernetes SLOs. In most cases, an offending client might be making too many `LIST` calls on a large set of objects or objects that are too large. Unfortunately, no hard scalability limits are available to guide users about API server scalability. API server or etcd scalability limits depend on various factors that are explained in [Kubernetes Scalability thresholds](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md).
 
-## Cause and Resolution
-
-### Cause 1: A network rule blocks the traffic from agent nodes to the API server
-
-A network rule can block traffic between the agent nodes and the API server.
-
-To check whether a misconfigured network policy is blocking communication between the API server and agent nodes, run the following [kubectl-aks](https://go.microsoft.com/fwlink/p/?linkid=2259767) commands:
-
-```bash
-kubectl aks config import \
-    --subscription <mySubscriptionID> \
-    --resource-group <myResourceGroup> \
-    --cluster-name <myAKSCluster>
-
-kubectl aks check-apiserver-connectivity --node <myNode>
-```
-
-The [config import](https://go.microsoft.com/fwlink/p/?linkid=2259867#importing-configuration) command retrieves the Virtual Machine Scale Set information for all the nodes in the cluster. Then, the [check-apiserver-connectivity](https://go.microsoft.com/fwlink/p/?linkid=2259674) command uses this information to verify the network connectivity between the API server and a specified node, specifically for its underlying scale set instance.
-
-> [!NOTE]
-> If the output of the `check-apiserver-connectivity` command contains the `Connectivity check: succeeded` message, then the network connectivity is unimpeded.
-
-### Solution 1: Fix the network policy to remove the traffic blockage
-
-If the command output indicates that a connection failure occurred, reconfigure the network policy so that it doesn't unnecessarily block traffic between the agent nodes and the API server.
-
-### Cause 2: An offending client leaks etcd objects and causes a slowdown of etcd
-
-A common situation is that objects are continuously created even though existing unused objects in the etcd database aren't removed. This situation can cause performance problems if etcd handles too many objects (more than 10,000) of any type. A rapid increase of changes on such objects could also cause the default size of the etcd database (by default, 4 gigabytes) to be exceeded.
-
-To check the etcd database usage, navigate to **Diagnose and Solve problems** in the Azure portal. Run the **Etcd Availability Issues** diagnosis tool by searching for "_etcd_" in the Search box. The diagnosis tool shows the usage breakdown and the total database size.
-
-:::image type="content" source="media/troubleshoot-apiserver-etcd/etcd-detector.png" alt-text="Azure portal screenshot that shows the Etcd Availability Diagnosis for Azure Kubernetes Service (AKS)." lightbox="media/troubleshoot-apiserver-etcd/etcd-detector.png":::
-
-To get a quick view of the current size of your etcd database in bytes, run the following command:
-
-```bash
-kubectl get --raw /metrics | grep -E "etcd_db_total_size_in_bytes|apiserver_storage_size_bytes|apiserver_storage_db_total_size_in_bytes"
-```
-
-> [!NOTE]
-> The metric name in the previous command is different for different Kubernetes versions. For Kubernetes 1.25 and earlier versions, use `etcd_db_total_size_in_bytes`. For Kubernetes 1.26 to 1.28, use `apiserver_storage_db_total_size_in_bytes`.
-
-### Solution 2: Define quotas for object creation, delete objects, or limit object lifetime in etcd
-
-To prevent etcd from reaching capacity and causing cluster downtime, you can limit the maximum number of resources that are created. You can also slow the number of revisions that are generated for resource instances. To limit the number of objects that can be created, [define object quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/#object-count-quota).
-
-If you identified objects that are no longer in use but consume resources, consider deleting them. For example, delete completed jobs to free up space:
-
-```bash
-kubectl delete jobs --field-selector status.successful=1
-```
-
-For objects that support [automatic cleanup](https://kubernetes.io/docs/concepts/architecture/garbage-collection/), set Time to Live (TTL) values to limit the lifetime of these objects. You can also label your objects so that you can bulk delete all the objects of a specific type by using label selectors. If you establish [owner references](https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/) among objects, any dependent objects are automatically deleted after the parent object is deleted.
-
-### Cause 3: An offending client makes excessive LIST or PUT calls
-
-If you determine that etcd isn't overloaded with too many objects, an offending client might be making too many `LIST` or `PUT` calls to the API server.
-
 ### Solution 3a: Tune your API call pattern
 
 To reduce the pressure on the control plane, consider tuning your client's API call pattern.