Skip to content

Commit 65b3610

Browse files
authored
Merge pull request #53034 from MicrosoftDocs/NEW-monitor-apps-azure-kubernetes-service
New monitor apps azure kubernetes service module from release branch
2 parents abdc183 + 21fb0f5 commit 65b3610

15 files changed

Lines changed: 609 additions & 0 deletions
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.monitor-apps-azure-kubernetes-service.introduction
3+
title: Introduction
4+
metadata:
5+
title: Introduction
6+
description: Introduction
7+
ms.date: 12/31/2025
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 3
12+
content: |
13+
[!include[](includes/1-introduction.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.monitor-apps-azure-kubernetes-service.monitor-logs-metrics
3+
title: Monitor application logs and metrics
4+
metadata:
5+
title: Monitor Application Logs and Metrics
6+
description: Monitor application logs and metrics
7+
ms.date: 12/31/2025
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 8
12+
content: |
13+
[!include[](includes/2-monitor-logs-metrics.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.monitor-apps-azure-kubernetes-service.troubleshoot-pods-services
3+
title: Troubleshoot pods and services
4+
metadata:
5+
title: Troubleshoot Pods and Services
6+
description: Troubleshoot pods and services
7+
ms.date: 12/31/2025
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 8
12+
content: |
13+
[!include[](includes/3-troubleshoot-pods-services.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.monitor-apps-azure-kubernetes-service.verify-connectivity
3+
title: Verify service connectivity and endpoints
4+
metadata:
5+
title: Verify Service Connectivity and Endpoints
6+
description: Verify service connectivity and endpoints
7+
ms.date: 12/31/2025
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 8
12+
content: |
13+
[!include[](includes/4-verify-connectivity.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.monitor-apps-azure-kubernetes-service.exercise-troubleshoot-apps
3+
title: Exercise - Troubleshoot apps on Azure Kubernetes Service
4+
metadata:
5+
title: Exercise - Troubleshoot Apps on Azure Kubernetes Service
6+
description: Exercise - Troubleshoot apps on Azure Kubernetes Service
7+
ms.date: 12/31/2025
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 30
12+
content: |
13+
[!include[](includes/5-exercise-troubleshoot-apps.md)]
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.monitor-apps-azure-kubernetes-service.module-assessment
3+
title: Module assessment
4+
metadata:
5+
title: Module Assessment
6+
description: Module assessment
7+
ms.date: 12/31/2025
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 5
12+
content: |
13+
quiz:
14+
questions:
15+
- content: "You receive reports that an AI inference API on Azure Kubernetes Service occasionally returns HTTP 500 errors and higher latency. You want to inspect recent error messages for a specific pod while reproducing the issue. Which approach is most appropriate?"
16+
choices:
17+
- content: "Use `kubectl logs -f <pod-name> -n <namespace>` while you send test requests to the API."
18+
isCorrect: true
19+
explanation: "Streaming logs from the specific pod while you reproduce the issue lets you see real-time error messages and stack traces that explain the HTTP 500 errors and latency."
20+
- content: "Open Azure Monitor and review only node CPU metrics for the last week."
21+
isCorrect: false
22+
explanation: "Node CPU metrics alone don't show application-level errors for a specific pod, and a week-long window is too broad for targeted debugging."
23+
- content: "Run `kubectl describe node` on all nodes to look for scheduling events."
24+
isCorrect: false
25+
explanation: "Node describe output focuses on node health and scheduling, not on recent application log messages for a single pod."
26+
- content: "A pod that runs a model-serving container is stuck in CrashLoopBackOff. You want to understand why the container exits. What should you do first?"
27+
choices:
28+
- content: "Run `kubectl describe pod <pod-name> -n <namespace>` and inspect events and container status."
29+
isCorrect: true
30+
explanation: "`kubectl describe pod` shows recent events and container status, which usually reveal the exit reason, failed probes, or configuration problems causing CrashLoopBackOff."
31+
- content: "Immediately delete the pod so Kubernetes recreates it."
32+
isCorrect: false
33+
explanation: "Deleting the pod without inspecting it first can hide the root cause and usually leads to another CrashLoopBackOff cycle."
34+
- content: "Scale the Deployment to zero replicas and then scale it back up."
35+
isCorrect: false
36+
explanation: "Scaling down and up restarts pods but doesn't explain why the container fails, so it doesn't help you diagnose the problem."
37+
- content: "A Service that fronts your AI API shows no endpoints, even though the pods appear healthy and ready. Which command helps you confirm whether the Service selectors match pod labels?"
38+
choices:
39+
- content: "`kubectl describe service <service-name> -n <namespace>`"
40+
isCorrect: true
41+
explanation: "`kubectl describe service` shows the selector labels and current endpoints, so you can verify whether the Service matches the pods."
42+
- content: "`kubectl top nodes`"
43+
isCorrect: false
44+
explanation: "`kubectl top nodes` only shows node resource usage and doesn't expose Service selector or endpoint details."
45+
- content: "`kubectl logs <pod-name> -n <namespace>`"
46+
isCorrect: false
47+
explanation: "Pod logs help debug application behavior but don't show how the Service selects pods or why it has no endpoints."
48+
- content: "You need to debug a new AI endpoint inside the cluster before exposing it externally. You want to send HTTP requests from your development machine directly to the Service. Which command should you use?"
49+
choices:
50+
- content: "`kubectl port-forward service/<service-name> 8080:80 -n <namespace>`"
51+
isCorrect: true
52+
explanation: "`kubectl port-forward` from your workstation to the Service lets you send HTTP requests directly to the endpoint inside the cluster."
53+
- content: "`kubectl get endpoints <service-name> -n <namespace>`"
54+
isCorrect: false
55+
explanation: "`kubectl get endpoints` shows which pods back the Service but doesn't forward traffic from your development machine."
56+
- content: "`kubectl describe node` on the node that hosts the pod"
57+
isCorrect: false
58+
explanation: "`kubectl describe node` focuses on node state and scheduling, not on creating a local tunnel to the Service for HTTP requests."
59+
- content: "Metrics for a model-serving pod show sustained CPU usage at its configured limit, and users report increased latency. What is the most appropriate next step?"
60+
choices:
61+
- content: "Adjust CPU requests and limits or scale out replicas so the pod has enough capacity."
62+
isCorrect: true
63+
explanation: "When a pod consistently hits its CPU limit and latency increases, you typically need to raise CPU resources or add replicas so the workload has more capacity."
64+
- content: "Ignore the metrics because the pod is still running."
65+
isCorrect: false
66+
explanation: "Ignoring sustained high CPU and latency risks breaching performance objectives and causing user-facing issues."
67+
- content: "Delete the Service and recreate it with the same configuration."
68+
isCorrect: false
69+
explanation: "Recreating the Service doesn't address CPU saturation inside the pod and won't resolve the latency problem."
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.monitor-apps-azure-kubernetes-service.summary
3+
title: Summary
4+
metadata:
5+
title: Summary
6+
description: Summary
7+
ms.date: 12/31/2025
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 3
12+
content: |
13+
[!include[](includes/7-summary.md)]
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
AI applications on Azure need reliable monitoring and fast troubleshooting to meet user expectations. These workloads often run as services and background workers on Azure Kubernetes Service (AKS). Monitoring and debugging applications on AKS help you spot issues such as latency spikes, failed inference calls, and resource saturation before they affect users.
2+
3+
AKS provides multiple ways to monitor and troubleshoot applications. The Azure portal offers visual tools like the Workloads blade, Live Logs, and the Diagnose and solve problems feature for quick inspection and guided troubleshooting. For detailed investigation, `kubectl` commands give you direct access to cluster resources, logs, and events. Many developers use both approaches together—the portal for visual assessment and `kubectl` for in-depth analysis.
4+
5+
Imagine you deploy a model inference API and a background worker that enriches data for a recommendation system. Both components run on AKS and must stay responsive as traffic changes. When errors or slowdowns occur, you need to understand whether the problem comes from the application code, Kubernetes configuration, or the underlying cluster. By learning how to monitor logs and metrics, inspect pods and Services, and verify connectivity using both the Azure portal and command-line tools, you can keep AI workloads healthy on AKS.
6+
7+
## After completing this module, you'll be able to:
8+
9+
- Explain which monitoring signals matter for AI applications on AKS
10+
- Use the Azure portal and `kubectl` commands to inspect application logs and metrics
11+
- Troubleshoot pod and Service issues using visual tools and command-line investigation
12+
- Verify Service and ingress connectivity so clients can reach AI endpoints
13+
- Apply a structured workflow to deploy, monitor, and debug applications on AKS
14+
15+
> [!NOTE]
16+
> All commands and patterns in this module use current AKS and Kubernetes concepts. You should validate resource definitions and flags against official Kubernetes and Azure Kubernetes Service documentation when you adapt examples to your environment.
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
In this unit, you learn how to observe application behavior inside an Azure Kubernetes Service (AKS) cluster. You focus on logs and metrics that reveal how AI workloads behave in production. You see how to use both the Azure portal and `kubectl` for inspections, and how these signals relate to Azure monitoring tools.
2+
3+
AI applications such as model inference APIs and background processors depend on predictable performance. Latency spikes, increased error rates, or CPU saturation can indicate model issues or configuration problems. Monitoring logs and metrics helps you distinguish between normal variation and emerging incidents. You can then decide when to scale, investigate, or roll back changes.
4+
5+
AKS provides multiple ways to access logs and metrics. The Azure portal offers visual dashboards, Live Logs streaming, and the Monitoring tab on your cluster resource. For command-line access, `kubectl` provides direct queries against pods and nodes. Many developers start with the portal for a quick overview and then use `kubectl` for targeted investigation.
6+
7+
## Identify key monitoring signals for AI workloads
8+
9+
You start by identifying which signals matter most for AI services that run on AKS. For example, an inference API might expose HTTP status codes and latency. Background workers might track queue depth or batch processing times.
10+
11+
Important signals include:
12+
13+
- Response latency and throughput for AI endpoints
14+
- Error rates such as HTTP 5xx responses or timeouts
15+
- Pod restart counts and container exit codes
16+
- CPU and memory utilization compared to configured requests and limits
17+
18+
Together, these signals help you determine whether your AI workload stays within its performance and reliability targets.
19+
20+
## View logs using the Azure portal
21+
22+
The Azure portal provides visual tools to view container logs without command-line access. You can stream logs in real time or view recent output directly from the AKS resource.
23+
24+
To view logs in the Azure portal:
25+
26+
1. Navigate to your AKS cluster in the Azure portal.
27+
1. Under **Kubernetes resources**, select **Workloads**.
28+
1. Select a deployment, pod, or other workload type.
29+
1. Select **Live Logs** to stream container output in real time.
30+
31+
The Live Logs feature shows container stdout and stderr as the application produces output. You can pause the stream, search for specific text, and switch between containers in multi-container pods. This approach is useful when you need quick access to logs without configuring kubectl or when you want to share a browser session with a colleague.
32+
33+
You can also view logs from the **Insights** section under **Monitoring**. Container insights provides a unified view of logs across your cluster, with filtering by namespace, pod, and container. This view is helpful when you need to correlate logs across multiple pods or search for patterns across your AI workloads.
34+
35+
## Use `kubectl logs` to inspect application behavior
36+
37+
You can use `kubectl logs` to read container logs directly from pods. This method is useful when you need to inspect an application quickly or reproduce an issue.
38+
39+
A typical flow is:
40+
41+
1. List pods for your AI application in the target namespace.
42+
1. Stream logs from a specific pod while you send test traffic.
43+
1. Filter for errors, timeouts, or unexpected responses.
44+
45+
You might run the following commands:
46+
47+
```bash
48+
kubectl get pods -n ai-workloads
49+
kubectl logs <pod-name> -n ai-workloads
50+
kubectl logs -f <pod-name> -n ai-workloads
51+
```
52+
53+
These log commands target a specific Kubernetes namespace instead of the default namespace. A namespace is a logical boundary inside the cluster that groups related workloads, such as all AI services for a particular environment.
54+
55+
You can use the `-n ai-workloads` flag to tell `kubectl` to look in the `ai-workloads` namespace when you list pods or fetch logs. If your AI workloads run in the `default` namespace, or your current context already sets a namespace, you can omit the flag or replace `ai-workloads` with the namespace that matches your deployment.
56+
57+
Namespaces complement Kubernetes labels. Namespaces help you separate environments and control access and quotas, while labels help you select specific pods inside a namespace when you scale, roll out updates, or filter logs for a single AI service.
58+
59+
If a pod has multiple containers, such as a sidecar for logging or metrics, you specify the container name:
60+
61+
```bash
62+
kubectl logs <pod-name> -c inference-api -n ai-workloads
63+
```
64+
65+
You can use these logs to find patterns such as repeated timeouts when calling an upstream model endpoint or failures when retrieving features from a cache.
66+
67+
> [!NOTE]
68+
> Code fragments and commands in this unit are patterns you can adapt. Replace namespace, pod, and container names with values from your own AKS environment.
69+
70+
## View resource metrics using the Azure portal
71+
72+
The Azure portal provides visual dashboards for CPU, memory, and other resource metrics without requiring command-line access.
73+
74+
To view metrics in the Azure portal:
75+
76+
1. Navigate to your AKS cluster in the Azure portal.
77+
1. Select the **Monitoring** tab on the Overview page to see metric graphs for your cluster.
78+
1. Under **Monitoring**, select **Insights** to access Container insights dashboards.
79+
1. Use the **Nodes**, **Controllers**, or **Containers** tabs to view resource utilization at different levels.
80+
81+
The Monitoring tab shows CPU and memory usage graphs separated by node pool. You can select any graph to open it in metrics explorer for deeper analysis. Container insights provides heat maps and performance charts that highlight resource-constrained pods and nodes.
82+
83+
You can also view live metrics for individual pods:
84+
85+
1. Under **Monitoring**, select **Insights**.
86+
1. Select the **Nodes** or **Controllers** tab, then select a pod.
87+
1. Select **Live Metrics** to see real-time CPU, memory, and network data.
88+
89+
These visual tools help you quickly identify which AI workloads are under resource pressure without running commands.
90+
91+
## View resource metrics using kubectl
92+
93+
Resource metrics help you understand whether your AI workload uses CPU and memory as expected. High CPU utilization can indicate a model that needs more resources or a need to scale out replicas. High memory usage can cause container restarts or degraded performance.
94+
95+
If your cluster has the metrics server installed, you can run:
96+
97+
```bash
98+
kubectl top nodes
99+
kubectl top pods -n ai-workloads
100+
```
101+
102+
This output shows per-node and per-pod CPU and memory usage. You can compare these values to the requests and limits defined in your pod specifications. If a pod constantly reaches its CPU limit, node-level CPU limits can throttle it, which can increase inference latency.
103+
104+
## Combine Azure portal and kubectl for effective monitoring
105+
106+
AKS integrates with Azure Monitor so you can collect and visualize telemetry over time. The monitoring tools covered in this unit work together to provide comprehensive visibility into your AI workloads.
107+
108+
A typical monitoring workflow combines both approaches:
109+
110+
- Use the Azure portal Monitoring tab for a quick health overview of your cluster
111+
- Use Live Logs in the portal to stream container output when investigating an issue
112+
- Use `kubectl logs` when you need to filter logs or pipe output to other tools
113+
- Use `kubectl top` for a quick snapshot of current resource usage
114+
- Use Container insights dashboards for historical trends and cross-pod analysis
115+
116+
For example, you might notice high CPU usage on the portal's Monitoring tab, drill into Container insights to identify the affected pods, then use `kubectl logs` to inspect application behavior during the spike. This combination helps you respond quickly to incidents and learn from historical data.
117+
118+
## Best practices for monitoring AI workloads on AKS
119+
120+
- **Start with the portal for visual overview:** Use the Monitoring tab and Container insights for a quick assessment of cluster and workload health before diving into command-line investigation.
121+
- **Use Live Logs for real-time observation:** Stream container output in the Azure portal when you need to watch application behavior during testing or incident response.
122+
- **Combine portal and kubectl:** Use the portal for visual dashboards and historical trends, and use kubectl for targeted queries and scripted automation.
123+
- **Include correlation data in logs:** Add request identifiers, model names, and version information to log entries so you can trace problematic inference calls.
124+
- **Prefer structured logging:** Emit logs in structured formats that are easier to query and filter in both the portal and command-line tools.
125+
- **Measure against service objectives:** Define latency and error budget targets for your AI services and select metrics that indicate when you approach those thresholds.

0 commit comments

Comments
 (0)