Skip to content

Commit 408e1f7

Browse files
Merge pull request #53353 from wwlpublish/156462-2
Fixed Feedback bugs
2 parents ff948b9 + 19548fe commit 408e1f7

27 files changed

Lines changed: 469 additions & 0 deletions
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.manage-monitoring-ai-ready-infrastructure.introduction
3+
title: "Introduction"
4+
metadata:
5+
title: "Introduction"
6+
description: "Introduction"
7+
ms.date: 02/02/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
durationInMinutes: 5
12+
content: |
13+
[!include[](includes/1-introduction.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.manage-monitoring-ai-ready-infrastructure.understand-azure-monitor-metrics-visualization
3+
title: "Understand Azure Monitor metrics and visualization"
4+
metadata:
5+
title: "Understand Azure Monitor metrics and visualization"
6+
description: "Understand Azure Monitor metrics and visualization"
7+
ms.date: 02/02/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
durationInMinutes: 12
12+
content: |
13+
[!include[](includes/2-understand-azure-monitor-metrics-visualization.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.manage-monitoring-ai-ready-infrastructure.configure-alerts-alert-processing-rules
3+
title: "Configure alerts and alert processing rules"
4+
metadata:
5+
title: "Configure alerts and alert processing rules"
6+
description: "Configure alerts and alert processing rules"
7+
ms.date: 02/02/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
durationInMinutes: 13
12+
content: |
13+
[!include[](includes/3-configure-alerts-alert-processing-rules.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.manage-monitoring-ai-ready-infrastructure.query-log-data-log-analytics-workspace
3+
title: "Query log data in Log Analytics Workspace"
4+
metadata:
5+
title: "Query log data in Log Analytics Workspace"
6+
description: "Query log data in Log Analytics Workspace"
7+
ms.date: 02/02/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
durationInMinutes: 11
12+
content: |
13+
[!include[](includes/4-query-log-data-log-analytics-workspace.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.manage-monitoring-ai-ready-infrastructure.exercise-configure-monitoring-azure-infrastructure
3+
title: "Configure Monitoring Azure Infrastructure"
4+
metadata:
5+
title: "Configure Monitoring Azure Infrastructure"
6+
description: "Configure Monitoring Azure Infrastructure"
7+
ms.date: 02/02/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
durationInMinutes: 60
12+
content: |
13+
[!include[](includes/5-exercise-configure-monitoring-azure-infrastructure.md)]
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.manage-monitoring-ai-ready-infrastructure.knowledge-check
3+
title: "Module assessment"
4+
metadata:
5+
title: "Knowledge check"
6+
description: "Test your understanding of Azure Monitor implementation by answering these scenario-based questions. Consider how you would apply monitoring concepts to real-world infrastructure management challenges."
7+
ms.date: 02/02/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
module_assessment: false
12+
durationInMinutes: 5
13+
content: "Choose the best response for each of the following questions."
14+
quiz:
15+
questions:
16+
- content: "Your operations team manages 50 virtual machines running AI training workloads. You need to track memory usage inside the VMs to detect when training processes consume excessive resources. Which metric collection method provides the visibility you require?"
17+
choices:
18+
- content: "Platform metrics collected automatically by Azure Monitor, which include memory available bytes for all virtual machines"
19+
isCorrect: false
20+
explanation: "Incorrect. Platform metrics don't include memory usage inside the virtual machine's operating system—they only track resource consumption at the Azure infrastructure level like CPU percentage and disk IOPS."
21+
- content: "Guest OS metrics collected by the Azure Diagnostics extension installed on each virtual machine"
22+
isCorrect: true
23+
explanation: "Correct. Guest OS metrics provide visibility into memory usage and process-level performance counters. The Azure Diagnostics extension must be installed on each VM to collect these metrics."
24+
- content: "Custom metrics published from your training application using the Application Insights SDK"
25+
isCorrect: false
26+
explanation: "Incorrect. Custom metrics would work but require modifying your training application code to publish metrics, adding unnecessary complexity when guest OS metrics provide the needed data automatically once the diagnostics extension is installed."
27+
- content: "You create an alert rule that fires when storage account transaction latency exceeds 500 milliseconds. During testing, you notice the alert fires briefly every hour during backup operations, generating notifications your team ignores. How should you reduce alert fatigue while maintaining visibility into genuine latency issues?"
28+
choices:
29+
- content: "Increase the latency threshold to 1000 milliseconds and reduce the evaluation frequency to 15 minutes"
30+
isCorrect: false
31+
explanation: "Incorrect. Increasing the threshold to 1000 milliseconds would hide real latency issues that occur between 500-1000ms, potentially missing performance degradation that affects user experience."
32+
- content: "Create an alert processing rule that suppresses notifications during the 10-minute backup window each hour"
33+
isCorrect: true
34+
explanation: "Correct. An alert processing rule with time-based suppression eliminates notifications for expected latency spikes during backups while preserving the alert rule for genuine performance problems outside the backup window."
35+
- content: "Disable the alert rule entirely and rely on user reports to detect storage performance problems"
36+
isCorrect: false
37+
explanation: "Incorrect. Disabling the alert entirely removes proactive monitoring, forcing your team into reactive mode where storage problems surface only after users complain, increasing mean time to detection and business impact."
38+
- content: "After receiving an alert about high CPU usage on a virtual machine, you need to identify which process consumed the most resources during the spike. Which Kusto Query Language (KQL) query pattern provides this information?"
39+
choices:
40+
- content: "Query the AzureDiagnostics table filtering for Level equals 'Error' to find failures that caused the CPU spike"
41+
isCorrect: false
42+
explanation: "Incorrect. The AzureDiagnostics table contains error logs but doesn't provide process-level performance data—errors might correlate with high CPU but won't tell you which process was responsible."
43+
- content: "Query the Perf table filtering for CounterName equals '% Processor Time' and summarize by process name"
44+
isCorrect: true
45+
explanation: "Correct. Querying the Perf table for processor time counters by process name provides the granular resource consumption data needed to identify which specific process caused the CPU spike."
46+
- content: "Query the SecurityEvent table to detect unauthorized access attempts that might have triggered resource-intensive operations"
47+
isCorrect: false
48+
explanation: "Incorrect. The SecurityEvent table tracks authentication and access events, useful for security investigations but irrelevant for diagnosing resource consumption patterns that cause CPU saturation."
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.manage-monitoring-ai-ready-infrastructure.summary
3+
title: "Summary"
4+
metadata:
5+
title: "Summary"
6+
description: "Summary"
7+
ms.date: 02/02/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
durationInMinutes: 2
12+
content: |
13+
[!include[](includes/7-summary.md)]
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
Your company runs machine learning workloads on Azure that analyze customer data around the clock. Last week, a virtual machine crashed during a training job, but your team discovered the failure only after users reported missing results. By then, two hours of compute time were wasted, and the training pipeline needed manual restart. This scenario highlights a critical gap: without proactive monitoring, infrastructure failures disrupt business operations before anyone notices.
2+
3+
Azure Monitor closes this gap by collecting metrics, logs, and alerts from your infrastructure in real time. With Azure Monitor, you detect performance degradation before it causes downtime, receive notifications when resources exceed capacity thresholds, and query log data to diagnose the root cause of failures. For AI workloads that demand high availability, this visibility translates to measurable outcomes: reduced mean time to resolution (MTTR), improved service level agreement (SLA) compliance, and fewer manual interventions during production incidents.
4+
5+
In this module, you configure monitoring for Azure infrastructure supporting AI workloads. You set up metric collection to track CPU, memory, and disk performance. You create alert rules that notify your operations team when thresholds are breached. You implement alert processing rules to suppress notifications during planned maintenance windows. Finally, you query log data in Log Analytics Workspace to investigate infrastructure events and validate your monitoring configuration.
6+
7+
## Learning objectives
8+
9+
By the end of this module, you're able to:
10+
11+
- Explain how Azure Monitor and Log Analytics Workspace support infrastructure management
12+
- Configure metrics collection and visualization for Azure resources
13+
- Implement alert rules and processing rules to respond to infrastructure events
14+
- Query log data to diagnose infrastructure issues
15+
16+
## Prerequisites
17+
18+
- Familiarity with basic Azure concepts and resource types such as virtual machines, storage accounts, and networking components
19+
- Access to an Azure subscription with Contributor permissions to create and configure resources
20+
- Understanding of fundamental networking and compute concepts including IP addressing, load balancing, and CPU utilization
21+
22+
## More resources
23+
24+
- [Azure Monitor overview](/azure/azure-monitor/overview) - Official documentation covering Azure Monitor architecture and capabilities
25+
- [Log Analytics Workspace documentation](/azure/azure-monitor/logs/log-analytics-workspace-overview) - Detailed guide to Log Analytics Workspace setup and query capabilities
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
You need visibility into infrastructure performance before problems escalate into outages. Consider a scenario where your virtual machine's CPU usage climbs steadily over several days. Without metric tracking, you discover the capacity issue only after the VM becomes unresponsive and training jobs fail. Azure Monitor metrics solve this problem by collecting performance data automatically from every Azure resource you deploy.
2+
3+
## How Azure Monitor collects metrics
4+
5+
Azure Monitor captures three types of metrics that provide different layers of visibility into your infrastructure. Platform metrics are collected automatically the moment you create a resource—no configuration required. When you deploy a virtual machine, Azure Monitor immediately begins tracking CPU percentage, network throughput, and disk operations per second. These metrics flow into Azure Monitor's time-series database every 60 seconds, giving you near real-time visibility into resource behavior.
6+
7+
Platform metrics cover the fundamentals, but they don't reveal what's happening inside your virtual machine's operating system. For deeper visibility, you enable guest OS metrics by installing the Azure Diagnostics extension on your VM. This agent collects memory usage, process-level performance counters, and application-specific metrics that platform monitoring can't access. With this approach, you track not just whether your VM is running, but whether it has sufficient memory to handle current workloads and which processes consume the most resources.
8+
9+
Custom metrics extend monitoring beyond infrastructure to capture business-specific indicators. Using the Application Insights SDK or Azure Monitor REST API, you send metrics that matter to your organization—such as the number of AI model predictions completed per minute, queue processing latency, or user session duration. This becomes especially important when your operations team needs to correlate infrastructure performance with business outcomes and demonstrate how resource optimization improves application responsiveness.
10+
11+
:::image type="content" source="../media/custom-metrics-extend-monitoring-infrastructure.png" alt-text="Diagram showing how custom metrics extend monitoring beyond infrastructure to capture business-specific indicators.":::
12+
13+
## Visualizing metrics for operations teams
14+
15+
Collecting metrics delivers value only when your team can interpret trends and act on anomalies. Azure Monitor provides two primary visualization tools that serve different operational needs. Metrics Explorer offers ad-hoc analysis when you investigate a specific performance question or troubleshoot an active incident. You select a resource, choose one or more metrics, apply time range filters, and view trend charts that reveal patterns like CPU spikes during batch processing or gradual memory leaks over multiple days.
16+
17+
With Metrics Explorer, you answer immediate questions: Did CPU usage exceed 80% during last night's training run? How does network throughput compare between this week and last week? However, ad-hoc analysis doesn't provide continuous monitoring. Your operations team needs persistent visibility into critical metrics without repeatedly building the same charts. Azure dashboards solve this by pinning Metrics Explorer visualizations to a shared view that displays real-time data from multiple resources simultaneously.
18+
19+
:::image type="content" source="../media/temporary-analysis-continuous-monitor.png" alt-text="Diagram Azure dashboards showing how to pin Metrics Explorer visualizations to a shared view.":::
20+
21+
A well-designed dashboard shows your team the health of compute, storage, and networking resources at a glance. You create separate panels for CPU utilization across all virtual machines, storage account transaction rates, and network gateway bandwidth consumption. This consolidated view enables your operations team to detect cross-resource patterns—such as high CPU correlating with increased storage I/O—and prioritize investigation efforts based on severity and business impact. For AI workloads that span multiple services, this holistic visibility reduces the time spent switching between resource pages and accelerates root cause analysis during incidents.
22+
23+
## Business impact of continuous metric monitoring
24+
25+
Proactive metric tracking transforms infrastructure management from reactive firefighting to preventive maintenance. When you visualize CPU trends over weeks instead of responding to individual spikes, you identify capacity planning opportunities before resources become bottlenecks. Your finance team benefits from this visibility through more accurate cost forecasting, because metric data reveals when to scale resources up or down based on actual usage patterns rather than guesswork.
26+
27+
For teams managing AI infrastructure, continuous monitoring delivers measurable operational improvements. Organizations that implement metric dashboards report 40-60% reductions in mean time to detection (MTTD) for performance issues, because anomalies become visible immediately rather than surfacing only after user complaints. This early detection prevents cascading failures—such as a memory leak in one VM causing downstream service timeouts—and reduces the business impact of infrastructure incidents by enabling faster, more targeted remediation efforts.
28+
29+
:::image type="content" source="../media/azure-monitor-collect-platform.png" alt-text="Diagram showing how three metric sources with Azure resources emitting platform metrics automatically.":::
30+
31+
*Azure Monitor collects platform, guest OS, and custom metrics, then delivers them to visualization and alerting tools*
32+
33+
34+
## More resources
35+
36+
- [Azure Monitor Metrics overview](/azure/azure-monitor/essentials/data-platform-metrics) - Comprehensive guide to metric types, collection methods, and retention policies
37+
- [Metrics Explorer documentation](/azure/azure-monitor/essentials/metrics-getting-started) - Step-by-step instructions for creating charts and analyzing metric data
38+
- [Azure dashboards best practices](/azure/azure-portal/azure-portal-dashboards) - Design patterns for effective operational dashboards
39+

0 commit comments

Comments
 (0)