MicrosoftDocs
diff --git a/‎learn-pr/wwl-azure/manage-monitoring-ai-ready-infrastructure/includes/2-understand-azure-monitor-metrics-visualization.md‎
Lines changed: 39 additions & 0 deletions b/‎learn-pr/wwl-azure/manage-monitoring-ai-ready-infrastructure/includes/2-understand-azure-monitor-metrics-visualization.md‎
Lines changed: 39 additions & 0 deletions
diff --git a/‎learn-pr/wwl-azure/manage-monitoring-ai-ready-infrastructure/includes/3-configure-alerts-alert-processing-rules.md‎
Lines changed: 39 additions & 0 deletions b/‎learn-pr/wwl-azure/manage-monitoring-ai-ready-infrastructure/includes/3-configure-alerts-alert-processing-rules.md‎
Lines changed: 39 additions & 0 deletions
@@ -0,0 +1,39 @@
+You need visibility into infrastructure performance before problems escalate into outages. Consider a scenario where your virtual machine's CPU usage climbs steadily over several days. Without metric tracking, you discover the capacity issue only after the VM becomes unresponsive and training jobs fail. Azure Monitor metrics solve this problem by collecting performance data automatically from every Azure resource you deploy.
+
+## How Azure Monitor collects metrics
+
+Azure Monitor captures three types of metrics that provide different layers of visibility into your infrastructure. Platform metrics are collected automatically the moment you create a resource—no configuration required. When you deploy a virtual machine, Azure Monitor immediately begins tracking CPU percentage, network throughput, and disk operations per second. These metrics flow into Azure Monitor's time-series database every 60 seconds, giving you near real-time visibility into resource behavior.
+
+Platform metrics cover the fundamentals, but they don't reveal what's happening inside your virtual machine's operating system. For deeper visibility, you enable guest OS metrics by installing the Azure Diagnostics extension on your VM. This agent collects memory usage, process-level performance counters, and application-specific metrics that platform monitoring can't access. With this approach, you track not just whether your VM is running, but whether it has sufficient memory to handle current workloads and which processes consume the most resources.
+
+Custom metrics extend monitoring beyond infrastructure to capture business-specific indicators. Using the Application Insights SDK or Azure Monitor REST API, you send metrics that matter to your organization—such as the number of AI model predictions completed per minute, queue processing latency, or user session duration. This becomes especially important when your operations team needs to correlate infrastructure performance with business outcomes and demonstrate how resource optimization improves application responsiveness.
+
+:::image type="content" source="../media/custom-metrics-extend-monitoring-infrastructure.png" alt-text="Diagram showing how custom metrics extend monitoring beyond infrastructure to capture business-specific indicators.":::
+
+## Visualizing metrics for operations teams
+
+Collecting metrics delivers value only when your team can interpret trends and act on anomalies. Azure Monitor provides two primary visualization tools that serve different operational needs. Metrics Explorer offers ad-hoc analysis when you investigate a specific performance question or troubleshoot an active incident. You select a resource, choose one or more metrics, apply time range filters, and view trend charts that reveal patterns like CPU spikes during batch processing or gradual memory leaks over multiple days.
+
+With Metrics Explorer, you answer immediate questions: Did CPU usage exceed 80% during last night's training run? How does network throughput compare between this week and last week? However, ad-hoc analysis doesn't provide continuous monitoring. Your operations team needs persistent visibility into critical metrics without repeatedly building the same charts. Azure dashboards solve this by pinning Metrics Explorer visualizations to a shared view that displays real-time data from multiple resources simultaneously.
+
+:::image type="content" source="../media/temporary-analysis-continuous-monitor.png" alt-text="Diagram Azure dashboards showing how to pin Metrics Explorer visualizations to a shared view.":::
+
+A well-designed dashboard shows your team the health of compute, storage, and networking resources at a glance. You create separate panels for CPU utilization across all virtual machines, storage account transaction rates, and network gateway bandwidth consumption. This consolidated view enables your operations team to detect cross-resource patterns—such as high CPU correlating with increased storage I/O—and prioritize investigation efforts based on severity and business impact. For AI workloads that span multiple services, this holistic visibility reduces the time spent switching between resource pages and accelerates root cause analysis during incidents.
+
+## Business impact of continuous metric monitoring
+
+Proactive metric tracking transforms infrastructure management from reactive firefighting to preventive maintenance. When you visualize CPU trends over weeks instead of responding to individual spikes, you identify capacity planning opportunities before resources become bottlenecks. Your finance team benefits from this visibility through more accurate cost forecasting, because metric data reveals when to scale resources up or down based on actual usage patterns rather than guesswork.
+
+For teams managing AI infrastructure, continuous monitoring delivers measurable operational improvements. Organizations that implement metric dashboards report 40-60% reductions in mean time to detection (MTTD) for performance issues, because anomalies become visible immediately rather than surfacing only after user complaints. This early detection prevents cascading failures—such as a memory leak in one VM causing downstream service timeouts—and reduces the business impact of infrastructure incidents by enabling faster, more targeted remediation efforts.
+
+:::image type="content" source="../media/azure-monitor-collect-platform.png" alt-text="Diagram showing how three metric sources with Azure resources emitting platform metrics automatically.":::
+
+*Azure Monitor collects platform, guest OS, and custom metrics, then delivers them to visualization and alerting tools*
+
+
+## More resources
+
+- [Azure Monitor Metrics overview](/azure/azure-monitor/essentials/data-platform-metrics) - Comprehensive guide to metric types, collection methods, and retention policies
+- [Metrics Explorer documentation](/azure/azure-monitor/essentials/metrics-getting-started) - Step-by-step instructions for creating charts and analyzing metric data
+- [Azure dashboards best practices](/azure/azure-portal/azure-portal-dashboards) - Design patterns for effective operational dashboards
+
@@ -0,0 +1,39 @@
+Visualizing metrics helps your operations team spot trends, but manual monitoring doesn't scale when you manage dozens of resources across multiple regions. You need automated notifications that alert your team the moment performance thresholds are breached—before users notice degraded service. At the same time, you must avoid alert fatigue from notifications triggered during planned maintenance or outside business hours when no one is available to respond. Azure Monitor alert rules and alert processing rules work together to deliver timely notifications while suppressing irrelevant alerts.
+
+## Creating effective alert rules
+
+An alert rule defines the condition that triggers a notification and the action taken when that condition is met. You start by selecting the resource to monitor—such as a specific virtual machine or all VMs in a resource group. Next, you choose the metric to evaluate, such as CPU percentage or available memory. The critical decision comes when you set the threshold and evaluation window: should the alert fire when CPU exceeds 80% for 5 minutes, or wait for 15 minutes to avoid false positives from transient spikes?
+
+Your threshold choices directly impact operational effectiveness. Set thresholds too low, and your team receives alerts for normal traffic variations that require no action. This creates alert fatigue, where administrators begin ignoring notifications because most turn out to be false alarms. Set thresholds too high, and you miss early warning signs of capacity problems, discovering issues only after performance has already degraded enough to affect users. For AI workloads that process large datasets, you typically set CPU alerts at 80-85% sustained for 10-15 minutes, allowing brief spikes during data loading while catching genuine capacity constraints before they cause job failures.
+
+:::image type="content" source="../media/threshold-choice-direct-impact-operation.png" alt-text="Diagram showing how threshold choices directly impact operational effectiveness.":::
+
+Once you've defined the condition, you specify the action group that receives notifications when the alert fires. Action groups contain one or more notification methods: email addresses for your operations team, SMS phone numbers for on-call engineers, or webhooks that trigger automated remediation scripts. With action groups, you separate the condition logic from the notification routing, so you can reuse the same action group across multiple alert rules and update contact information in one place when team members change roles.
+
+## Managing alert delivery with processing rules
+
+Alert rules determine when notifications are generated, but alert processing rules control whether those notifications actually reach your team. Consider a common scenario: you schedule weekly maintenance every Sunday from 2:00 AM to 4:00 AM. During this window, you intentionally restart virtual machines and adjust configurations, triggering dozens of alerts for expected state changes. Without alert processing rules, your on-call engineer receives notifications for every maintenance action, creating noise that obscures genuine emergencies.
+
+Alert processing rules evaluate each fired alert and apply suppression or routing logic based on conditions you define. You create a processing rule that suppresses all alerts from your production resource group during the Sunday maintenance window. When an alert fires at 2:30 AM, Azure Monitor checks whether any processing rules match the alert's properties. The maintenance window rule matches, so Azure Monitor suppresses the notification instead of sending it to the action group. Your team's notification channels remain quiet during planned work, but any alerts fired outside the maintenance window still reach the on-call engineer immediately.
+
+:::image type="content" source="../media/alert-process-rules-evaluate-fired-alert.png" alt-text="Diagram showing how alert processing rules evaluate each fired alert and apply suppression.":::
+
+Beyond suppression, alert processing rules enable dynamic routing based on time or resource properties. You create a processing rule that routes alerts to different action groups depending on whether they fire during business hours or after hours. Weekday alerts go to the general operations email distribution list, while weekend and evening alerts route directly to the on-call engineer's mobile phone. This ensures the right person receives notifications at the right time, reducing response delays and preventing high-priority issues from sitting in a shared inbox until the next business day.
+
+## Balancing responsiveness with notification fatigue
+
+Effective alerting requires constant refinement based on operational experience. After you deploy initial alert rules, you monitor how often they fire and whether each notification leads to meaningful action. If your team receives 50 alerts per week but takes action on only 10, you've created alert fatigue that reduces overall responsiveness. You adjust thresholds upward for metrics that generate frequent false positives, lengthen evaluation windows to filter out transient spikes, or add alert processing rules to suppress notifications during known high-activity periods.
+
+This iterative approach delivers measurable improvements in operational efficiency. Organizations that implement alert processing rules report 60-70% reductions in notification volume without missing critical incidents, because they eliminate alerts for expected state changes and route remaining notifications to the appropriate responders. For AI infrastructure teams, this means your on-call engineers spend less time triaging false alarms and more time addressing genuine capacity constraints, security events, or performance anomalies that require human intervention.
+
+:::image type="content" source="../media/alert-process-rules-evaluate-time-window.png" alt-text="Diagram showing how decisions flow starting with a metric exceeding its threshold.":::
+
+*Alert processing rules evaluate time windows and resource properties to determine whether to suppress or route notifications*
+
+
+## More resources
+
+- [Azure Monitor alerts overview](/azure/azure-monitor/alerts/alerts-overview) - Comprehensive guide to alert rule types, action groups, and notification methods
+- [Alert processing rules documentation](/azure/azure-monitor/alerts/alerts-processing-rules) - Detailed instructions for creating suppression and routing rules with time-based and property-based conditions
+- [Action groups configuration](/azure/azure-monitor/alerts/action-groups) - Best practices for configuring email, SMS, webhook, and ITSM integration notification methods
+