Skip to content

Commit fe28f08

Browse files
authored
Merge pull request #53567 from MicrosoftDocs/NEW-analyze-telemetry-logs-metrics
New analyze telemetry logs metrics module - from release branch
2 parents 460e2b5 + f4a8ea0 commit fe28f08

19 files changed

Lines changed: 660 additions & 0 deletions
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.analyze-telemetry-logs-metrics.introduction
3+
title: Introduction
4+
metadata:
5+
title: Introduction
6+
description: Introduction
7+
ms.date: 02/19/2026
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 3
12+
content: |
13+
[!include[](includes/1-introduction.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.analyze-telemetry-logs-metrics.write-basic-kql-queries
3+
title: Write basic KQL queries
4+
metadata:
5+
title: Write Basic KQL Queries
6+
description: Write basic KQL queries
7+
ms.date: 02/19/2026
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 12
12+
content: |
13+
[!include[](includes/2-write-basic-kql-queries.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.analyze-telemetry-logs-metrics.explore-logs-errors-performance
3+
title: Explore logs for errors and performance
4+
metadata:
5+
title: Explore Logs for Errors and Performance
6+
description: Explore logs for errors and performance
7+
ms.date: 02/19/2026
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 12
12+
content: |
13+
[!include[](includes/3-explore-logs-errors-performance.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.analyze-telemetry-logs-metrics.build-dashboards-app-telemetry
3+
title: Build dashboards for app telemetry
4+
metadata:
5+
title: Build Dashboards for App Telemetry
6+
description: Build dashboards for app telemetry
7+
ms.date: 02/19/2026
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 9
12+
content: |
13+
[!include[](includes/4-build-dashboards-app-telemetry.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.analyze-telemetry-logs-metrics.create-workbooks-interactive-analysis
3+
title: Create workbooks for interactive analysis
4+
metadata:
5+
title: Create Workbooks for Interactive Analysis
6+
description: Create workbooks for interactive analysis
7+
ms.date: 02/19/2026
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 8
12+
content: |
13+
[!include[](includes/5-create-workbooks-interactive-analysis.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.analyze-telemetry-logs-metrics.set-alerts-app-failures-anomalies
3+
title: Set alerts for app failures and anomalies
4+
metadata:
5+
title: Set Alerts for App Failures and Anomalies
6+
description: Set alerts for app failures and anomalies
7+
ms.date: 02/19/2026
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 11
12+
content: |
13+
[!include[](includes/6-set-alerts-app-failures-anomalies.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.analyze-telemetry-logs-metrics.exercise-query-logs-kql
3+
title: Exercise - Query logs with KQL
4+
metadata:
5+
title: Exercise - Query Logs with KQL
6+
description: Exercise - Query logs with KQL
7+
ms.date: 02/19/2026
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 20
12+
content: |
13+
[!include[](includes/7-exercise-query-logs-kql.md)]
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.analyze-telemetry-logs-metrics.knowledge-check
3+
title: Module assessment
4+
metadata:
5+
title: Module Assessment
6+
description: Module assessment
7+
ms.date: 02/19/2026
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 5
12+
content: "Choose the best response for each of the following questions."
13+
quiz:
14+
questions:
15+
- content: "Your AI application uses Application Insights with sampling enabled. You need to query the exceptions table to get an accurate count of total exceptions over the last 24 hours. Which KQL aggregation approach produces accurate results when sampling is active?"
16+
choices:
17+
- content: "`sum(itemCount)`"
18+
isCorrect: true
19+
explanation: "When sampling is active, each sampled record's `itemCount` reflects how many actual events that record represents. Summing `itemCount` produces totals that reflect actual application behavior rather than just the count of sampled records."
20+
- content: "`count()`"
21+
isCorrect: false
22+
explanation: "With sampling active, `count()` returns only the count of sampled records, which under represents the true exception volume. The `count()` function doesn't account for the events that each sampled record represents."
23+
- content: "`dcount(type)`"
24+
isCorrect: false
25+
explanation: "The `dcount()` function counts distinct values of a column. Using `dcount(type)` returns the number of unique exception types, not the total number of exceptions that occurred."
26+
- content: "A developer needs to trace the full chain of events from a single user request as it flows through four services in a document processing pipeline. Which Application Insights column links all telemetry items from that single distributed request?"
27+
choices:
28+
- content: "`cloud_RoleName`"
29+
isCorrect: false
30+
explanation: "The `cloud_RoleName` column identifies which service generated the telemetry item. It doesn't link items from the same request across services."
31+
- content: "`operation_Id`"
32+
isCorrect: true
33+
explanation: "The `operation_Id` column is shared across all telemetry items that belong to a single distributed trace, making it possible to correlate requests, dependencies, exceptions, and traces from the same user request as it flows through multiple services."
34+
- content: "`customDimensions`"
35+
isCorrect: false
36+
explanation: "The `customDimensions` column contains a dynamic property bag of custom key-value pairs that developers add to enrich telemetry. It doesn't provide built-in correlation across services for distributed tracing."
37+
- content: "During an incident, a team member confirms elevated error rates on a dashboard and needs to investigate the root cause by dynamically filtering telemetry by service, time window, and error type. Which Azure Monitor tool is designed for this interactive investigation?"
38+
choices:
39+
- content: "Azure dashboards"
40+
isCorrect: false
41+
explanation: "Dashboards provide at-a-glance operational awareness with static tiles that refresh on a schedule. They aren't designed for interactive filtering and drill-down investigation during incidents."
42+
- content: "Azure Monitor metric alerts"
43+
isCorrect: false
44+
explanation: "Metric alerts detect threshold violations and trigger notifications. They're a detection mechanism, not an investigation tool for exploring telemetry interactively."
45+
- content: "Azure Monitor Workbooks"
46+
isCorrect: true
47+
explanation: "Workbooks provide an interactive canvas with parameters, conditional visibility, and linked grids that let users adjust filters and drill into specific data points. This interactivity makes workbooks the right choice for troubleshooting and root cause analysis."
48+
- content: "A developer creates a log search alert with a KQL query that counts failed requests per service and returns rows only for services exceeding ten failures. The threshold is set to greater than zero table rows. What behavior does this configuration produce?"
49+
choices:
50+
- content: "The alert fires whenever any service has more than 10 failures in the evaluation window."
51+
isCorrect: true
52+
explanation: "The query returns rows only for services that exceed 10 failures. Setting the threshold to greater than zero table rows means the alert fires if the query returns any rows, indicating at least one service crossed the failure threshold in the evaluation window."
53+
- content: "The alert fires once for each individual failed request detected in the evaluation window."
54+
isCorrect: false
55+
explanation: "The query aggregates failures by service using `summarize`, so individual failed requests aren't counted as separate alert triggers. The alert evaluates whether the query returns any rows, not the number of individual failures."
56+
- content: "The alert fires only when all monitored services have more than 10 failures simultaneously."
57+
isCorrect: false
58+
explanation: "The alert fires when the query returns any rows. Even a single service exceeding 10 failures produces a returned row, which triggers the alert. There's no requirement for all services to exceed the threshold."
59+
- content: "A developer monitors an AI pipeline where average response times remain within acceptable limits, but users report occasional slow responses. Which KQL function reveals the response time experienced by the slowest five percent of requests?"
60+
choices:
61+
- content: "`avg()`"
62+
isCorrect: false
63+
explanation: "The `avg()` function calculates the mean response time across all requests. A few slow requests has minimal impact on the average, so `avg()` masks the outlier behavior that users experience."
64+
- content: "`percentile(duration, 95)`"
65+
isCorrect: true
66+
explanation: "The `percentile()` function with a 95th percentile argument returns the response time below which 95 percent of requests fall. This reveals the experience of the slowest five percent of requests, capturing the outlier latency that average values hide."
67+
- content: "`max(duration)`"
68+
isCorrect: false
69+
explanation: "The `max()` function returns only the single slowest response time. While it shows the absolute worst case, it doesn't characterize the experience of a percentile of requests and can be heavily influenced by a single extreme outlier."
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.analyze-telemetry-logs-metrics.summary
3+
title: Summary
4+
metadata:
5+
title: Summary
6+
description: Summary
7+
ms.date: 02/19/2026
8+
author: jeffkoms
9+
ms.author: jeffko
10+
ms.topic: unit
11+
durationInMinutes: 2
12+
content: |
13+
[!include[](includes/9-summary.md)]
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
AI applications in production generate large volumes of telemetry data across distributed services, but raw data alone doesn't provide actionable insight. This module guides you through analyzing application telemetry with Azure Monitor logs and metrics to detect failures, identify performance trends, and maintain operational visibility for AI solutions on Azure.
2+
3+
Imagine you're a developer building a document processing pipeline for an enterprise content moderation AI service. The system consists of four services: an ingestion API that receives document uploads, a classification service that categorizes content using a trained model, an extraction service that identifies key entities, and a moderation service that flags policy violations. After deploying to production, the team notices that some documents take over 30 seconds to process, but there's no way to determine which service causes the delay. Occasionally, the moderation service returns errors for specific document types, and the team only discovers these failures when users report them. Your client expects a dashboard that shows real-time pipeline health, with alerts that notify the on-call team within five minutes of a failure spike. The client also needs the ability to investigate incidents interactively, drilling into specific time windows and filtering by document type or service. Azure Monitor provides the query language, visualization tools, and alerting capabilities to meet all of these requirements.
4+
5+
After completing this module, you'll be able to:
6+
7+
- Write KQL queries to retrieve and analyze application telemetry from Application Insights.
8+
- Explore log data to identify error patterns, performance bottlenecks, and trends in application behavior.
9+
- Build Azure dashboards that display key telemetry metrics and log query results for operational monitoring.
10+
- Create Azure Monitor Workbooks for interactive, parameter-driven telemetry analysis.
11+
- Configure alert rules that detect application failures, performance degradation, and anomalies.
12+
13+
> [!NOTE]
14+
> All code examples in this module use KQL queries against Application Insights log tables. The Azure Monitor query experience is updated regularly, and the recommendation is to visit the [Azure Monitor logs documentation](/azure/azure-monitor/logs/log-query-overview) for the most up-to-date information.

0 commit comments

Comments
 (0)