Skip to content

Commit 473bac3

Browse files
committed
update
1 parent dccab74 commit 473bac3

5 files changed

Lines changed: 12 additions & 17 deletions

File tree

learn-pr/wwl-data-ai/evaluate-optimize-agents/4-evaluate-agent-responses.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
### YamlMime:ModuleUnit
22
uid: learn.wwl.evaluate-optimize-agents.evaluate-agent-responses
33
metadata:
4-
title: Apply evaluation rubrics for consistent scoring
4+
title: Use evaluation rubrics for consistent scoring
55
description: Learn how to create detailed evaluation rubrics, train human evaluators through calibration exercises, and maintain inter-rater reliability for consistent agent quality assessment.
66
ms.date: 02/17/2026
77
author: madiepev

learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/2-design-evaluation-experiments.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,6 @@ Success criteria establish what constitutes acceptable performance before you ru
8383

8484
Business requirements influence these thresholds: customer-facing agents handling trip planning need higher quality standards and faster response times than internal tools.
8585

86-
Your comparison methodology structures how you execute experiments and analyze results. You run each variant against the same test prompts, recording quality scores, token usage, and response times for every request. Organizing results in a comparison table reveals patterns across variants—perhaps GPT-4 mini performs well on straightforward gear queries but struggles with complex multi-day trip planning requiring detailed equipment recommendations, or streaming significantly improves perceived responsiveness without increasing costs. Documenting your experiment design ensures reproducibility: another team member can repeat your experiment and verify your findings. This documentation captures test prompts, scoring criteria, variant configurations, and the rationale behind each design decision.
86+
Your comparison methodology runs each variant against the same test prompts, recording quality scores, token usage, and response times. Organizing results reveals patterns. For example, GPT-4 mini might excel at straightforward queries but struggle with complex planning. Document your experiment design to ensure reproducibility: test prompts, scoring criteria, variant configurations, and rationale.
8787

8888
With comprehensive experiment design complete, you're ready to implement these experiments using version control workflows that enable safe testing and team collaboration.

learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/3-git-based-experimentation-workflow.md

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,11 @@
1-
# Apply git-based workflows to optimization experiments
21

32
Optimization experiments require systematic organization to track which changes were tested and what results they produced. Git-based workflows enable you to test agent variants safely, document evaluation results, and compare experiments to identify which configuration performs best.
43

5-
| Step | Action |
6-
| ------ | -------- |
7-
| 1. **Create branch** | Create experiment branch for each variant |
8-
| 2. **Add test prompts** | Store test prompts in experiment folder |
9-
| 3. **Run evaluation script** | Deploy agent version, run test prompts, capture responses |
10-
| 4. **Score responses** | Manually evaluate responses for quality metrics |
11-
| 5. **Compare and decide** | Review results across branches, merge successful experiments |
4+
1. **Create branch**: Create experiment branch for each variant
5+
2. **Add test prompts**: Store test prompts in experiment folder
6+
3. **Run evaluation script**: Deploy agent version, run test prompts, capture responses
7+
4. **Score responses**: Manually evaluate responses for quality metrics
8+
5. **Compare and decide**: Review results across branches, merge successful experiments
129

1310
## Create experiment branches
1411

@@ -88,7 +85,7 @@ After completing evaluations across multiple experiment branches, use your CSV d
8885

8986
For the Adventure Works experiments, you might document your comparison:
9087

91-
| Experiment Branch | Key Observations | Meets Criteria? |
88+
| Experiment branch | Key observations | Meets criteria? |
9289
| ------------------- | ------------------ | ------------------ |
9390
| main (baseline) | Solid responses, some verbosity | Yes (4.2 avg) |
9491
| prompt-v2-concise | Maintains quality, more focused | Yes (4.4 avg) |

learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/4-evaluate-agent-responses.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# Apply evaluation rubrics for consistent scoring
21

32
Manual evaluation provides essential quality insights that automated metrics can't capture, but multiple human evaluators often score the same response differently without clear guidance. When three Adventure Works team members evaluate the same Trail Guide Agent response, one rates it 5 for Intent Resolution while another rates it 3—not because the response quality changed, but because they interpret the scoring criteria differently. Inconsistent evaluation undermines optimization decisions, making it impossible to determine whether quality improved or human evaluators judged responses more leniently. Here, you learn how to create evaluation consistency through rubrics, rater training with calibration examples, and inter-rater reliability testing.
43

@@ -19,7 +18,7 @@ For the Adventure Works Trail Guide Agent, create a rubric for each evaluation c
1918

2019
**Intent Resolution Rubric (1-5 scale):**
2120

22-
| Score | Definition | Example Response |
21+
| Score | Definition | Example response |
2322
| ------- | ------------ | -------------------- |
2423
| 5 | Fully addresses user's need with complete information | User asks about March Scotland hiking gear; agent recommends waterproof layers, specifies materials, suggests Adventure Works products |
2524
| 4 | Addresses core need with minor gaps | User asks about Scotland gear; agent recommends waterproof items but doesn't specify materials or products |
@@ -35,15 +34,15 @@ Human evaluator training ensures all team members interpret rubrics consistently
3534

3635
Select five to eight agent responses that span your score range—include clear examples of scores 5, 3, and 1, plus ambiguous responses that fall between levels. For Adventure Works, you might include responses to the "Scottish Highlands March gear" test prompt that demonstrate different quality levels. Present each response to your evaluation team without revealing the intended score. Format the calibration set as simple text blocks:
3736

38-
### Low-performing response
37+
**Low-performing response**
3938

4039
```text
4140
Test Prompt: What gear do I need for hiking in the Scottish Highlands in March?
4241
4342
Agent Response: The Scottish Highlands feature beautiful terrain with mountains, lochs, and glens. Popular trails include the West Highland Way and routes around Ben Nevis. March is considered shoulder season with fewer tourists than summer months. The landscape offers stunning views and diverse wildlife including red deer and golden eagles.
4443
```
4544

46-
### High-performing response
45+
**High-performing response**
4746

4847
```text
4948
Test Prompt: What gear do I need for hiking in the Scottish Highlands in March?

learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/5-exercise.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# Exercise - Evaluate and compare AI agent versions
21

32
In this exercise, you evaluate two prompt versions of the Trail Guide Agent and create a Version Comparison Decision Document that justifies which version to promote to production based on quality scores and cost analysis.
43

@@ -13,4 +12,4 @@ Throughout this exercise, you:
1312
1413
Launch the exercise and follow the instructions.
1514

16-
[![Button to launch exercise.](../media/launch-exercise.png)](PLACEHOLDER - Create link at: https://akalinkmanager.trafficmanager.net/am/redirection/home?options=host:go.microsoft.com)
15+
[![Button to launch exercise.](../media/launch-exercise.png)](https://go.microsoft.com/fwlink/?linkid=2352696)

0 commit comments

Comments
 (0)