You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: Apply evaluation rubrics for consistent scoring
4
+
title: Use evaluation rubrics for consistent scoring
5
5
description: Learn how to create detailed evaluation rubrics, train human evaluators through calibration exercises, and maintain inter-rater reliability for consistent agent quality assessment.
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/2-design-evaluation-experiments.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -83,6 +83,6 @@ Success criteria establish what constitutes acceptable performance before you ru
83
83
84
84
Business requirements influence these thresholds: customer-facing agents handling trip planning need higher quality standards and faster response times than internal tools.
85
85
86
-
Your comparison methodology structures how you execute experiments and analyze results. You run each variant against the same test prompts, recording quality scores, token usage, and response times for every request. Organizing results in a comparison table reveals patterns across variants—perhaps GPT-4 mini performs well on straightforward gear queries but struggles with complex multi-day trip planning requiring detailed equipment recommendations, or streaming significantly improves perceived responsiveness without increasing costs. Documenting your experiment design ensures reproducibility: another team member can repeat your experiment and verify your findings. This documentation captures test prompts, scoring criteria, variant configurations, and the rationale behind each design decision.
86
+
Your comparison methodology runs each variant against the same test prompts, recording quality scores, token usage, and response times. Organizing results reveals patterns. For example, GPT-4 mini might excel at straightforward queries but struggle with complex planning. Document your experiment design to ensure reproducibility: test prompts, scoring criteria, variant configurations, and rationale.
87
87
88
88
With comprehensive experiment design complete, you're ready to implement these experiments using version control workflows that enable safe testing and team collaboration.
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/3-git-based-experimentation-workflow.md
+6-9Lines changed: 6 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,11 @@
1
-
# Apply git-based workflows to optimization experiments
2
1
3
2
Optimization experiments require systematic organization to track which changes were tested and what results they produced. Git-based workflows enable you to test agent variants safely, document evaluation results, and compare experiments to identify which configuration performs best.
4
3
5
-
| Step | Action |
6
-
| ------ | -------- |
7
-
| 1. **Create branch**| Create experiment branch for each variant |
8
-
| 2. **Add test prompts**| Store test prompts in experiment folder |
9
-
| 3. **Run evaluation script**| Deploy agent version, run test prompts, capture responses |
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/4-evaluate-agent-responses.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,3 @@
1
-
# Apply evaluation rubrics for consistent scoring
2
1
3
2
Manual evaluation provides essential quality insights that automated metrics can't capture, but multiple human evaluators often score the same response differently without clear guidance. When three Adventure Works team members evaluate the same Trail Guide Agent response, one rates it 5 for Intent Resolution while another rates it 3—not because the response quality changed, but because they interpret the scoring criteria differently. Inconsistent evaluation undermines optimization decisions, making it impossible to determine whether quality improved or human evaluators judged responses more leniently. Here, you learn how to create evaluation consistency through rubrics, rater training with calibration examples, and inter-rater reliability testing.
4
3
@@ -19,7 +18,7 @@ For the Adventure Works Trail Guide Agent, create a rubric for each evaluation c
19
18
20
19
**Intent Resolution Rubric (1-5 scale):**
21
20
22
-
| Score | Definition | Example Response|
21
+
| Score | Definition | Example response|
23
22
| ------- | ------------ | -------------------- |
24
23
| 5 | Fully addresses user's need with complete information | User asks about March Scotland hiking gear; agent recommends waterproof layers, specifies materials, suggests Adventure Works products |
25
24
| 4 | Addresses core need with minor gaps | User asks about Scotland gear; agent recommends waterproof items but doesn't specify materials or products |
@@ -35,15 +34,15 @@ Human evaluator training ensures all team members interpret rubrics consistently
35
34
36
35
Select five to eight agent responses that span your score range—include clear examples of scores 5, 3, and 1, plus ambiguous responses that fall between levels. For Adventure Works, you might include responses to the "Scottish Highlands March gear" test prompt that demonstrate different quality levels. Present each response to your evaluation team without revealing the intended score. Format the calibration set as simple text blocks:
37
36
38
-
### Low-performing response
37
+
**Low-performing response**
39
38
40
39
```text
41
40
Test Prompt: What gear do I need for hiking in the Scottish Highlands in March?
42
41
43
42
Agent Response: The Scottish Highlands feature beautiful terrain with mountains, lochs, and glens. Popular trails include the West Highland Way and routes around Ben Nevis. March is considered shoulder season with fewer tourists than summer months. The landscape offers stunning views and diverse wildlife including red deer and golden eagles.
44
43
```
45
44
46
-
### High-performing response
45
+
**High-performing response**
47
46
48
47
```text
49
48
Test Prompt: What gear do I need for hiking in the Scottish Highlands in March?
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/5-exercise.md
+1-2Lines changed: 1 addition & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,3 @@
1
-
# Exercise - Evaluate and compare AI agent versions
2
1
3
2
In this exercise, you evaluate two prompt versions of the Trail Guide Agent and create a Version Comparison Decision Document that justifies which version to promote to production based on quality scores and cost analysis.
4
3
@@ -13,4 +12,4 @@ Throughout this exercise, you:
13
12
14
13
Launch the exercise and follow the instructions.
15
14
16
-
[](PLACEHOLDER - Create link at: https://akalinkmanager.trafficmanager.net/am/redirection/home?options=host:go.microsoft.com)
15
+
[](https://go.microsoft.com/fwlink/?linkid=2352696)
0 commit comments