update

madiepev · madiepev · commit 473bac37f371 · 2026-02-18T22:52:59.000+01:00
diff --git a/learn-pr/wwl-data-ai/evaluate-optimize-agents/4-evaluate-agent-responses.yml b/learn-pr/wwl-data-ai/evaluate-optimize-agents/4-evaluate-agent-responses.yml
@@ -1,7 +1,7 @@
 ### YamlMime:ModuleUnit
 uid: learn.wwl.evaluate-optimize-agents.evaluate-agent-responses
 metadata:
-  title: Apply evaluation rubrics for consistent scoring
+  title: Use evaluation rubrics for consistent scoring
   description: Learn how to create detailed evaluation rubrics, train human evaluators through calibration exercises, and maintain inter-rater reliability for consistent agent quality assessment.
   ms.date: 02/17/2026
   author: madiepev
diff --git a/learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/2-design-evaluation-experiments.md b/learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/2-design-evaluation-experiments.md
@@ -83,6 +83,6 @@ Success criteria establish what constitutes acceptable performance before you ru
 
 Business requirements influence these thresholds: customer-facing agents handling trip planning need higher quality standards and faster response times than internal tools.
 
-Your comparison methodology structures how you execute experiments and analyze results. You run each variant against the same test prompts, recording quality scores, token usage, and response times for every request. Organizing results in a comparison table reveals patterns across variants—perhaps GPT-4 mini performs well on straightforward gear queries but struggles with complex multi-day trip planning requiring detailed equipment recommendations, or streaming significantly improves perceived responsiveness without increasing costs. Documenting your experiment design ensures reproducibility: another team member can repeat your experiment and verify your findings. This documentation captures test prompts, scoring criteria, variant configurations, and the rationale behind each design decision.
+Your comparison methodology runs each variant against the same test prompts, recording quality scores, token usage, and response times. Organizing results reveals patterns. For example, GPT-4 mini might excel at straightforward queries but struggle with complex planning. Document your experiment design to ensure reproducibility: test prompts, scoring criteria, variant configurations, and rationale.
 
 With comprehensive experiment design complete, you're ready to implement these experiments using version control workflows that enable safe testing and team collaboration.
diff --git a/learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/3-git-based-experimentation-workflow.md b/learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/3-git-based-experimentation-workflow.md
@@ -1,14 +1,11 @@
-# Apply git-based workflows to optimization experiments
 
 Optimization experiments require systematic organization to track which changes were tested and what results they produced. Git-based workflows enable you to test agent variants safely, document evaluation results, and compare experiments to identify which configuration performs best.
 
-| Step | Action |
-| ------ | -------- |
-| 1. **Create branch** | Create experiment branch for each variant |
-| 2. **Add test prompts** | Store test prompts in experiment folder |
-| 3. **Run evaluation script** | Deploy agent version, run test prompts, capture responses |
-| 4. **Score responses** | Manually evaluate responses for quality metrics |
-| 5. **Compare and decide** | Review results across branches, merge successful experiments |
+1. **Create branch**: Create experiment branch for each variant
+2. **Add test prompts**: Store test prompts in experiment folder
+3. **Run evaluation script**: Deploy agent version, run test prompts, capture responses
+4. **Score responses**: Manually evaluate responses for quality metrics
+5. **Compare and decide**: Review results across branches, merge successful experiments
 
 ## Create experiment branches
 
@@ -88,7 +85,7 @@ After completing evaluations across multiple experiment branches, use your CSV d
 
 For the Adventure Works experiments, you might document your comparison:
 
-| Experiment Branch | Key Observations | Meets Criteria? |
+| Experiment branch | Key observations | Meets criteria? |
 | ------------------- | ------------------ | ------------------ |
 | main (baseline) | Solid responses, some verbosity | Yes (4.2 avg) |
 | prompt-v2-concise | Maintains quality, more focused | Yes (4.4 avg) |
diff --git a/learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/4-evaluate-agent-responses.md b/learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/4-evaluate-agent-responses.md
@@ -1,4 +1,3 @@
-# Apply evaluation rubrics for consistent scoring
 
 Manual evaluation provides essential quality insights that automated metrics can't capture, but multiple human evaluators often score the same response differently without clear guidance. When three Adventure Works team members evaluate the same Trail Guide Agent response, one rates it 5 for Intent Resolution while another rates it 3—not because the response quality changed, but because they interpret the scoring criteria differently. Inconsistent evaluation undermines optimization decisions, making it impossible to determine whether quality improved or human evaluators judged responses more leniently. Here, you learn how to create evaluation consistency through rubrics, rater training with calibration examples, and inter-rater reliability testing.
 
@@ -19,7 +18,7 @@ For the Adventure Works Trail Guide Agent, create a rubric for each evaluation c
 
 **Intent Resolution Rubric (1-5 scale):**
 
-| Score | Definition | Example Response |
+| Score | Definition | Example response |
 | ------- | ------------ | -------------------- |
 | 5 | Fully addresses user's need with complete information | User asks about March Scotland hiking gear; agent recommends waterproof layers, specifies materials, suggests Adventure Works products |
 | 4 | Addresses core need with minor gaps | User asks about Scotland gear; agent recommends waterproof items but doesn't specify materials or products |
@@ -35,15 +34,15 @@ Human evaluator training ensures all team members interpret rubrics consistently
 
 Select five to eight agent responses that span your score range—include clear examples of scores 5, 3, and 1, plus ambiguous responses that fall between levels. For Adventure Works, you might include responses to the "Scottish Highlands March gear" test prompt that demonstrate different quality levels. Present each response to your evaluation team without revealing the intended score. Format the calibration set as simple text blocks:
 
-### Low-performing response
+**Low-performing response**
 
 ```text
 Test Prompt: What gear do I need for hiking in the Scottish Highlands in March?
 
 Agent Response: The Scottish Highlands feature beautiful terrain with mountains, lochs, and glens. Popular trails include the West Highland Way and routes around Ben Nevis. March is considered shoulder season with fewer tourists than summer months. The landscape offers stunning views and diverse wildlife including red deer and golden eagles.
 ```
 
-### High-performing response
+**High-performing response**
 
 ```text
 Test Prompt: What gear do I need for hiking in the Scottish Highlands in March?
diff --git a/learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/5-exercise.md b/learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/5-exercise.md
@@ -1,4 +1,3 @@
-# Exercise - Evaluate and compare AI agent versions
 
 In this exercise, you evaluate two prompt versions of the Trail Guide Agent and create a Version Comparison Decision Document that justifies which version to promote to production based on quality scores and cost analysis.
 
@@ -13,4 +12,4 @@ Throughout this exercise, you:
 
 Launch the exercise and follow the instructions.
 
-[![Button to launch exercise.](../media/launch-exercise.png)](PLACEHOLDER - Create link at: https://akalinkmanager.trafficmanager.net/am/redirection/home?options=host:go.microsoft.com)
+[![Button to launch exercise.](../media/launch-exercise.png)](https://go.microsoft.com/fwlink/?linkid=2352696)