You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: Apply git-based workflows to optimization experiments
5
-
description: Learn how to organize agent optimization experiments using git branches, systematic testing scripts, and documented evaluation results for reproducible comparisons.
4
+
title: Apply Git-based workflows to optimization experiments
5
+
description: Learn how to organize agent optimization experiments using Git branches, systematic testing scripts, and documented evaluation results for reproducible comparisons.
6
6
ms.date: 02/17/2026
7
7
author: madiepev
8
8
ms.author: madiepev
9
9
ms.topic: unit
10
10
ai-usage: ai-generated
11
-
title: Apply git-based workflows to optimization experiments
11
+
title: Apply Git-based workflows to optimization experiments
description: "Test your understanding of agent evaluation experiments, git-based workflows, and evaluation rubrics."
6
+
description: "Test your understanding of agent evaluation experiments, Git-based workflows, and evaluation rubrics."
7
7
ms.date: 02/17/2026
8
8
author: madiepev
9
9
ms.author: madiepev
@@ -17,11 +17,11 @@ content: |
17
17
quiz:
18
18
title: "Check your knowledge"
19
19
questions:
20
-
- content: "What is the primary reason for organizing agent optimization experiments into separate git branches?"
20
+
- content: "What is the primary reason for organizing agent optimization experiments into separate Git branches?"
21
21
choices:
22
22
- content: "To enable parallel development by multiple team members simultaneously"
23
23
isCorrect: false
24
-
explanation: "Incorrect: While git supports parallel development, the primary reason for separate experiment branches is controlled comparison, not collaboration."
24
+
explanation: "Incorrect: While Git supports parallel development, the primary reason for separate experiment branches is controlled comparison, not collaboration."
25
25
- content: "To isolate specific changes and attribute performance differences to individual modifications"
26
26
isCorrect: true
27
27
explanation: "Correct: Separate branches enable controlled comparison by testing one change at a time, making it clear which modification caused observed performance differences."
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/2-design-evaluation-experiments.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -65,21 +65,21 @@ Representative test prompts cover the spectrum of real-world usage. For the Adve
65
65
66
66
-**Digital nomads planning weekend hikes**: "I'm hiking in the Scottish Highlands in March, what waterproof gear do I need from Adventure Works?"
67
67
-**Families preparing for their first outdoor adventure**: "We're taking our teenagers on easy trails near London next month, what basic equipment should we buy or rent?"
68
-
-**Experienced hikers planning extended trips**: "I need a complete gear list for five-day backpacking trip in moderate terrain with variable weather"
68
+
-**Experienced hikers planning extended trips**: "I need a complete gear list for five-day backpacking trip in moderate terrain with variable weather."
69
69
70
70
Edge cases test how the agent handles challenging situations:
71
71
72
72
-**Ambiguous requests**: "What should I pack for hiking?"
73
-
-**Incomplete trip details**: "I need gear for Scotland"
73
+
-**Incomplete trip details**: "I need gear for Scotland."
74
74
-**Last-minute gear changes**: "Can I swap my camping equipment rental for different sizes?"
75
75
76
76
Including five to 10 diverse test prompts provides sufficient coverage for manual testing and smoke tests while remaining practical for human evaluation. Each test prompt captures the user query, expected information needs, and ideal response characteristics.
77
77
78
78
Success criteria establish what constitutes acceptable performance before you run experiments. Setting thresholds in advance prevents rationalizing disappointing results. Adventure Works defines success thresholds across all three optimization dimensions:
79
79
80
-
-**Quality**: Average 4.2+ (five-point scale), minimum 3.5 per response to align with customer satisfaction targets and prevent trust erosion
81
-
-**Cost**: 60% expense reduction to achieve operational efficiency goals while maintaining 85% resolution rate
82
-
-**Performance**: Average response time <30 seconds, time-to-first-token <2 seconds (streaming) to ensure acceptable user experience for real-time interactions
80
+
-**Quality**: Average 4.2+ (five-point scale), minimum 3.5 per response to align with customer satisfaction targets and prevent trust erosion.
81
+
-**Cost**: 60% expense reduction to achieve operational efficiency goals while maintaining 85% resolution rate.
82
+
-**Performance**: Average response time <30 seconds, time-to-first-token <2 seconds (streaming) to ensure acceptable user experience for real-time interactions.
83
83
84
84
Business requirements influence these thresholds: customer-facing agents handling trip planning need higher quality standards and faster response times than internal tools.
| gpt4o-mini-model | Lower quality on complex prompts | No (4.1 avg, below 4.2 threshold) |
93
93
94
-
If `prompt-v2-concise` meets your quality threshold and improves conciseness, use git to merge the winning experiment:
94
+
If `prompt-v2-concise` meets your quality threshold and improves conciseness, use Git to merge the winning experiment:
95
95
96
96
```bash
97
97
git checkout main
@@ -102,4 +102,4 @@ git push origin main --tags
102
102
103
103
For experiments that don't meet criteria, document why before deciding whether to keep or delete the branch: "gpt4o-mini-model: Quality dropped below 4.2 threshold on complex trip planning prompts. Not recommended for production."
104
104
105
-
With git workflows established for organizing experiments, you're ready to execute the actual evaluations by running agents against test prompts and systematically scoring the results.
105
+
With Git workflows established for organizing experiments, you're ready to execute the actual evaluations by running agents against test prompts and systematically scoring the results.
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/evaluate-optimize-agents/includes/7-summary.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ You've learned how to optimize AI agents through structured evaluation that tran
5
5
6
6
Effective optimization depends on clear metrics that measure quality, cost, and performance. Quality metrics like Intent Resolution, Relevance, and Groundedness reveal whether agents serve user needs effectively. Cost metrics quantify token usage and operational expenses, enabling you to calculate the financial impact of model changes. Performance metrics measure response times that directly affect user experience. Together, these metrics provide objective criteria for comparing agent variants.
7
7
8
-
## Organize experiments with git-based workflows
8
+
## Organize experiments with Git-based workflows
9
9
10
10
Git-based workflows bring engineering discipline to agent optimization. You create one branch per experiment variant, isolating specific changes like prompt modifications or model switches. Each branch maintains test prompts, evaluation scripts, and documented results. This structured approach lets you test changes safely, compare experiments systematically, and merge successful optimizations to production with confidence.
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/evaluate-optimize-agents/index.yml
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
uid: learn.wwl.evaluate-optimize-agents
3
3
metadata:
4
4
title: Evaluate and optimize AI agents through structured experiments
5
-
description: Learn how to evaluate and optimize AI agents systematically through structured experiments that measure quality, cost, and performance. Design evaluation metrics, apply git-based workflows, create consistent scoring rubrics, and make evidence-based optimization decisions.
5
+
description: Learn how to evaluate and optimize AI agents systematically through structured experiments that measure quality, cost, and performance. Design evaluation metrics, apply Git-based workflows, create consistent scoring rubrics, and make evidence-based optimization decisions.
6
6
ms.date: 02/17/2026
7
7
author: madiepev
8
8
ms.author: madiepev
@@ -11,17 +11,17 @@ metadata:
11
11
ms.service: azure-ai-foundry
12
12
title: Evaluate and optimize AI agents through structured experiments
13
13
summary: |
14
-
Learn how to optimize AI agents through structured evaluation that transforms guesswork into evidence-based engineering decisions. You'll explore how to design evaluation experiments with clear metrics for quality, cost, and performance; organize experiments using git-based workflows; create evaluation rubrics for consistent scoring; and compare results to make informed optimization decisions.
14
+
Learn how to optimize AI agents through structured evaluation that transforms guesswork into evidence-based engineering decisions. You'll explore how to design evaluation experiments with clear metrics for quality, cost, and performance; organize experiments using Git-based workflows; create evaluation rubrics for consistent scoring; and compare results to make informed optimization decisions.
15
15
abstract: |
16
16
In this module, you:
17
17
- Design evaluation experiments with clear metrics for quality, cost, and performance
18
-
- Apply git-based workflows to organize and compare agent variants systematically
18
+
- Apply Git-based workflows to organize and compare agent variants systematically
19
19
- Create evaluation rubrics that ensure consistent scoring across human evaluators
20
20
- Compare experiment results to make evidence-based optimization decisions
21
21
prerequisites: |
22
22
Before starting this module, you should have:
23
23
- Basic understanding of AI agents and large language models
24
-
- Familiarity with git version control workflows
24
+
- Familiarity with Git version control workflows
25
25
- Experience with Microsoft Azure AI Foundry or similar AI development platforms
0 commit comments