Skip to content

Commit 0152458

Browse files
committed
update
1 parent 064bf90 commit 0152458

10 files changed

Lines changed: 342 additions & 994 deletions

learn-pr/wwl-data-ai/automated-evaluation-genaiops/4-create-synthetic-test-data.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
### YamlMime:ModuleUnit
22
uid: learn.wwl.automated-evaluation-genaiops.create-synthetic-test-data
3-
title: Create synthetic test datasets
3+
title: Create evaluation datasets
44
metadata:
5-
title: Create synthetic test datasets
6-
description: "Design and generate synthetic test datasets using Microsoft Foundry with proper composition across common scenarios, variations, edge cases, and adversarial examples"
5+
title: Create evaluation datasets
6+
description: "Create comprehensive evaluation datasets from production data and synthetic generation with proper composition across common scenarios, variations, edge cases, and adversarial examples"
77
ms.date: 02/22/2026
88
author: madiepev
99
ms.author: madiepev

learn-pr/wwl-data-ai/automated-evaluation-genaiops/6-github-actions-workflow.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ uid: learn.wwl.automated-evaluation-genaiops.github-actions-workflow
33
title: Integrate evaluations into GitHub Actions
44
metadata:
55
title: Integrate evaluations into GitHub Actions
6-
description: "Learn how to integrate automated evaluations into GitHub Actions workflows for continuous quality assurance"
6+
description: "Learn how to automate Python evaluation scripts in GitHub Actions workflows triggered by pull requests"
77
ms.date: 02/22/2026
88
author: madiepev
99
ms.author: madiepev

learn-pr/wwl-data-ai/automated-evaluation-genaiops/8-knowledge-check.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ metadata:
1212
- N/A
1313
durationInMinutes: 3
1414
content: |
15-
[!include[](includes/8-knowledge-check.md)]
1615
quiz:
1716
title: "Check your knowledge"
1817
questions:

learn-pr/wwl-data-ai/automated-evaluation-genaiops/includes/3-align-evaluators-human-criteria.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -18,19 +18,20 @@ Start by choosing Microsoft Foundry evaluators that align with your human evalua
1818

1919
| Evaluator | Measures | Best for |
2020
|-----------|----------|----------|
21-
| Groundedness | Factual accuracy based on sources | Checking if responses stick to provided information |
21+
| Intent Resolution | How fully the response addresses user's need | Ensuring the agent completes the user's task |
2222
| Relevance | How well response addresses the question | Ensuring answers are on-topic |
23-
| Coherence | Logical flow and structure | Evaluating response organization |
24-
| Fluency | Language quality and naturalness | Assessing readability |
25-
| Similarity | Match to expected ground truth | Comparing against reference answers |
23+
| Groundedness | Factual accuracy based on sources | Checking if responses stick to provided information |
24+
25+
> [!TIP]
26+
> This table shows a subset of commonly used evaluators. For detailed specification of all available evaluators including required inputs, scoring ranges, and implementation guidance, learn more through the [evaluators reference](/azure/ai-foundry/concepts/built-in-evaluators).
2627
2728
**Map your criteria to evaluators:**
2829

2930
For Adventure Works, human evaluators assess:
3031

31-
- Trail accuracy**Groundedness** + **Relevance**
32-
- Response completeness **Relevance** + **Coherence**
33-
- Clarity**Coherence** + **Fluency**
32+
- Intent Resolution**Intent Resolution**
33+
- Relevance **Relevance**
34+
- Groundedness**Groundedness**
3435

3536
**Add essential evaluators beyond human criteria:**
3637

@@ -39,9 +40,9 @@ You can include evaluators that humans don't shadow-rate but are critical for sa
3940
```python
4041
evaluators = {
4142
# Shadow-rated against human judgment
42-
'groundedness': GroundednessEvaluator(),
43+
'intent_resolution': IntentResolutionEvaluator(),
4344
'relevance': RelevanceEvaluator(),
44-
'coherence': CoherenceEvaluator(),
45+
'groundedness': GroundednessEvaluator(),
4546

4647
# Essential safety checks (not shadow-rated)
4748
'content_safety': ContentSafetyEvaluator(),
@@ -50,7 +51,7 @@ evaluators = {
5051
```
5152

5253
> [!NOTE]
53-
> Safety and compliance evaluators serve as gates regardless of human evaluation. A response can score well on human-validated dimensions but still fail on content safety, blocking deployment.
54+
> Safety and compliance evaluators can serve as gates regardless of human evaluation. A response can score well on human-validated dimensions but still fail on content safety, blocking deployment.
5455
5556
## Run shadow rating
5657

@@ -70,14 +71,14 @@ from scipy.stats import pearsonr
7071
# Compare scores
7172
df = pd.DataFrame({
7273
'response_id': response_ids,
73-
'human_accuracy_score': human_scores,
74-
'automated_groundedness': groundedness_scores
74+
'human_intent_resolution_score': human_scores,
75+
'automated_intent_resolution': intent_resolution_scores
7576
})
7677

7778
# Calculate correlation
7879
correlation, p_value = pearsonr(
79-
df['human_accuracy_score'],
80-
df['automated_groundedness']
80+
df['human_intent_resolution_score'],
81+
df['automated_intent_resolution']
8182
)
8283

8384
print(f"Correlation: {correlation:.2f}")
@@ -91,7 +92,7 @@ print(f"Correlation: {correlation:.2f}")
9192
| 0.5-0.7 | Moderate alignment | Investigate and refine |
9293
| < 0.5 | Weak alignment | Major adjustments needed |
9394

94-
Adventure Works targets 0.75 correlation between their accuracy criteria and automated groundedness scores before trusting automation for deployment decisions.
95+
Adventure Works targets 0.75 correlation between their human evaluation scores and automated evaluator scores before trusting automation for deployment decisions.
9596

9697
## Monitor alignment over time
9798

0 commit comments

Comments
 (0)