You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: "Design and generate synthetic test datasets using Microsoft Foundry with proper composition across common scenarios, variations, edge cases, and adversarial examples"
5
+
title: Create evaluation datasets
6
+
description: "Create comprehensive evaluation datasets from production data and synthetic generation with proper composition across common scenarios, variations, edge cases, and adversarial examples"
| Fluency | Language quality and naturalness | Assessing readability |
25
-
| Similarity | Match to expected ground truth | Comparing against reference answers |
23
+
| Groundedness | Factual accuracy based on sources | Checking if responses stick to provided information |
24
+
25
+
> [!TIP]
26
+
> This table shows a subset of commonly used evaluators. For detailed specification of all available evaluators including required inputs, scoring ranges, and implementation guidance, learn more through the [evaluators reference](/azure/ai-foundry/concepts/built-in-evaluators).
**Add essential evaluators beyond human criteria:**
36
37
@@ -39,9 +40,9 @@ You can include evaluators that humans don't shadow-rate but are critical for sa
39
40
```python
40
41
evaluators = {
41
42
# Shadow-rated against human judgment
42
-
'groundedness': GroundednessEvaluator(),
43
+
'intent_resolution': IntentResolutionEvaluator(),
43
44
'relevance': RelevanceEvaluator(),
44
-
'coherence': CoherenceEvaluator(),
45
+
'groundedness': GroundednessEvaluator(),
45
46
46
47
# Essential safety checks (not shadow-rated)
47
48
'content_safety': ContentSafetyEvaluator(),
@@ -50,7 +51,7 @@ evaluators = {
50
51
```
51
52
52
53
> [!NOTE]
53
-
> Safety and compliance evaluators serve as gates regardless of human evaluation. A response can score well on human-validated dimensions but still fail on content safety, blocking deployment.
54
+
> Safety and compliance evaluators can serve as gates regardless of human evaluation. A response can score well on human-validated dimensions but still fail on content safety, blocking deployment.
54
55
55
56
## Run shadow rating
56
57
@@ -70,14 +71,14 @@ from scipy.stats import pearsonr
Adventure Works targets 0.75 correlation between their accuracy criteria and automated groundedness scores before trusting automation for deployment decisions.
95
+
Adventure Works targets 0.75 correlation between their human evaluation scores and automated evaluator scores before trusting automation for deployment decisions.
0 commit comments