run-tests: fix evals (query-filter regex, sibling skills, command observability) by Evangelink · Pull Request #800 · dotnet/skills

Evangelink · 2026-06-22T12:52:40Z

Why

Several run-tests eval scenarios were failing. I ran the skill-validator locally (evaluate --no-judge) to get ground truth instead of guessing, then fixed the genuine issues. (Ignoring the local-only expect_tools: ["bash"] mismatch — CI is Linux→bash, my local box is Windows→powershell.)

What was wrong & the fixes

Broken assertion regex (YAML quoting bug). The xUnit v3 query-filter assertion is a single-quoted YAML scalar using \\s/\\[. Single-quoted YAML doesn't process backslash escapes, so the regex searched for a literal \s and could never match. Corrected to single backslashes (\s, \[). Verified the fixed pattern matches the agent's dotnet test -- --filter-query "/*/*/*Integration*/*[Category=Smoke]".
Real skill-content gap. run-tests/SKILL.md Step 3 listed only --filter-class/method/trait for xUnit v3 (no --filter-query), so the agent concluded complex xUnit v3 filters "cannot be combined" — wrong in both arms. Added --filter-query guidance (the filter-syntax skill already documents it). Verified: the isolated agent now produces the correct combined query.
Isolated-arm knowledge gaps. run-tests explicitly defers to the filter-syntax / platform-detection sibling reference skills, but the eval never loaded them in the isolated arm. Added additional_required_skills to the 8 filter/detection scenarios.
Command observability. output_matches only sees the agent's final message. "Run my tests" prompts make the agent execute and summarize ("✅ tests passed"), so the recommended command never appears and the assertion fails even when the agent did the right thing. Added a neutral "Show me the exact command" clause to the 7 execute-style prompts. This still lets assertions catch wrong commands.

Not changed (real signal, not eval bugs)

Genuine plugin-arm quality misses (e.g., the model occasionally using VSTest --filter for xUnit v3, or dropping -- for blame-crash) are left in place — the eval should keep catching those. The SKILL.md improvement is the legitimate lever.

Verification

Re-ran the two representative scenarios after the fixes:

Query-filter: isolated arm now emits dotnet test -- --filter-query "/*/*/*Integration*/*[Category=Smoke]" and the corrected regex matches.
VSTest-run: now emits dotnet test in both arms.

…ervability) Investigated run-tests eval failures by running the validator locally. - SKILL.md Step 3: document xUnit v3 --filter-query so the agent stops answering that complex xUnit v3 filters 'cannot be combined'. - eval.yaml: fix a broken assertion regex. The query-filter pattern is a single-quoted YAML scalar using '\\s'/'\\[', which (unlike a double-quoted scalar) is NOT unescaped, so the regex searched for a literal '\s' and could never match. Corrected to single backslashes. - eval.yaml: add additional_required_skills (filter-syntax / platform-detection) to the filter and detection scenarios, so the isolated arm loads the sibling reference skills that run-tests explicitly defers to. - eval.yaml: ask the agent to show the exact command in execute-style prompts. output_matches only sees the final assistant message; 'run my tests' prompts make the agent execute and summarize ('tests passed'), so the recommended command never appears. The assertions still catch wrong commands. Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

This PR fixes failing run-tests evaluation scenarios in the dotnet-test plugin by correcting a broken output_matches regex, ensuring isolated eval runs load required sibling skills, and improving prompt wording so the exact recommended dotnet test command appears in the agent’s final response.

Changes:

Fixes a YAML-quoting/regex escaping issue in the xUnit v3 --filter-query assertion so it can actually match output.
Adds additional_required_skills to relevant scenarios so isolated-arm runs include the filter-syntax / platform-detection reference skills.
Updates several “run my tests” prompts to explicitly request the exact command, improving command observability for assertions.
Updates run-tests/SKILL.md to document xUnit v3’s --filter-query option for combined filters (consistent with filter-syntax).

Show a summary per file

File	Description
tests/dotnet-test/run-tests/eval.yaml	Fixes a regex that could never match under YAML single-quoting, loads required sibling skills for isolated runs, and adjusts prompts to surface the exact command in final output.
plugins/dotnet-test/skills/run-tests/SKILL.md	Documents `--filter-query` for xUnit v3 on MTP to close a skill-content gap that caused incorrect guidance in evals.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 2/2 changed files
Comments generated: 0

Evangelink · 2026-06-22T13:10:22Z

/evaluate

github-actions · 2026-06-22T13:24:00Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.23	❌ [1]
run-tests	Run tests with trx reporting on MTP project (SDK 9)	1.7/5 → 4.3/5 🟢	✅ run-tests; tools: skill, report_intent / ⚠️ NOT ACTIVATED	🟡 0.23	❌ [2]
run-tests	Run tests with blame-hang on MTP project (SDK 10)	3.0/5 → 4.7/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.23	❌ [3]
run-tests	Run tests on a specific TFM with TRX in a multi-TFM MTP project (SDK 9)	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, report_intent, view / ⚠️ NOT ACTIVATED	🟡 0.23	❌ [4]
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED	🟡 0.23	❌ [5]
run-tests	Filter NUnit tests by class name on VSTest	3.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, bash / ⚠️ NOT ACTIVATED	🟡 0.23	✅ [6]
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED	🟡 0.23	❌ [7]
run-tests	Filter xUnit v3 tests by trait on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED	🟡 0.23	✅ [8]
run-tests	Filter xUnit v3 tests by class pattern and trait using query filter language	1.0/5 → 4.7/5 🟢	✅ run-tests; tools: report_intent, skill, view / ⚠️ NOT ACTIVATED	🟡 0.23	❌ [9]
run-tests	Filter TUnit tests by class using treenode-filter	1.7/5 → 4.7/5 🟢	✅ run-tests; tools: skill, bash, glob / ⚠️ NOT ACTIVATED	🟡 0.23	✅ [10]
run-tests	Combine multiple filter criteria on VSTest MSTest	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, report_intent, view / ⚠️ NOT ACTIVATED	🟡 0.23	❌ [11]
run-tests	MTP project on SDK 9 must use -- separator for args	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.23	✅ [12]
run-tests	MTP project on SDK 10 passes args directly	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.23	❌ [13]
run-tests	Detect test platform from Directory.Build.props	1.3/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.23	✅
run-tests	Negative test: do not use MTP syntax for a VSTest project	4.7/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED	🟡 0.23	❌ [14]

[1] (Plugin) Quality unchanged but weighted score is -2.9% due to: tokens (25307 → 35677), time (12.7s → 16.9s)
[2] ⚠️ High run-to-run variance (CV=1084%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -14.6% due to: judgment, quality, time (10.3s → 13.7s)
[3] ⚠️ High run-to-run variance (CV=311%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -30.2% due to: quality, judgment, tokens (25301 → 35877), tool calls (2 → 3), time (14.9s → 19.8s)
[4] ⚠️ High run-to-run variance (CV=170%) — consider re-running with --runs 5
[5] (Isolated) Quality unchanged but weighted score is -8.8% due to: tokens (25414 → 50594), tool calls (2 → 5), time (14.9s → 23.0s)
[6] ⚠️ High run-to-run variance (CV=132%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=95%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -1.2% due to: tool calls (1 → 2), tokens (25000 → 35789), time (8.5s → 10.4s)
[8] ⚠️ High run-to-run variance (CV=571%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=71%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -13.9% due to: judgment, tokens (12660 → 17877)
[10] ⚠️ High run-to-run variance (CV=383%) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=120%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -9.4% due to: tokens (21269 → 50787), tool calls (1 → 5), time (10.0s → 17.5s)
[12] ⚠️ High run-to-run variance (CV=103%) — consider re-running with --runs 5
[13] ⚠️ High run-to-run variance (CV=181%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.1% due to: tokens (25222 → 35677)
[14] ⚠️ High run-to-run variance (CV=132%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.0% due to: tokens (25648 → 36031)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 800 in dotnet/skills, download eval artifacts with gh run download 27955388384 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/67b339333b6817d071a9935a764f6056a6d52968/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

The run-tests skill activated reliably in the isolated eval arm but unreliably in the plugin arm (all dotnet-test sibling descriptions loaded), so every scenario's pluginImprovementScore went negative and dragged min(isolated, plugin) below zero. Lead the description with natural-language intent triggers that mirror how the eval prompts phrase requests (run all tests, run a subset via filters, produce TRX reports, collect crash/hang dumps, run a single TFM) instead of opening with platform-detection mechanism, and add explicit DO NOT USE redirects to code-testing-agent / mtp-hot-reload. Stays under the 1024-char description cap. Co-authored-by: Copilot <[email protected]>

Evangelink · 2026-06-22T14:15:16Z

/evaluate

github-actions · 2026-06-22T14:25:04Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.14	❌ [1]
run-tests	Run tests with trx reporting on MTP project (SDK 9)	1.7/5 → 5.0/5 🟢	✅ run-tests; tools: report_intent, skill, view / ⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	Run tests with blame-hang on MTP project (SDK 10)	3.3/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.14	❌ [2]
run-tests	Run tests on a specific TFM with TRX in a multi-TFM MTP project (SDK 9)	5.0/5 → 2.0/5 🔴	⚠️ NOT ACTIVATED	✅ 0.14	❌
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.14	❌ [3]
run-tests	Filter NUnit tests by class name on VSTest	3.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, bash / ⚠️ NOT ACTIVATED	✅ 0.14	✅ [4]
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED	✅ 0.14	❌ [5]
run-tests	Filter xUnit v3 tests by trait on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED	✅ 0.14	❌ [6]
run-tests	Filter xUnit v3 tests by class pattern and trait using query filter language	1.0/5 → 3.7/5 🟢	⚠️ NOT ACTIVATED	✅ 0.14	✅ [7]
run-tests	Filter TUnit tests by class using treenode-filter	2.3/5 → 4.0/5 🟢	⚠️ NOT ACTIVATED	✅ 0.14	✅ [8]
run-tests	Combine multiple filter criteria on VSTest MSTest	4.3/5 → 5.0/5 🟢	✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED	✅ 0.14	❌ [9]
run-tests	MTP project on SDK 9 must use -- separator for args	1.3/5 → 5.0/5 🟢	⚠️ NOT ACTIVATED	✅ 0.14	✅ [10]
run-tests	MTP project on SDK 10 passes args directly	1.0/5 → 4.0/5 🟢	⚠️ NOT ACTIVATED	✅ 0.14	❌ [11]
run-tests	Detect test platform from Directory.Build.props	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	Negative test: do not use MTP syntax for a VSTest project	4.3/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED	✅ 0.14	❌ [12]

[1] ⚠️ High run-to-run variance (CV=143%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -3.0% due to: tokens (25279 → 35687), time (11.1s → 15.2s)
[2] ⚠️ High run-to-run variance (CV=70%) — consider re-running with --runs 5
[3] (Plugin) Quality unchanged but weighted score is -2.1% due to: tokens (25467 → 35902)
[4] ⚠️ High run-to-run variance (CV=109%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=301%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -5.2% due to: tool calls (1 → 2), tokens (25004 → 35660), time (7.8s → 9.5s)
[6] ⚠️ High run-to-run variance (CV=231%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.8% due to: tokens (33914 → 39100)
[7] ⚠️ High run-to-run variance (CV=1611%) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=216%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=947%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -10.9% due to: judgment, tokens (25647 → 64031), quality, tool calls (2 → 5), time (14.4s → 28.4s)
[10] ⚠️ High run-to-run variance (CV=117%) — consider re-running with --runs 5
[11] (Plugin) Quality unchanged but weighted score is -2.2% due to: tokens (25187 → 35669)
[12] ⚠️ High run-to-run variance (CV=390%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.8% due to: tokens (25646 → 36034)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 800 in dotnet/skills, download eval artifacts with gh run download 27959304151 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/3b726d597e32d8cd7602a13190578779b73b142a/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

github-actions · 2026-06-22T15:10:04Z

✅ Evaluation passed for 3b726d5. cc @dotnet/dotnet-testing — please review.

Root cause (verified against the Copilot CLI SDK skill renderer): the model-facing skill menu has a 15000-char budget. Skills are listed alphabetically and emitted with full descriptions only until the budget is exhausted; the rest collapse to bare names with no description and effectively cannot be model-activated. With 27 dotnet-test skills, run-tests (alphabetical position ~20) fell into the name-only overflow, so it never activated in the plugin eval arm even though it activated reliably in isolation. This is a real user-facing discoverability bug, not just an eval artifact. Fix: hide reference/primitive skills that are never meant to be model-invoked from the menu via 'disable-model-invocation: true', which the SDK filters out of the budget entirely: - filter-syntax, platform-detection, dotnet-test-frameworks, code-testing-extensions, test-analysis-extensions — already user-invocable:false reference data ('DO NOT USE directly'). - find-untested-sources, find-untested-sources-polyglot — researcher primitives invoked by-name from the code-testing-researcher agent (which has a manual fallback); no standalone evals. These remain invocable by explicit name (agents/users), only auto- suggestion is suppressed. This frees enough budget that run-tests (plus migrate-xunit-to-xunit-v3 and mtp-hot-reload) now receive full descriptions; no previously-visible skill regresses. Also trimmed the run-tests description so its menu block fits with margin while keeping all activation triggers. Co-authored-by: Copilot <[email protected]>

Evangelink · 2026-06-22T15:21:08Z

/evaluate

Copilot

Copilot's findings

Files reviewed: 9/9 changed files
Comments generated: 7

github-actions · 2026-06-22T15:26:57Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED	✅ 0.16	✅ [1]
run-tests	Run tests with trx reporting on MTP project (SDK 9)	1.3/5 → 5.0/5 🟢	✅ run-tests; tools: report_intent, skill, view / ⚠️ NOT ACTIVATED	✅ 0.16	✅
run-tests	Run tests with blame-hang on MTP project (SDK 10)	2.0/5 → 3.7/5 🟢	⚠️ NOT ACTIVATED / ✅ run-tests; tools: glob, skill	✅ 0.16	✅ [2]
run-tests	Run tests on a specific TFM with TRX in a multi-TFM MTP project (SDK 9)	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.16	❌ [3]
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.16	❌ [4]
run-tests	Filter NUnit tests by class name on VSTest	3.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, bash / ✅ run-tests; tools: skill, view	✅ 0.16	✅ [5]
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view	✅ 0.16	✅
run-tests	Filter xUnit v3 tests by trait on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED	✅ 0.16	❌ [6]
run-tests	Filter xUnit v3 tests by class pattern and trait using query filter language	1.0/5 → 2.3/5 ⏰ 🟢	✅ run-tests; tools: report_intent, view, skill, bash / ⚠️ NOT ACTIVATED	✅ 0.16	❌ [7]
run-tests	Filter TUnit tests by class using treenode-filter	2.3/5 → 3.7/5 🟢	⚠️ NOT ACTIVATED	✅ 0.16	❌ [8]
run-tests	Combine multiple filter criteria on VSTest MSTest	4.0/5 → 4.7/5 🟢	✅ run-tests; tools: report_intent, skill, view, bash / ⚠️ NOT ACTIVATED	✅ 0.16	✅ [9]
run-tests	MTP project on SDK 9 must use -- separator for args	1.0/5 → 2.7/5 🟢	⚠️ NOT ACTIVATED	✅ 0.16	✅
run-tests	MTP project on SDK 10 passes args directly	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.16	✅
run-tests	Detect test platform from Directory.Build.props	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.16	✅
run-tests	Negative test: do not use MTP syntax for a VSTest project	4.3/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED	✅ 0.16	✅ [10]
dotnet-test-frameworks	Cross-framework assertion equivalence mapping	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.08	❌ [11]
dotnet-test-frameworks	Identify TUnit framework and its unique attributes	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.08	❌ [12]
dotnet-test-frameworks	Replace try-catch with framework-native exception assertions	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.08	❌ [13]
dotnet-test-frameworks	Skip annotations across all four frameworks	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.08	❌ [14]
dotnet-test-frameworks	Convert NUnit lifecycle methods to xUnit equivalents	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.08	❌ [15]
dotnet-test-frameworks	Identify integration tests by markers and code patterns	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.08	❌ [16]
dotnet-test-frameworks	Convert cross-framework assertions to TUnit syntax	1.7/5 → 2.0/5 🟢	ℹ️ not activated (expected)	✅ 0.08	❌ [17]
dotnet-test-frameworks	Diagnose silently-passing TUnit test with missing await	4.3/5 → 4.7/5 🟢	ℹ️ not activated (expected)	✅ 0.08	❌ [18]
dotnet-test-frameworks	Refactor TUnit try/catch to native exception assertion	2.3/5 → 3.0/5 🟢	ℹ️ not activated (expected)	✅ 0.08	✅ [19]
dotnet-test-frameworks	TUnit lifecycle hooks at test, class, assembly, and session scope	4.0/5 → 4.0/5	ℹ️ not activated (expected)	✅ 0.08	❌ [20]
dotnet-test-frameworks	TUnit skip mechanisms — attribute, assembly-wide, and dynamic	3.3/5 → 3.3/5	ℹ️ not activated (expected)	✅ 0.08	❌ [21]

[1] ⚠️ High run-to-run variance (CV=203%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=68%) — consider re-running with --runs 5
[3] (Plugin) Quality unchanged but weighted score is -2.7% due to: tokens (25492 → 35929), time (14.3s → 18.0s)
[4] (Plugin) Quality unchanged but weighted score is -4.0% due to: tokens (25463 → 36293), tool calls (2 → 3), time (14.3s → 17.6s)
[5] ⚠️ High run-to-run variance (CV=144%) — consider re-running with --runs 5
[6] (Plugin) Quality unchanged but weighted score is -2.1% due to: tokens (37092 → 51787)
[7] ⚠️ High run-to-run variance (CV=1261%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -25.2% due to: judgment, tokens (12671 → 161334), quality, tool calls (0 → 9), time (9.0s → 55.2s)
[8] ⚠️ High run-to-run variance (CV=268%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -16.9% due to: judgment, quality, tool calls (2 → 3)
[9] ⚠️ High run-to-run variance (CV=387%) — consider re-running with --runs 5
[10] ⚠️ High run-to-run variance (CV=120%) — consider re-running with --runs 5
[11] (Isolated) Quality unchanged but weighted score is -19.8% due to: judgment, quality
[12] ⚠️ High run-to-run variance (CV=121%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -3.2% due to: tokens (13029 → 18226), quality
[13] (Plugin) Quality unchanged but weighted score is -2.1% due to: tokens (13109 → 18314)
[14] ⚠️ High run-to-run variance (CV=103%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.4% due to: tokens (12693 → 17893)
[15] ⚠️ High run-to-run variance (CV=1160%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.4% due to: tokens (13116 → 18365)
[16] ⚠️ High run-to-run variance (CV=91%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -15.1% due to: judgment, quality
[17] ⚠️ High run-to-run variance (CV=250%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.5% due to: tokens (12782 → 17995)
[18] ⚠️ High run-to-run variance (CV=3834%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -4.5% due to: quality, tokens (12806 → 17988)
[19] ⚠️ High run-to-run variance (CV=265%) — consider re-running with --runs 5
[20] ⚠️ High run-to-run variance (CV=65%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -6.4% due to: quality, tokens (13235 → 18532)
[21] (Isolated) Quality unchanged but weighted score is -22.0% due to: judgment, quality

⏰ timeout — run(s) hit the (120s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 800 in dotnet/skills, download eval artifacts with gh run download 27963627901 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/7f8b17d85fc998bf24939b5643b1be426d9125b7/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

github-actions · 2026-06-22T17:00:13Z

👋 @Evangelink — this PR has 7 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

Copilot AI review requested due to automatic review settings June 22, 2026 12:52

Copilot started reviewing on behalf of Evangelink June 22, 2026 12:53 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

github-actions Bot added the pr-state/ready-for-eval PR is mergeable and awaiting evaluation label Jun 22, 2026

Evangelink enabled auto-merge (squash) June 22, 2026 15:07

Evangelink disabled auto-merge June 22, 2026 15:08

github-actions Bot added waiting-on-review PR state label and removed pr-state/ready-for-eval PR is mergeable and awaiting evaluation labels Jun 22, 2026

Copilot AI review requested due to automatic review settings June 22, 2026 15:15

Copilot started reviewing on behalf of Evangelink June 22, 2026 15:15 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

github-actions Bot added a commit that referenced this pull request Jun 22, 2026

Update PR token usage data (PR #800)

a9ad713

Evangelink mentioned this pull request Jun 22, 2026

skill-validator: restore 15K aggregate cap as the real Copilot CLI skill-menu budget #803

Open

github-actions Bot added waiting-on-author PR state label and removed waiting-on-review PR state label labels Jun 22, 2026

Evangelink enabled auto-merge (squash) June 23, 2026 07:50

YuliiaKovalova approved these changes Jun 23, 2026

View reviewed changes

Evangelink merged commit 29bfae5 into dotnet:main Jun 23, 2026
37 checks passed

Evangelink deleted the run-tests-eval-fixes branch June 23, 2026 09:04

Conversation

Evangelink commented Jun 22, 2026

Why

What was wrong & the fixes

Not changed (real signal, not eval bugs)

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Evangelink commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Skill Validation Results

Uh oh!

Evangelink commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Skill Validation Results

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

Evangelink commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 22, 2026

Skill Validation Results

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants