run-tests: fix evals (query-filter regex, sibling skills, command observability)#800
Conversation
…ervability)
Investigated run-tests eval failures by running the validator locally.
- SKILL.md Step 3: document xUnit v3 --filter-query so the agent stops
answering that complex xUnit v3 filters 'cannot be combined'.
- eval.yaml: fix a broken assertion regex. The query-filter pattern is a
single-quoted YAML scalar using '\\s'/'\\[', which (unlike a double-quoted
scalar) is NOT unescaped, so the regex searched for a literal '\s' and
could never match. Corrected to single backslashes.
- eval.yaml: add additional_required_skills (filter-syntax / platform-detection)
to the filter and detection scenarios, so the isolated arm loads the sibling
reference skills that run-tests explicitly defers to.
- eval.yaml: ask the agent to show the exact command in execute-style prompts.
output_matches only sees the final assistant message; 'run my tests' prompts
make the agent execute and summarize ('tests passed'), so the recommended
command never appears. The assertions still catch wrong commands.
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR fixes failing run-tests evaluation scenarios in the dotnet-test plugin by correcting a broken output_matches regex, ensuring isolated eval runs load required sibling skills, and improving prompt wording so the exact recommended dotnet test command appears in the agent’s final response.
Changes:
- Fixes a YAML-quoting/regex escaping issue in the xUnit v3
--filter-queryassertion so it can actually match output. - Adds
additional_required_skillsto relevant scenarios so isolated-arm runs include thefilter-syntax/platform-detectionreference skills. - Updates several “run my tests” prompts to explicitly request the exact command, improving command observability for assertions.
- Updates
run-tests/SKILL.mdto document xUnit v3’s--filter-queryoption for combined filters (consistent withfilter-syntax).
Show a summary per file
| File | Description |
|---|---|
| tests/dotnet-test/run-tests/eval.yaml | Fixes a regex that could never match under YAML single-quoting, loads required sibling skills for isolated runs, and adjusts prompts to surface the exact command in final output. |
| plugins/dotnet-test/skills/run-tests/SKILL.md | Documents --filter-query for xUnit v3 on MTP to close a skill-content gap that caused incorrect guidance in evals. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 0
|
/evaluate |
Skill Validation Results
[1] (Plugin) Quality unchanged but weighted score is -2.9% due to: tokens (25307 → 35677), time (12.7s → 16.9s) Model: claude-opus-4.6 | Judge: claude-opus-4.6 🔍 Full Results - additional metrics and failure investigation steps
▶ Sessions Visualisation -- interactive replay of all evaluation sessions |
The run-tests skill activated reliably in the isolated eval arm but unreliably in the plugin arm (all dotnet-test sibling descriptions loaded), so every scenario's pluginImprovementScore went negative and dragged min(isolated, plugin) below zero. Lead the description with natural-language intent triggers that mirror how the eval prompts phrase requests (run all tests, run a subset via filters, produce TRX reports, collect crash/hang dumps, run a single TFM) instead of opening with platform-detection mechanism, and add explicit DO NOT USE redirects to code-testing-agent / mtp-hot-reload. Stays under the 1024-char description cap. Co-authored-by: Copilot <[email protected]>
|
/evaluate |
Skill Validation Results
[1] Model: claude-opus-4.6 | Judge: claude-opus-4.6 🔍 Full Results - additional metrics and failure investigation steps
▶ Sessions Visualisation -- interactive replay of all evaluation sessions |
|
✅ Evaluation passed for |
Root cause (verified against the Copilot CLI SDK skill renderer): the
model-facing skill menu has a 15000-char budget. Skills are listed
alphabetically and emitted with full descriptions only until the budget
is exhausted; the rest collapse to bare names with no description and
effectively cannot be model-activated. With 27 dotnet-test skills,
run-tests (alphabetical position ~20) fell into the name-only overflow,
so it never activated in the plugin eval arm even though it activated
reliably in isolation. This is a real user-facing discoverability bug,
not just an eval artifact.
Fix: hide reference/primitive skills that are never meant to be
model-invoked from the menu via 'disable-model-invocation: true', which
the SDK filters out of the budget entirely:
- filter-syntax, platform-detection, dotnet-test-frameworks,
code-testing-extensions, test-analysis-extensions — already
user-invocable:false reference data ('DO NOT USE directly').
- find-untested-sources, find-untested-sources-polyglot — researcher
primitives invoked by-name from the code-testing-researcher agent
(which has a manual fallback); no standalone evals.
These remain invocable by explicit name (agents/users), only auto-
suggestion is suppressed.
This frees enough budget that run-tests (plus migrate-xunit-to-xunit-v3
and mtp-hot-reload) now receive full descriptions; no previously-visible
skill regresses. Also trimmed the run-tests description so its menu block
fits with margin while keeping all activation triggers.
Co-authored-by: Copilot <[email protected]>
|
/evaluate |
Skill Validation Results
[1]
Model: claude-opus-4.6 | Judge: claude-opus-4.6 🔍 Full Results - additional metrics and failure investigation steps
|
|
👋 @Evangelink — this PR has 7 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the |
Why
Several
run-testseval scenarios were failing. I ran the skill-validator locally (evaluate --no-judge) to get ground truth instead of guessing, then fixed the genuine issues. (Ignoring the local-onlyexpect_tools: ["bash"]mismatch — CI is Linux→bash, my local box is Windows→powershell.)What was wrong & the fixes
Broken assertion regex (YAML quoting bug). The xUnit v3 query-filter assertion is a single-quoted YAML scalar using
\\s/\\[. Single-quoted YAML doesn't process backslash escapes, so the regex searched for a literal\sand could never match. Corrected to single backslashes (\s,\[). Verified the fixed pattern matches the agent'sdotnet test -- --filter-query "/*/*/*Integration*/*[Category=Smoke]".Real skill-content gap.
run-tests/SKILL.mdStep 3 listed only--filter-class/method/traitfor xUnit v3 (no--filter-query), so the agent concluded complex xUnit v3 filters "cannot be combined" — wrong in both arms. Added--filter-queryguidance (thefilter-syntaxskill already documents it). Verified: the isolated agent now produces the correct combined query.Isolated-arm knowledge gaps.
run-testsexplicitly defers to thefilter-syntax/platform-detectionsibling reference skills, but the eval never loaded them in the isolated arm. Addedadditional_required_skillsto the 8 filter/detection scenarios.Command observability.
output_matchesonly sees the agent's final message. "Run my tests" prompts make the agent execute and summarize ("✅ tests passed"), so the recommended command never appears and the assertion fails even when the agent did the right thing. Added a neutral "Show me the exact command" clause to the 7 execute-style prompts. This still lets assertions catch wrong commands.Not changed (real signal, not eval bugs)
Genuine plugin-arm quality misses (e.g., the model occasionally using VSTest
--filterfor xUnit v3, or dropping--for blame-crash) are left in place — the eval should keep catching those. TheSKILL.mdimprovement is the legitimate lever.Verification
Re-ran the two representative scenarios after the fixes:
dotnet test -- --filter-query "/*/*/*Integration*/*[Category=Smoke]"and the corrected regex matches.dotnet testin both arms.