Skip to content

run-tests: fix evals (query-filter regex, sibling skills, command observability)#800

Merged
Evangelink merged 3 commits into
dotnet:mainfrom
Evangelink:run-tests-eval-fixes
Jun 23, 2026
Merged

run-tests: fix evals (query-filter regex, sibling skills, command observability)#800
Evangelink merged 3 commits into
dotnet:mainfrom
Evangelink:run-tests-eval-fixes

Conversation

@Evangelink

Copy link
Copy Markdown
Member

Why

Several run-tests eval scenarios were failing. I ran the skill-validator locally (evaluate --no-judge) to get ground truth instead of guessing, then fixed the genuine issues. (Ignoring the local-only expect_tools: ["bash"] mismatch — CI is Linux→bash, my local box is Windows→powershell.)

What was wrong & the fixes

  1. Broken assertion regex (YAML quoting bug). The xUnit v3 query-filter assertion is a single-quoted YAML scalar using \\s/\\[. Single-quoted YAML doesn't process backslash escapes, so the regex searched for a literal \s and could never match. Corrected to single backslashes (\s, \[). Verified the fixed pattern matches the agent's dotnet test -- --filter-query "/*/*/*Integration*/*[Category=Smoke]".

  2. Real skill-content gap. run-tests/SKILL.md Step 3 listed only --filter-class/method/trait for xUnit v3 (no --filter-query), so the agent concluded complex xUnit v3 filters "cannot be combined" — wrong in both arms. Added --filter-query guidance (the filter-syntax skill already documents it). Verified: the isolated agent now produces the correct combined query.

  3. Isolated-arm knowledge gaps. run-tests explicitly defers to the filter-syntax / platform-detection sibling reference skills, but the eval never loaded them in the isolated arm. Added additional_required_skills to the 8 filter/detection scenarios.

  4. Command observability. output_matches only sees the agent's final message. "Run my tests" prompts make the agent execute and summarize ("✅ tests passed"), so the recommended command never appears and the assertion fails even when the agent did the right thing. Added a neutral "Show me the exact command" clause to the 7 execute-style prompts. This still lets assertions catch wrong commands.

Not changed (real signal, not eval bugs)

Genuine plugin-arm quality misses (e.g., the model occasionally using VSTest --filter for xUnit v3, or dropping -- for blame-crash) are left in place — the eval should keep catching those. The SKILL.md improvement is the legitimate lever.

Verification

Re-ran the two representative scenarios after the fixes:

  • Query-filter: isolated arm now emits dotnet test -- --filter-query "/*/*/*Integration*/*[Category=Smoke]" and the corrected regex matches.
  • VSTest-run: now emits dotnet test in both arms.

…ervability)

Investigated run-tests eval failures by running the validator locally.

- SKILL.md Step 3: document xUnit v3 --filter-query so the agent stops
  answering that complex xUnit v3 filters 'cannot be combined'.
- eval.yaml: fix a broken assertion regex. The query-filter pattern is a
  single-quoted YAML scalar using '\\s'/'\\[', which (unlike a double-quoted
  scalar) is NOT unescaped, so the regex searched for a literal '\s' and
  could never match. Corrected to single backslashes.
- eval.yaml: add additional_required_skills (filter-syntax / platform-detection)
  to the filter and detection scenarios, so the isolated arm loads the sibling
  reference skills that run-tests explicitly defers to.
- eval.yaml: ask the agent to show the exact command in execute-style prompts.
  output_matches only sees the final assistant message; 'run my tests' prompts
  make the agent execute and summarize ('tests passed'), so the recommended
  command never appears. The assertions still catch wrong commands.

Co-authored-by: Copilot <[email protected]>
Copilot AI review requested due to automatic review settings June 22, 2026 12:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes failing run-tests evaluation scenarios in the dotnet-test plugin by correcting a broken output_matches regex, ensuring isolated eval runs load required sibling skills, and improving prompt wording so the exact recommended dotnet test command appears in the agent’s final response.

Changes:

  • Fixes a YAML-quoting/regex escaping issue in the xUnit v3 --filter-query assertion so it can actually match output.
  • Adds additional_required_skills to relevant scenarios so isolated-arm runs include the filter-syntax / platform-detection reference skills.
  • Updates several “run my tests” prompts to explicitly request the exact command, improving command observability for assertions.
  • Updates run-tests/SKILL.md to document xUnit v3’s --filter-query option for combined filters (consistent with filter-syntax).
Show a summary per file
File Description
tests/dotnet-test/run-tests/eval.yaml Fixes a regex that could never match under YAML single-quoting, loads required sibling skills for isolated runs, and adjusts prompts to surface the exact command in final output.
plugins/dotnet-test/skills/run-tests/SKILL.md Documents --filter-query for xUnit v3 on MTP to close a skill-content gap that caused incorrect guidance in evals.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 0

@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

@github-actions github-actions Bot added the pr-state/ready-for-eval PR is mergeable and awaiting evaluation label Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.23 [1]
run-tests Run tests with trx reporting on MTP project (SDK 9) 1.7/5 → 4.3/5 🟢 ✅ run-tests; tools: skill, report_intent / ⚠️ NOT ACTIVATED 🟡 0.23 [2]
run-tests Run tests with blame-hang on MTP project (SDK 10) 3.0/5 → 4.7/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.23 [3]
run-tests Run tests on a specific TFM with TRX in a multi-TFM MTP project (SDK 9) 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, report_intent, view / ⚠️ NOT ACTIVATED 🟡 0.23 [4]
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED 🟡 0.23 [5]
run-tests Filter NUnit tests by class name on VSTest 3.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, bash / ⚠️ NOT ACTIVATED 🟡 0.23 [6]
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED 🟡 0.23 [7]
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED 🟡 0.23 [8]
run-tests Filter xUnit v3 tests by class pattern and trait using query filter language 1.0/5 → 4.7/5 🟢 ✅ run-tests; tools: report_intent, skill, view / ⚠️ NOT ACTIVATED 🟡 0.23 [9]
run-tests Filter TUnit tests by class using treenode-filter 1.7/5 → 4.7/5 🟢 ✅ run-tests; tools: skill, bash, glob / ⚠️ NOT ACTIVATED 🟡 0.23 [10]
run-tests Combine multiple filter criteria on VSTest MSTest 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, report_intent, view / ⚠️ NOT ACTIVATED 🟡 0.23 [11]
run-tests MTP project on SDK 9 must use -- separator for args 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.23 [12]
run-tests MTP project on SDK 10 passes args directly 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.23 [13]
run-tests Detect test platform from Directory.Build.props 1.3/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.23
run-tests Negative test: do not use MTP syntax for a VSTest project 4.7/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED 🟡 0.23 [14]

[1] (Plugin) Quality unchanged but weighted score is -2.9% due to: tokens (25307 → 35677), time (12.7s → 16.9s)
[2] ⚠️ High run-to-run variance (CV=1084%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -14.6% due to: judgment, quality, time (10.3s → 13.7s)
[3] ⚠️ High run-to-run variance (CV=311%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -30.2% due to: quality, judgment, tokens (25301 → 35877), tool calls (2 → 3), time (14.9s → 19.8s)
[4] ⚠️ High run-to-run variance (CV=170%) — consider re-running with --runs 5
[5] (Isolated) Quality unchanged but weighted score is -8.8% due to: tokens (25414 → 50594), tool calls (2 → 5), time (14.9s → 23.0s)
[6] ⚠️ High run-to-run variance (CV=132%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=95%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -1.2% due to: tool calls (1 → 2), tokens (25000 → 35789), time (8.5s → 10.4s)
[8] ⚠️ High run-to-run variance (CV=571%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=71%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -13.9% due to: judgment, tokens (12660 → 17877)
[10] ⚠️ High run-to-run variance (CV=383%) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=120%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -9.4% due to: tokens (21269 → 50787), tool calls (1 → 5), time (10.0s → 17.5s)
[12] ⚠️ High run-to-run variance (CV=103%) — consider re-running with --runs 5
[13] ⚠️ High run-to-run variance (CV=181%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.1% due to: tokens (25222 → 35677)
[14] ⚠️ High run-to-run variance (CV=132%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.0% due to: tokens (25648 → 36031)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 800 in dotnet/skills, download eval artifacts with gh run download 27955388384 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/67b339333b6817d071a9935a764f6056a6d52968/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

The run-tests skill activated reliably in the isolated eval arm but
unreliably in the plugin arm (all dotnet-test sibling descriptions
loaded), so every scenario's pluginImprovementScore went negative and
dragged min(isolated, plugin) below zero.

Lead the description with natural-language intent triggers that mirror
how the eval prompts phrase requests (run all tests, run a subset via
filters, produce TRX reports, collect crash/hang dumps, run a single
TFM) instead of opening with platform-detection mechanism, and add
explicit DO NOT USE redirects to code-testing-agent / mtp-hot-reload.
Stays under the 1024-char description cap.

Co-authored-by: Copilot <[email protected]>
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.14 [1]
run-tests Run tests with trx reporting on MTP project (SDK 9) 1.7/5 → 5.0/5 🟢 ✅ run-tests; tools: report_intent, skill, view / ⚠️ NOT ACTIVATED ✅ 0.14
run-tests Run tests with blame-hang on MTP project (SDK 10) 3.3/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.14 [2]
run-tests Run tests on a specific TFM with TRX in a multi-TFM MTP project (SDK 9) 5.0/5 → 2.0/5 🔴 ⚠️ NOT ACTIVATED ✅ 0.14
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.14 [3]
run-tests Filter NUnit tests by class name on VSTest 3.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, bash / ⚠️ NOT ACTIVATED ✅ 0.14 [4]
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED ✅ 0.14 [5]
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED ✅ 0.14 [6]
run-tests Filter xUnit v3 tests by class pattern and trait using query filter language 1.0/5 → 3.7/5 🟢 ⚠️ NOT ACTIVATED ✅ 0.14 [7]
run-tests Filter TUnit tests by class using treenode-filter 2.3/5 → 4.0/5 🟢 ⚠️ NOT ACTIVATED ✅ 0.14 [8]
run-tests Combine multiple filter criteria on VSTest MSTest 4.3/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED ✅ 0.14 [9]
run-tests MTP project on SDK 9 must use -- separator for args 1.3/5 → 5.0/5 🟢 ⚠️ NOT ACTIVATED ✅ 0.14 [10]
run-tests MTP project on SDK 10 passes args directly 1.0/5 → 4.0/5 🟢 ⚠️ NOT ACTIVATED ✅ 0.14 [11]
run-tests Detect test platform from Directory.Build.props 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.14
run-tests Negative test: do not use MTP syntax for a VSTest project 4.3/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED ✅ 0.14 [12]

[1] ⚠️ High run-to-run variance (CV=143%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -3.0% due to: tokens (25279 → 35687), time (11.1s → 15.2s)
[2] ⚠️ High run-to-run variance (CV=70%) — consider re-running with --runs 5
[3] (Plugin) Quality unchanged but weighted score is -2.1% due to: tokens (25467 → 35902)
[4] ⚠️ High run-to-run variance (CV=109%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=301%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -5.2% due to: tool calls (1 → 2), tokens (25004 → 35660), time (7.8s → 9.5s)
[6] ⚠️ High run-to-run variance (CV=231%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.8% due to: tokens (33914 → 39100)
[7] ⚠️ High run-to-run variance (CV=1611%) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=216%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=947%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -10.9% due to: judgment, tokens (25647 → 64031), quality, tool calls (2 → 5), time (14.4s → 28.4s)
[10] ⚠️ High run-to-run variance (CV=117%) — consider re-running with --runs 5
[11] (Plugin) Quality unchanged but weighted score is -2.2% due to: tokens (25187 → 35669)
[12] ⚠️ High run-to-run variance (CV=390%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.8% due to: tokens (25646 → 36034)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 800 in dotnet/skills, download eval artifacts with gh run download 27959304151 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/3b726d597e32d8cd7602a13190578779b73b142a/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@Evangelink Evangelink enabled auto-merge (squash) June 22, 2026 15:07
@Evangelink Evangelink disabled auto-merge June 22, 2026 15:08
@github-actions github-actions Bot added waiting-on-review PR state label and removed pr-state/ready-for-eval PR is mergeable and awaiting evaluation labels Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

✅ Evaluation passed for 3b726d5. cc @dotnet/dotnet-testing — please review.

Root cause (verified against the Copilot CLI SDK skill renderer): the
model-facing skill menu has a 15000-char budget. Skills are listed
alphabetically and emitted with full descriptions only until the budget
is exhausted; the rest collapse to bare names with no description and
effectively cannot be model-activated. With 27 dotnet-test skills,
run-tests (alphabetical position ~20) fell into the name-only overflow,
so it never activated in the plugin eval arm even though it activated
reliably in isolation. This is a real user-facing discoverability bug,
not just an eval artifact.

Fix: hide reference/primitive skills that are never meant to be
model-invoked from the menu via 'disable-model-invocation: true', which
the SDK filters out of the budget entirely:
  - filter-syntax, platform-detection, dotnet-test-frameworks,
    code-testing-extensions, test-analysis-extensions — already
    user-invocable:false reference data ('DO NOT USE directly').
  - find-untested-sources, find-untested-sources-polyglot — researcher
    primitives invoked by-name from the code-testing-researcher agent
    (which has a manual fallback); no standalone evals.
These remain invocable by explicit name (agents/users), only auto-
suggestion is suppressed.

This frees enough budget that run-tests (plus migrate-xunit-to-xunit-v3
and mtp-hot-reload) now receive full descriptions; no previously-visible
skill regresses. Also trimmed the run-tests description so its menu block
fits with margin while keeping all activation triggers.

Co-authored-by: Copilot <[email protected]>
Copilot AI review requested due to automatic review settings June 22, 2026 15:15
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 9/9 changed files
  • Comments generated: 7

Comment thread plugins/dotnet-test/skills/platform-detection/SKILL.md
Comment thread plugins/dotnet-test/skills/filter-syntax/SKILL.md
Comment thread plugins/dotnet-test/skills/code-testing-extensions/SKILL.md
Comment thread plugins/dotnet-test/skills/test-analysis-extensions/SKILL.md
Comment thread plugins/dotnet-test/skills/find-untested-sources/SKILL.md
Comment thread plugins/dotnet-test/skills/find-untested-sources-polyglot/SKILL.md
Comment thread plugins/dotnet-test/skills/dotnet-test-frameworks/SKILL.md
github-actions Bot added a commit that referenced this pull request Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED ✅ 0.16 [1]
run-tests Run tests with trx reporting on MTP project (SDK 9) 1.3/5 → 5.0/5 🟢 ✅ run-tests; tools: report_intent, skill, view / ⚠️ NOT ACTIVATED ✅ 0.16
run-tests Run tests with blame-hang on MTP project (SDK 10) 2.0/5 → 3.7/5 🟢 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: glob, skill ✅ 0.16 [2]
run-tests Run tests on a specific TFM with TRX in a multi-TFM MTP project (SDK 9) 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.16 [3]
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.16 [4]
run-tests Filter NUnit tests by class name on VSTest 3.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, bash / ✅ run-tests; tools: skill, view ✅ 0.16 [5]
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view ✅ 0.16
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED ✅ 0.16 [6]
run-tests Filter xUnit v3 tests by class pattern and trait using query filter language 1.0/5 → 2.3/5 ⏰ 🟢 ✅ run-tests; tools: report_intent, view, skill, bash / ⚠️ NOT ACTIVATED ✅ 0.16 [7]
run-tests Filter TUnit tests by class using treenode-filter 2.3/5 → 3.7/5 🟢 ⚠️ NOT ACTIVATED ✅ 0.16 [8]
run-tests Combine multiple filter criteria on VSTest MSTest 4.0/5 → 4.7/5 🟢 ✅ run-tests; tools: report_intent, skill, view, bash / ⚠️ NOT ACTIVATED ✅ 0.16 [9]
run-tests MTP project on SDK 9 must use -- separator for args 1.0/5 → 2.7/5 🟢 ⚠️ NOT ACTIVATED ✅ 0.16
run-tests MTP project on SDK 10 passes args directly 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.16
run-tests Detect test platform from Directory.Build.props 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.16
run-tests Negative test: do not use MTP syntax for a VSTest project 4.3/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED ✅ 0.16 [10]
dotnet-test-frameworks Cross-framework assertion equivalence mapping 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.08 [11]
dotnet-test-frameworks Identify TUnit framework and its unique attributes 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.08 [12]
dotnet-test-frameworks Replace try-catch with framework-native exception assertions 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.08 [13]
dotnet-test-frameworks Skip annotations across all four frameworks 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.08 [14]
dotnet-test-frameworks Convert NUnit lifecycle methods to xUnit equivalents 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.08 [15]
dotnet-test-frameworks Identify integration tests by markers and code patterns 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.08 [16]
dotnet-test-frameworks Convert cross-framework assertions to TUnit syntax 1.7/5 → 2.0/5 🟢 ℹ️ not activated (expected) ✅ 0.08 [17]
dotnet-test-frameworks Diagnose silently-passing TUnit test with missing await 4.3/5 → 4.7/5 🟢 ℹ️ not activated (expected) ✅ 0.08 [18]
dotnet-test-frameworks Refactor TUnit try/catch to native exception assertion 2.3/5 → 3.0/5 🟢 ℹ️ not activated (expected) ✅ 0.08 [19]
dotnet-test-frameworks TUnit lifecycle hooks at test, class, assembly, and session scope 4.0/5 → 4.0/5 ℹ️ not activated (expected) ✅ 0.08 [20]
dotnet-test-frameworks TUnit skip mechanisms — attribute, assembly-wide, and dynamic 3.3/5 → 3.3/5 ℹ️ not activated (expected) ✅ 0.08 [21]

[1] ⚠️ High run-to-run variance (CV=203%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=68%) — consider re-running with --runs 5
[3] (Plugin) Quality unchanged but weighted score is -2.7% due to: tokens (25492 → 35929), time (14.3s → 18.0s)
[4] (Plugin) Quality unchanged but weighted score is -4.0% due to: tokens (25463 → 36293), tool calls (2 → 3), time (14.3s → 17.6s)
[5] ⚠️ High run-to-run variance (CV=144%) — consider re-running with --runs 5
[6] (Plugin) Quality unchanged but weighted score is -2.1% due to: tokens (37092 → 51787)
[7] ⚠️ High run-to-run variance (CV=1261%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -25.2% due to: judgment, tokens (12671 → 161334), quality, tool calls (0 → 9), time (9.0s → 55.2s)
[8] ⚠️ High run-to-run variance (CV=268%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -16.9% due to: judgment, quality, tool calls (2 → 3)
[9] ⚠️ High run-to-run variance (CV=387%) — consider re-running with --runs 5
[10] ⚠️ High run-to-run variance (CV=120%) — consider re-running with --runs 5
[11] (Isolated) Quality unchanged but weighted score is -19.8% due to: judgment, quality
[12] ⚠️ High run-to-run variance (CV=121%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -3.2% due to: tokens (13029 → 18226), quality
[13] (Plugin) Quality unchanged but weighted score is -2.1% due to: tokens (13109 → 18314)
[14] ⚠️ High run-to-run variance (CV=103%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.4% due to: tokens (12693 → 17893)
[15] ⚠️ High run-to-run variance (CV=1160%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.4% due to: tokens (13116 → 18365)
[16] ⚠️ High run-to-run variance (CV=91%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -15.1% due to: judgment, quality
[17] ⚠️ High run-to-run variance (CV=250%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.5% due to: tokens (12782 → 17995)
[18] ⚠️ High run-to-run variance (CV=3834%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -4.5% due to: quality, tokens (12806 → 17988)
[19] ⚠️ High run-to-run variance (CV=265%) — consider re-running with --runs 5
[20] ⚠️ High run-to-run variance (CV=65%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -6.4% due to: quality, tokens (13235 → 18532)
[21] (Isolated) Quality unchanged but weighted score is -22.0% due to: judgment, quality

timeout — run(s) hit the (120s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 800 in dotnet/skills, download eval artifacts with gh run download 27963627901 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/7f8b17d85fc998bf24939b5643b1be426d9125b7/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

@github-actions

Copy link
Copy Markdown
Contributor

👋 @Evangelink — this PR has 7 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

@Evangelink Evangelink enabled auto-merge (squash) June 23, 2026 07:50
@Evangelink Evangelink merged commit 29bfae5 into dotnet:main Jun 23, 2026
37 checks passed
@Evangelink Evangelink deleted the run-tests-eval-fixes branch June 23, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-author PR state label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants