Fix code-testing-agent activation for the Flask pytest scenario by Evangelink · Pull Request #802 · dotnet/skills

Evangelink · 2026-06-22T15:25:32Z

What

Fixes plugin- and isolated-mode skill activation for the code-testing-agent scenario "Generate pytest tests for the Flask tasks API (Python polyglot)".

Root cause

The scenario prompt enumerated, layer by layer and class by class, exactly what to mock, inject, and test (TaskService with the repository mocked and the clock injected, queries.apply_query over fixed Task lists, both repositories, SqliteTaskRepository against an in-memory connection, the blueprint via test_client…). That level of detail acts as an answer key: the base agent just executed it directly with edit tools and never routed to the code-testing-agent research → plan → implement pipeline.

Reproduced locally with the skill-validator: the original prompt yielded detected: [] in both isolated and plugin runs. The sibling ContosoUniversity scenario (same skill, concise high-level prompt) activates fine.

Fix

Rewrote the prompt to a realistic, high-level ask mirroring the ContosoUniversity scenario:

describes the app at a layer level (service / repository / query / blueprint) instead of a per-module test checklist,
keeps the no tests yet, project-wide multi-file framing,
keeps the 80% line+branch coverage floor and the pip install -e ".[test]" + pytest + coverage.xml expectations.

Assertions, rubric, and timeout are unchanged — the agent must still discover mocking, test_client usage, validation paths, etc. via the skill. Applied to both eval.yaml and eval.vally.yaml.

Verification

Local skill-validator run after the change: 🔌 Skill activated (isolated): skills=code-testing-agent and 🔌 Skill activated (plugin): skills=code-testing-agent.

Co-authored-by: Copilot [email protected]

The 'Generate pytest tests for the Flask tasks API' scenario failed to activate code-testing-agent in BOTH isolated and plugin mode: its prompt enumerated, file by file, exactly what to mock/inject/test (TaskService with repo mocked + clock injected, queries.apply_query over fixed lists, both repositories, the blueprint via test_client), acting as an answer key that let the base agent generate tests directly with edit tools instead of routing to the skill's research-plan-implement pipeline. Rewrite the prompt to a realistic, high-level ask (mirroring the ContosoUniversity scenario that does activate): describe the app at a layer level, keep the 'no tests yet', project-wide multi-file framing and the 80% coverage floor, and drop the per-module test checklist. Assertions, rubric and timeout are unchanged. Verified locally that the skill now activates in both isolated and plugin mode. Co-authored-by: Copilot <[email protected]>

Evangelink · 2026-06-22T15:25:41Z

/evaluate

Copilot

Pull request overview

This PR updates the Flask/pytest “Python polyglot” evaluation stimulus to restore code-testing-agent activation in both isolated and plugin runs by rewriting the scenario prompt to be higher-level and less “answer-key”-like.

Changes:

Rewrites the Flask tasks API scenario prompt in eval.yaml to remove overly prescriptive testing instructions while keeping the coverage/install/run expectations.
Applies the same prompt rewrite to the corresponding Vally evaluation in eval.vally.yaml.

Show a summary per file

File	Description
tests/dotnet-test/code-testing-agent/eval.yaml	Updates the Flask pytest scenario prompt to improve code-testing-agent activation while retaining coverage/run requirements.
tests/dotnet-test/code-testing-agent/eval.vally.yaml	Mirrors the same Flask prompt adjustments for the Vally evaluation stimulus.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 2/2 changed files
Comments generated: 2

github-actions · 2026-06-22T15:42:44Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
code-testing-agent	Generate tests for ContosoUniversity ASP.NET Core MVC app	3.3/5 → 3.0/5 🔴	✅ code-testing-agent; tools: skill / ✅ code-testing-agent; code-testing-extensions; tools: skill, task, read_agent, grep	🟡 0.25	❌ [1]
code-testing-agent	Generate pytest tests for the Flask tasks API (Python polyglot)	4.3/5 → 4.3/5	✅ code-testing-agent; tools: skill / ✅ code-testing-agent; tools: skill, edit	🟡 0.25	✅ [2]
code-testing-agent	Generate Vitest tests for the shopping-cart library (TypeScript polyglot)	4.7/5 → 4.7/5	✅ code-testing-agent; tools: skill, edit / ✅ code-testing-agent; tools: skill	🟡 0.25	❌ [3]
code-testing-agent	Does not revert a gutted-looking workspace (workspace integrity)	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	🟡 0.25	✅ [4]

[1] ⚠️ High run-to-run variance (CV=105%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.5% due to: tokens (1341684 → 2290878), time (369.2s → 502.1s), tool calls (62 → 79)
[2] ⚠️ High run-to-run variance (CV=105%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=74%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.1% due to: tokens (225259 → 304869)
[4] ⚠️ High run-to-run variance (CV=517%) — consider re-running with --runs 5

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 802 in dotnet/skills, download eval artifacts with gh run download 27963926402 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/96deb8bd5d5c03e787d099fc1b7a191390baae8a/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

github-actions · 2026-06-22T17:00:05Z

👋 @Evangelink — this PR has 2 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

Review feedback: the prompt told the agent to run \python -m ...\ but the grader (and the vally command) invoke \python3\. On Linux runners that may lack a \python\ shim the agent could hit command-not-found. Align the prompt to \python3\ in both eval.yaml and eval.vally.yaml. Co-authored-by: Copilot <[email protected]>

Evangelink · 2026-06-22T18:15:05Z

/evaluate

github-actions · 2026-06-22T18:33:41Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
code-testing-agent	Generate tests for ContosoUniversity ASP.NET Core MVC app	3.3/5 → 3.0/5 🔴	✅ code-testing-agent; tools: skill, grep / ✅ code-testing-agent; code-testing-extensions; test-gap-analysis; assertion-quality; tools: skill, task, read_agent, grep, glob, read_bash	🟡 0.23	❌
code-testing-agent	Generate pytest tests for the Flask tasks API (Python polyglot)	4.0/5 → 4.3/5 🟢	✅ code-testing-agent; tools: skill, edit / ✅ code-testing-agent; tools: glob, skill	🟡 0.23	❌ [1]
code-testing-agent	Generate Vitest tests for the shopping-cart library (TypeScript polyglot)	4.7/5 → 5.0/5 🟢	✅ code-testing-agent; tools: skill / ✅ code-testing-agent; code-testing-extensions; test-gap-analysis; assertion-quality; tools: skill, task, edit, read_agent	🟡 0.23	❌ [2]
code-testing-agent	Does not revert a gutted-looking workspace (workspace integrity)	5.0/5 → 5.0/5	✅ code-testing-agent; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.23	❌ [3]

[1] ⚠️ High run-to-run variance (CV=110%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -7.5% due to: quality, tokens (220521 → 288452)
[2] ⚠️ High run-to-run variance (CV=52305%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=74%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -21.4% due to: judgment, quality, tokens (101598 → 121028)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 802 in dotnet/skills, download eval artifacts with gh run download 27974177013 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/038fc9253af951f0c24095a60d19622c4f92b491/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

github-actions · 2026-06-23T07:34:32Z

✅ Evaluation passed for 038fc92. cc @dotnet/dotnet-testing — please review.

Copilot AI review requested due to automatic review settings June 22, 2026 15:25

Copilot started reviewing on behalf of Evangelink June 22, 2026 15:26 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Comment thread tests/dotnet-test/code-testing-agent/eval.yaml Outdated

Comment thread tests/dotnet-test/code-testing-agent/eval.vally.yaml Outdated

github-actions Bot added the waiting-on-author PR state label label Jun 22, 2026

Evangelink enabled auto-merge (squash) June 23, 2026 07:25

github-actions Bot added waiting-on-review PR state label and removed waiting-on-author PR state label labels Jun 23, 2026

Evangelink mentioned this pull request Jun 23, 2026

code-testing-agent: fix workspace-integrity activation + stabilize Contoso rubric Evangelink/skills#2

Open

YuliiaKovalova approved these changes Jun 23, 2026

View reviewed changes

Evangelink merged commit 102663d into dotnet:main Jun 23, 2026
34 of 36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix code-testing-agent activation for the Flask pytest scenario#802

Fix code-testing-agent activation for the Flask pytest scenario#802
Evangelink merged 2 commits into
dotnet:mainfrom
Evangelink:fix-code-testing-agent-flask-activation

Evangelink commented Jun 22, 2026

Uh oh!

Evangelink commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

Evangelink commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Evangelink commented Jun 22, 2026

What

Root cause

Fix

Verification

Uh oh!

Evangelink commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 22, 2026

Skill Validation Results

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

Evangelink commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Skill Validation Results

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants