Skip to content

Fix code-testing-agent activation for the Flask pytest scenario#802

Merged
Evangelink merged 2 commits into
dotnet:mainfrom
Evangelink:fix-code-testing-agent-flask-activation
Jun 23, 2026
Merged

Fix code-testing-agent activation for the Flask pytest scenario#802
Evangelink merged 2 commits into
dotnet:mainfrom
Evangelink:fix-code-testing-agent-flask-activation

Conversation

@Evangelink

Copy link
Copy Markdown
Member

What

Fixes plugin- and isolated-mode skill activation for the code-testing-agent scenario "Generate pytest tests for the Flask tasks API (Python polyglot)".

Root cause

The scenario prompt enumerated, layer by layer and class by class, exactly what to mock, inject, and test (TaskService with the repository mocked and the clock injected, queries.apply_query over fixed Task lists, both repositories, SqliteTaskRepository against an in-memory connection, the blueprint via test_client…). That level of detail acts as an answer key: the base agent just executed it directly with edit tools and never routed to the code-testing-agent research → plan → implement pipeline.

Reproduced locally with the skill-validator: the original prompt yielded detected: [] in both isolated and plugin runs. The sibling ContosoUniversity scenario (same skill, concise high-level prompt) activates fine.

Fix

Rewrote the prompt to a realistic, high-level ask mirroring the ContosoUniversity scenario:

  • describes the app at a layer level (service / repository / query / blueprint) instead of a per-module test checklist,
  • keeps the no tests yet, project-wide multi-file framing,
  • keeps the 80% line+branch coverage floor and the pip install -e ".[test]" + pytest + coverage.xml expectations.

Assertions, rubric, and timeout are unchanged — the agent must still discover mocking, test_client usage, validation paths, etc. via the skill. Applied to both eval.yaml and eval.vally.yaml.

Verification

Local skill-validator run after the change: 🔌 Skill activated (isolated): skills=code-testing-agent and 🔌 Skill activated (plugin): skills=code-testing-agent.

Co-authored-by: Copilot [email protected]

The 'Generate pytest tests for the Flask tasks API' scenario failed to activate code-testing-agent in BOTH isolated and plugin mode: its prompt enumerated, file by file, exactly what to mock/inject/test (TaskService with repo mocked + clock injected, queries.apply_query over fixed lists, both repositories, the blueprint via test_client), acting as an answer key that let the base agent generate tests directly with edit tools instead of routing to the skill's research-plan-implement pipeline. Rewrite the prompt to a realistic, high-level ask (mirroring the ContosoUniversity scenario that does activate): describe the app at a layer level, keep the 'no tests yet', project-wide multi-file framing and the 80% coverage floor, and drop the per-module test checklist. Assertions, rubric and timeout are unchanged. Verified locally that the skill now activates in both isolated and plugin mode.

Co-authored-by: Copilot <[email protected]>
Copilot AI review requested due to automatic review settings June 22, 2026 15:25
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Flask/pytest “Python polyglot” evaluation stimulus to restore code-testing-agent activation in both isolated and plugin runs by rewriting the scenario prompt to be higher-level and less “answer-key”-like.

Changes:

  • Rewrites the Flask tasks API scenario prompt in eval.yaml to remove overly prescriptive testing instructions while keeping the coverage/install/run expectations.
  • Applies the same prompt rewrite to the corresponding Vally evaluation in eval.vally.yaml.
Show a summary per file
File Description
tests/dotnet-test/code-testing-agent/eval.yaml Updates the Flask pytest scenario prompt to improve code-testing-agent activation while retaining coverage/run requirements.
tests/dotnet-test/code-testing-agent/eval.vally.yaml Mirrors the same Flask prompt adjustments for the Vally evaluation stimulus.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 2

Comment thread tests/dotnet-test/code-testing-agent/eval.yaml Outdated
Comment thread tests/dotnet-test/code-testing-agent/eval.vally.yaml Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.3/5 → 3.0/5 🔴 ✅ code-testing-agent; tools: skill / ✅ code-testing-agent; code-testing-extensions; tools: skill, task, read_agent, grep 🟡 0.25 [1]
code-testing-agent Generate pytest tests for the Flask tasks API (Python polyglot) 4.3/5 → 4.3/5 ✅ code-testing-agent; tools: skill / ✅ code-testing-agent; tools: skill, edit 🟡 0.25 [2]
code-testing-agent Generate Vitest tests for the shopping-cart library (TypeScript polyglot) 4.7/5 → 4.7/5 ✅ code-testing-agent; tools: skill, edit / ✅ code-testing-agent; tools: skill 🟡 0.25 [3]
code-testing-agent Does not revert a gutted-looking workspace (workspace integrity) 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED 🟡 0.25 [4]

[1] ⚠️ High run-to-run variance (CV=105%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.5% due to: tokens (1341684 → 2290878), time (369.2s → 502.1s), tool calls (62 → 79)
[2] ⚠️ High run-to-run variance (CV=105%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=74%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.1% due to: tokens (225259 → 304869)
[4] ⚠️ High run-to-run variance (CV=517%) — consider re-running with --runs 5

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 802 in dotnet/skills, download eval artifacts with gh run download 27963926402 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/96deb8bd5d5c03e787d099fc1b7a191390baae8a/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@github-actions github-actions Bot added the waiting-on-author PR state label label Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

👋 @Evangelink — this PR has 2 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

Review feedback: the prompt told the agent to run \python -m ...\ but the grader (and the vally command) invoke \python3\. On Linux runners that may lack a \python\ shim the agent could hit command-not-found. Align the prompt to \python3\ in both eval.yaml and eval.vally.yaml.

Co-authored-by: Copilot <[email protected]>
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.3/5 → 3.0/5 🔴 ✅ code-testing-agent; tools: skill, grep / ✅ code-testing-agent; code-testing-extensions; test-gap-analysis; assertion-quality; tools: skill, task, read_agent, grep, glob, read_bash 🟡 0.23
code-testing-agent Generate pytest tests for the Flask tasks API (Python polyglot) 4.0/5 → 4.3/5 🟢 ✅ code-testing-agent; tools: skill, edit / ✅ code-testing-agent; tools: glob, skill 🟡 0.23 [1]
code-testing-agent Generate Vitest tests for the shopping-cart library (TypeScript polyglot) 4.7/5 → 5.0/5 🟢 ✅ code-testing-agent; tools: skill / ✅ code-testing-agent; code-testing-extensions; test-gap-analysis; assertion-quality; tools: skill, task, edit, read_agent 🟡 0.23 [2]
code-testing-agent Does not revert a gutted-looking workspace (workspace integrity) 5.0/5 → 5.0/5 ✅ code-testing-agent; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.23 [3]

[1] ⚠️ High run-to-run variance (CV=110%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -7.5% due to: quality, tokens (220521 → 288452)
[2] ⚠️ High run-to-run variance (CV=52305%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=74%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -21.4% due to: judgment, quality, tokens (101598 → 121028)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 802 in dotnet/skills, download eval artifacts with gh run download 27974177013 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/038fc9253af951f0c24095a60d19622c4f92b491/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@Evangelink Evangelink enabled auto-merge (squash) June 23, 2026 07:25
@github-actions github-actions Bot added waiting-on-review PR state label and removed waiting-on-author PR state label labels Jun 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

✅ Evaluation passed for 038fc92. cc @dotnet/dotnet-testing — please review.

@Evangelink Evangelink merged commit 102663d into dotnet:main Jun 23, 2026
34 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-review PR state label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants