Fix code-testing-agent activation for the Flask pytest scenario#802
Conversation
The 'Generate pytest tests for the Flask tasks API' scenario failed to activate code-testing-agent in BOTH isolated and plugin mode: its prompt enumerated, file by file, exactly what to mock/inject/test (TaskService with repo mocked + clock injected, queries.apply_query over fixed lists, both repositories, the blueprint via test_client), acting as an answer key that let the base agent generate tests directly with edit tools instead of routing to the skill's research-plan-implement pipeline. Rewrite the prompt to a realistic, high-level ask (mirroring the ContosoUniversity scenario that does activate): describe the app at a layer level, keep the 'no tests yet', project-wide multi-file framing and the 80% coverage floor, and drop the per-module test checklist. Assertions, rubric and timeout are unchanged. Verified locally that the skill now activates in both isolated and plugin mode. Co-authored-by: Copilot <[email protected]>
|
/evaluate |
There was a problem hiding this comment.
Pull request overview
This PR updates the Flask/pytest “Python polyglot” evaluation stimulus to restore code-testing-agent activation in both isolated and plugin runs by rewriting the scenario prompt to be higher-level and less “answer-key”-like.
Changes:
- Rewrites the Flask tasks API scenario prompt in
eval.yamlto remove overly prescriptive testing instructions while keeping the coverage/install/run expectations. - Applies the same prompt rewrite to the corresponding Vally evaluation in
eval.vally.yaml.
Show a summary per file
| File | Description |
|---|---|
| tests/dotnet-test/code-testing-agent/eval.yaml | Updates the Flask pytest scenario prompt to improve code-testing-agent activation while retaining coverage/run requirements. |
| tests/dotnet-test/code-testing-agent/eval.vally.yaml | Mirrors the same Flask prompt adjustments for the Vally evaluation stimulus. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 2
Skill Validation Results
[1] Model: claude-opus-4.6 | Judge: claude-opus-4.6 🔍 Full Results - additional metrics and failure investigation steps
▶ Sessions Visualisation -- interactive replay of all evaluation sessions |
|
👋 @Evangelink — this PR has 2 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the |
Review feedback: the prompt told the agent to run \python -m ...\ but the grader (and the vally command) invoke \python3\. On Linux runners that may lack a \python\ shim the agent could hit command-not-found. Align the prompt to \python3\ in both eval.yaml and eval.vally.yaml. Co-authored-by: Copilot <[email protected]>
|
/evaluate |
Skill Validation Results
[1] Model: claude-opus-4.6 | Judge: claude-opus-4.6 🔍 Full Results - additional metrics and failure investigation steps
▶ Sessions Visualisation -- interactive replay of all evaluation sessions |
|
✅ Evaluation passed for |
What
Fixes plugin- and isolated-mode skill activation for the code-testing-agent scenario "Generate pytest tests for the Flask tasks API (Python polyglot)".
Root cause
The scenario prompt enumerated, layer by layer and class by class, exactly what to mock, inject, and test (
TaskServicewith the repository mocked and the clock injected,queries.apply_queryover fixedTasklists, both repositories,SqliteTaskRepositoryagainst an in-memory connection, the blueprint viatest_client…). That level of detail acts as an answer key: the base agent just executed it directly withedittools and never routed to thecode-testing-agentresearch → plan → implement pipeline.Reproduced locally with the skill-validator: the original prompt yielded
detected: []in both isolated and plugin runs. The sibling ContosoUniversity scenario (same skill, concise high-level prompt) activates fine.Fix
Rewrote the prompt to a realistic, high-level ask mirroring the ContosoUniversity scenario:
pip install -e ".[test]"+pytest+coverage.xmlexpectations.Assertions, rubric, and timeout are unchanged — the agent must still discover mocking,
test_clientusage, validation paths, etc. via the skill. Applied to botheval.yamlandeval.vally.yaml.Verification
Local skill-validator run after the change:
🔌 Skill activated (isolated): skills=code-testing-agentand🔌 Skill activated (plugin): skills=code-testing-agent.Co-authored-by: Copilot [email protected]