feat(skills): Add AXIS skill evals by dcramer · Pull Request #159 · getsentry/skills

dcramer · 2026-06-30T20:47:19Z

This standardizes skill evaluation on AXIS, using its built-in Codex adapter so evals run through the real codex exec --json harness instead of custom runner scripts.

The new maintainer path for skill-writer is:

npx @netlify/axis run --config skills/skill-writer/evals/axis.config.json
npx @netlify/axis run --config skills/skill-writer/evals/axis.config.json --scenario small-inline-workflow
npx @netlify/axis reports latest --config skills/skill-writer/evals/axis.config.json --html

Each AXIS scenario copies the repo into an isolated temp workspace, excludes local agent/report state, installs the local skill under test with skills: ["./.."], runs Codex, and captures transcript/judge/artifact evidence. Generated .axis/ report directories are now ignored so local eval runs do not pollute future diffs.

The initial skill-writer evals cover three behavior classes:

small-inline-workflow: catches simple skills being overbuilt with references, scripts, or broad source research.
reference-backed-integration: checks that complex/integration skills keep SKILL.md as a router and split optional depth into focused references.
iteration-from-bad-output: checks that skill-writer can improve an existing bloated skill by narrowing triggers and removing unnecessary files.

The runtime references/skill-evals.md teaches future skills the same pattern: add EVAL.md, evals/axis.config.json, and focused AXIS scenarios with observable judge checks. It explicitly keeps AXIS as the prescribed skill-eval framework and treats Promptfoo/custom Codex scripts as out of scope for this repo's skill evals.

pr-writer gets a proof eval focused on actual behavior rather than eval plumbing:

npx @netlify/axis run --config skills/pr-writer/evals/axis.config.json --scenario concise-docs-pr

That scenario creates a docs-only skill change and verifies the generated PR title/body stay concise, use an appropriate conventional title, avoid template headings, and omit validation/tool boilerplate. The renamed concise-docs-pr case is meant to be a pattern for behavior-focused eval cases, not framework smoke tests.

I exercised the new paths locally: small-inline-workflow completed through AXIS/Codex with score 94, and concise-docs-pr completed with score 99. Structural validation also passed for both skill-writer and pr-writer.

Add AXIS as the prescribed skill-eval harness for skill-writer and pr-writer so maintainers can run Codex-backed scenarios with reports, artifacts, judge checks, and baselines. Include skill-writer scenarios for inline skill creation, reference-backed skill creation, and iteration from bad output. Add a pr-writer behavior case for concise docs-only PR bodies and ignore generated AXIS report directories. Co-Authored-By: GPT-5 Codex <[email protected]>

sentry-warden Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread skills/skill-writer/evals/scenarios/iteration-from-bad-output.json

dcramer merged commit 5a64b36 into main Jun 30, 2026
14 checks passed

dcramer deleted the feat/axis-skill-evals branch June 30, 2026 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(skills): Add AXIS skill evals#159

feat(skills): Add AXIS skill evals#159
dcramer merged 1 commit into
mainfrom
feat/axis-skill-evals

dcramer commented Jun 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dcramer commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dcramer commented Jun 30, 2026 •

edited

Loading