Skip to content

feat(skills): Add AXIS skill evals#159

Merged
dcramer merged 1 commit into
mainfrom
feat/axis-skill-evals
Jun 30, 2026
Merged

feat(skills): Add AXIS skill evals#159
dcramer merged 1 commit into
mainfrom
feat/axis-skill-evals

Conversation

@dcramer

@dcramer dcramer commented Jun 30, 2026

Copy link
Copy Markdown
Member

This standardizes skill evaluation on AXIS, using its built-in Codex adapter so evals run through the real codex exec --json harness instead of custom runner scripts.

The new maintainer path for skill-writer is:

npx @netlify/axis run --config skills/skill-writer/evals/axis.config.json
npx @netlify/axis run --config skills/skill-writer/evals/axis.config.json --scenario small-inline-workflow
npx @netlify/axis reports latest --config skills/skill-writer/evals/axis.config.json --html

Each AXIS scenario copies the repo into an isolated temp workspace, excludes local agent/report state, installs the local skill under test with skills: ["./.."], runs Codex, and captures transcript/judge/artifact evidence. Generated .axis/ report directories are now ignored so local eval runs do not pollute future diffs.

The initial skill-writer evals cover three behavior classes:

  • small-inline-workflow: catches simple skills being overbuilt with references, scripts, or broad source research.
  • reference-backed-integration: checks that complex/integration skills keep SKILL.md as a router and split optional depth into focused references.
  • iteration-from-bad-output: checks that skill-writer can improve an existing bloated skill by narrowing triggers and removing unnecessary files.

The runtime references/skill-evals.md teaches future skills the same pattern: add EVAL.md, evals/axis.config.json, and focused AXIS scenarios with observable judge checks. It explicitly keeps AXIS as the prescribed skill-eval framework and treats Promptfoo/custom Codex scripts as out of scope for this repo's skill evals.

pr-writer gets a proof eval focused on actual behavior rather than eval plumbing:

npx @netlify/axis run --config skills/pr-writer/evals/axis.config.json --scenario concise-docs-pr

That scenario creates a docs-only skill change and verifies the generated PR title/body stay concise, use an appropriate conventional title, avoid template headings, and omit validation/tool boilerplate. The renamed concise-docs-pr case is meant to be a pattern for behavior-focused eval cases, not framework smoke tests.

I exercised the new paths locally: small-inline-workflow completed through AXIS/Codex with score 94, and concise-docs-pr completed with score 99. Structural validation also passed for both skill-writer and pr-writer.

Add AXIS as the prescribed skill-eval harness for skill-writer and pr-writer so maintainers can run Codex-backed scenarios with reports, artifacts, judge checks, and baselines.

Include skill-writer scenarios for inline skill creation, reference-backed skill creation, and iteration from bad output. Add a pr-writer behavior case for concise docs-only PR bodies and ignore generated AXIS report directories.

Co-Authored-By: GPT-5 Codex <[email protected]>
Comment thread skills/skill-writer/evals/scenarios/iteration-from-bad-output.json
@dcramer dcramer merged commit 5a64b36 into main Jun 30, 2026
14 checks passed
@dcramer dcramer deleted the feat/axis-skill-evals branch June 30, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant