feat(skills): Add AXIS skill evals#159
Merged
Merged
Conversation
Add AXIS as the prescribed skill-eval harness for skill-writer and pr-writer so maintainers can run Codex-backed scenarios with reports, artifacts, judge checks, and baselines. Include skill-writer scenarios for inline skill creation, reference-backed skill creation, and iteration from bad output. Add a pr-writer behavior case for concise docs-only PR bodies and ignore generated AXIS report directories. Co-Authored-By: GPT-5 Codex <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This standardizes skill evaluation on AXIS, using its built-in Codex adapter so evals run through the real
codex exec --jsonharness instead of custom runner scripts.The new maintainer path for
skill-writeris:Each AXIS scenario copies the repo into an isolated temp workspace, excludes local agent/report state, installs the local skill under test with
skills: ["./.."], runs Codex, and captures transcript/judge/artifact evidence. Generated.axis/report directories are now ignored so local eval runs do not pollute future diffs.The initial
skill-writerevals cover three behavior classes:small-inline-workflow: catches simple skills being overbuilt with references, scripts, or broad source research.reference-backed-integration: checks that complex/integration skills keepSKILL.mdas a router and split optional depth into focused references.iteration-from-bad-output: checks thatskill-writercan improve an existing bloated skill by narrowing triggers and removing unnecessary files.The runtime
references/skill-evals.mdteaches future skills the same pattern: addEVAL.md,evals/axis.config.json, and focused AXIS scenarios with observable judge checks. It explicitly keeps AXIS as the prescribed skill-eval framework and treats Promptfoo/custom Codex scripts as out of scope for this repo's skill evals.pr-writergets a proof eval focused on actual behavior rather than eval plumbing:That scenario creates a docs-only skill change and verifies the generated PR title/body stay concise, use an appropriate conventional title, avoid template headings, and omit validation/tool boilerplate. The renamed
concise-docs-prcase is meant to be a pattern for behavior-focused eval cases, not framework smoke tests.I exercised the new paths locally:
small-inline-workflowcompleted through AXIS/Codex with score 94, andconcise-docs-prcompleted with score 99. Structural validation also passed for bothskill-writerandpr-writer.