evals: Go-native eval suites that gate CI#361
Open
erain wants to merge 1 commit into
Open
Conversation
Add an `evals/` package and a `glue eval` command — the feedback loop the v1.13.0 harness work lacked. Define a Suite of Cases, run them against a *glue.Agent with a Runner, score each with Scorers, and gate CI on the aggregate pass rate via Report.Gate(minPassRate). Built-in scorers: Contains, NotContains, Regex, Equals, Command (runs a build/test, passes on exit 0 — the coding gate), Judge (LLM-as-judge against a rubric), and a ScorerFunc adapter. A case passes when it ran cleanly and every scorer meets the suite threshold (default 0.5); scorer errors fail the case loudly rather than scoring zero. Suites can be declarative JSON (SuiteSpec — unknown fields and scorer types rejected so a typo is a loud error). `glue eval <suite.json>` builds the agent under test (optionally --coding, auto-approving permission) plus a separate tool-less judge agent so an LLM judge never grades its own work, runs the suite, prints a text or --json report, and exits non-zero when --min-pass-rate isn't met. The evals package depends only on glue (no new external deps); tests use a scripted, no-network provider. Borrowed from Vercel's Eve framework. ADR-0019 + docs/evals.md + examples/evals/smoke.json. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
|
The setup review is complete, requirements met. Please consult for further assistance. 🤖 Posted by glue-review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #360. Second borrow from Vercel's Eve framework (sibling to tracing #358/#359).
Why
v1.13.0 shipped real loop-behavior changes (edit ladder, retry/overflow, compaction, guardrails) with no automated way to prove they helped or to catch a regression. This adds that feedback loop.
What
A new
evals/package + aglue evalcommand.Case(prompt + per-case options + scorers) →Suite(cases + pass threshold) →Runner(one session per case against a*glue.Agent, optional parallelism) →Report. A case passes when it ran cleanly and every scorer meets the threshold (default 0.5).Report.Gate(minPassRate)returns an error → non-zero exit for CI.Contains,NotContains,Regex,Equals,Command(runs a build/test, passes on exit 0 — the coding gate),Judge(LLM-as-judge against a rubric),ScorerFunc. Scorer errors fail the case loudly rather than silently scoring zero.SuiteSpecmirrors a suite as data; unknown fields and scorer types are rejected so a typo is a loud error, never a skipped check.glue eval <suite.json> [--coding] [--model] [--judge-model] [--work] [--min-pass-rate] [--parallel] [--json] [--env]builds the agent under test (auto-approving permission for unattended runs) plus a separate tool-less judge agent so an LLM judge never grades its own work, then runs + gates.Design
go test; the JSONSuiteSpeccovers data-file suites and powers the CLI.evalsdepends only onglueand adds no new external deps. The CLI is a thin front end.Tests
evals/: runner scoring + gating, threshold override, parallelism, validation, every scorer, judge (scripted no-network provider), spec parsing/rejection, and a guard test that keepsexamples/evals/smoke.jsonvalid.cmd/glue/:glue evalend-to-end — gate fails below threshold (exit 1), passes when relaxed,--jsonoutput, missing-arg usage.go vet+ gofmt clean.Safety note
Eval runs are unattended and auto-approve side-effecting tools (like
glue goal --yolo);--codingsuites andcommandscorers are meant for a feature branch / worktree / sandbox. Documented in docs/evals.md.See ADR-0019 and examples/evals/smoke.json.
🤖 Generated with Claude Code