Skip to content

evals: Go-native eval suites that gate CI#361

Open
erain wants to merge 1 commit into
mainfrom
issue/360-evals
Open

evals: Go-native eval suites that gate CI#361
erain wants to merge 1 commit into
mainfrom
issue/360-evals

Conversation

@erain

@erain erain commented Jun 18, 2026

Copy link
Copy Markdown
Owner

Closes #360. Second borrow from Vercel's Eve framework (sibling to tracing #358/#359).

Why

v1.13.0 shipped real loop-behavior changes (edit ladder, retry/overflow, compaction, guardrails) with no automated way to prove they helped or to catch a regression. This adds that feedback loop.

What

A new evals/ package + a glue eval command.

  • Model. Case (prompt + per-case options + scorers) → Suite (cases + pass threshold) → Runner (one session per case against a *glue.Agent, optional parallelism) → Report. A case passes when it ran cleanly and every scorer meets the threshold (default 0.5). Report.Gate(minPassRate) returns an error → non-zero exit for CI.
  • Scorers. Contains, NotContains, Regex, Equals, Command (runs a build/test, passes on exit 0 — the coding gate), Judge (LLM-as-judge against a rubric), ScorerFunc. Scorer errors fail the case loudly rather than silently scoring zero.
  • Declarative JSON. SuiteSpec mirrors a suite as data; unknown fields and scorer types are rejected so a typo is a loud error, never a skipped check.
  • CLI. glue eval <suite.json> [--coding] [--model] [--judge-model] [--work] [--min-pass-rate] [--parallel] [--json] [--env] builds the agent under test (auto-approving permission for unattended runs) plus a separate tool-less judge agent so an LLM judge never grades its own work, then runs + gates.

Design

  • Go-native, not a DSL runtime — agents are compiled Go packages (ADR-0012), so suites are too and run under go test; the JSON SuiteSpec covers data-file suites and powers the CLI.
  • Library is binary-independentevals depends only on glue and adds no new external deps. The CLI is a thin front end.

Tests

  • evals/: runner scoring + gating, threshold override, parallelism, validation, every scorer, judge (scripted no-network provider), spec parsing/rejection, and a guard test that keeps examples/evals/smoke.json valid.
  • cmd/glue/: glue eval end-to-end — gate fails below threshold (exit 1), passes when relaxed, --json output, missing-arg usage.
  • Full suite + go vet + gofmt clean.

Safety note

Eval runs are unattended and auto-approve side-effecting tools (like glue goal --yolo); --coding suites and command scorers are meant for a feature branch / worktree / sandbox. Documented in docs/evals.md.

See ADR-0019 and examples/evals/smoke.json.

🤖 Generated with Claude Code

Add an `evals/` package and a `glue eval` command — the feedback loop the
v1.13.0 harness work lacked. Define a Suite of Cases, run them against a
*glue.Agent with a Runner, score each with Scorers, and gate CI on the
aggregate pass rate via Report.Gate(minPassRate).

Built-in scorers: Contains, NotContains, Regex, Equals, Command (runs a
build/test, passes on exit 0 — the coding gate), Judge (LLM-as-judge
against a rubric), and a ScorerFunc adapter. A case passes when it ran
cleanly and every scorer meets the suite threshold (default 0.5); scorer
errors fail the case loudly rather than scoring zero.

Suites can be declarative JSON (SuiteSpec — unknown fields and scorer
types rejected so a typo is a loud error). `glue eval <suite.json>` builds
the agent under test (optionally --coding, auto-approving permission) plus
a separate tool-less judge agent so an LLM judge never grades its own
work, runs the suite, prints a text or --json report, and exits non-zero
when --min-pass-rate isn't met.

The evals package depends only on glue (no new external deps); tests use a
scripted, no-network provider. Borrowed from Vercel's Eve framework.
ADR-0019 + docs/evals.md + examples/evals/smoke.json.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@github-actions

Copy link
Copy Markdown

The setup review is complete, requirements met. Please consult for further assistance.


🤖 Posted by glue-review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

evals: Go-native eval suites that gate CI (Eve borrow)

1 participant