evals: Go-native eval suites that gate CI by erain · Pull Request #361 · erain/glue

erain · 2026-06-18T15:09:50Z

Closes #360. Second borrow from Vercel's Eve framework (sibling to tracing #358/#359).

Why

v1.13.0 shipped real loop-behavior changes (edit ladder, retry/overflow, compaction, guardrails) with no automated way to prove they helped or to catch a regression. This adds that feedback loop.

What

A new evals/ package + a glue eval command.

Model. Case (prompt + per-case options + scorers) → Suite (cases + pass threshold) → Runner (one session per case against a *glue.Agent, optional parallelism) → Report. A case passes when it ran cleanly and every scorer meets the threshold (default 0.5). Report.Gate(minPassRate) returns an error → non-zero exit for CI.
Scorers. Contains, NotContains, Regex, Equals, Command (runs a build/test, passes on exit 0 — the coding gate), Judge (LLM-as-judge against a rubric), ScorerFunc. Scorer errors fail the case loudly rather than silently scoring zero.
Declarative JSON. SuiteSpec mirrors a suite as data; unknown fields and scorer types are rejected so a typo is a loud error, never a skipped check.
CLI. glue eval <suite.json> [--coding] [--model] [--judge-model] [--work] [--min-pass-rate] [--parallel] [--json] [--env] builds the agent under test (auto-approving permission for unattended runs) plus a separate tool-less judge agent so an LLM judge never grades its own work, then runs + gates.

Design

Go-native, not a DSL runtime — agents are compiled Go packages (ADR-0012), so suites are too and run under go test; the JSON SuiteSpec covers data-file suites and powers the CLI.
Library is binary-independent — evals depends only on glue and adds no new external deps. The CLI is a thin front end.

Tests

evals/: runner scoring + gating, threshold override, parallelism, validation, every scorer, judge (scripted no-network provider), spec parsing/rejection, and a guard test that keeps examples/evals/smoke.json valid.
cmd/glue/: glue eval end-to-end — gate fails below threshold (exit 1), passes when relaxed, --json output, missing-arg usage.
Full suite + go vet + gofmt clean.

Safety note

Eval runs are unattended and auto-approve side-effecting tools (like glue goal --yolo); --coding suites and command scorers are meant for a feature branch / worktree / sandbox. Documented in docs/evals.md.

See ADR-0019 and examples/evals/smoke.json.

🤖 Generated with Claude Code

Add an `evals/` package and a `glue eval` command — the feedback loop the v1.13.0 harness work lacked. Define a Suite of Cases, run them against a *glue.Agent with a Runner, score each with Scorers, and gate CI on the aggregate pass rate via Report.Gate(minPassRate). Built-in scorers: Contains, NotContains, Regex, Equals, Command (runs a build/test, passes on exit 0 — the coding gate), Judge (LLM-as-judge against a rubric), and a ScorerFunc adapter. A case passes when it ran cleanly and every scorer meets the suite threshold (default 0.5); scorer errors fail the case loudly rather than scoring zero. Suites can be declarative JSON (SuiteSpec — unknown fields and scorer types rejected so a typo is a loud error). `glue eval <suite.json>` builds the agent under test (optionally --coding, auto-approving permission) plus a separate tool-less judge agent so an LLM judge never grades its own work, runs the suite, prints a text or --json report, and exits non-zero when --min-pass-rate isn't met. The evals package depends only on glue (no new external deps); tests use a scripted, no-network provider. Borrowed from Vercel's Eve framework. ADR-0019 + docs/evals.md + examples/evals/smoke.json. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

github-actions · 2026-06-18T15:12:53Z

The setup review is complete, requirements met. Please consult for further assistance.

🤖 Posted by glue-review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals: Go-native eval suites that gate CI#361

evals: Go-native eval suites that gate CI#361
erain wants to merge 1 commit into
mainfrom
issue/360-evals

erain commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

erain commented Jun 18, 2026

Why

What

Design

Tests

Safety note

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant