Motivation
Second borrow from Vercel's Eve framework (sibling to the tracing item #358). Eve ships evals: scored test suites that run an agent against cases and gate deploys in CI. glue has no eval harness today, so there is no automated feedback loop proving a harness/prompt/model change actually improved behavior — exactly the loop the v1.13.0 harness work needs.
Scope
- New
evals/ package, Go-native (agents are compiled, per ADR-0012):
Case (prompt + per-case PromptOptions + scorers), Suite (cases + pass threshold), Runner (runs each case in its own session against a *glue.Agent, optional parallelism), Report (per-case + aggregate, pass rate), and Report.Gate(minPassRate) returning an error so CI can exit non-zero.
- Built-in deterministic scorers:
Contains, NotContains, Regex, Equals, Command (runs a shell command in a dir — pass on exit 0; the coding-agent gate), and a ScorerFunc adapter.
Judge scorer: LLM-as-judge grading a response against a rubric via a provider/agent, returning a 0..1 score.
glue eval <suite.json> CLI: load a declarative suite, run it against a --provider/--model, print a report, exit non-zero when the gate fails.
- Tests with a scripted (no-network) provider, ADR, docs, CHANGELOG.
Non-goals: remote/deployed eval runs, dataset management, sandboxed isolation.
Tracker #110.
Motivation
Second borrow from Vercel's Eve framework (sibling to the tracing item #358). Eve ships evals: scored test suites that run an agent against cases and gate deploys in CI. glue has no eval harness today, so there is no automated feedback loop proving a harness/prompt/model change actually improved behavior — exactly the loop the v1.13.0 harness work needs.
Scope
evals/package, Go-native (agents are compiled, per ADR-0012):Case(prompt + per-case PromptOptions + scorers),Suite(cases + pass threshold),Runner(runs each case in its own session against a*glue.Agent, optional parallelism),Report(per-case + aggregate, pass rate), andReport.Gate(minPassRate)returning an error so CI can exit non-zero.Contains,NotContains,Regex,Equals,Command(runs a shell command in a dir — pass on exit 0; the coding-agent gate), and aScorerFuncadapter.Judgescorer: LLM-as-judge grading a response against a rubric via a provider/agent, returning a 0..1 score.glue eval <suite.json>CLI: load a declarative suite, run it against a--provider/--model, print a report, exit non-zero when the gate fails.Non-goals: remote/deployed eval runs, dataset management, sandboxed isolation.
Tracker #110.