evals: Go-native eval suites that gate CI (Eve borrow)

## Motivation

Second borrow from Vercel's **Eve** framework (sibling to the tracing item #358). Eve ships **evals**: scored test suites that run an agent against cases and gate deploys in CI. glue has no eval harness today, so there is no automated feedback loop proving a harness/prompt/model change actually improved behavior — exactly the loop the v1.13.0 harness work needs.

## Scope

- New `evals/` package, Go-native (agents are compiled, per ADR-0012):
  - `Case` (prompt + per-case PromptOptions + scorers), `Suite` (cases + pass threshold), `Runner` (runs each case in its own session against a `*glue.Agent`, optional parallelism), `Report` (per-case + aggregate, pass rate), and `Report.Gate(minPassRate)` returning an error so CI can exit non-zero.
  - Built-in deterministic scorers: `Contains`, `NotContains`, `Regex`, `Equals`, `Command` (runs a shell command in a dir — pass on exit 0; the coding-agent gate), and a `ScorerFunc` adapter.
  - `Judge` scorer: LLM-as-judge grading a response against a rubric via a provider/agent, returning a 0..1 score.
- `glue eval <suite.json>` CLI: load a declarative suite, run it against a `--provider`/`--model`, print a report, exit non-zero when the gate fails.
- Tests with a scripted (no-network) provider, ADR, docs, CHANGELOG.

Non-goals: remote/deployed eval runs, dataset management, sandboxed isolation.

Tracker #110.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals: Go-native eval suites that gate CI (Eve borrow) #360

Motivation

Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

evals: Go-native eval suites that gate CI (Eve borrow) #360

Description

Motivation

Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions