Skip to content

evals: Go-native eval suites that gate CI (Eve borrow) #360

Description

@erain

Motivation

Second borrow from Vercel's Eve framework (sibling to the tracing item #358). Eve ships evals: scored test suites that run an agent against cases and gate deploys in CI. glue has no eval harness today, so there is no automated feedback loop proving a harness/prompt/model change actually improved behavior — exactly the loop the v1.13.0 harness work needs.

Scope

  • New evals/ package, Go-native (agents are compiled, per ADR-0012):
    • Case (prompt + per-case PromptOptions + scorers), Suite (cases + pass threshold), Runner (runs each case in its own session against a *glue.Agent, optional parallelism), Report (per-case + aggregate, pass rate), and Report.Gate(minPassRate) returning an error so CI can exit non-zero.
    • Built-in deterministic scorers: Contains, NotContains, Regex, Equals, Command (runs a shell command in a dir — pass on exit 0; the coding-agent gate), and a ScorerFunc adapter.
    • Judge scorer: LLM-as-judge grading a response against a rubric via a provider/agent, returning a 0..1 score.
  • glue eval <suite.json> CLI: load a declarative suite, run it against a --provider/--model, print a report, exit non-zero when the gate fails.
  • Tests with a scripted (no-network) provider, ADR, docs, CHANGELOG.

Non-goals: remote/deployed eval runs, dataset management, sandboxed isolation.

Tracker #110.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions