evalgate

Decide whether an eval delta is a real regression or just sampling noise, and fail CI only when it is real.

A model eval that drops from 90.0% to 89.4% on a 1,000-example set looks like a regression, but on that sample size it is noise. Gating CI on the raw number makes the build flap; ignoring it lets real regressions through. evalgate runs the appropriate significance test and fails only when the candidate is significantly worse.

$ evalgate proportions \
    --baseline-score 0.900 --baseline-n 1000 \
    --candidate-score 0.894 --candidate-n 1000
verdict     worse, but within noise
difference  -0.0060
p-value     0.6232
alpha       0.05
# exit code 0 -> build passes

Install

$ pip install evalgate-cli                 # from PyPI, once released
$ pip install git+https://github.com/jmweb-org/evalgate   # latest, available now

Pure standard library plus typer and rich. No heavy dependencies.

Usage

From two aggregate accuracies

$ evalgate proportions \
    --baseline-score 0.90 --baseline-n 2000 \
    --candidate-score 0.87 --candidate-n 2000 \
    --alpha 0.05

Uses a two-proportion z-test on the accuracies and their sample sizes.

From paired per-example results

When both models were evaluated on the same examples, a paired test is more powerful. Give a CSV with per-example correctness (or predictions plus a truth column):

$ evalgate paired results.csv --baseline base_correct --candidate cand_correct
$ evalgate paired results.csv --baseline pred_a --candidate pred_b --truth label

Uses McNemar's test: an exact binomial test on the discordant pairs, or the continuity-corrected chi-squared approximation for large samples.

In CI

- run: evalgate proportions --baseline-score 0.90 --baseline-n 2000
        --candidate-score "$SCORE" --candidate-n 2000

Verdicts and exit codes

Verdict	Meaning	Exit
`improvement`	Candidate is better	0
`unchanged`	No measurable difference	0
`noise`	Worse, but not significant at `alpha`	0
`regression`	Significantly worse	1

A bad invocation (scores out of range, missing column, unreadable file) exits 2.

JSON output

When using the --json flag, evalgate returns a JSON object with the following fields:

Field	Type	Description
`verdict`	`string`	One of `improvement`, `unchanged`, `noise`, or `regression`.
`p_value`	`float`	The p-value from the statistical test.
`difference`	`float`	Signed difference between candidate and baseline. A negative value means the candidate performed worse.
`alpha`	`float`	Significance threshold used for the test.
`is_regression`	`boolean`	`true` if the result is a statistically significant regression; otherwise `false`.
`test`	`string`	Statistical test used (`two_proportion_z` or `mcnemar`).
`baseline_only_correct`	`integer`	(Paired mode only) Number of examples only the baseline answered correctly.
`candidate_only_correct`	`integer`	(Paired mode only) Number of examples only the candidate answered correctly.

What it does and does not do

It answers one question: is this difference larger than sampling variation? It does not correct for multiple comparisons across many evals, and a noise verdict means "not proven", not "proven equal". For small evals, collect more examples rather than trusting a borderline p-value.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
src/evalgate		src/evalgate
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

evalgate

Install

Usage

From two aggregate accuracies

From paired per-example results

In CI

Verdicts and exit codes

JSON output

What it does and does not do

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

evalgate

Install

Usage

From two aggregate accuracies

From paired per-example results

In CI

Verdicts and exit codes

JSON output

What it does and does not do

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages