Decide whether an eval delta is a real regression or just sampling noise, and fail CI only when it is real.
A model eval that drops from 90.0% to 89.4% on a 1,000-example set looks like a
regression, but on that sample size it is noise. Gating CI on the raw number
makes the build flap; ignoring it lets real regressions through. evalgate
runs the appropriate significance test and fails only when the candidate is
significantly worse.
$ evalgate proportions \
--baseline-score 0.900 --baseline-n 1000 \
--candidate-score 0.894 --candidate-n 1000
verdict worse, but within noise
difference -0.0060
p-value 0.6232
alpha 0.05
# exit code 0 -> build passes$ pip install evalgate-cli # from PyPI, once released
$ pip install git+https://github.com/jmweb-org/evalgate # latest, available nowPure standard library plus typer and rich. No heavy dependencies.
$ evalgate proportions \
--baseline-score 0.90 --baseline-n 2000 \
--candidate-score 0.87 --candidate-n 2000 \
--alpha 0.05Uses a two-proportion z-test on the accuracies and their sample sizes.
When both models were evaluated on the same examples, a paired test is more powerful. Give a CSV with per-example correctness (or predictions plus a truth column):
$ evalgate paired results.csv --baseline base_correct --candidate cand_correct
$ evalgate paired results.csv --baseline pred_a --candidate pred_b --truth labelUses McNemar's test: an exact binomial test on the discordant pairs, or the continuity-corrected chi-squared approximation for large samples.
- run: evalgate proportions --baseline-score 0.90 --baseline-n 2000
--candidate-score "$SCORE" --candidate-n 2000| Verdict | Meaning | Exit |
|---|---|---|
improvement |
Candidate is better | 0 |
unchanged |
No measurable difference | 0 |
noise |
Worse, but not significant at alpha |
0 |
regression |
Significantly worse | 1 |
A bad invocation (scores out of range, missing column, unreadable file) exits 2.
When using the --json flag, evalgate returns a JSON object with the following fields:
| Field | Type | Description |
|---|---|---|
verdict |
string |
One of improvement, unchanged, noise, or regression. |
p_value |
float |
The p-value from the statistical test. |
difference |
float |
Signed difference between candidate and baseline. A negative value means the candidate performed worse. |
alpha |
float |
Significance threshold used for the test. |
is_regression |
boolean |
true if the result is a statistically significant regression; otherwise false. |
test |
string |
Statistical test used (two_proportion_z or mcnemar). |
baseline_only_correct |
integer |
(Paired mode only) Number of examples only the baseline answered correctly. |
candidate_only_correct |
integer |
(Paired mode only) Number of examples only the candidate answered correctly. |
It answers one question: is this difference larger than sampling variation?
It does not correct for multiple comparisons across many evals, and a
noise verdict means "not proven", not "proven equal". For small evals,
collect more examples rather than trusting a borderline p-value.
MIT. See LICENSE.