Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a v1 benchmarking “P50 evaluation harness” that can score uplift methods’ P50 estimates against injected synthetic ground truth, including a short-campaign (campaign-length) sweep and side-by-side comparison outputs.
Changes:
- Introduces a new
benchmarking.harnesspackage: replicate ensemble generation, campaign windows, scoring orchestrator, metrics, leaderboard summarization, and campaign-length plotting. - Extends synthetic toggling with an optional
ToggleSchedule.startbaseline and exposes a publictreated_mask()helper to unify treated-row selection across generator/methods/scoring. - Adds comprehensive unit tests for harness components plus an end-to-end slow test on real Hill of Towie data.
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/benchmarking/synthetic/test_public_api.py | Verifies synthetic package exports now include treated_mask. |
| tests/benchmarking/synthetic/test_generator.py | Adds tests for toggle baseline (start) and treated_mask inference. |
| tests/benchmarking/harness/test_scoring.py | Validates scoring output shape, correctness, and fairness guarantees. |
| tests/benchmarking/harness/test_replicates.py | Tests replicate ensemble construction and determinism. |
| tests/benchmarking/harness/test_public_api.py | Ensures harness package root exports intended entry points. |
| tests/benchmarking/harness/test_plots.py | Tests campaign-curve plotting and file saving. |
| tests/benchmarking/harness/test_metrics.py | Tests bias/spread/RMSE metric calculations. |
| tests/benchmarking/harness/test_method.py | Tests the method protocol seam (input/output). |
| tests/benchmarking/harness/test_leaderboard.py | Tests leaderboard aggregation from tidy scoring results. |
| tests/benchmarking/harness/test_hot_end_to_end.py | Slow end-to-end harness test on real HoT data. |
| tests/benchmarking/harness/test_campaign.py | Tests campaign window construction and masking logic. |
| tests/benchmarking/harness/stubs.py | Stub methods (oracle/biased/noisy/recording) for harness tests. |
| tests/benchmarking/harness/init.py | Adds harness test package init. |
| benchmarking/synthetic/generator.py | Adds ToggleSchedule.start semantics + public treated_mask. |
| benchmarking/synthetic/init.py | Re-exports treated_mask from synthetic package root. |
| benchmarking/harness/scoring.py | New scoring orchestrator (replicates × campaign sweep × methods). |
| benchmarking/harness/replicates.py | New replicate ensemble builder and StudyConfig. |
| benchmarking/harness/plots.py | New campaign-length curve plotting utility. |
| benchmarking/harness/metrics.py | New bias/spread/RMSE summary metrics. |
| benchmarking/harness/method.py | New method protocol + input/output dataclasses. |
| benchmarking/harness/leaderboard.py | New leaderboard aggregation over scoring results. |
| benchmarking/harness/campaign.py | New campaign windows and derived masks. |
| benchmarking/harness/init.py | New harness public API exports and module docstring. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue 2 — P50 evaluation harness & scoring (WS1)
Goal: score any uplift method's P50 estimate against the known injected truth,
including a short-campaign robustness sweep.
Scope
overall and (where applicable) per condition.
quantify how accuracy and precision degrade with less data.
Done when: given a method and a synthetic dataset, the harness emits comparable accuracy/precision numbers and a
campaign-length curve.