V1 issue2 by aclerc · Pull Request #104 · resgroup/wind-up

aclerc · 2026-06-22T14:11:18Z

Issue 2 — P50 evaluation harness & scoring (WS1)

Goal: score any uplift method's P50 estimate against the known injected truth,
including a short-campaign robustness sweep.

Scope

Accuracy (bias) and precision (spread) metrics: recovered uplift vs injected,
overall and (where applicable) per condition.
Short-campaign sweep: re-score as a function of campaign length / data volume to
quantify how accuracy and precision degrade with less data.
A simple results format / leaderboard so methods can be compared side by side.
P50 only — no uncertainty/P95 scoring in this phase.

Done when: given a method and a synthetic dataset, the harness emits comparable accuracy/precision numbers and a
campaign-length curve.

Copilot

Pull request overview

Adds a v1 benchmarking “P50 evaluation harness” that can score uplift methods’ P50 estimates against injected synthetic ground truth, including a short-campaign (campaign-length) sweep and side-by-side comparison outputs.

Changes:

Introduces a new benchmarking.harness package: replicate ensemble generation, campaign windows, scoring orchestrator, metrics, leaderboard summarization, and campaign-length plotting.
Extends synthetic toggling with an optional ToggleSchedule.start baseline and exposes a public treated_mask() helper to unify treated-row selection across generator/methods/scoring.
Adds comprehensive unit tests for harness components plus an end-to-end slow test on real Hill of Towie data.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/benchmarking/synthetic/test_public_api.py	Verifies synthetic package exports now include `treated_mask`.
tests/benchmarking/synthetic/test_generator.py	Adds tests for toggle baseline (`start`) and `treated_mask` inference.
tests/benchmarking/harness/test_scoring.py	Validates scoring output shape, correctness, and fairness guarantees.
tests/benchmarking/harness/test_replicates.py	Tests replicate ensemble construction and determinism.
tests/benchmarking/harness/test_public_api.py	Ensures harness package root exports intended entry points.
tests/benchmarking/harness/test_plots.py	Tests campaign-curve plotting and file saving.
tests/benchmarking/harness/test_metrics.py	Tests bias/spread/RMSE metric calculations.
tests/benchmarking/harness/test_method.py	Tests the method protocol seam (input/output).
tests/benchmarking/harness/test_leaderboard.py	Tests leaderboard aggregation from tidy scoring results.
tests/benchmarking/harness/test_hot_end_to_end.py	Slow end-to-end harness test on real HoT data.
tests/benchmarking/harness/test_campaign.py	Tests campaign window construction and masking logic.
tests/benchmarking/harness/stubs.py	Stub methods (oracle/biased/noisy/recording) for harness tests.
tests/benchmarking/harness/init.py	Adds harness test package init.
benchmarking/synthetic/generator.py	Adds `ToggleSchedule.start` semantics + public `treated_mask`.
benchmarking/synthetic/init.py	Re-exports `treated_mask` from synthetic package root.
benchmarking/harness/scoring.py	New scoring orchestrator (replicates × campaign sweep × methods).
benchmarking/harness/replicates.py	New replicate ensemble builder and `StudyConfig`.
benchmarking/harness/plots.py	New campaign-length curve plotting utility.
benchmarking/harness/metrics.py	New bias/spread/RMSE summary metrics.
benchmarking/harness/method.py	New method protocol + input/output dataclasses.
benchmarking/harness/leaderboard.py	New leaderboard aggregation over scoring results.
benchmarking/harness/campaign.py	New campaign windows and derived masks.
benchmarking/harness/init.py	New harness public API exports and module docstring.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <[email protected]>

aclerc closed this Jun 22, 2026

aclerc reopened this Jun 22, 2026

issue 2 initial effort

3852475

aclerc changed the base branch from main to v1 June 22, 2026 16:13

aclerc marked this pull request as ready for review June 22, 2026 16:14

aclerc requested a review from Copilot June 22, 2026 16:14

Copilot started reviewing on behalf of aclerc June 22, 2026 16:14 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Comment thread benchmarking/synthetic/generator.py

Comment thread benchmarking/harness/__init__.py

Comment thread benchmarking/harness/scoring.py

Comment thread benchmarking/harness/scoring.py

aclerc and others added 4 commits June 23, 2026 08:40

Simplify docstring

9addd01

Co-authored-by: Copilot Autofix powered by AI <[email protected]>

add example_hot_study.py

3d2c336

address PR comments

884fa84

improve kwargs enforcement

83850d0

aclerc merged commit 86763fd into v1 Jun 23, 2026

aclerc deleted the v1-issue2 branch June 23, 2026 10:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

V1 issue2#104

V1 issue2#104
aclerc merged 5 commits into
v1from
v1-issue2

aclerc commented Jun 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

aclerc commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue 2 — P50 evaluation harness & scoring (WS1)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aclerc commented Jun 22, 2026 •

edited

Loading