Skip to content

V1 issue2#104

Merged
aclerc merged 5 commits into
v1from
v1-issue2
Jun 23, 2026
Merged

V1 issue2#104
aclerc merged 5 commits into
v1from
v1-issue2

Conversation

@aclerc

@aclerc aclerc commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Issue 2 — P50 evaluation harness & scoring (WS1)

Goal: score any uplift method's P50 estimate against the known injected truth,
including a short-campaign robustness sweep.

Scope

  • Accuracy (bias) and precision (spread) metrics: recovered uplift vs injected,
    overall and (where applicable) per condition.
  • Short-campaign sweep: re-score as a function of campaign length / data volume to
    quantify how accuracy and precision degrade with less data.
  • A simple results format / leaderboard so methods can be compared side by side.
  • P50 only — no uncertainty/P95 scoring in this phase.

Done when: given a method and a synthetic dataset, the harness emits comparable accuracy/precision numbers and a
campaign-length curve.

@aclerc aclerc closed this Jun 22, 2026
@aclerc aclerc reopened this Jun 22, 2026
@aclerc aclerc changed the base branch from main to v1 June 22, 2026 16:13
@aclerc aclerc marked this pull request as ready for review June 22, 2026 16:14
@aclerc aclerc requested a review from Copilot June 22, 2026 16:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a v1 benchmarking “P50 evaluation harness” that can score uplift methods’ P50 estimates against injected synthetic ground truth, including a short-campaign (campaign-length) sweep and side-by-side comparison outputs.

Changes:

  • Introduces a new benchmarking.harness package: replicate ensemble generation, campaign windows, scoring orchestrator, metrics, leaderboard summarization, and campaign-length plotting.
  • Extends synthetic toggling with an optional ToggleSchedule.start baseline and exposes a public treated_mask() helper to unify treated-row selection across generator/methods/scoring.
  • Adds comprehensive unit tests for harness components plus an end-to-end slow test on real Hill of Towie data.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/benchmarking/synthetic/test_public_api.py Verifies synthetic package exports now include treated_mask.
tests/benchmarking/synthetic/test_generator.py Adds tests for toggle baseline (start) and treated_mask inference.
tests/benchmarking/harness/test_scoring.py Validates scoring output shape, correctness, and fairness guarantees.
tests/benchmarking/harness/test_replicates.py Tests replicate ensemble construction and determinism.
tests/benchmarking/harness/test_public_api.py Ensures harness package root exports intended entry points.
tests/benchmarking/harness/test_plots.py Tests campaign-curve plotting and file saving.
tests/benchmarking/harness/test_metrics.py Tests bias/spread/RMSE metric calculations.
tests/benchmarking/harness/test_method.py Tests the method protocol seam (input/output).
tests/benchmarking/harness/test_leaderboard.py Tests leaderboard aggregation from tidy scoring results.
tests/benchmarking/harness/test_hot_end_to_end.py Slow end-to-end harness test on real HoT data.
tests/benchmarking/harness/test_campaign.py Tests campaign window construction and masking logic.
tests/benchmarking/harness/stubs.py Stub methods (oracle/biased/noisy/recording) for harness tests.
tests/benchmarking/harness/init.py Adds harness test package init.
benchmarking/synthetic/generator.py Adds ToggleSchedule.start semantics + public treated_mask.
benchmarking/synthetic/init.py Re-exports treated_mask from synthetic package root.
benchmarking/harness/scoring.py New scoring orchestrator (replicates × campaign sweep × methods).
benchmarking/harness/replicates.py New replicate ensemble builder and StudyConfig.
benchmarking/harness/plots.py New campaign-length curve plotting utility.
benchmarking/harness/metrics.py New bias/spread/RMSE summary metrics.
benchmarking/harness/method.py New method protocol + input/output dataclasses.
benchmarking/harness/leaderboard.py New leaderboard aggregation over scoring results.
benchmarking/harness/campaign.py New campaign windows and derived masks.
benchmarking/harness/init.py New harness public API exports and module docstring.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread benchmarking/synthetic/generator.py
Comment thread benchmarking/harness/__init__.py
Comment thread benchmarking/harness/scoring.py
Comment thread benchmarking/harness/scoring.py
@aclerc aclerc merged commit 86763fd into v1 Jun 23, 2026
@aclerc aclerc deleted the v1-issue2 branch June 23, 2026 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants