Skip to content

Phase B1 infrastructure: until-death runs + pinned baseline sidecars#29

Merged
jkbennitt merged 1 commit into
masterfrom
feat/baseline-calibration
Jun 10, 2026
Merged

Phase B1 infrastructure: until-death runs + pinned baseline sidecars#29
jkbennitt merged 1 commit into
masterfrom
feat/baseline-calibration

Conversation

@jkbennitt

Copy link
Copy Markdown
Member

Phase B groundwork per the approved plan — the pinned-baseline pattern (calibrate once at N=4, compare against the reference forever after).

What

  • --until-death on run_scenario.py: clears the scenario tick cap (max_ticks=None so the evaluator's timeout check never fires); the run ends on the evaluator's terminal conditions — all_colonists_dead defeat or victory — with a 5000-tick runaway guard.
  • BaselineReference / BaselinePoint (scenarios/schema.py): the .baseline.json sidecar schema — seeds, per-run outcomes, time_to_end_days_mean ± bootstrap CI95, loop-tick-indexed composite trajectory with CIs + per-metric means, and provenance (save_sha256, DLL sha, commit, scoring_version).
  • load_baseline() (scenarios/loader.py): returns the sidecar or None; fails fast (BaselineMismatchError) when the baseline was calibrated against a different save_sha256 or SCORING_VERSION.
  • rle.scoring.baseline: testable aggregation (read_run from CSV+summary artifacts, aggregate_baseline with unequal-length-run handling — points record n_runs still alive).
  • scripts/calibrate_baseline.py: thin driver — N seeded --no-agent --until-death --no-pause subprocess runs, aborts on any failure rather than writing a partial baseline, writes the sidecar, round-trips it through the strict loader.

Verification

450 tests pass (10 new: aggregation math incl. unequal run lengths, CSV/summary reading, loader mismatch fail-fast both ways), ruff clean, mypy strict clean. The actual N=4 Crashlanded calibration runs follow this merge as a separate PR carrying the generated sidecar.

🤖 Generated with Claude Code

- run_scenario --until-death lifts the scenario tick cap (evaluator
  terminal conditions are the stop; 5000-tick runaway guard) so
  no-agent colonies run to natural death and agent runs to victory.
- BaselineReference/BaselinePoint schema: per-scenario .baseline.json
  sidecar with seeds, outcomes, time-to-end mean + bootstrap CI95,
  loop-tick-indexed composite/metric trajectories, and full provenance
  (save_sha256, DLL sha, commit, scoring_version).
- load_baseline() fails fast when a sidecar was calibrated against a
  different save_sha256 or SCORING_VERSION — stale references must be
  recharacterized, never silently compared against.
- scripts/calibrate_baseline.py drives N seeded runs against the live
  game and aggregates them (rle.scoring.baseline); aborts rather than
  writing a partial baseline if any run fails; --aggregate-only reuses
  existing run dirs.

Co-Authored-By: Claude Fable 5 <[email protected]>
@jkbennitt jkbennitt merged commit 1961e91 into master Jun 10, 2026
3 checks passed
@jkbennitt jkbennitt deleted the feat/baseline-calibration branch June 10, 2026 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant