Phase B1 infrastructure: until-death runs + pinned baseline sidecars by jkbennitt · Pull Request #29 · AppSprout-dev/RLE

jkbennitt · 2026-06-10T04:20:29Z

Phase B groundwork per the approved plan — the pinned-baseline pattern (calibrate once at N=4, compare against the reference forever after).

What

--until-death on run_scenario.py: clears the scenario tick cap (max_ticks=None so the evaluator's timeout check never fires); the run ends on the evaluator's terminal conditions — all_colonists_dead defeat or victory — with a 5000-tick runaway guard.
BaselineReference / BaselinePoint (scenarios/schema.py): the .baseline.json sidecar schema — seeds, per-run outcomes, time_to_end_days_mean ± bootstrap CI95, loop-tick-indexed composite trajectory with CIs + per-metric means, and provenance (save_sha256, DLL sha, commit, scoring_version).
load_baseline() (scenarios/loader.py): returns the sidecar or None; fails fast (BaselineMismatchError) when the baseline was calibrated against a different save_sha256 or SCORING_VERSION.
rle.scoring.baseline: testable aggregation (read_run from CSV+summary artifacts, aggregate_baseline with unequal-length-run handling — points record n_runs still alive).
scripts/calibrate_baseline.py: thin driver — N seeded --no-agent --until-death --no-pause subprocess runs, aborts on any failure rather than writing a partial baseline, writes the sidecar, round-trips it through the strict loader.

Verification

450 tests pass (10 new: aggregation math incl. unequal run lengths, CSV/summary reading, loader mismatch fail-fast both ways), ruff clean, mypy strict clean. The actual N=4 Crashlanded calibration runs follow this merge as a separate PR carrying the generated sidecar.

🤖 Generated with Claude Code

- run_scenario --until-death lifts the scenario tick cap (evaluator terminal conditions are the stop; 5000-tick runaway guard) so no-agent colonies run to natural death and agent runs to victory. - BaselineReference/BaselinePoint schema: per-scenario .baseline.json sidecar with seeds, outcomes, time-to-end mean + bootstrap CI95, loop-tick-indexed composite/metric trajectories, and full provenance (save_sha256, DLL sha, commit, scoring_version). - load_baseline() fails fast when a sidecar was calibrated against a different save_sha256 or SCORING_VERSION — stale references must be recharacterized, never silently compared against. - scripts/calibrate_baseline.py drives N seeded runs against the live game and aggregates them (rle.scoring.baseline); aborts rather than writing a partial baseline if any run fails; --aggregate-only reuses existing run dirs. Co-Authored-By: Claude Fable 5 <[email protected]>

jkbennitt merged commit 1961e91 into master Jun 10, 2026
3 checks passed

jkbennitt deleted the feat/baseline-calibration branch June 10, 2026 04:22

jkbennitt mentioned this pull request Jun 10, 2026

Pin Crashlanded no-agent baseline (N=4, scoring 1.1) #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase B1 infrastructure: until-death runs + pinned baseline sidecars#29

Phase B1 infrastructure: until-death runs + pinned baseline sidecars#29
jkbennitt merged 1 commit into
masterfrom
feat/baseline-calibration

jkbennitt commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jkbennitt commented Jun 10, 2026

What

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant