Phase B1 infrastructure: until-death runs + pinned baseline sidecars#29
Merged
Conversation
- run_scenario --until-death lifts the scenario tick cap (evaluator terminal conditions are the stop; 5000-tick runaway guard) so no-agent colonies run to natural death and agent runs to victory. - BaselineReference/BaselinePoint schema: per-scenario .baseline.json sidecar with seeds, outcomes, time-to-end mean + bootstrap CI95, loop-tick-indexed composite/metric trajectories, and full provenance (save_sha256, DLL sha, commit, scoring_version). - load_baseline() fails fast when a sidecar was calibrated against a different save_sha256 or SCORING_VERSION — stale references must be recharacterized, never silently compared against. - scripts/calibrate_baseline.py drives N seeded runs against the live game and aggregates them (rle.scoring.baseline); aborts rather than writing a partial baseline if any run fails; --aggregate-only reuses existing run dirs. Co-Authored-By: Claude Fable 5 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase B groundwork per the approved plan — the pinned-baseline pattern (calibrate once at N=4, compare against the reference forever after).
What
--until-deathonrun_scenario.py: clears the scenario tick cap (max_ticks=Noneso the evaluator's timeout check never fires); the run ends on the evaluator's terminal conditions —all_colonists_deaddefeat or victory — with a 5000-tick runaway guard.BaselineReference/BaselinePoint(scenarios/schema.py): the.baseline.jsonsidecar schema — seeds, per-run outcomes,time_to_end_days_mean± bootstrap CI95, loop-tick-indexed composite trajectory with CIs + per-metric means, and provenance (save_sha256, DLL sha, commit,scoring_version).load_baseline()(scenarios/loader.py): returns the sidecar or None; fails fast (BaselineMismatchError) when the baseline was calibrated against a differentsave_sha256orSCORING_VERSION.rle.scoring.baseline: testable aggregation (read_runfrom CSV+summary artifacts,aggregate_baselinewith unequal-length-run handling — points recordn_runsstill alive).scripts/calibrate_baseline.py: thin driver — N seeded--no-agent --until-death --no-pausesubprocess runs, aborts on any failure rather than writing a partial baseline, writes the sidecar, round-trips it through the strict loader.Verification
450 tests pass (10 new: aggregation math incl. unequal run lengths, CSV/summary reading, loader mismatch fail-fast both ways), ruff clean, mypy strict clean. The actual N=4 Crashlanded calibration runs follow this merge as a separate PR carrying the generated sidecar.
🤖 Generated with Claude Code