Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source by jkbennitt · Pull Request #20 · AppSprout-dev/RLE

jkbennitt · 2026-05-23T23:38:58Z

Summary

Phase A of the observability-first restructuring plan. Nine commits, +969 lines of source, +56 lines of cost-tracker docs/tests. Goal: the next live benchmark run produces a fully reviewable, replayable artifact.

Originally stacked on #19; rebased against master after #19 merged.

#	Commit	What
A1	Restore per-scenario `*_deliberations.jsonl` export — was in docker-benchmark (Apr 12), lost in pr17-live. New `RLEGameLoop.deliberation_log` public property; both `run_benchmark.py` and `run_scenario.py` now write the JSONL.
A2	Expand `EventType.DELIBERATION` payload — events.jsonl now carries the full actions array + summary + reason text (was only latency/confidence/num_actions). Truncation constants `_ACTION_REASON_CHARS=200`, `_PLAN_SUMMARY_CHARS=300`, `_PARSE_FAILURE_RAW_CHARS=500` consolidated at module top.
A3	`_last_raw_output` exported via `PROVIDER_CALL` events (4 KB head-truncated). Means a weird deliberation is grep-able without keeping the process alive.
A6	Replay-grade metadata: `SCORING_VERSION="1.0"`, `file_sha256()`, RIMAPI DLL hash via Workshop path lookup, RIMAPI fork commit lookup, `--seed` CLI flag. Plus `save_sha256` on every scenario YAML (all 6 now pinned) + loader validation. `scripts/hash_saves.py` helper.
A7	Per-task LLM timeout via `asyncio.wait_for`. Hung deliberation emits `deliberation_timeout` ERROR event and the tick proceeds. Default 60s. All three deliberation paths covered (parallel, sequential, MapAnalyst).
A8	`run_scenario.py` now appends to `results/benchmark_history.jsonl` with `run_type: "scenario"` so single-scenario runs are distinguishable from full batteries. Skipped when no LLM was actually called (smoke tests).
A9	Cost tracker surfaces `pricing_source` (`openrouter_api` / `override` / `unknown`) on every snapshot, plus the actual prices used. New `--prompt-price-per-mtok` / `--completion-price-per-mtok` CLI overrides for when OpenRouter's /models pricing diverges from billed cost.
CI fix	`.gitattributes` marks `*.rws binary` so save_sha256 pins are cross-platform. Initial commit had Git's autocrlf=true silently converting CRLF↔LF on commit/checkout, so Windows-pinned SHAs didn't match Linux CI's on-disk bytes. Re-staged the 6 saves so the repo stores CRLF directly (matching what RimWorld writes on Windows); the pinned SHAs in the YAMLs were already the CRLF hashes, no YAML changes needed.

Test results

pytest — 398 passed (+13 new tests covering deliberation log, raw-output event, save SHA validation, metadata, timeout, pricing source, overrides). ruff + mypy strict clean. CI green (lint + test + smoke-test).

Why this matters for the benchmark

The original review found that pr17-live (the post-PR-#17 live run) had:

No persisted deliberation JSONLs (regressed between PR HeadlessRim Docker + real benchmarks + leaderboard infrastructure (#13) #15 and Full RIMAPI integration: fix data gaps, alerts, incident control, scenario pipeline #18)
No raw LLM output anywhere on disk
No way to know which RIMAPI DLL the run used
Silent timeout-blocking (one hung LLM call stalled the entire tick)
1-byte benchmark_history.jsonl despite dozens of runs persisted to subdirs

After this PR, a fresh single-scenario run produces:

<scenario>.csv (per-tick time series)
<scenario>_deliberations.jsonl (full agent rationale per tick)
<scenario>_summary.json (metadata + outcome + cost + event summary)
events.jsonl (every tick event, with actions/reasons/raw LLM output inline)
An entry in results/benchmark_history.jsonl tagged run_type: "scenario"

That's the artifact format Phase B (baseline calibration + first real long run) will produce.

Empirical validation (A10 partial run, killed at tick 5)

Ran the new observability stack against a live Nemotron 120B / Crashlanded smoke test before opening this PR. Five ticks of data confirmed:

A7 timeout protection works: 26 deliberation_timeout events fired; tick advanced cleanly each time (would have hung indefinitely pre-A7).
Active intervention hurts: when agents did get through at tick 3, they executed 6 actions and the composite dropped −0.103 in that single tick. Other ticks (where agents timed out) drifted ~−0.013/tick from RimWorld's natural decay. Clean replication of the GitHub Agents must beat unmanaged baseline #6 "agents lose to baseline" finding, now with per-action observability.
PR Surface RIMAPI action errors to agents + fix load-settle race #19's ActionOutcome capture works: 2 RIMAPI 500s in tick 3 (Bionics already-researched, work-priority on disabled skill) were caught with structured reasons.
60s timeout is too aggressive for Nemotron 120B paid tier: observed deliberation latencies 36s / 89s / 130s on tick 0. Bumping the timeout makes the architectural mismatch worse, not better — the next architectural decision (pause vs no-pause + model choice) is what unblocks Phase B.

Known artifact bug (followup)

A7's wait_for cancels the orchestrator's await but the worker thread keeps running (Python threads can't be force-killed). When it eventually completes, it still emits a DELIBERATION event — so a single (tick, agent) can have both an ERROR (timeout) and a DELIBERATION event. The orchestrator's plan list is correct (it returns (agent, None) at the timeout); the artifact just needs a post_timeout: true flag on late-arriving DELIBERATION events so downstream consumers can deduplicate. Tracked as a followup, not a correctness bug.

What's still open in Phase A

A10: live smoke comparison run — partially executed (5/10 ticks before kill); the run config itself needs a structural fix before the run is meaningful (current config has agents acting on stale game state because game advances during deliberation).
A11: upstream RIMAPI PR to tighten ResponceBuilder.DetermineStatusCode. Low urgency since the RLE-side ActionOutcome workaround from Surface RIMAPI action errors to agents + fix load-settle race #19 is in place.
A12: wire pawn-social endpoints (opinions / relations / interactions) into SocialOverseer, filtered to mood<0.4. Should be its own PR.

Test plan

pytest (398 pass)
ruff check src/ tests/ scripts/ (clean)
mypy src/ (clean, 42 source files)
python scripts/hash_saves.py --print (round-trips all 6 scenario saves)
python scripts/run_scenario.py --help shows --seed, --prompt-price-per-mtok, --completion-price-per-mtok
CI lint + test + smoke-test green
Live partial run validates A1-A8 produce the expected artifacts

🤖 Generated with Claude Code

Plan item A1. docker-benchmark (Apr 12) wrote *_deliberations.jsonl per scenario via run_benchmark.py; pr17-live (run via run_scenario.py) silently dropped the export, so post-PR-#17 runs lost the per-tick agent rationale (actions, reasons, summary). Re-add it via a public deliberation_log property on RLEGameLoop so both scripts use the same contract. - RLEGameLoop.deliberation_log: public property returning a shallow copy of the per-tick records (tick, agent, status, plus actions/summary/ confidence for success and raw/reason for failures). - run_scenario.py: writes <scenario>_deliberations.jsonl alongside the CSV when --output is given, off-thread via asyncio.to_thread. - run_benchmark.py: switched to the public property (was reading the private _deliberation_log attribute). - Integration test asserts record count and shallow-copy semantics. Tests: 383 pass (+1). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Plan item A2. The DELIBERATION events emitted to events.jsonl previously carried only latency_ms / confidence / num_actions; the rich payload (actions array with type/target/priority/reason, plus the plan summary) lived only in the in-memory _deliberation_log mirror. Now both writers emit the same data so events.jsonl is self-contained for downstream analysis without needing to cross-reference *_deliberations.jsonl. Also surfaces the raw LLM content (first 500 chars) on parse_failure ERROR events, matching what _deliberation_log already records. Extracted three truncation constants (_ACTION_REASON_CHARS=200, _PLAN_SUMMARY_CHARS=300, _PARSE_FAILURE_RAW_CHARS=500) so the two writers share the same shape and the limits are documented in one place. Tests: 384 pass (+1). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Plan item A3. _last_raw_output (the full LLM completion text) was stored in-memory on each agent but never exported, so post-hoc analysis of a weird deliberation required keeping the process alive. Now every PROVIDER_CALL event carries the agent's raw completion text (head- truncated to 4 KB) plus a raw_output_truncated bool so consumers know whether the JSONL has the full response or just a prefix. PROVIDER_CALL is the right semantic home — it's the provider's response to the call we already log token counts for. The truncation length is configurable via _RAW_OUTPUT_CHARS (matches the other event-log truncation constants from A2). 4 KB covers typical action-plan JSON plus a reasoning preamble. Tests: 385 pass (+1). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Plan item A6 (commit 1 of 2). Surfaces the metadata a benchmark replay kit needs: a versioned scoring identifier, the deployed RIMAPI DLL hash, the RIMAPI fork commit, and an explicit seed for RLE-side stochasticity. Scenario save_sha256 + loader validation lands in a follow-up. metadata.py: * SCORING_VERSION constant ("1.0") — bumped when DEFAULT_WEIGHTS, metric implementations, or composite math change in a way that makes older runs not directly comparable. Surfaced in every run summary. * file_sha256(path): stdlib-only chunked SHA-256, returns None on missing/unreadable so the metadata dict stays serializable. * _rimapi_dll_path(): env override ($RIMAPI_DLL_PATH) → Workshop default (Steam path). Hashed via file_sha256. * _rimapi_fork_commit(): env override ($RIMAPI_FORK_PATH) → sibling ../RIMAPI checkout. Empty string when not reachable. * collect_metadata(random_seed=None): threads the seed through so the summary records what was actually set. Docstring notes that the seed only affects RLE-side randomness, not RimWorld's internal RNG. CLI: * Both run_benchmark.py and run_scenario.py gain --seed INT. When set, random.seed() is called before any work and the value is threaded through to collect_metadata() at every summary write. run_scenario.py: * Writes <scenario>_summary.json alongside the CSV/deliberations, including the metadata block, model/provider config, outcome, final composite, ticks_run, cost_snapshot, and event_summary. The new helper _build_run_summary keeps main() from getting longer. Tests: 391 pass (+6 new metadata tests). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Plan item A6 (commit 2 of 2). Live game runs can silently drift from the canonical docker/saves/ mirror (RimWorld saves into AppData on Windows; nothing forces it to match the repo). When the agent + scoring code are working off a different save than the one the leaderboard tracks, every result is uncomparable. Pin and verify. schema.py: * ScenarioConfig.save_sha256: str | None — optional pinned hash of the docker/saves/<save_name>.rws file. loader.py: * ScenarioSaveMismatchError — distinct exception so callers can handle this specifically (e.g., a CLI flag --allow-unpinned). * canonical_save_path(save_name) — resolves to <repo>/docker/saves/. * load_scenario(path, allow_unpinned=False) — if save_sha256 is set AND the canonical save exists on disk AND the actual SHA doesn't match, raises. allow_unpinned=True is the intentional escape hatch. Short-circuits silently when the canonical file is missing (CI without docker volumes) — that's a separate failure mode caught at game-load time. scripts/hash_saves.py: * New helper that hashes every canonical save and updates the matching scenario YAML's save_sha256 field. --print for dry-run. definitions/*.yaml: * All 6 scenarios now carry pinned save_sha256 fields (the actual hashes of the current docker/saves/*.rws files). Re-run scripts/hash_saves.py after rebuilding saves to re-pin. Tests: 394 pass (+3 new SHA validation tests). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Plan item A7. A single hung LLM call previously blocked the entire tick because asyncio.gather has no per-task timeout. Now each deliberation runs under a wait_for(role_timeout_s) wrapper; on timeout the orchestrator emits a deliberation_timeout ERROR event, records the agent's empty contribution in the deliberation log, and proceeds. The hung thread keeps running in the background until the provider's own timeout cleans it up (Python threads can't be force-killed) but the tick advances regardless. Why it matters for the benchmark: cheap and premium models can be in the same matrix only if tail latency is bounded. Without this, a stuck 30B/120B request stalls every other agent's runtime measurement. config.py: * RLEConfig.role_timeout_s: float = 60.0 (~8x the 7s avg deliberation observed in docker-benchmark; tunable via .env or CLI override). game_loop.py: * _deliberate_agent_with_timeout(): new async wrapper, runs _deliberate_agent in a worker thread under asyncio.wait_for. Emits EventType.ERROR with error_type=deliberation_timeout + timeout_s on TimeoutError; returns (agent, None). * _deliberate_parallel(): collapsed the inline _run closure into the new wrapper. * _deliberate_sequential(): is now async and awaits the wrapper. Sequential-mode dispatch in run_tick() now awaited. * MapAnalyst dispatch also goes through the timeout wrapper. Test: TestMultiAgent.test_role_timeout_emits_event_and_other_agents_continue injects a 5s-sleep provider with role_timeout_s=0.05; asserts 7 deliberation_timeout events fire, merged plan has 0 actions, tick completes successfully. Tests: 395 pass (+1). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Plan item A8. run_benchmark.py already calls append_history (line 711); run_scenario.py never did. Combined with the fact that the user has been doing live smoke testing via run_scenario.py (not run_benchmark.py), results/benchmark_history.jsonl stayed at 1 byte even though dozens of runs were persisted in per-run subdirectories. Now run_scenario.py appends one history entry per --output run, tagged run_type: "scenario" so consumers can distinguish single-scenario runs from full benchmark batteries (which would tag run_type: "benchmark" when run_benchmark.py is updated; that's a separate trivial follow-up). Skips the history append when cost_tracker.num_calls == 0 (smoke tests where the LLM was never reached, e.g. today's broken nano-4b run that generated parse failures across the board because LM Studio was down). Those rows would skew the leaderboard's avg-score-per-model otherwise. Tests: 395 pass. ruff + mypy src/ clean. (Script-level mypy errors at run_scenario.py:171, :175 are pre-existing in untouched code.) Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Plan item A9. The live smoke test reported \$0.044 internal vs \$0.168 actual OpenRouter charge. Internal estimates depend on the public /models pricing, which can diverge from billed cost due to provider routing surcharges, BYOK markups, or stale prices. The tracker would silently fall back to \$0/token if the model wasn't found or the API was unreachable. Now those failure modes are visible, and operators can plug in the actual price they're seeing on their bill. cost_tracker.py: * CostSnapshot gains prompt_price_per_token, completion_price_per_token, and pricing_source (one of: openrouter_api / override / unknown). "unknown" flags that estimated_cost_usd is \$0 because pricing resolution failed — consumers should not aggregate it. * create_cost_tracker() accepts prompt_price_override / completion_price_override (per-token USD); when both are passed, the live fetch is skipped entirely (source="override"). * Logs the resolved prices at INFO ("prompt=\$X.XX/MTok completion= \$Y.YY/MTok source=...") and a WARNING when the fetch resolved to \$0 for a model the user explicitly named. CLI: * --prompt-price-per-mtok / --completion-price-per-mtok on both run_scenario.py and run_benchmark.py. Per-MTok is the standard human-readable unit on OpenRouter's pricing page; the scripts convert to per-token before handing to create_cost_tracker. Tests: 398 pass (+3 new pricing-source / override tests; +1 nemotron-120B reconciliation test that pins the exact math from the docker-benchmark cost recording). ruff + mypy strict clean. Reconciliation against an actual billed amount is still a follow-up: needs an authenticated query to OpenRouter's /credits or per-generation endpoints. For now the override is the unblock. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

CI failed every scenario-loader test on PR #20: pinned save_sha256 values matched the on-disk hash on Windows but not on Linux. Root cause was Git's autocrlf=true converting CRLF<->LF on commit/checkout for the .rws saves (which are XML text). The repo wound up storing LF bytes while my Windows working tree had CRLF, so the SHA-256 differed between OSes and the loader rejected every scenario. .gitattributes now declares *.rws as binary, which disables line-ending translation. Re-staged the 6 save files so the repo stores the same bytes as the working tree (CRLF, as RimWorld writes them on Windows). File sizes grew ~3% as expected. The pinned SHAs in the YAMLs were already the CRLF hashes, so no YAML changes are needed. Cross-platform contract going forward: the bytes in docker/saves/ are the canonical bytes the scenario loader hashes against, regardless of OS. RimWorld instances writing into AppData are still free to use any line endings — the loader compares against docker/saves/, not against the AppData copy. Tests: tests/unit/test_scenario_loader.py — 15 pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…export Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source Restores per-commit history for #20 (originally landed as a single squash commit 61bce01; this merge re-establishes the 9 individual commits + the cross-platform .gitattributes CI fix as separate parents-of-master so they remain bisectable). * Restore per-scenario *_deliberations.jsonl export in run_scenario (A1) * Expand DELIBERATION event payload with actions + summary (A2) * Surface raw LLM output via PROVIDER_CALL events (A3) * Add replay-grade metadata + --seed flag + per-scenario summary JSON (A6.1) * Pin canonical save SHA-256 into scenario YAMLs + loader validation (A6.2) * Per-task LLM timeout via asyncio.wait_for (A7) * Populate benchmark_history.jsonl from run_scenario.py too (A8) * Cost tracker: surface pricing source + manual override flags (A9) * Treat .rws saves as binary so scenario SHA pins are cross-platform (CI fix)

jkbennitt and others added 8 commits May 16, 2026 02:43

Base automatically changed from fix/post-live-test-findings to master June 6, 2026 06:09

jkbennitt and others added 2 commits June 6, 2026 02:11

Merge branch 'master' into feat/restore-deliberations-export

180179a

jkbennitt merged commit 61bce01 into master Jun 6, 2026
3 checks passed

jkbennitt deleted the feat/restore-deliberations-export branch June 6, 2026 06:25

jkbennitt restored the feat/restore-deliberations-export branch June 6, 2026 06:32

This was referenced Jun 9, 2026

Agents must beat unmanaged baseline #6

Open

Create benchmark save files for all 6 scenarios #7

Open

RLE v1.0: Multi-model colony management leaderboard #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source#20

Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source#20
jkbennitt merged 10 commits into
masterfrom
feat/restore-deliberations-export

jkbennitt commented May 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jkbennitt commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test results

Why this matters for the benchmark

Empirical validation (A10 partial run, killed at tick 5)

Known artifact bug (followup)

What's still open in Phase A

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jkbennitt commented May 23, 2026 •

edited

Loading