Skip to content

Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source#20

Merged
jkbennitt merged 10 commits into
masterfrom
feat/restore-deliberations-export
Jun 6, 2026
Merged

Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source#20
jkbennitt merged 10 commits into
masterfrom
feat/restore-deliberations-export

Conversation

@jkbennitt

@jkbennitt jkbennitt commented May 23, 2026

Copy link
Copy Markdown
Member

Summary

Phase A of the observability-first restructuring plan. Nine commits, +969 lines of source, +56 lines of cost-tracker docs/tests. Goal: the next live benchmark run produces a fully reviewable, replayable artifact.

Originally stacked on #19; rebased against master after #19 merged.

# Commit What
A1 Restore per-scenario *_deliberations.jsonl export — was in docker-benchmark (Apr 12), lost in pr17-live. New RLEGameLoop.deliberation_log public property; both run_benchmark.py and run_scenario.py now write the JSONL.
A2 Expand EventType.DELIBERATION payload — events.jsonl now carries the full actions array + summary + reason text (was only latency/confidence/num_actions). Truncation constants _ACTION_REASON_CHARS=200, _PLAN_SUMMARY_CHARS=300, _PARSE_FAILURE_RAW_CHARS=500 consolidated at module top.
A3 _last_raw_output exported via PROVIDER_CALL events (4 KB head-truncated). Means a weird deliberation is grep-able without keeping the process alive.
A6 Replay-grade metadata: SCORING_VERSION="1.0", file_sha256(), RIMAPI DLL hash via Workshop path lookup, RIMAPI fork commit lookup, --seed CLI flag. Plus save_sha256 on every scenario YAML (all 6 now pinned) + loader validation. scripts/hash_saves.py helper.
A7 Per-task LLM timeout via asyncio.wait_for. Hung deliberation emits deliberation_timeout ERROR event and the tick proceeds. Default 60s. All three deliberation paths covered (parallel, sequential, MapAnalyst).
A8 run_scenario.py now appends to results/benchmark_history.jsonl with run_type: "scenario" so single-scenario runs are distinguishable from full batteries. Skipped when no LLM was actually called (smoke tests).
A9 Cost tracker surfaces pricing_source (openrouter_api / override / unknown) on every snapshot, plus the actual prices used. New --prompt-price-per-mtok / --completion-price-per-mtok CLI overrides for when OpenRouter's /models pricing diverges from billed cost.
CI fix .gitattributes marks *.rws binary so save_sha256 pins are cross-platform. Initial commit had Git's autocrlf=true silently converting CRLF↔LF on commit/checkout, so Windows-pinned SHAs didn't match Linux CI's on-disk bytes. Re-staged the 6 saves so the repo stores CRLF directly (matching what RimWorld writes on Windows); the pinned SHAs in the YAMLs were already the CRLF hashes, no YAML changes needed.

Test results

pytest398 passed (+13 new tests covering deliberation log, raw-output event, save SHA validation, metadata, timeout, pricing source, overrides). ruff + mypy strict clean. CI green (lint + test + smoke-test).

Why this matters for the benchmark

The original review found that pr17-live (the post-PR-#17 live run) had:

After this PR, a fresh single-scenario run produces:

  • <scenario>.csv (per-tick time series)
  • <scenario>_deliberations.jsonl (full agent rationale per tick)
  • <scenario>_summary.json (metadata + outcome + cost + event summary)
  • events.jsonl (every tick event, with actions/reasons/raw LLM output inline)
  • An entry in results/benchmark_history.jsonl tagged run_type: "scenario"

That's the artifact format Phase B (baseline calibration + first real long run) will produce.

Empirical validation (A10 partial run, killed at tick 5)

Ran the new observability stack against a live Nemotron 120B / Crashlanded smoke test before opening this PR. Five ticks of data confirmed:

  • A7 timeout protection works: 26 deliberation_timeout events fired; tick advanced cleanly each time (would have hung indefinitely pre-A7).
  • Active intervention hurts: when agents did get through at tick 3, they executed 6 actions and the composite dropped −0.103 in that single tick. Other ticks (where agents timed out) drifted ~−0.013/tick from RimWorld's natural decay. Clean replication of the GitHub Agents must beat unmanaged baseline #6 "agents lose to baseline" finding, now with per-action observability.
  • PR Surface RIMAPI action errors to agents + fix load-settle race #19's ActionOutcome capture works: 2 RIMAPI 500s in tick 3 (Bionics already-researched, work-priority on disabled skill) were caught with structured reasons.
  • 60s timeout is too aggressive for Nemotron 120B paid tier: observed deliberation latencies 36s / 89s / 130s on tick 0. Bumping the timeout makes the architectural mismatch worse, not better — the next architectural decision (pause vs no-pause + model choice) is what unblocks Phase B.

Known artifact bug (followup)

A7's wait_for cancels the orchestrator's await but the worker thread keeps running (Python threads can't be force-killed). When it eventually completes, it still emits a DELIBERATION event — so a single (tick, agent) can have both an ERROR (timeout) and a DELIBERATION event. The orchestrator's plan list is correct (it returns (agent, None) at the timeout); the artifact just needs a post_timeout: true flag on late-arriving DELIBERATION events so downstream consumers can deduplicate. Tracked as a followup, not a correctness bug.

What's still open in Phase A

  • A10: live smoke comparison run — partially executed (5/10 ticks before kill); the run config itself needs a structural fix before the run is meaningful (current config has agents acting on stale game state because game advances during deliberation).
  • A11: upstream RIMAPI PR to tighten ResponceBuilder.DetermineStatusCode. Low urgency since the RLE-side ActionOutcome workaround from Surface RIMAPI action errors to agents + fix load-settle race #19 is in place.
  • A12: wire pawn-social endpoints (opinions / relations / interactions) into SocialOverseer, filtered to mood<0.4. Should be its own PR.

Test plan

  • pytest (398 pass)
  • ruff check src/ tests/ scripts/ (clean)
  • mypy src/ (clean, 42 source files)
  • python scripts/hash_saves.py --print (round-trips all 6 scenario saves)
  • python scripts/run_scenario.py --help shows --seed, --prompt-price-per-mtok, --completion-price-per-mtok
  • CI lint + test + smoke-test green
  • Live partial run validates A1-A8 produce the expected artifacts

🤖 Generated with Claude Code

jkbennitt and others added 8 commits May 16, 2026 02:43
Plan item A1. docker-benchmark (Apr 12) wrote *_deliberations.jsonl per
scenario via run_benchmark.py; pr17-live (run via run_scenario.py)
silently dropped the export, so post-PR-#17 runs lost the per-tick
agent rationale (actions, reasons, summary). Re-add it via a public
deliberation_log property on RLEGameLoop so both scripts use the same
contract.

- RLEGameLoop.deliberation_log: public property returning a shallow copy
  of the per-tick records (tick, agent, status, plus actions/summary/
  confidence for success and raw/reason for failures).
- run_scenario.py: writes <scenario>_deliberations.jsonl alongside the
  CSV when --output is given, off-thread via asyncio.to_thread.
- run_benchmark.py: switched to the public property (was reading the
  private _deliberation_log attribute).
- Integration test asserts record count and shallow-copy semantics.

Tests: 383 pass (+1). ruff + mypy strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Plan item A2. The DELIBERATION events emitted to events.jsonl previously
carried only latency_ms / confidence / num_actions; the rich payload
(actions array with type/target/priority/reason, plus the plan summary)
lived only in the in-memory _deliberation_log mirror. Now both writers
emit the same data so events.jsonl is self-contained for downstream
analysis without needing to cross-reference *_deliberations.jsonl.

Also surfaces the raw LLM content (first 500 chars) on parse_failure
ERROR events, matching what _deliberation_log already records.

Extracted three truncation constants (_ACTION_REASON_CHARS=200,
_PLAN_SUMMARY_CHARS=300, _PARSE_FAILURE_RAW_CHARS=500) so the two
writers share the same shape and the limits are documented in one place.

Tests: 384 pass (+1). ruff + mypy strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Plan item A3. _last_raw_output (the full LLM completion text) was stored
in-memory on each agent but never exported, so post-hoc analysis of a
weird deliberation required keeping the process alive. Now every
PROVIDER_CALL event carries the agent's raw completion text (head-
truncated to 4 KB) plus a raw_output_truncated bool so consumers know
whether the JSONL has the full response or just a prefix.

PROVIDER_CALL is the right semantic home — it's the provider's response
to the call we already log token counts for. The truncation length is
configurable via _RAW_OUTPUT_CHARS (matches the other event-log
truncation constants from A2). 4 KB covers typical action-plan JSON
plus a reasoning preamble.

Tests: 385 pass (+1). ruff + mypy strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Plan item A6 (commit 1 of 2). Surfaces the metadata a benchmark replay
kit needs: a versioned scoring identifier, the deployed RIMAPI DLL hash,
the RIMAPI fork commit, and an explicit seed for RLE-side stochasticity.
Scenario save_sha256 + loader validation lands in a follow-up.

metadata.py:
  * SCORING_VERSION constant ("1.0") — bumped when DEFAULT_WEIGHTS,
    metric implementations, or composite math change in a way that makes
    older runs not directly comparable. Surfaced in every run summary.
  * file_sha256(path): stdlib-only chunked SHA-256, returns None on
    missing/unreadable so the metadata dict stays serializable.
  * _rimapi_dll_path(): env override ($RIMAPI_DLL_PATH) → Workshop
    default (Steam path). Hashed via file_sha256.
  * _rimapi_fork_commit(): env override ($RIMAPI_FORK_PATH) → sibling
    ../RIMAPI checkout. Empty string when not reachable.
  * collect_metadata(random_seed=None): threads the seed through so the
    summary records what was actually set. Docstring notes that the
    seed only affects RLE-side randomness, not RimWorld's internal RNG.

CLI:
  * Both run_benchmark.py and run_scenario.py gain --seed INT. When
    set, random.seed() is called before any work and the value is
    threaded through to collect_metadata() at every summary write.

run_scenario.py:
  * Writes <scenario>_summary.json alongside the CSV/deliberations,
    including the metadata block, model/provider config, outcome,
    final composite, ticks_run, cost_snapshot, and event_summary.
    The new helper _build_run_summary keeps main() from getting longer.

Tests: 391 pass (+6 new metadata tests). ruff + mypy strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Plan item A6 (commit 2 of 2). Live game runs can silently drift from the
canonical docker/saves/ mirror (RimWorld saves into AppData on Windows;
nothing forces it to match the repo). When the agent + scoring code are
working off a different save than the one the leaderboard tracks, every
result is uncomparable. Pin and verify.

schema.py:
  * ScenarioConfig.save_sha256: str | None — optional pinned hash of
    the docker/saves/<save_name>.rws file.

loader.py:
  * ScenarioSaveMismatchError — distinct exception so callers can
    handle this specifically (e.g., a CLI flag --allow-unpinned).
  * canonical_save_path(save_name) — resolves to <repo>/docker/saves/.
  * load_scenario(path, allow_unpinned=False) — if save_sha256 is set
    AND the canonical save exists on disk AND the actual SHA doesn't
    match, raises. allow_unpinned=True is the intentional escape hatch.
    Short-circuits silently when the canonical file is missing (CI
    without docker volumes) — that's a separate failure mode caught
    at game-load time.

scripts/hash_saves.py:
  * New helper that hashes every canonical save and updates the matching
    scenario YAML's save_sha256 field. --print for dry-run.

definitions/*.yaml:
  * All 6 scenarios now carry pinned save_sha256 fields (the actual
    hashes of the current docker/saves/*.rws files). Re-run
    scripts/hash_saves.py after rebuilding saves to re-pin.

Tests: 394 pass (+3 new SHA validation tests). ruff + mypy strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Plan item A7. A single hung LLM call previously blocked the entire tick
because asyncio.gather has no per-task timeout. Now each deliberation
runs under a wait_for(role_timeout_s) wrapper; on timeout the orchestrator
emits a deliberation_timeout ERROR event, records the agent's empty
contribution in the deliberation log, and proceeds. The hung thread keeps
running in the background until the provider's own timeout cleans it up
(Python threads can't be force-killed) but the tick advances regardless.

Why it matters for the benchmark: cheap and premium models can be in the
same matrix only if tail latency is bounded. Without this, a stuck
30B/120B request stalls every other agent's runtime measurement.

config.py:
  * RLEConfig.role_timeout_s: float = 60.0 (~8x the 7s avg deliberation
    observed in docker-benchmark; tunable via .env or CLI override).

game_loop.py:
  * _deliberate_agent_with_timeout(): new async wrapper, runs
    _deliberate_agent in a worker thread under asyncio.wait_for.
    Emits EventType.ERROR with error_type=deliberation_timeout +
    timeout_s on TimeoutError; returns (agent, None).
  * _deliberate_parallel(): collapsed the inline _run closure into the
    new wrapper.
  * _deliberate_sequential(): is now async and awaits the wrapper.
    Sequential-mode dispatch in run_tick() now awaited.
  * MapAnalyst dispatch also goes through the timeout wrapper.

Test: TestMultiAgent.test_role_timeout_emits_event_and_other_agents_continue
injects a 5s-sleep provider with role_timeout_s=0.05; asserts 7
deliberation_timeout events fire, merged plan has 0 actions, tick
completes successfully.

Tests: 395 pass (+1). ruff + mypy strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Plan item A8. run_benchmark.py already calls append_history (line 711);
run_scenario.py never did. Combined with the fact that the user has been
doing live smoke testing via run_scenario.py (not run_benchmark.py),
results/benchmark_history.jsonl stayed at 1 byte even though dozens of
runs were persisted in per-run subdirectories.

Now run_scenario.py appends one history entry per --output run, tagged
run_type: "scenario" so consumers can distinguish single-scenario runs
from full benchmark batteries (which would tag run_type: "benchmark"
when run_benchmark.py is updated; that's a separate trivial follow-up).

Skips the history append when cost_tracker.num_calls == 0 (smoke tests
where the LLM was never reached, e.g. today's broken nano-4b run that
generated parse failures across the board because LM Studio was down).
Those rows would skew the leaderboard's avg-score-per-model otherwise.

Tests: 395 pass. ruff + mypy src/ clean. (Script-level mypy errors at
run_scenario.py:171, :175 are pre-existing in untouched code.)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Plan item A9. The live smoke test reported \$0.044 internal vs \$0.168
actual OpenRouter charge. Internal estimates depend on the public
/models pricing, which can diverge from billed cost due to provider
routing surcharges, BYOK markups, or stale prices. The tracker would
silently fall back to \$0/token if the model wasn't found or the API
was unreachable. Now those failure modes are visible, and operators
can plug in the actual price they're seeing on their bill.

cost_tracker.py:
  * CostSnapshot gains prompt_price_per_token, completion_price_per_token,
    and pricing_source (one of: openrouter_api / override / unknown).
    "unknown" flags that estimated_cost_usd is \$0 because pricing
    resolution failed — consumers should not aggregate it.
  * create_cost_tracker() accepts prompt_price_override /
    completion_price_override (per-token USD); when both are passed,
    the live fetch is skipped entirely (source="override").
  * Logs the resolved prices at INFO ("prompt=\$X.XX/MTok completion=
    \$Y.YY/MTok source=...") and a WARNING when the fetch resolved
    to \$0 for a model the user explicitly named.

CLI:
  * --prompt-price-per-mtok / --completion-price-per-mtok on both
    run_scenario.py and run_benchmark.py. Per-MTok is the standard
    human-readable unit on OpenRouter's pricing page; the scripts
    convert to per-token before handing to create_cost_tracker.

Tests: 398 pass (+3 new pricing-source / override tests; +1
nemotron-120B reconciliation test that pins the exact math from
the docker-benchmark cost recording). ruff + mypy strict clean.

Reconciliation against an actual billed amount is still a follow-up:
needs an authenticated query to OpenRouter's /credits or per-generation
endpoints. For now the override is the unblock.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Base automatically changed from fix/post-live-test-findings to master June 6, 2026 06:09
jkbennitt and others added 2 commits June 6, 2026 02:11
CI failed every scenario-loader test on PR #20: pinned save_sha256 values
matched the on-disk hash on Windows but not on Linux. Root cause was
Git's autocrlf=true converting CRLF<->LF on commit/checkout for the .rws
saves (which are XML text). The repo wound up storing LF bytes while my
Windows working tree had CRLF, so the SHA-256 differed between OSes and
the loader rejected every scenario.

.gitattributes now declares *.rws as binary, which disables line-ending
translation. Re-staged the 6 save files so the repo stores the same
bytes as the working tree (CRLF, as RimWorld writes them on Windows).
File sizes grew ~3% as expected. The pinned SHAs in the YAMLs were
already the CRLF hashes, so no YAML changes are needed.

Cross-platform contract going forward: the bytes in docker/saves/ are
the canonical bytes the scenario loader hashes against, regardless of
OS. RimWorld instances writing into AppData are still free to use any
line endings — the loader compares against docker/saves/, not against
the AppData copy.

Tests: tests/unit/test_scenario_loader.py — 15 pass locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@jkbennitt jkbennitt merged commit 61bce01 into master Jun 6, 2026
3 checks passed
@jkbennitt jkbennitt deleted the feat/restore-deliberations-export branch June 6, 2026 06:25
@jkbennitt jkbennitt restored the feat/restore-deliberations-export branch June 6, 2026 06:32
jkbennitt added a commit that referenced this pull request Jun 6, 2026
…export

Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source

Restores per-commit history for #20 (originally landed as a single squash commit
61bce01; this merge re-establishes the 9 individual commits + the cross-platform
.gitattributes CI fix as separate parents-of-master so they remain bisectable).

* Restore per-scenario *_deliberations.jsonl export in run_scenario (A1)
* Expand DELIBERATION event payload with actions + summary (A2)
* Surface raw LLM output via PROVIDER_CALL events (A3)
* Add replay-grade metadata + --seed flag + per-scenario summary JSON (A6.1)
* Pin canonical save SHA-256 into scenario YAMLs + loader validation (A6.2)
* Per-task LLM timeout via asyncio.wait_for (A7)
* Populate benchmark_history.jsonl from run_scenario.py too (A8)
* Cost tracker: surface pricing source + manual override flags (A9)
* Treat .rws saves as binary so scenario SHA pins are cross-platform (CI fix)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant