Strict action-plan schema + alias resolution in parse filter#24
Merged
Conversation
Live Fable 5 preflights exposed two silent measurement gaps: - A schema-violating but valid JSON response (model-invented shape with no "actions" key) parsed into an empty plan with no error event — the correction retry never fired and a schema violation was indistinguishable from a deliberate no-op. Missing "actions" now raises ActionPlanParseError whose message carries the exact expected schema (the retry prompt embeds it), with the shape extracted to _ACTION_PLAN_SCHEMA_EXAMPLE as single source of truth. - The per-action ALLOWED_ACTIONS filter checked raw action_type names before alias resolution, so set_work_priority was dropped even though ACTION_TYPE_ALIASES maps it to the allowed work_priority. The filter now accepts either the raw name or its catalog alias; added create_growing_zone / create_stockpile_zone aliases observed live. Plus two prompt/observability changes informed by the same preflights: a recency-positioned RESPONSE FORMAT reminder with the role's literal valid action_type list at the end of every user prompt (Fable through claude -p drifts into markdown reports when the schema sits ~8KB back in the system prompt — A/B verified), and PROVIDER_CALL raw_output capture bumped 4KB -> 16KB so frontier-length completions aren't clipped. Preflight results (2 ticks, Fable 5, live game): before — 0 actions, silent; after — all 7 agents proposing (33 actions tick 0), 11 executed across both ticks, zero parse retries, and the DO-NOT-REPEAT loop visibly correcting a failed research_target between ticks. Co-Authored-By: Claude Fable 5 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Live 2-tick Fable 5 preflights (via the new claude-code provider) exposed silent measurement gaps: 14 successful LLM calls produced zero actions and zero error events. Two distinct holes:
threat_assessment/defensive_actions).data.get("actions", [])swallowed the missing key — no parse error, no correction retry, confidence 0.0, and the benchmark silently measured nothing.actions, every entry was dropped: theALLOWED_ACTIONScheck ran on raw names before alias resolution, soset_work_prioritywas rejected despite our ownACTION_TYPE_ALIASESmapping it to the allowedwork_priority.What
"actions"key →ActionPlanParseErrorcarrying the exact expected schema (feeds the existing correction-retry prompt). Explicitactions: []remains valid (deliberate no-op). Schema example extracted to_ACTION_PLAN_SCHEMA_EXAMPLE(single source of truth for system prompt, reminder, and error message).create_growing_zone/create_stockpile_zonealiases observed live.RESPONSE FORMATreminder + the role's literal validaction_typelist appended to every user prompt (recency position). A/B verified live: without it Fable-via-claude-p writes markdown reports; with it, compliant JSON._RAW_OUTPUT_CHARS4096 → 16384 — Fable's completions are multi-KB and the verbatim transcripts are first-class artifacts.Preflight evidence (live game, seed 42, 2 ticks each)
Bonus: tick 0's
research_targetfailed ("ShieldBelt not available"), the DO-NOT-REPEAT broadcast fired, and tick 1'sresearch_targetsucceeded — first live confirmation the PR #19 feedback loop changes behavior.423 tests pass (4 new), ruff clean, mypy strict clean.
🤖 Generated with Claude Code