fix(cli): recover from context-limit errors by compacting + retrying#86
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
A Codex CLI run could fail outright with:
The conversation then surfaced the error to the user. Re-sending the message sometimes appeared to recover, but only incidentally (a later run might cross the proactive-compaction threshold, or the failed run's empty placeholder got discarded) — there was no deliberate recovery.
Root cause
Context-window recovery existed only on the HTTP provider path (
engine.rs:318— on a context-limit provider error it compacts withCompactionTrigger::ErrorRecoveryand retries once). The CLI path (local_agent.rs, shared by Codex / Claude Code / OpenCode) had no equivalent branch. Its only retry was for lost CLI sessions (no rollout found,thread/resume failed), which does not match a context-window error.Claude Code rarely surfaces this because it auto-compacts internally before CLAI sees an error; Codex returns the failure to us instead. But the gap was provider-agnostic — any CLI provider returning a context-limit error would have failed the same way.
The fix (provider-agnostic,
local_agent.rsonly)Inside the existing bounded CLI retry loop, add a second recovery branch: on a
LocalAgentRunError::Failedwhose message matchescompaction::is_context_limit_error, do exactly what the HTTP path does —CompactionTrigger::ErrorRecovery(force),SessionCompactedevent,Bounded: the context-recovery retry, like the session-lost retry, fires at most once (own boolean flag) — max 2 attempts. Usage/rate-limit errors and ordinary CLI failures are unaffected and still surface.
Tests
Classifier unit tests for all three CLI providers (Codex / Claude Code / OpenCode), positive (context-limit phrasings recover) and negative (usage-limit, session-lost, generic exit codes do not trigger context recovery).
Verification
cargo fmt --check,cargo clippy -- -D warningsclean.Independent review
Static review verdict: production_quality — no blocker/major findings. Non-blocking minors noted (dead
_provider_runtimeparam kept for symmetry withis_session_lost_errorand future per-provider needles; recovery pattern now mirrored in HTTP + CLI paths, flagged for a possible future shared helper).Note
Needs a Codex run that actually overflows the context window to confirm end-to-end; the classifier + retry wiring is unit-tested and mirrors the proven HTTP path.