Skip to content

fix(chat): bound CLI runner with idle watchdog + clear stuck chat spinner (#271 a+b)#276

Open
ProfSynapse wants to merge 2 commits into
mainfrom
fix/issue-271-chat-loading-state
Open

fix(chat): bound CLI runner with idle watchdog + clear stuck chat spinner (#271 a+b)#276
ProfSynapse wants to merge 2 commits into
mainfrom
fix/issue-271-chat-loading-state

Conversation

@ProfSynapse

Copy link
Copy Markdown
Owner

Summary

Two provider-agnostic chat-state bugs surfaced via #271 (the AGY work), but verified to affect every provider — not just gemini/AGY. This PR is the chat-state half of #271 only; the gemini→Antigravity (agy) runtime re-route is a separate follow-up.

Re: #271 (partial — chat-loading-state fixes a + b). Issue intentionally left open for the AGY re-route + manual verification.

Fixes

(a) Unbounded CLI process → idle watchdog

runCliProcess (src/utils/cliProcessRunner.ts) resolved only on child close/error — a CLI that streamed progress but never exited hung the awaiter forever.

  • Added a configurable idle/inactivity watchdog (default 120s, re-armed on every stdout/stderr chunk so legitimate long streaming runs are never cut; 0 disables).
  • On fire: best-effort kill + resolves with errorCode: 'PROVIDER_TIMEOUT' — the resolve-only contract is preserved (no caller is converted to handle a rejection). The LLMProviderError(PROVIDER_TIMEOUT) is constructed in the gemini adapter's mapCliProcessFailure, not the runner. No --print-timeout arg added.

(b) Stuck chat spinner on empty/error-before-first-token

The assistant placeholder inits isLoading: true and was only cleared inside the first-token branch — so empty-complete, the post-loop safety-net, and non-abort errors left the spinner frozen forever.

  • isLoading: false is now set on the empty-complete branch + post-loop safety-net (MessageStreamHandler).
  • New AbortHandler.finalizeErroredPlaceholder (sets state: 'invalid', clears isLoading, preserves partial content) is called from both MessageManager non-abort catch branches.
  • The genuine-abort path is untouched (byte-for-byte vs main), and this is distinct from the manager-level setLoading input spinner.

Scope note — please read

The auth-status checks now also inherit the 120s idle watchdog (previously unbounded); both callers degrade gracefully on timeout.

Known residual (tracked as a follow-up issue)

AnthropicClaudeCodeAdapter's streaming generation spawns directly (bypasses runCliProcess), so a wedged-never-closing Claude Code process is guarded by neither fix. The chat-state clear still cures the common terminal-but-no-first-token case for all providers. Filing a separate issue to bring CC's own spawn under an idle timeout.

Tests

New: idle-watchdog fires → PROVIDER_TIMEOUT, watchdog resets on output (no mid-stream cut), isLoading cleared on each terminal-without-first-token path, and 3 added in review covering the MessageManager → finalizeErroredPlaceholder wiring through the real catch branches (both call sites).

Verification

  • npm run build clean · npm run lint clean
  • Full jest: 3713 pass / 21 skip / 1 fail — the lone failure is the pre-existing, documented TaskBoardEditCoordinator jsdom-Modal issue (untouched files; not a regression).

Review

Peer-reviewed (APPROVE, 0 blocking): docs/review/pr271-chat-loading-state-2026-06-22.md, docs/review/pr271-fe-chat-state-lifecycle-2026-06-22.md, docs/review/test-coverage-271-272-2026-06-22.md.

🤖 Generated with Claude Code

ProfSynapse and others added 2 commits June 22, 2026 11:04
…nner

Two provider-agnostic fixes for the "chat spins forever" report (issue #271).
Both root causes were verified in current code, not just gemini/AGY.

(a) cliProcessRunner had no timeout — a CLI that emits progress but never
exits hung the awaiter forever. Add an INACTIVITY (idle) watchdog to the
shared runner: the timer resets on every stdout/stderr chunk, so a streaming
process is never cut off, but a wedged/silent one is killed and the promise
RESOLVES with errorCode PROVIDER_TIMEOUT (resolve-only contract preserved, so
auth-status callers don't break). Default 120s of silence, configurable;
0 disables. GoogleGeminiCliAdapter.mapCliProcessFailure maps the code to a
user-visible LLMProviderError.

(b) The assistant placeholder inits isLoading:true and it was only cleared on
the first streamed token, so any terminal path that produced no token left the
spinner stuck. Clear isLoading on all three stuck paths: the empty-complete
final branch and the post-loop safety net in MessageStreamHandler, and the
non-abort error path via a new AbortHandler.finalizeErroredPlaceholder() called
from both MessageManager catch blocks. The genuine-abort path is untouched.

Scope note: AnthropicClaudeCodeAdapter's streaming generation spawns directly
and bypasses the shared runner, so its never-closing case is caught by neither
fix — tracked as a follow-up. The shared-runner timeout still guards gemini
generation and both CLI auth-status checks; the chat-state clear cures the
common terminal-but-no-first-token case for all providers.

Tests: cliProcessRunner idle-watchdog (fire/reset/disable/default), AbortHandler
finalizeErroredPlaceholder, MessageStreamHandler empty-complete + safety-net
isLoading clear, gemini PROVIDER_TIMEOUT mapping. Full suite green except the
pre-existing TaskBoardEditCoordinator jsdom-Modal failure (unrelated).

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Claude-Session: https://claude.ai/code/session_01NxKeRz1gihguL9wcidm78m
Peer-review of 292bc51 flagged that finalizeErroredPlaceholder was only
exercised in isolation (AbortHandler.test.ts) — the WIRING at the two new
MessageManager catch sites was untested, so the mock hid the seam.

Adds three MessageManager-level tests driving a real non-abort error through
the generation path and asserting the placeholder END STATE via the actual
catch-branch wiring (not by calling finalizeErroredPlaceholder directly):
- send path, error before first token: placeholder ends isLoading:false,
  state:'invalid'.
- send path, error mid-stream: partial content preserved, isLoading:false
  (first-token path already cleared the spinner, so finalize is a no-op).
- regenerate path (handleRetryMessage -> regenerateAIResponse ->
  generateFreshAIResponse): covers the second new call site.

Test-only; no production-code change. Full suite 3713 pass / 21 skip; the lone
TaskBoardEditCoordinator jsdom-Modal failure is the known pre-existing one.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Claude-Session: https://claude.ai/code/session_01NxKeRz1gihguL9wcidm78m
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant