Skip to content

Recover from transient stream failures: provider auto-retry + error/retry UI#135

Open
oratis wants to merge 2 commits into
mainfrom
claude/silly-elion-7babb6
Open

Recover from transient stream failures: provider auto-retry + error/retry UI#135
oratis wants to merge 2 commits into
mainfrom
claude/silly-elion-7babb6

Conversation

@oratis

@oratis oratis commented Jun 23, 2026

Copy link
Copy Markdown
Owner

What & why

Diagnosed a local Lisa failure: a chat turn died with [error] request ended without sending any chunks (twice), and the same error hit [idle] reflection in backend.log. The Anthropic SDK throws this when a streaming response opens (HTTP 200) but the SSE body closes before any event arrives — here, the Clash/mihomo proxy (127.0.0.1:7897) tearing down an idle CONNECT tunnel during a long time-to-first-byte. The SDK's own retries don't cover it (thrown while iterating a stream that already "succeeded"), so a momentary blip became a hard, user-visible error with no recovery.

Two complementary layers of fix:

1. Provider auto-retry (silent, proxy-agnostic) — fix(providers)

  • New shared withStreamRetry() (stream-retry.ts) wired into all three providers (anthropic / openai / gemini).
  • Retries the empty-stream error + connection resets / premature closes, only while no delta has been forwarded yet — so already-streamed output is never duplicated, and a user abort is never retried.

2. Error detail + retry button (manual fallback) — feat(web)

  • Replaces the bare [error] … line with an .err-block: error detail + ↻ 重试 button that re-runs the same message + files. Covers what auto-retry can't (non-retryable errors, or retries exhausted).
  • Fixes double-rendered error: runAgent emits an error event and rethrows, and the server catch sent a second identical event (this is why the screenshot showed it twice). Deduped server-side (errorSent) and client-side.
  • Fixes retry duplicating the user message: the user message was persisted before the provider call, so an immediate stream failure orphaned it on disk and a retry wrote it again. Now persisted lazily, only after the first provider response commits the turn.

Testing

  • New: withStreamRetry / isRetryableStreamError unit tests, Anthropic provider retry integration tests, and a runAgent test asserting zero persistence when the first provider call throws.
  • Existing provider/agent persistence-order tests unchanged and green.
  • html-syntax.test.ts confirms the inline <script> still parses; snapshot recomputed.
  • Full suite: 829 pass / 1 skip / 1 fail. The lone failure is src/cli/pair.test.ts, a pre-existing qrcode-terminal module-not-found in this checkout — zero overlap with these changes.

Note: environment-side, pinning api.anthropic.com to a stable Clash node (vs url-test/fallback flapping) reduces the transient drops at the source; this PR makes Lisa resilient to them regardless.

🤖 Generated with Claude Code

oratis and others added 2 commits June 23, 2026 17:39
The Anthropic SDK throws "request ended without sending any chunks" when a
streaming response opens (HTTP 200) but the SSE body closes before any event
arrives — commonly an HTTP proxy (Clash, one-api relay) tearing down an idle
CONNECT tunnel during a long time-to-first-byte on a large request. The SDK's
own retries don't cover it: the error is thrown while iterating the stream,
after the request already succeeded, so a momentary blip reached the user as a
hard error and aborted the turn (idle reflection silently gave up too).

Add a shared withStreamRetry() helper and wire it into all three providers
(anthropic, openai, gemini). It retries this class of error plus connection
resets / premature closes, but only while no delta has been forwarded yet — so
already-streamed output is never duplicated, and a user abort is never retried.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Replace the bare "[error] …" line with an .err-block that shows the error
detail and a ↻ retry button which re-runs the same message + files. Covers the
cases provider-level auto-retry can't: a non-retryable error, or auto-retry
exhausted. send() is split into send() (input/bubble/attachments) and runChat()
(fetch/stream) so retry never re-appends the user bubble or re-reads the
already-cleared attachment tray.

Also fixes two issues this surfaced:
- The same error rendered twice: runAgent emits an error event and rethrows,
  and the server's turn catch then sent a second identical event. Dedupe both
  server-side (errorSent guard) and client-side (one block per turn).
- Retry could duplicate the user message in the session file: it was persisted
  before the provider call, so an immediate stream failure orphaned it on disk
  and the retry wrote it again. Persist lazily — only after the first provider
  response commits the turn (file order stays user→assistant).

Snapshot (lisa-html-snapshot.test.ts) recomputed for the new markup/CSS.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant