Skip to content

fix: recover & rotate endpoints when an RPC endpoint misbehaves#750

Open
sinzii wants to merge 2 commits into
mainfrom
fix/ws-provider-multi-endpoint-failover
Open

fix: recover & rotate endpoints when an RPC endpoint misbehaves#750
sinzii wants to merge 2 commits into
mainfrom
fix/ws-provider-multi-endpoint-failover

Conversation

@sinzii

@sinzii sinzii commented Jun 14, 2026

Copy link
Copy Markdown
Member

Problem

With a multi-endpoint WsProvider, if the endpoint picked at connect time is unreachable/misbehaving, the client crashed with Error: [object Object] and stopped instead of retrying a different endpoint. Reported via examples/scripts/reconnection.ts.

Root cause

Endpoint rotation itself worked. The failure was one layer up: WsProvider emitted the raw WebSocket Event (not an Error) on socket errors, and BaseSubstrateClient rejected the pending connect() on the first error — killing the client while the provider was about to rotate. A few related gaps also left a client stuck on a connected-but-broken endpoint.

Changes

  • Wrap socket errors in a WsConnectionError (with endpoint context) instead of emitting a raw Event.
  • connect() no longer fails fast on transient provider errors; it rejects only on init errors or MaxRetryAttemptedError, letting the provider rotate.
  • Escalate init failures to an endpoint switch (disconnect(true)) when retry is enabled; JsonRpcV2NotSupportedError still rejects so the v2→legacy fallback is preserved.
  • ChainHead stop: bounded re-follow retries, then switch endpoint instead of dead-ending.
  • Staling watchdog armed right after init (not only on the next block).
  • Provider hardening: immediate retries count toward maxRetryAttempts; attempt counter resets only after a connection proves healthy (grace window / first message); array failover remembers all recently-failed endpoints; new connectTimeoutMs (default 30s) force-closes a stalled handshake so reconnection rotates onward.

Behavior changes to note

  • maxRetryAttempts now also counts immediate retries (e.g. disconnect(true) switches).
  • Init errors during the initial connect() no longer fail fast when retry is enabled — they rotate endpoints (bounded by maxRetryAttempts if set).

Testing

  • providers 60/60, api 448/448; new unit tests cover each behavior.
  • Manual repro with a bad endpoint first now logs a readable WsConnectionError, rotates to a working endpoint, and returns the correct genesis hash — no crash.

🤖 Generated with Claude Code

sinzii and others added 2 commits June 14, 2026 22:45
With a multi-endpoint WsProvider, if the endpoint picked at connect time was
unreachable/misbehaving, the client crashed with `Error: [object Object]` and
stopped instead of retrying a different endpoint.

- Wrap socket errors in a WsConnectionError (with endpoint context) instead of
  emitting the raw WebSocket Event.
- connect() no longer fails fast on transient provider errors; it rejects only
  on init errors or MaxRetryAttemptedError, letting the provider rotate.
- Escalate init failures to an endpoint switch (disconnect(true)) when retry is
  enabled; JsonRpcV2NotSupportedError still rejects (preserves v2->legacy fallback).
- ChainHead stop: bounded re-follow retries, then switch endpoint.
- Arm the staling watchdog right after init, not only on the next block.
- Provider hardening: immediate retries count toward maxRetryAttempts; attempt
  counter resets only after a healthy connection; array failover remembers all
  recently-failed endpoints; new connectTimeoutMs (default 30s) force-closes a
  stalled handshake so reconnection rotates onward.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
On a stop event the #recovering deferred is rejected when re-follow fails, but it
only has an awaiting consumer when an operation is in flight (#ensureFollowed).
When recovery fails with no pending request, the rejection had no handler and
surfaced as an unhandled rejection. Attach a no-op catch on creation; real
awaiters still receive the rejection through their own await.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant