You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: dealer websocket reconnect leaving spirc hung on stale channels
When the dealer websocket connection drops and reconnects internally,
spirc's tokio::select! loop remains blocked on subscription streams
(connection_id_update, connect_state_update, etc.) that will never
receive new messages. The mpsc senders in the SubscriberMap are not
cleaned up on reconnect, so spirc hangs indefinitely — requiring a
manual process restart.
A second failure mode occurs when the dealer cannot reconnect because
get_url() (which resolves the dealer endpoint and fetches an auth
token via the session) hangs forever on a dead session TCP connection,
with no timeout.
Root cause analysis
-------------------
The dealer's run() loop (core/src/dealer/mod.rs) coordinates
reconnecting: when the websocket drops, it calls get_url() to resolve
a new dealer endpoint, then connect(). However:
1. The subscription channels (mpsc::UnboundedSender<Message>) stored
in DealerShared::message_handlers survive reconnects. Spirc's
.next() calls on the receiver side never return None because the
senders are still alive in the map — they just never send again.
2. get_url() calls session.apresolver().resolve("dealer") and
session.login5().auth_token(), both of which need the session's
TCP connection. When that connection is dead ("Connection to server
closed"), these calls hang forever with no timeout.
Before fix — log evidence of hangs requiring manual restart
-----------------------------------------------------------
Feb 17 01:12 — "Websocket peer does not respond."
[63.5 hour gap — process completely unresponsive]
Feb 19 16:44 — Manual restart: "librespot 0.8.0 ..."
Feb 23 08:41 — "Websocket peer does not respond."
[32.2 hour gap — process completely unresponsive]
Feb 24 16:51 — Manual restart: "librespot 0.8.0 ..."
Dec 15 20:53-21:07 — Rapid reconnect storm: 12 "peer does not
respond" in 50 minutes, with "starting dealer failed: Websocket
couldn't be started because: Handshake not finished" errors.
Feb 22 — Session TCP died at 05:55, spirc didn't notice for 7+
hours (no dealer reconnect signal), finally shut down at 22:11.
Fix
---
Add a watch::Sender<u64> generation counter shared between the dealer
and its consumers. The dealer increments it when:
- It successfully reconnects after a connection loss
- get_url() times out (30s RECONNECT_URL_TIMEOUT)
- get_url() returns an error
Spirc subscribes to a watch::Receiver before dealer.start() to avoid
a lost-wakeup race (watch retains state, unlike Notify which loses
notifications if no one is awaiting). In its select! loop, spirc
watches for changes and breaks out, triggering the existing "Spirc
shut down unexpectedly" -> auto-reconnect path in main.rs.
The get_url() error handling also fixes a pre-existing issue where
get_url() failures would propagate via ? and terminate the dealer
background task entirely, rather than retrying.
Changes:
- core/src/dealer/mod.rs: Add watch channel plumbing to Dealer,
Builder, create_dealer! macro, and run(). Add 30s timeout on
get_url(). Handle get_url() errors with retry+signal instead of
fatal ? propagation. Signal consumers on reconnect.
- core/src/dealer/manager.rs: Store watch::Sender in
DealerManagerInner, pass to Builder::launch(), expose
reconnect_receiver() for consumers.
- connect/src/spirc.rs: Subscribe to reconnect watch before
dealer.start(). Add select! branch to break on dealer reconnect.
After fix — 9 days of logs showing automatic recovery
-----------------------------------------------------
Websocket failures now recover in 2-7 seconds automatically:
Mar 01 15:45 — "Websocket connection failed: Connection reset"
Mar 01 15:45 — "Dealer reconnected; notifying consumers."
Mar 01 15:45 — "Dealer reconnected; restarting spirc to refresh subscriptions."
Mar 01 15:46 — "Spirc shut down unexpectedly"
Mar 01 15:46 — "active device is <> with session <...>" [7s recovery]
Mar 03 10:21 — "Websocket peer does not respond."
Mar 03 10:21 — "Dealer reconnected; notifying consumers."
Mar 03 10:21 — "restarting spirc to refresh subscriptions."
Mar 03 10:21 — "active device is <> with session <...>" [7s recovery]
Mar 06 09:42 — "Websocket peer does not respond."
Mar 06 09:42 — "Error while connecting: Network is unreachable"
Mar 06 09:43 — [retries for ~1 min while network recovers]
Mar 06 09:43 — "Dealer reconnected; notifying consumers."
Mar 06 09:43 — "active device is <> with session <...>" [91s recovery]
Summary over 9 days post-fix (Feb 28 - Mar 8):
- 0 manual restarts needed (vs 2 in 7 days before fix)
- 9 dealer reconnect events, all recovered in 2-91 seconds
- 14 session TCP closures also recovered (via existing path)
- 0 get_url() timeouts fired (websocket errors caught first)
- Process running continuously for 9+ days
Co-authored-by: Copilot <[email protected]>
0 commit comments