fix: dealer reconnect and session loss recovery without playback interruption#1692
fix: dealer reconnect and session loss recovery without playback interruption#1692antoinecellerier wants to merge 7 commits intolibrespot-org:devfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a hang in the Connect “spirc” task after dealer websocket reconnects by adding an explicit reconnect notification mechanism so consumers can tear down and re-subscribe, and by bounding dealer URL resolution time during reconnect attempts.
Changes:
- Add a
watch-based reconnect “generation” signal plumbed through dealer builder/manager and emitted on reconnect andget_url()failures/timeouts. - Add a 30s timeout around
get_url()during the dealer reconnect loop and retry instead of terminating the dealer task. - Update
SpircTaskto subscribe to the reconnect signal beforedealer.start()and break out of itsselect!loop when a reconnect is observed.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| core/src/dealer/mod.rs | Adds reconnect watch plumbing and get_url() timeout/handling in the reconnect loop; exposes a reconnect receiver. |
| core/src/dealer/manager.rs | Stores and exposes the reconnect watch sender/receiver; passes sender into dealer launch. |
| connect/src/spirc.rs | Subscribes to reconnect notifications and restarts spirc when dealer reconnects to avoid stale subscription hangs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
d40e7ec to
eb172e9
Compare
When the dealer websocket connection drops and reconnects internally,
spirc's tokio::select! loop remains blocked on subscription streams
(connection_id_update, connect_state_update, etc.) that will never
receive new messages. The mpsc senders in the SubscriberMap are not
cleaned up on reconnect, so spirc hangs indefinitely — requiring a
manual process restart.
A second failure mode occurs when the dealer cannot reconnect because
get_url() (which resolves the dealer endpoint and fetches an auth
token via the session) hangs forever on a dead session TCP connection,
with no timeout.
Root cause analysis
-------------------
The dealer's run() loop (core/src/dealer/mod.rs) coordinates
reconnecting: when the websocket drops, it calls get_url() to resolve
a new dealer endpoint, then connect(). However:
1. The subscription channels (mpsc::UnboundedSender<Message>) stored
in DealerShared::message_handlers survive reconnects. Spirc's
.next() calls on the receiver side never return None because the
senders are still alive in the map — they just never send again.
2. get_url() calls session.apresolver().resolve("dealer") and
session.login5().auth_token(), both of which need the session's
TCP connection. When that connection is dead ("Connection to server
closed"), these calls hang forever with no timeout.
Before fix — log evidence of hangs requiring manual restart
-----------------------------------------------------------
Feb 17 01:12 — "Websocket peer does not respond."
[63.5 hour gap — process completely unresponsive]
Feb 19 16:44 — Manual restart: "librespot 0.8.0 ..."
Feb 23 08:41 — "Websocket peer does not respond."
[32.2 hour gap — process completely unresponsive]
Feb 24 16:51 — Manual restart: "librespot 0.8.0 ..."
Dec 15 20:53-21:07 — Rapid reconnect storm: 12 "peer does not
respond" in 50 minutes, with "starting dealer failed: Websocket
couldn't be started because: Handshake not finished" errors.
Feb 22 — Session TCP died at 05:55, spirc didn't notice for 7+
hours (no dealer reconnect signal), finally shut down at 22:11.
Fix
---
Add a watch::Sender<u64> generation counter shared between the dealer
and its consumers. The dealer increments it when:
- It successfully reconnects after a connection loss
- get_url() times out (30s RECONNECT_URL_TIMEOUT)
- get_url() returns an error
Spirc subscribes to a watch::Receiver before dealer.start() to avoid
a lost-wakeup race (watch retains state, unlike Notify which loses
notifications if no one is awaiting). In its select! loop, spirc
watches for changes and breaks out, triggering the existing "Spirc
shut down unexpectedly" -> auto-reconnect path in main.rs.
The get_url() error handling also fixes a pre-existing issue where
get_url() failures would propagate via ? and terminate the dealer
background task entirely, rather than retrying.
Changes:
- core/src/dealer/mod.rs: Add watch channel plumbing to Dealer,
Builder, create_dealer! macro, and run(). Add 30s timeout on
get_url(). Handle get_url() errors with retry+signal instead of
fatal ? propagation. Signal consumers on reconnect.
- core/src/dealer/manager.rs: Store watch::Sender in
DealerManagerInner, pass to Builder::launch(), expose
reconnect_receiver() for consumers.
- connect/src/spirc.rs: Subscribe to reconnect watch before
dealer.start(). Add select! branch to break on dealer reconnect.
After fix — 9 days of logs showing automatic recovery
-----------------------------------------------------
Websocket failures now recover in 2-7 seconds automatically:
Mar 01 15:45 — "Websocket connection failed: Connection reset"
Mar 01 15:45 — "Dealer reconnected; notifying consumers."
Mar 01 15:45 — "Dealer reconnected; restarting spirc to refresh subscriptions."
Mar 01 15:46 — "Spirc shut down unexpectedly"
Mar 01 15:46 — "active device is <> with session <...>" [7s recovery]
Mar 03 10:21 — "Websocket peer does not respond."
Mar 03 10:21 — "Dealer reconnected; notifying consumers."
Mar 03 10:21 — "restarting spirc to refresh subscriptions."
Mar 03 10:21 — "active device is <> with session <...>" [7s recovery]
Mar 06 09:42 — "Websocket peer does not respond."
Mar 06 09:42 — "Error while connecting: Network is unreachable"
Mar 06 09:43 — [retries for ~1 min while network recovers]
Mar 06 09:43 — "Dealer reconnected; notifying consumers."
Mar 06 09:43 — "active device is <> with session <...>" [91s recovery]
Summary over 9 days post-fix (Feb 28 - Mar 8):
- 0 manual restarts needed (vs 2 in 7 days before fix)
- 9 dealer reconnect events, all recovered in 2-91 seconds
- 14 session TCP closures also recovered (via existing path)
- 0 get_url() timeouts fired (websocket errors caught first)
- Process running continuously for 9+ days
Co-authored-by: Copilot <[email protected]>
eb172e9 to
34d2fd9
Compare
Symptoms observed in logs:
WARN librespot_core::dealer Websocket peer does not respond.
WARN librespot_connect::spirc unexpected shutdown
WARN librespot Spirc shut down unexpectedly
When the dealer websocket drops (peer timeout or TLS close_notify),
SpircTask broke out of its event loop so main.rs could tear down and
recreate the entire Spirc. This caused playback to stop on every
transient websocket drop — even though the dealer already
auto-reconnects the websocket.
The subscription streams survive reconnects because they are registered
on the shared DealerShared instance. After reconnect, the server pushes
a new connection_id which handle_connection_id_update already handles.
Changes: reconnect_rx.changed() logs and continues instead of breaking.
handle_connection_id_update errors are non-fatal (logged, not breaking).
After fix — 6 days of logs (Mar 14-20) showing ~20 dealer reconnects
handled in-place without restarting spirc or stopping playback:
Mar 16 05:06 — "Dealer reconnected; awaiting new connection_id."
Mar 16 05:06 — "re-registering with active playback state: Paused { ... }"
[no restart, no "Spirc shut down unexpectedly"]
Mar 18 12:14 — "Dealer reconnected; awaiting new connection_id."
Mar 18 12:14 — "re-registering with active playback state: Playing { ... }"
[playback continued uninterrupted]
Mar 19 — 9 dealer reconnects in one day, all handled in-place
Summary: 0 spirc restarts from dealer reconnects (vs ~1/day before fix).
Co-authored-by: Copilot <[email protected]>
Symptoms observed in logs — the TCP session dies, then cleanup fails: ERROR librespot_core::session Connection to server closed. WARN librespot_connect::spirc unexpected shutdown ERROR librespot_core::session Broken pipe (os error 32) ERROR librespot_core::session Transport endpoint is not connected (os error 107) WARN librespot Spirc shut down unexpectedly When SpircTask exits because session.is_invalid(), the post-loop cleanup called handle_disconnect() (which sets play_status to Stopped and tries to notify Spotify), delete_connect_state_request(), and dealer().close(). All of these fail because the TCP connection is dead, and setting play_status to Stopped needlessly kills the Player. Now we detect session.is_invalid() and skip all server communication in the post-loop cleanup. The Player runs in a separate thread and continues playing from its audio buffer. main.rs will create a new session and Spirc. After fix — the "Broken pipe" and "Transport endpoint is not connected" errors no longer appear after session loss. The Player continues playing while the session reconnects (see next commit for state restoration evidence). Co-authored-by: Copilot <[email protected]>
When SpircTask exits due to session loss, it now saves its
ConnectState, SpircPlayStatus, and play_request_id into a
SavedPlaybackState. main.rs captures this and passes it to
Spirc::with_saved_state() when creating the replacement Spirc.
The restored SpircTask starts with the saved state. On the first
connection_id_update, it updates the playback position to account for
elapsed time and re-registers with Spotify showing the correct track
and position. The Player is never interrupted.
After fix — 6 days of logs (Mar 14-20) showing 5 TCP session losses
all recovered with playback state preserved:
Mar 18 11:49 — "Connection to server closed."
Mar 18 11:52 — "session lost, saving playback state for recovery:
Playing { nominal_start_time: 1773834169899, ... }"
Mar 18 11:52 — "Spirc shut down with saved playback state, reconnecting"
Mar 18 11:52 — "Spirc[1] restoring saved playback state"
Mar 18 11:52 — "re-registering with active playback state:
Playing { nominal_start_time: 1773834169899, ... }"
[3 second recovery, playback never stopped]
Mar 19 12:21-12:37 — Two session losses during active playback,
both recovered in ~2 seconds with Playing state preserved.
Summary over 6 days post-fix (Mar 14-20):
- 0 "Spirc shut down unexpectedly" (vs ~2-3/day before fix)
- 0 process restarts needed
- 5 session TCP losses, all recovered with state preserved
- ~20 dealer reconnects, all handled in-place
- Process running continuously (Spirc counter reached Spirc[4])
Co-authored-by: Copilot <[email protected]>
3388445 to
f69778d
Compare
|
Works great for me thanks!. I switched ISP before the weekend and since then librespot would loose connection at least once every 15 minute. Very odd behavior that so far I have not been able to pinpoint to a specific cause (perhaps something to do with poor multicast support of the supplied router?). This PR fixed it! |
|
I think I’ve encountered an asymmetrical issue when switching between accounts that seems specific to this PR. If the device has credentials (username) defined in the config, it becomes impossible to switch back to that "Owner" account after a "Guest" (Discovery/Zeroconf) session has been active or is active (the "Owner" can join the jam but not take over the session). The error it shows is:
to reproduce the error:
The only way to recover is to manually restart the service. Curiously the "Guest" does not experience the same issues when the "Owner" is playing, it can happily take over. It seems the new recovery logic might be preventing a clean shutdown of the Dealer task during an account handover? It is curious it only happens in this specific situation though. Other than this, I haven't experienced any problems (thanks again!) |
Only save state and skip cleanup on unintentional session loss (not when shutdown was explicitly requested via spirc.shutdown()). Fixes account handover: when Discovery triggers a new account, session.shutdown() races with the Shutdown command — SpircTask would see session.is_invalid() and skip dealer().close(), leaving a stale dealer with closed command channels.
Clear saved_playback_state on Discovery credential change. Without this, a session loss under Account A saves state, then Account B takes over via Discovery, and when Account A reconnects later, it restores stale state from a different account/session.
Restore break on initial handle_connection_id_update failure. Only tolerate errors after connect_established is true (re-registration after dealer reconnect). Without this, a failed initial registration leaves the device in connect_established=false where local commands are silently ignored.
|
@artenverho Thanks for the detailed report! Copilot has likely identified the root cause and I've pushed 3 fixup commits. The problem: When Account B takes over via Discovery, Fixes pushed (as fixup commits, not yet squashed):
Could you test with these changes and let me know if the account switching issue is resolved? |
|
Sorry took a few days to find the time for testing. At first glance it seems to solve the issue! I will need to do some more long term testing but switching between users is now seamless again. Thanks! |
When the dealer websocket connection drops or the session TCP connection dies,
librespot either hangs indefinitely or restarts spirc — stopping playback in
both cases. This PR fixes all observed failure modes so the process recovers
automatically without interrupting audio.
Symptoms before fix
Hang on dealer websocket drop (fixed in 34d2fd9):
Spirc's
tokio::select!loop blocked on subscription streams that would neverreceive new messages. The mpsc senders in
SubscriberMapweren't cleaned up onreconnect, so spirc hung indefinitely — requiring a manual process restart.
Unnecessary spirc restart on dealer reconnect (fixed in 812c972):
After 34d2fd9 added reconnect notification, spirc broke out of its event loop
on every dealer websocket reconnect to "refresh subscriptions" — even though
the subscription streams survive reconnects on the shared
DealerShared.Playback killed on session TCP loss (fixed in 18eb5be + f69778d):
When the session TCP connection dies,
handle_disconnect()explicitly setSpircPlayStatus::Stoppedand tried to notify Spotify (which failed anyway).The Player was still playing from its buffer, but the new SpircTask started
with a blank
ConnectStateandSpircPlayStatus::Stopped.Commits
34d2fd9 —
fix: dealer websocket reconnect leaving spirc hung on stale channelsAdd a
watch::Sender<u64>generation counter shared between the dealer and itsconsumers. The dealer increments it on successful reconnect,
get_url()timeout(30s), or
get_url()error. Spirc subscribes beforedealer.start()and breaksout of its event loop on change, triggering the existing auto-reconnect path in
main.rs.Also fixes
get_url()failures propagating via?and terminating the dealerbackground task entirely, rather than retrying.
Changes:
core/src/dealer/mod.rs: watch channel plumbing, 30s timeout onget_url(),retry+signal on errors, signal consumers on reconnect
core/src/dealer/manager.rs: Storewatch::Sender, exposereconnect_receiver()connect/src/spirc.rs: Subscribe to reconnect watch, addselect!branch812c972 —
fix: handle dealer reconnect in-place without restarting spircThe subscription streams survive reconnects because they're registered on the
shared
DealerShared— the new websocket dispatches through the samemessage_handlersmap. After reconnect, the server pushes a newconnection_idwhich
handle_connection_id_update()already handles correctly.Changes:
reconnect_rx.changed(): log and continue instead ofbreakhandle_connection_id_updateerror: log instead ofbreak18eb5be —
fix: skip server cleanup on session loss to keep playback aliveWhen
session.is_invalid(), skiphandle_disconnect()(which setsSpircPlayStatus::Stopped),delete_connect_state_request(), anddealer().close()— all of which fail on a dead TCP connection anyway.The Player continues playing from its buffer independently.
f69778d —
fix: save and restore playback state across session reconnectsSpircTasksaves itsConnectState,SpircPlayStatus, andplay_request_idinto a
SavedPlaybackStatebefore exiting on session loss.main.rscapturesthis and passes it to
Spirc::with_saved_state()when creating the replacement.The restored SpircTask updates the playback position on the first
connection_id_updateand re-registers with Spotify showing the correct track.Changes:
connect/src/model.rs: AddSavedPlaybackStatestructconnect/src/spirc.rs:Spirc::with_saved_state(), save state on sessionloss, restore on creation, update position in
handle_connection_id_updatesrc/main.rs: Capture saved state from spirc_task, pass to new SpircEvidence after fix
34d2fd9 — 9 days of logs (Feb 28 - Mar 8):
812c972 + 18eb5be + f69778d — 6 days of logs (Mar 14-20):
Dealer reconnects handled in-place (~20 events, zero spirc restarts):
Session TCP losses recovered with state preserved (5 events):
Summary (Mar 14-20):
Use of AI
This PR was created with GitHub Copilot CLI. Copilot assisted with root cause
analysis, implementation, code review, log analysis, and PR description. I'll
admit to not having any knowledge of Rust which means I'm not able to review
Rust specifics. At a high level the changes seem to make conceptual sense to me
and have proven to have positive effects. Do let me know if this is garbage.