fix: dealer reconnect and session loss recovery without playback interruption by antoinecellerier · Pull Request #1692 · librespot-org/librespot

antoinecellerier · 2026-03-08T18:50:58Z

When the dealer websocket connection drops or the session TCP connection dies,
librespot either hangs indefinitely or restarts spirc — stopping playback in
both cases. This PR fixes all observed failure modes so the process recovers
automatically without interrupting audio.

Symptoms before fix

Hang on dealer websocket drop (fixed in 34d2fd9):
Spirc's tokio::select! loop blocked on subscription streams that would never
receive new messages. The mpsc senders in SubscriberMap weren't cleaned up on
reconnect, so spirc hung indefinitely — requiring a manual process restart.

Feb 17 01:12 — "Websocket peer does not respond."
[63.5 hour gap — process completely unresponsive]
Feb 19 16:44 — Manual restart

Unnecessary spirc restart on dealer reconnect (fixed in 812c972):
After 34d2fd9 added reconnect notification, spirc broke out of its event loop
on every dealer websocket reconnect to "refresh subscriptions" — even though
the subscription streams survive reconnects on the shared DealerShared.

WARN  Dealer reconnected; restarting spirc to refresh subscriptions.
WARN  unexpected shutdown
WARN  Spirc shut down unexpectedly

Playback killed on session TCP loss (fixed in 18eb5be + f69778d):
When the session TCP connection dies, handle_disconnect() explicitly set
SpircPlayStatus::Stopped and tried to notify Spotify (which failed anyway).
The Player was still playing from its buffer, but the new SpircTask started
with a blank ConnectState and SpircPlayStatus::Stopped.

ERROR Connection to server closed.
WARN  unexpected shutdown
ERROR Broken pipe (os error 32)
ERROR Audio key response timeout
ERROR Unable to read audio file: end of stream
WARN  Spirc shut down unexpectedly

Commits

`34d2fd9` — `fix: dealer websocket reconnect leaving spirc hung on stale channels`

Add a watch::Sender<u64> generation counter shared between the dealer and its
consumers. The dealer increments it on successful reconnect, get_url() timeout
(30s), or get_url() error. Spirc subscribes before dealer.start() and breaks
out of its event loop on change, triggering the existing auto-reconnect path in
main.rs.

Also fixes get_url() failures propagating via ? and terminating the dealer
background task entirely, rather than retrying.

Changes:

core/src/dealer/mod.rs: watch channel plumbing, 30s timeout on get_url(),
retry+signal on errors, signal consumers on reconnect
core/src/dealer/manager.rs: Store watch::Sender, expose reconnect_receiver()
connect/src/spirc.rs: Subscribe to reconnect watch, add select! branch

`812c972` — `fix: handle dealer reconnect in-place without restarting spirc`

The subscription streams survive reconnects because they're registered on the
shared DealerShared — the new websocket dispatches through the same
message_handlers map. After reconnect, the server pushes a new connection_id
which handle_connection_id_update() already handles correctly.

Changes:

reconnect_rx.changed(): log and continue instead of break
handle_connection_id_update error: log instead of break

`18eb5be` — `fix: skip server cleanup on session loss to keep playback alive`

When session.is_invalid(), skip handle_disconnect() (which sets
SpircPlayStatus::Stopped), delete_connect_state_request(), and
dealer().close() — all of which fail on a dead TCP connection anyway.
The Player continues playing from its buffer independently.

`f69778d` — `fix: save and restore playback state across session reconnects`

SpircTask saves its ConnectState, SpircPlayStatus, and play_request_id
into a SavedPlaybackState before exiting on session loss. main.rs captures
this and passes it to Spirc::with_saved_state() when creating the replacement.
The restored SpircTask updates the playback position on the first
connection_id_update and re-registers with Spotify showing the correct track.

Changes:

connect/src/model.rs: Add SavedPlaybackState struct
connect/src/spirc.rs: Spirc::with_saved_state(), save state on session
loss, restore on creation, update position in handle_connection_id_update
src/main.rs: Capture saved state from spirc_task, pass to new Spirc

Evidence after fix

34d2fd9 — 9 days of logs (Feb 28 - Mar 8):

0 manual restarts needed (vs 2 in 7 days before fix)
9 dealer reconnect events, all recovered in 2-91 seconds
Process running continuously for 9+ days

812c972 + 18eb5be + f69778d — 6 days of logs (Mar 14-20):

Dealer reconnects handled in-place (~20 events, zero spirc restarts):

Mar 18 12:14 — "Dealer reconnected; awaiting new connection_id."
Mar 18 12:14 — "re-registering with active playback state: Playing { ... }"
[playback continued uninterrupted]

Session TCP losses recovered with state preserved (5 events):

Mar 18 11:49 — "Connection to server closed."
Mar 18 11:52 — "session lost, saving playback state for recovery:
                 Playing { nominal_start_time: 1773834169899, ... }"
Mar 18 11:52 — "Spirc shut down with saved playback state, reconnecting"
Mar 18 11:52 — "Spirc[1] restoring saved playback state"
Mar 18 11:52 — "re-registering with active playback state:
                 Playing { nominal_start_time: 1773834169899, ... }"
[3 second recovery, playback never stopped]

Summary (Mar 14-20):

0 "Spirc shut down unexpectedly" (vs ~2-3/day before)
0 process restarts needed
5 session TCP losses, all recovered with state preserved
~20 dealer reconnects, all handled in-place
Process running continuously (Spirc counter reached Spirc[4])

Use of AI

This PR was created with GitHub Copilot CLI. Copilot assisted with root cause
analysis, implementation, code review, log analysis, and PR description. I'll
admit to not having any knowledge of Rust which means I'm not able to review
Rust specifics. At a high level the changes seem to make conceptual sense to me
and have proven to have positive effects. Do let me know if this is garbage.

Copilot

Pull request overview

This PR addresses a hang in the Connect “spirc” task after dealer websocket reconnects by adding an explicit reconnect notification mechanism so consumers can tear down and re-subscribe, and by bounding dealer URL resolution time during reconnect attempts.

Changes:

Add a watch-based reconnect “generation” signal plumbed through dealer builder/manager and emitted on reconnect and get_url() failures/timeouts.
Add a 30s timeout around get_url() during the dealer reconnect loop and retry instead of terminating the dealer task.
Update SpircTask to subscribe to the reconnect signal before dealer.start() and break out of its select! loop when a reconnect is observed.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
core/src/dealer/mod.rs	Adds reconnect watch plumbing and `get_url()` timeout/handling in the reconnect loop; exposes a reconnect receiver.
core/src/dealer/manager.rs	Stores and exposes the reconnect watch sender/receiver; passes sender into dealer launch.
connect/src/spirc.rs	Subscribes to reconnect notifications and restarts spirc when dealer reconnects to avoid stale subscription hangs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

antoinecellerier · 2026-03-08T19:05:50Z

Fixes Playlist playback stopped after a few hours with a broken pipe (os error 32) error / non recoverable websocket/TLS error (using Dealer) #1419 — Playlist playback stopped after broken pipe / non-recoverable websocket/TLS error
Relates to Librespot stops working when switching networks #1627 — Librespot stops working when switching networks. Network switch kills the websocket. Our fix ensures spirc doesn't hang on stale channels after the dealer reconnects. The session TCP side (slow recovery) is still a separate issue.
Relates to Connection to server closed #1492 — Connection to server closed. Session TCP death followed by spirc shutdown. Our get_url() timeout helps here but doesn't fully solve the session-layer problem.

When the dealer websocket connection drops and reconnects internally, spirc's tokio::select! loop remains blocked on subscription streams (connection_id_update, connect_state_update, etc.) that will never receive new messages. The mpsc senders in the SubscriberMap are not cleaned up on reconnect, so spirc hangs indefinitely — requiring a manual process restart. A second failure mode occurs when the dealer cannot reconnect because get_url() (which resolves the dealer endpoint and fetches an auth token via the session) hangs forever on a dead session TCP connection, with no timeout. Root cause analysis ------------------- The dealer's run() loop (core/src/dealer/mod.rs) coordinates reconnecting: when the websocket drops, it calls get_url() to resolve a new dealer endpoint, then connect(). However: 1. The subscription channels (mpsc::UnboundedSender<Message>) stored in DealerShared::message_handlers survive reconnects. Spirc's .next() calls on the receiver side never return None because the senders are still alive in the map — they just never send again. 2. get_url() calls session.apresolver().resolve("dealer") and session.login5().auth_token(), both of which need the session's TCP connection. When that connection is dead ("Connection to server closed"), these calls hang forever with no timeout. Before fix — log evidence of hangs requiring manual restart ----------------------------------------------------------- Feb 17 01:12 — "Websocket peer does not respond." [63.5 hour gap — process completely unresponsive] Feb 19 16:44 — Manual restart: "librespot 0.8.0 ..." Feb 23 08:41 — "Websocket peer does not respond." [32.2 hour gap — process completely unresponsive] Feb 24 16:51 — Manual restart: "librespot 0.8.0 ..." Dec 15 20:53-21:07 — Rapid reconnect storm: 12 "peer does not respond" in 50 minutes, with "starting dealer failed: Websocket couldn't be started because: Handshake not finished" errors. Feb 22 — Session TCP died at 05:55, spirc didn't notice for 7+ hours (no dealer reconnect signal), finally shut down at 22:11. Fix --- Add a watch::Sender<u64> generation counter shared between the dealer and its consumers. The dealer increments it when: - It successfully reconnects after a connection loss - get_url() times out (30s RECONNECT_URL_TIMEOUT) - get_url() returns an error Spirc subscribes to a watch::Receiver before dealer.start() to avoid a lost-wakeup race (watch retains state, unlike Notify which loses notifications if no one is awaiting). In its select! loop, spirc watches for changes and breaks out, triggering the existing "Spirc shut down unexpectedly" -> auto-reconnect path in main.rs. The get_url() error handling also fixes a pre-existing issue where get_url() failures would propagate via ? and terminate the dealer background task entirely, rather than retrying. Changes: - core/src/dealer/mod.rs: Add watch channel plumbing to Dealer, Builder, create_dealer! macro, and run(). Add 30s timeout on get_url(). Handle get_url() errors with retry+signal instead of fatal ? propagation. Signal consumers on reconnect. - core/src/dealer/manager.rs: Store watch::Sender in DealerManagerInner, pass to Builder::launch(), expose reconnect_receiver() for consumers. - connect/src/spirc.rs: Subscribe to reconnect watch before dealer.start(). Add select! branch to break on dealer reconnect. After fix — 9 days of logs showing automatic recovery ----------------------------------------------------- Websocket failures now recover in 2-7 seconds automatically: Mar 01 15:45 — "Websocket connection failed: Connection reset" Mar 01 15:45 — "Dealer reconnected; notifying consumers." Mar 01 15:45 — "Dealer reconnected; restarting spirc to refresh subscriptions." Mar 01 15:46 — "Spirc shut down unexpectedly" Mar 01 15:46 — "active device is <> with session <...>" [7s recovery] Mar 03 10:21 — "Websocket peer does not respond." Mar 03 10:21 — "Dealer reconnected; notifying consumers." Mar 03 10:21 — "restarting spirc to refresh subscriptions." Mar 03 10:21 — "active device is <> with session <...>" [7s recovery] Mar 06 09:42 — "Websocket peer does not respond." Mar 06 09:42 — "Error while connecting: Network is unreachable" Mar 06 09:43 — [retries for ~1 min while network recovers] Mar 06 09:43 — "Dealer reconnected; notifying consumers." Mar 06 09:43 — "active device is <> with session <...>" [91s recovery] Summary over 9 days post-fix (Feb 28 - Mar 8): - 0 manual restarts needed (vs 2 in 7 days before fix) - 9 dealer reconnect events, all recovered in 2-91 seconds - 14 session TCP closures also recovered (via existing path) - 0 get_url() timeouts fired (websocket errors caught first) - Process running continuously for 9+ days Co-authored-by: Copilot <[email protected]>

Symptoms observed in logs: WARN librespot_core::dealer Websocket peer does not respond. WARN librespot_connect::spirc unexpected shutdown WARN librespot Spirc shut down unexpectedly When the dealer websocket drops (peer timeout or TLS close_notify), SpircTask broke out of its event loop so main.rs could tear down and recreate the entire Spirc. This caused playback to stop on every transient websocket drop — even though the dealer already auto-reconnects the websocket. The subscription streams survive reconnects because they are registered on the shared DealerShared instance. After reconnect, the server pushes a new connection_id which handle_connection_id_update already handles. Changes: reconnect_rx.changed() logs and continues instead of breaking. handle_connection_id_update errors are non-fatal (logged, not breaking). After fix — 6 days of logs (Mar 14-20) showing ~20 dealer reconnects handled in-place without restarting spirc or stopping playback: Mar 16 05:06 — "Dealer reconnected; awaiting new connection_id." Mar 16 05:06 — "re-registering with active playback state: Paused { ... }" [no restart, no "Spirc shut down unexpectedly"] Mar 18 12:14 — "Dealer reconnected; awaiting new connection_id." Mar 18 12:14 — "re-registering with active playback state: Playing { ... }" [playback continued uninterrupted] Mar 19 — 9 dealer reconnects in one day, all handled in-place Summary: 0 spirc restarts from dealer reconnects (vs ~1/day before fix). Co-authored-by: Copilot <[email protected]>

Symptoms observed in logs — the TCP session dies, then cleanup fails: ERROR librespot_core::session Connection to server closed. WARN librespot_connect::spirc unexpected shutdown ERROR librespot_core::session Broken pipe (os error 32) ERROR librespot_core::session Transport endpoint is not connected (os error 107) WARN librespot Spirc shut down unexpectedly When SpircTask exits because session.is_invalid(), the post-loop cleanup called handle_disconnect() (which sets play_status to Stopped and tries to notify Spotify), delete_connect_state_request(), and dealer().close(). All of these fail because the TCP connection is dead, and setting play_status to Stopped needlessly kills the Player. Now we detect session.is_invalid() and skip all server communication in the post-loop cleanup. The Player runs in a separate thread and continues playing from its audio buffer. main.rs will create a new session and Spirc. After fix — the "Broken pipe" and "Transport endpoint is not connected" errors no longer appear after session loss. The Player continues playing while the session reconnects (see next commit for state restoration evidence). Co-authored-by: Copilot <[email protected]>

When SpircTask exits due to session loss, it now saves its ConnectState, SpircPlayStatus, and play_request_id into a SavedPlaybackState. main.rs captures this and passes it to Spirc::with_saved_state() when creating the replacement Spirc. The restored SpircTask starts with the saved state. On the first connection_id_update, it updates the playback position to account for elapsed time and re-registers with Spotify showing the correct track and position. The Player is never interrupted. After fix — 6 days of logs (Mar 14-20) showing 5 TCP session losses all recovered with playback state preserved: Mar 18 11:49 — "Connection to server closed." Mar 18 11:52 — "session lost, saving playback state for recovery: Playing { nominal_start_time: 1773834169899, ... }" Mar 18 11:52 — "Spirc shut down with saved playback state, reconnecting" Mar 18 11:52 — "Spirc[1] restoring saved playback state" Mar 18 11:52 — "re-registering with active playback state: Playing { nominal_start_time: 1773834169899, ... }" [3 second recovery, playback never stopped] Mar 19 12:21-12:37 — Two session losses during active playback, both recovered in ~2 seconds with Playing state preserved. Summary over 6 days post-fix (Mar 14-20): - 0 "Spirc shut down unexpectedly" (vs ~2-3/day before fix) - 0 process restarts needed - 5 session TCP losses, all recovered with state preserved - ~20 dealer reconnects, all handled in-place - Process running continuously (Spirc counter reached Spirc[4]) Co-authored-by: Copilot <[email protected]>

artenverho · 2026-03-29T07:47:12Z

Works great for me thanks!. I switched ISP before the weekend and since then librespot would loose connection at least once every 15 minute. Very odd behavior that so far I have not been able to pinpoint to a specific cause (perhaps something to do with poor multicast support of the supplied router?). This PR fixed it!

artenverho · 2026-04-05T17:52:35Z

I think I’ve encountered an asymmetrical issue when switching between accounts that seems specific to this PR. If the device has credentials (username) defined in the config, it becomes impossible to switch back to that "Owner" account after a "Guest" (Discovery/Zeroconf) session has been active or is active (the "Owner" can join the jam but not take over the session). The error it shows is:

[ERROR librespot_core::dealer::manager] failed sending dealer request channel closed

to reproduce the error:

Configure librespot with USERNAME and credentials saved (Account A).
Start the service. It successfully logs in as Account A.
Connect to the device using a different Spotify account (Account B) via Discovery/Zeroconf. (This works fine).
Optional: Stop playback or Disconnect Account B.
Attempt to trigger playback via Account A via the Spotify App.
Result: Account A fails to connect.

The only way to recover is to manually restart the service. Curiously the "Guest" does not experience the same issues when the "Owner" is playing, it can happily take over.

It seems the new recovery logic might be preventing a clean shutdown of the Dealer task during an account handover? It is curious it only happens in this specific situation though. Other than this, I haven't experienced any problems (thanks again!)

Only save state and skip cleanup on unintentional session loss (not when shutdown was explicitly requested via spirc.shutdown()). Fixes account handover: when Discovery triggers a new account, session.shutdown() races with the Shutdown command — SpircTask would see session.is_invalid() and skip dealer().close(), leaving a stale dealer with closed command channels.

Clear saved_playback_state on Discovery credential change. Without this, a session loss under Account A saves state, then Account B takes over via Discovery, and when Account A reconnects later, it restores stale state from a different account/session.

Restore break on initial handle_connection_id_update failure. Only tolerate errors after connect_established is true (re-registration after dealer reconnect). Without this, a failed initial registration leaves the device in connect_established=false where local commands are silently ignored.

antoinecellerier · 2026-04-06T19:05:50Z

@artenverho Thanks for the detailed report! Copilot has likely identified the root cause and I've pushed 3 fixup commits.

The problem: When Account B takes over via Discovery, spirc.shutdown() and session.shutdown() race. The session gets marked invalid before the SpircTask processes the Shutdown command, so our recovery code kicks in — it skips dealer().close() (thinking the session died unexpectedly) and saves playback state. This leaves a stale dealer with closed command channels, causing failed sending dealer request channel closed when Account A tries to reconnect.

Fixes pushed (as fixup commits, not yet squashed):

2ac494b — Only save state and skip dealer cleanup on unintentional session loss (&& !self.shutdown). When shutdown was explicitly requested (account handover), fall through to normal cleanup including dealer().close().
48adba0 — Clear saved playback state when Discovery provides new credentials, so stale state from a different account is never restored into the new Spirc.
303026b — Restore break on initial handle_connection_id_update failure (unrelated to your report, but another edge case found during review). After dealer reconnect, errors are tolerated since the next connection_id push will retry.

Could you test with these changes and let me know if the account switching issue is resolved?

artenverho · 2026-04-08T18:44:04Z

Sorry took a few days to find the time for testing. At first glance it seems to solve the issue! I will need to do some more long term testing but switching between users is now seamless again. Thanks!

Copilot AI review requested due to automatic review settings March 8, 2026 18:50

Copilot started reviewing on behalf of antoinecellerier March 8, 2026 18:51 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

Comment thread core/src/dealer/mod.rs

Comment thread core/src/dealer/manager.rs

antoinecellerier force-pushed the fix/dealer-reconnect-hang branch from d40e7ec to eb172e9 Compare March 8, 2026 19:09

antoinecellerier force-pushed the fix/dealer-reconnect-hang branch from eb172e9 to 34d2fd9 Compare March 8, 2026 19:19

antoinecellerier marked this pull request as draft March 14, 2026 12:51

antoinecellerier and others added 2 commits March 20, 2026 10:43

antoinecellerier changed the title ~~fix: dealer websocket reconnect leaving spirc hung on stale channels~~ fix: dealer reconnect and session loss recovery without playback interruption Mar 20, 2026

antoinecellerier force-pushed the fix/dealer-reconnect-hang branch from 3388445 to f69778d Compare March 20, 2026 10:09

antoinecellerier marked this pull request as ready for review March 20, 2026 10:13

aome510 mentioned this pull request Mar 29, 2026

Playlist playback stopped after a few hours with a broken pipe (os error 32) error / non recoverable websocket/TLS error (using Dealer) #1419

Open

antoinecellerier added 3 commits April 6, 2026 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: dealer reconnect and session loss recovery without playback interruption#1692

fix: dealer reconnect and session loss recovery without playback interruption#1692
antoinecellerier wants to merge 7 commits intolibrespot-org:devfrom
antoinecellerier:fix/dealer-reconnect-hang

antoinecellerier commented Mar 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

antoinecellerier commented Mar 8, 2026

Uh oh!

artenverho commented Mar 29, 2026 •

edited

Loading

Uh oh!

artenverho commented Apr 5, 2026 •

edited

Loading

Uh oh!

antoinecellerier commented Apr 6, 2026

Uh oh!

artenverho commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

antoinecellerier commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Symptoms before fix

Commits

34d2fd9 — fix: dealer websocket reconnect leaving spirc hung on stale channels

812c972 — fix: handle dealer reconnect in-place without restarting spirc

18eb5be — fix: skip server cleanup on session loss to keep playback alive

f69778d — fix: save and restore playback state across session reconnects

Evidence after fix

Use of AI

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

antoinecellerier commented Mar 8, 2026

Uh oh!

artenverho commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artenverho commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoinecellerier commented Apr 6, 2026

Uh oh!

artenverho commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

antoinecellerier commented Mar 8, 2026 •

edited

Loading

`34d2fd9` — `fix: dealer websocket reconnect leaving spirc hung on stale channels`

`812c972` — `fix: handle dealer reconnect in-place without restarting spirc`

`18eb5be` — `fix: skip server cleanup on session loss to keep playback alive`

`f69778d` — `fix: save and restore playback state across session reconnects`

artenverho commented Mar 29, 2026 •

edited

Loading

artenverho commented Apr 5, 2026 •

edited

Loading