Skip to content

fix: dealer reconnect and session loss recovery without playback interruption#1692

Open
antoinecellerier wants to merge 7 commits intolibrespot-org:devfrom
antoinecellerier:fix/dealer-reconnect-hang
Open

fix: dealer reconnect and session loss recovery without playback interruption#1692
antoinecellerier wants to merge 7 commits intolibrespot-org:devfrom
antoinecellerier:fix/dealer-reconnect-hang

Conversation

@antoinecellerier
Copy link
Copy Markdown

@antoinecellerier antoinecellerier commented Mar 8, 2026

When the dealer websocket connection drops or the session TCP connection dies,
librespot either hangs indefinitely or restarts spirc — stopping playback in
both cases. This PR fixes all observed failure modes so the process recovers
automatically without interrupting audio.

Symptoms before fix

Hang on dealer websocket drop (fixed in 34d2fd9):
Spirc's tokio::select! loop blocked on subscription streams that would never
receive new messages. The mpsc senders in SubscriberMap weren't cleaned up on
reconnect, so spirc hung indefinitely — requiring a manual process restart.

Feb 17 01:12 — "Websocket peer does not respond."
[63.5 hour gap — process completely unresponsive]
Feb 19 16:44 — Manual restart

Unnecessary spirc restart on dealer reconnect (fixed in 812c972):
After 34d2fd9 added reconnect notification, spirc broke out of its event loop
on every dealer websocket reconnect to "refresh subscriptions" — even though
the subscription streams survive reconnects on the shared DealerShared.

WARN  Dealer reconnected; restarting spirc to refresh subscriptions.
WARN  unexpected shutdown
WARN  Spirc shut down unexpectedly

Playback killed on session TCP loss (fixed in 18eb5be + f69778d):
When the session TCP connection dies, handle_disconnect() explicitly set
SpircPlayStatus::Stopped and tried to notify Spotify (which failed anyway).
The Player was still playing from its buffer, but the new SpircTask started
with a blank ConnectState and SpircPlayStatus::Stopped.

ERROR Connection to server closed.
WARN  unexpected shutdown
ERROR Broken pipe (os error 32)
ERROR Audio key response timeout
ERROR Unable to read audio file: end of stream
WARN  Spirc shut down unexpectedly

Commits

34d2fd9fix: dealer websocket reconnect leaving spirc hung on stale channels

Add a watch::Sender<u64> generation counter shared between the dealer and its
consumers. The dealer increments it on successful reconnect, get_url() timeout
(30s), or get_url() error. Spirc subscribes before dealer.start() and breaks
out of its event loop on change, triggering the existing auto-reconnect path in
main.rs.

Also fixes get_url() failures propagating via ? and terminating the dealer
background task entirely, rather than retrying.

Changes:

  • core/src/dealer/mod.rs: watch channel plumbing, 30s timeout on get_url(),
    retry+signal on errors, signal consumers on reconnect
  • core/src/dealer/manager.rs: Store watch::Sender, expose reconnect_receiver()
  • connect/src/spirc.rs: Subscribe to reconnect watch, add select! branch

812c972fix: handle dealer reconnect in-place without restarting spirc

The subscription streams survive reconnects because they're registered on the
shared DealerShared — the new websocket dispatches through the same
message_handlers map. After reconnect, the server pushes a new connection_id
which handle_connection_id_update() already handles correctly.

Changes:

  • reconnect_rx.changed(): log and continue instead of break
  • handle_connection_id_update error: log instead of break

18eb5befix: skip server cleanup on session loss to keep playback alive

When session.is_invalid(), skip handle_disconnect() (which sets
SpircPlayStatus::Stopped), delete_connect_state_request(), and
dealer().close() — all of which fail on a dead TCP connection anyway.
The Player continues playing from its buffer independently.

f69778dfix: save and restore playback state across session reconnects

SpircTask saves its ConnectState, SpircPlayStatus, and play_request_id
into a SavedPlaybackState before exiting on session loss. main.rs captures
this and passes it to Spirc::with_saved_state() when creating the replacement.
The restored SpircTask updates the playback position on the first
connection_id_update and re-registers with Spotify showing the correct track.

Changes:

  • connect/src/model.rs: Add SavedPlaybackState struct
  • connect/src/spirc.rs: Spirc::with_saved_state(), save state on session
    loss, restore on creation, update position in handle_connection_id_update
  • src/main.rs: Capture saved state from spirc_task, pass to new Spirc

Evidence after fix

34d2fd9 — 9 days of logs (Feb 28 - Mar 8):

  • 0 manual restarts needed (vs 2 in 7 days before fix)
  • 9 dealer reconnect events, all recovered in 2-91 seconds
  • Process running continuously for 9+ days

812c972 + 18eb5be + f69778d — 6 days of logs (Mar 14-20):

Dealer reconnects handled in-place (~20 events, zero spirc restarts):

Mar 18 12:14 — "Dealer reconnected; awaiting new connection_id."
Mar 18 12:14 — "re-registering with active playback state: Playing { ... }"
[playback continued uninterrupted]

Session TCP losses recovered with state preserved (5 events):

Mar 18 11:49 — "Connection to server closed."
Mar 18 11:52 — "session lost, saving playback state for recovery:
                 Playing { nominal_start_time: 1773834169899, ... }"
Mar 18 11:52 — "Spirc shut down with saved playback state, reconnecting"
Mar 18 11:52 — "Spirc[1] restoring saved playback state"
Mar 18 11:52 — "re-registering with active playback state:
                 Playing { nominal_start_time: 1773834169899, ... }"
[3 second recovery, playback never stopped]

Summary (Mar 14-20):

  • 0 "Spirc shut down unexpectedly" (vs ~2-3/day before)
  • 0 process restarts needed
  • 5 session TCP losses, all recovered with state preserved
  • ~20 dealer reconnects, all handled in-place
  • Process running continuously (Spirc counter reached Spirc[4])

Use of AI

This PR was created with GitHub Copilot CLI. Copilot assisted with root cause
analysis, implementation, code review, log analysis, and PR description. I'll
admit to not having any knowledge of Rust which means I'm not able to review
Rust specifics. At a high level the changes seem to make conceptual sense to me
and have proven to have positive effects. Do let me know if this is garbage.

Copilot AI review requested due to automatic review settings March 8, 2026 18:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a hang in the Connect “spirc” task after dealer websocket reconnects by adding an explicit reconnect notification mechanism so consumers can tear down and re-subscribe, and by bounding dealer URL resolution time during reconnect attempts.

Changes:

  • Add a watch-based reconnect “generation” signal plumbed through dealer builder/manager and emitted on reconnect and get_url() failures/timeouts.
  • Add a 30s timeout around get_url() during the dealer reconnect loop and retry instead of terminating the dealer task.
  • Update SpircTask to subscribe to the reconnect signal before dealer.start() and break out of its select! loop when a reconnect is observed.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
core/src/dealer/mod.rs Adds reconnect watch plumbing and get_url() timeout/handling in the reconnect loop; exposes a reconnect receiver.
core/src/dealer/manager.rs Stores and exposes the reconnect watch sender/receiver; passes sender into dealer launch.
connect/src/spirc.rs Subscribes to reconnect notifications and restarts spirc when dealer reconnects to avoid stale subscription hangs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread core/src/dealer/mod.rs
Comment thread core/src/dealer/manager.rs
@antoinecellerier
Copy link
Copy Markdown
Author

@antoinecellerier antoinecellerier force-pushed the fix/dealer-reconnect-hang branch from d40e7ec to eb172e9 Compare March 8, 2026 19:09
When the dealer websocket connection drops and reconnects internally,
spirc's tokio::select! loop remains blocked on subscription streams
(connection_id_update, connect_state_update, etc.) that will never
receive new messages. The mpsc senders in the SubscriberMap are not
cleaned up on reconnect, so spirc hangs indefinitely — requiring a
manual process restart.

A second failure mode occurs when the dealer cannot reconnect because
get_url() (which resolves the dealer endpoint and fetches an auth
token via the session) hangs forever on a dead session TCP connection,
with no timeout.

Root cause analysis
-------------------

The dealer's run() loop (core/src/dealer/mod.rs) coordinates
reconnecting: when the websocket drops, it calls get_url() to resolve
a new dealer endpoint, then connect(). However:

1. The subscription channels (mpsc::UnboundedSender<Message>) stored
   in DealerShared::message_handlers survive reconnects. Spirc's
   .next() calls on the receiver side never return None because the
   senders are still alive in the map — they just never send again.

2. get_url() calls session.apresolver().resolve("dealer") and
   session.login5().auth_token(), both of which need the session's
   TCP connection. When that connection is dead ("Connection to server
   closed"), these calls hang forever with no timeout.

Before fix — log evidence of hangs requiring manual restart
-----------------------------------------------------------

  Feb 17 01:12 — "Websocket peer does not respond."
  [63.5 hour gap — process completely unresponsive]
  Feb 19 16:44 — Manual restart: "librespot 0.8.0 ..."

  Feb 23 08:41 — "Websocket peer does not respond."
  [32.2 hour gap — process completely unresponsive]
  Feb 24 16:51 — Manual restart: "librespot 0.8.0 ..."

  Dec 15 20:53-21:07 — Rapid reconnect storm: 12 "peer does not
  respond" in 50 minutes, with "starting dealer failed: Websocket
  couldn't be started because: Handshake not finished" errors.

  Feb 22 — Session TCP died at 05:55, spirc didn't notice for 7+
  hours (no dealer reconnect signal), finally shut down at 22:11.

Fix
---

Add a watch::Sender<u64> generation counter shared between the dealer
and its consumers. The dealer increments it when:

  - It successfully reconnects after a connection loss
  - get_url() times out (30s RECONNECT_URL_TIMEOUT)
  - get_url() returns an error

Spirc subscribes to a watch::Receiver before dealer.start() to avoid
a lost-wakeup race (watch retains state, unlike Notify which loses
notifications if no one is awaiting). In its select! loop, spirc
watches for changes and breaks out, triggering the existing "Spirc
shut down unexpectedly" -> auto-reconnect path in main.rs.

The get_url() error handling also fixes a pre-existing issue where
get_url() failures would propagate via ? and terminate the dealer
background task entirely, rather than retrying.

Changes:
  - core/src/dealer/mod.rs: Add watch channel plumbing to Dealer,
    Builder, create_dealer! macro, and run(). Add 30s timeout on
    get_url(). Handle get_url() errors with retry+signal instead of
    fatal ? propagation. Signal consumers on reconnect.
  - core/src/dealer/manager.rs: Store watch::Sender in
    DealerManagerInner, pass to Builder::launch(), expose
    reconnect_receiver() for consumers.
  - connect/src/spirc.rs: Subscribe to reconnect watch before
    dealer.start(). Add select! branch to break on dealer reconnect.

After fix — 9 days of logs showing automatic recovery
-----------------------------------------------------

Websocket failures now recover in 2-7 seconds automatically:

  Mar 01 15:45 — "Websocket connection failed: Connection reset"
  Mar 01 15:45 — "Dealer reconnected; notifying consumers."
  Mar 01 15:45 — "Dealer reconnected; restarting spirc to refresh subscriptions."
  Mar 01 15:46 — "Spirc shut down unexpectedly"
  Mar 01 15:46 — "active device is <> with session <...>"  [7s recovery]

  Mar 03 10:21 — "Websocket peer does not respond."
  Mar 03 10:21 — "Dealer reconnected; notifying consumers."
  Mar 03 10:21 — "restarting spirc to refresh subscriptions."
  Mar 03 10:21 — "active device is <> with session <...>"  [7s recovery]

  Mar 06 09:42 — "Websocket peer does not respond."
  Mar 06 09:42 — "Error while connecting: Network is unreachable"
  Mar 06 09:43 — [retries for ~1 min while network recovers]
  Mar 06 09:43 — "Dealer reconnected; notifying consumers."
  Mar 06 09:43 — "active device is <> with session <...>"  [91s recovery]

Summary over 9 days post-fix (Feb 28 - Mar 8):
  - 0 manual restarts needed (vs 2 in 7 days before fix)
  - 9 dealer reconnect events, all recovered in 2-91 seconds
  - 14 session TCP closures also recovered (via existing path)
  - 0 get_url() timeouts fired (websocket errors caught first)
  - Process running continuously for 9+ days

Co-authored-by: Copilot <[email protected]>
@antoinecellerier antoinecellerier force-pushed the fix/dealer-reconnect-hang branch from eb172e9 to 34d2fd9 Compare March 8, 2026 19:19
@antoinecellerier antoinecellerier marked this pull request as draft March 14, 2026 12:51
antoinecellerier and others added 2 commits March 20, 2026 10:43
Symptoms observed in logs:

  WARN  librespot_core::dealer  Websocket peer does not respond.
  WARN  librespot_connect::spirc  unexpected shutdown
  WARN  librespot  Spirc shut down unexpectedly

When the dealer websocket drops (peer timeout or TLS close_notify),
SpircTask broke out of its event loop so main.rs could tear down and
recreate the entire Spirc. This caused playback to stop on every
transient websocket drop — even though the dealer already
auto-reconnects the websocket.

The subscription streams survive reconnects because they are registered
on the shared DealerShared instance. After reconnect, the server pushes
a new connection_id which handle_connection_id_update already handles.

Changes: reconnect_rx.changed() logs and continues instead of breaking.
handle_connection_id_update errors are non-fatal (logged, not breaking).

After fix — 6 days of logs (Mar 14-20) showing ~20 dealer reconnects
handled in-place without restarting spirc or stopping playback:

  Mar 16 05:06 — "Dealer reconnected; awaiting new connection_id."
  Mar 16 05:06 — "re-registering with active playback state: Paused { ... }"
  [no restart, no "Spirc shut down unexpectedly"]

  Mar 18 12:14 — "Dealer reconnected; awaiting new connection_id."
  Mar 18 12:14 — "re-registering with active playback state: Playing { ... }"
  [playback continued uninterrupted]

  Mar 19 — 9 dealer reconnects in one day, all handled in-place

Summary: 0 spirc restarts from dealer reconnects (vs ~1/day before fix).

Co-authored-by: Copilot <[email protected]>
Symptoms observed in logs — the TCP session dies, then cleanup fails:

  ERROR librespot_core::session  Connection to server closed.
  WARN  librespot_connect::spirc  unexpected shutdown
  ERROR librespot_core::session  Broken pipe (os error 32)
  ERROR librespot_core::session  Transport endpoint is not connected (os error 107)
  WARN  librespot  Spirc shut down unexpectedly

When SpircTask exits because session.is_invalid(), the post-loop
cleanup called handle_disconnect() (which sets play_status to Stopped
and tries to notify Spotify), delete_connect_state_request(), and
dealer().close(). All of these fail because the TCP connection is dead,
and setting play_status to Stopped needlessly kills the Player.

Now we detect session.is_invalid() and skip all server communication
in the post-loop cleanup. The Player runs in a separate thread and
continues playing from its audio buffer. main.rs will create a new
session and Spirc.

After fix — the "Broken pipe" and "Transport endpoint is not
connected" errors no longer appear after session loss. The Player
continues playing while the session reconnects (see next commit for
state restoration evidence).

Co-authored-by: Copilot <[email protected]>
@antoinecellerier antoinecellerier changed the title fix: dealer websocket reconnect leaving spirc hung on stale channels fix: dealer reconnect and session loss recovery without playback interruption Mar 20, 2026
When SpircTask exits due to session loss, it now saves its
ConnectState, SpircPlayStatus, and play_request_id into a
SavedPlaybackState. main.rs captures this and passes it to
Spirc::with_saved_state() when creating the replacement Spirc.

The restored SpircTask starts with the saved state. On the first
connection_id_update, it updates the playback position to account for
elapsed time and re-registers with Spotify showing the correct track
and position. The Player is never interrupted.

After fix — 6 days of logs (Mar 14-20) showing 5 TCP session losses
all recovered with playback state preserved:

  Mar 18 11:49 — "Connection to server closed."
  Mar 18 11:52 — "session lost, saving playback state for recovery:
                   Playing { nominal_start_time: 1773834169899, ... }"
  Mar 18 11:52 — "Spirc shut down with saved playback state, reconnecting"
  Mar 18 11:52 — "Spirc[1] restoring saved playback state"
  Mar 18 11:52 — "re-registering with active playback state:
                   Playing { nominal_start_time: 1773834169899, ... }"
  [3 second recovery, playback never stopped]

  Mar 19 12:21-12:37 — Two session losses during active playback,
  both recovered in ~2 seconds with Playing state preserved.

Summary over 6 days post-fix (Mar 14-20):
  - 0 "Spirc shut down unexpectedly" (vs ~2-3/day before fix)
  - 0 process restarts needed
  - 5 session TCP losses, all recovered with state preserved
  - ~20 dealer reconnects, all handled in-place
  - Process running continuously (Spirc counter reached Spirc[4])

Co-authored-by: Copilot <[email protected]>
@antoinecellerier antoinecellerier force-pushed the fix/dealer-reconnect-hang branch from 3388445 to f69778d Compare March 20, 2026 10:09
@antoinecellerier antoinecellerier marked this pull request as ready for review March 20, 2026 10:13
@artenverho
Copy link
Copy Markdown

artenverho commented Mar 29, 2026

Works great for me thanks!. I switched ISP before the weekend and since then librespot would loose connection at least once every 15 minute. Very odd behavior that so far I have not been able to pinpoint to a specific cause (perhaps something to do with poor multicast support of the supplied router?). This PR fixed it!

@artenverho
Copy link
Copy Markdown

artenverho commented Apr 5, 2026

I think I’ve encountered an asymmetrical issue when switching between accounts that seems specific to this PR. If the device has credentials (username) defined in the config, it becomes impossible to switch back to that "Owner" account after a "Guest" (Discovery/Zeroconf) session has been active or is active (the "Owner" can join the jam but not take over the session). The error it shows is:

[ERROR librespot_core::dealer::manager] failed sending dealer request channel closed

to reproduce the error:

  • Configure librespot with USERNAME and credentials saved (Account A).
  • Start the service. It successfully logs in as Account A.
  • Connect to the device using a different Spotify account (Account B) via Discovery/Zeroconf. (This works fine).
  • Optional: Stop playback or Disconnect Account B.
  • Attempt to trigger playback via Account A via the Spotify App.
  • Result: Account A fails to connect.

The only way to recover is to manually restart the service. Curiously the "Guest" does not experience the same issues when the "Owner" is playing, it can happily take over.

It seems the new recovery logic might be preventing a clean shutdown of the Dealer task during an account handover? It is curious it only happens in this specific situation though. Other than this, I haven't experienced any problems (thanks again!)

Only save state and skip cleanup on unintentional session loss (not when shutdown was explicitly requested via spirc.shutdown()). Fixes account handover: when Discovery triggers a new account, session.shutdown() races with the Shutdown command — SpircTask would see session.is_invalid() and skip dealer().close(), leaving a stale dealer with closed command channels.
Clear saved_playback_state on Discovery credential change. Without this, a session loss under Account A saves state, then Account B takes over via Discovery, and when Account A reconnects later, it restores stale state from a different account/session.
Restore break on initial handle_connection_id_update failure. Only tolerate errors after connect_established is true (re-registration after dealer reconnect). Without this, a failed initial registration leaves the device in connect_established=false where local commands are silently ignored.
@antoinecellerier
Copy link
Copy Markdown
Author

@artenverho Thanks for the detailed report! Copilot has likely identified the root cause and I've pushed 3 fixup commits.

The problem: When Account B takes over via Discovery, spirc.shutdown() and session.shutdown() race. The session gets marked invalid before the SpircTask processes the Shutdown command, so our recovery code kicks in — it skips dealer().close() (thinking the session died unexpectedly) and saves playback state. This leaves a stale dealer with closed command channels, causing failed sending dealer request channel closed when Account A tries to reconnect.

Fixes pushed (as fixup commits, not yet squashed):

  • 2ac494b — Only save state and skip dealer cleanup on unintentional session loss (&& !self.shutdown). When shutdown was explicitly requested (account handover), fall through to normal cleanup including dealer().close().

  • 48adba0 — Clear saved playback state when Discovery provides new credentials, so stale state from a different account is never restored into the new Spirc.

  • 303026b — Restore break on initial handle_connection_id_update failure (unrelated to your report, but another edge case found during review). After dealer reconnect, errors are tolerated since the next connection_id push will retry.

Could you test with these changes and let me know if the account switching issue is resolved?

@artenverho
Copy link
Copy Markdown

Sorry took a few days to find the time for testing. At first glance it seems to solve the issue! I will need to do some more long term testing but switching between users is now seamless again. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants