fix: dealer websocket reconnect leaving spirc hung on stale channels

antoinecellerier · Copilot · antoinecellerier · commit 34d2fd94bdea · 2026-03-08T20:19:47.000+01:00
When the dealer websocket connection drops and reconnects internally,
spirc's tokio::select! loop remains blocked on subscription streams
(connection_id_update, connect_state_update, etc.) that will never
receive new messages. The mpsc senders in the SubscriberMap are not
cleaned up on reconnect, so spirc hangs indefinitely — requiring a
manual process restart.

A second failure mode occurs when the dealer cannot reconnect because
get_url() (which resolves the dealer endpoint and fetches an auth
token via the session) hangs forever on a dead session TCP connection,
with no timeout.

Root cause analysis
-------------------

The dealer's run() loop (core/src/dealer/mod.rs) coordinates
reconnecting: when the websocket drops, it calls get_url() to resolve
a new dealer endpoint, then connect(). However:

1. The subscription channels (mpsc::UnboundedSender&lt;Message&gt;) stored
   in DealerShared::message_handlers survive reconnects. Spirc's
   .next() calls on the receiver side never return None because the
   senders are still alive in the map — they just never send again.

2. get_url() calls session.apresolver().resolve("dealer") and
   session.login5().auth_token(), both of which need the session's
   TCP connection. When that connection is dead ("Connection to server
   closed"), these calls hang forever with no timeout.

Before fix — log evidence of hangs requiring manual restart
-----------------------------------------------------------

  Feb 17 01:12 — "Websocket peer does not respond."
  [63.5 hour gap — process completely unresponsive]
  Feb 19 16:44 — Manual restart: "librespot 0.8.0 ..."

  Feb 23 08:41 — "Websocket peer does not respond."
  [32.2 hour gap — process completely unresponsive]
  Feb 24 16:51 — Manual restart: "librespot 0.8.0 ..."

  Dec 15 20:53-21:07 — Rapid reconnect storm: 12 "peer does not
  respond" in 50 minutes, with "starting dealer failed: Websocket
  couldn't be started because: Handshake not finished" errors.

  Feb 22 — Session TCP died at 05:55, spirc didn't notice for 7+
  hours (no dealer reconnect signal), finally shut down at 22:11.

Fix
---

Add a watch::Sender&lt;u64&gt; generation counter shared between the dealer
and its consumers. The dealer increments it when:

  - It successfully reconnects after a connection loss
  - get_url() times out (30s RECONNECT_URL_TIMEOUT)
  - get_url() returns an error

Spirc subscribes to a watch::Receiver before dealer.start() to avoid
a lost-wakeup race (watch retains state, unlike Notify which loses
notifications if no one is awaiting). In its select! loop, spirc
watches for changes and breaks out, triggering the existing "Spirc
shut down unexpectedly" -&gt; auto-reconnect path in main.rs.

The get_url() error handling also fixes a pre-existing issue where
get_url() failures would propagate via ? and terminate the dealer
background task entirely, rather than retrying.

Changes:
  - core/src/dealer/mod.rs: Add watch channel plumbing to Dealer,
    Builder, create_dealer! macro, and run(). Add 30s timeout on
    get_url(). Handle get_url() errors with retry+signal instead of
    fatal ? propagation. Signal consumers on reconnect.
  - core/src/dealer/manager.rs: Store watch::Sender in
    DealerManagerInner, pass to Builder::launch(), expose
    reconnect_receiver() for consumers.
  - connect/src/spirc.rs: Subscribe to reconnect watch before
    dealer.start(). Add select! branch to break on dealer reconnect.

After fix — 9 days of logs showing automatic recovery
-----------------------------------------------------

Websocket failures now recover in 2-7 seconds automatically:

  Mar 01 15:45 — "Websocket connection failed: Connection reset"
  Mar 01 15:45 — "Dealer reconnected; notifying consumers."
  Mar 01 15:45 — "Dealer reconnected; restarting spirc to refresh subscriptions."
  Mar 01 15:46 — "Spirc shut down unexpectedly"
  Mar 01 15:46 — "active device is &lt;&gt; with session &lt;...&gt;"  [7s recovery]

  Mar 03 10:21 — "Websocket peer does not respond."
  Mar 03 10:21 — "Dealer reconnected; notifying consumers."
  Mar 03 10:21 — "restarting spirc to refresh subscriptions."
  Mar 03 10:21 — "active device is &lt;&gt; with session &lt;...&gt;"  [7s recovery]

  Mar 06 09:42 — "Websocket peer does not respond."
  Mar 06 09:42 — "Error while connecting: Network is unreachable"
  Mar 06 09:43 — [retries for ~1 min while network recovers]
  Mar 06 09:43 — "Dealer reconnected; notifying consumers."
  Mar 06 09:43 — "active device is &lt;&gt; with session &lt;...&gt;"  [91s recovery]

Summary over 9 days post-fix (Feb 28 - Mar 8):
  - 0 manual restarts needed (vs 2 in 7 days before fix)
  - 9 dealer reconnect events, all recovered in 2-91 seconds
  - 14 session TCP closures also recovered (via existing path)
  - 0 get_url() timeouts fired (websocket errors caught first)
  - Process running continuously for 9+ days

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -22,6 +22,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - [main] Fixed `--volume-ctrl fixed` not disabling volume control
 - [core] Fix default permissions on credentials file and warn user if file is world readable
 - [core] Try all resolved addresses for the dealer connection instead of failing after the first one.
+- [core] Fix dealer websocket reconnect leaving spirc hung on stale subscription channels.
 
 ## [0.8.0] - 2025-11-10
 
diff --git a/connect/src/spirc.rs b/connect/src/spirc.rs
@@ -458,6 +458,9 @@ impl SpircTask {
             };
         }
 
+        // Subscribe before start() so we can't miss a reconnect notification.
+        let mut reconnect_rx = self.session.dealer().reconnect_receiver();
+
         if let Err(why) = self.session.dealer().start().await {
             error!("starting dealer failed: {why}");
             return;
@@ -585,6 +588,12 @@ impl SpircTask {
                         }
                     }
                 },
+                // dealer reconnected after a connection loss — our subscription
+                // streams are stale, so break out and let main.rs re-create spirc
+                Ok(()) = reconnect_rx.changed() => {
+                    warn!("Dealer reconnected; restarting spirc to refresh subscriptions.");
+                    break;
+                },
                 else => break
             }
         }
diff --git a/core/src/dealer/manager.rs b/core/src/dealer/manager.rs
@@ -2,7 +2,7 @@ use futures_core::Stream;
 use futures_util::StreamExt;
 use std::{pin::Pin, str::FromStr, sync::OnceLock};
 use thiserror::Error;
-use tokio::sync::mpsc;
+use tokio::sync::{mpsc, watch};
 use tokio_stream::wrappers::UnboundedReceiverStream;
 use url::Url;
 
@@ -16,6 +16,7 @@ component! {
     DealerManager: DealerManagerInner {
         builder: OnceLock<Builder> = OnceLock::from(Builder::new()),
         dealer: OnceLock<Dealer> = OnceLock::new(),
+        reconnect_tx: watch::Sender<u64> = watch::Sender::new(0),
     }
 }
 
@@ -153,10 +154,12 @@ impl DealerManager {
         // and the token is expired we will just get 401 error
         let get_url = move || Self::get_url(session.clone());
 
+        let reconnect_tx = self.lock(|inner| inner.reconnect_tx.clone());
+
         let dealer = self
             .lock(move |inner| inner.builder.take())
             .ok_or(DealerError::BuilderNotAvailable)?
-            .launch(get_url, None)
+            .launch(get_url, None, reconnect_tx)
             .await
             .map_err(DealerError::LaunchFailure)?;
 
@@ -171,4 +174,8 @@ impl DealerManager {
             dealer.close().await
         }
     }
+
+    pub fn reconnect_receiver(&self) -> watch::Receiver<u64> {
+        self.lock(|inner| inner.reconnect_tx.subscribe())
+    }
 }
diff --git a/core/src/dealer/mod.rs b/core/src/dealer/mod.rs
@@ -21,6 +21,7 @@ use tokio::{
     sync::{
         Semaphore,
         mpsc::{self, UnboundedReceiver},
+        watch,
     },
     task::JoinHandle,
 };
@@ -55,6 +56,7 @@ const PING_INTERVAL: Duration = Duration::from_secs(30);
 const PING_TIMEOUT: Duration = Duration::from_secs(3);
 
 const RECONNECT_INTERVAL: Duration = Duration::from_secs(10);
+const RECONNECT_URL_TIMEOUT: Duration = Duration::from_secs(30);
 
 const DEALER_REQUEST_HANDLERS_POISON_MSG: &str =
     "dealer request handlers mutex should not be poisoned";
@@ -261,7 +263,7 @@ struct Builder {
 }
 
 macro_rules! create_dealer {
-    ($builder:expr, $shared:ident -> $body:expr) => {
+    ($builder:expr, $reconnect_tx:expr, $shared:ident -> $body:expr) => {
         match $builder {
             builder => {
                 let shared = Arc::new(DealerShared {
@@ -270,6 +272,8 @@ macro_rules! create_dealer {
                     notify_drop: Semaphore::new(0),
                 });
 
+                let reconnect_tx: watch::Sender<u64> = $reconnect_tx;
+
                 let handle = {
                     let $shared = Arc::clone(&shared);
                     tokio::spawn($body)
@@ -278,6 +282,7 @@ macro_rules! create_dealer {
                 Dealer {
                     shared,
                     handle: TimeoutOnDrop::new(handle, WEBSOCKET_CLOSE_TIMEOUT),
+                    reconnect_tx,
                 }
             }
         }
@@ -301,26 +306,38 @@ impl Builder {
         handles(&self.request_handlers, &self.message_handlers, uri)
     }
 
-    pub fn launch_in_background<Fut, F>(self, get_url: F, proxy: Option<Url>) -> Dealer
+    pub fn launch_in_background<Fut, F>(
+        self,
+        get_url: F,
+        proxy: Option<Url>,
+        reconnect_tx: watch::Sender<u64>,
+    ) -> Dealer
     where
         Fut: Future<Output = GetUrlResult> + Send + 'static,
         F: (Fn() -> Fut) + Send + 'static,
     {
-        create_dealer!(self, shared -> run(shared, None, get_url, proxy))
+        let tx = reconnect_tx.clone();
+        create_dealer!(self, reconnect_tx, shared -> run(shared, None, get_url, proxy, tx))
     }
 
-    pub async fn launch<Fut, F>(self, get_url: F, proxy: Option<Url>) -> WsResult<Dealer>
+    pub async fn launch<Fut, F>(
+        self,
+        get_url: F,
+        proxy: Option<Url>,
+        reconnect_tx: watch::Sender<u64>,
+    ) -> WsResult<Dealer>
     where
         Fut: Future<Output = GetUrlResult> + Send + 'static,
         F: (Fn() -> Fut) + Send + 'static,
     {
-        let dealer = create_dealer!(self, shared -> {
+        let tx = reconnect_tx.clone();
+        let dealer = create_dealer!(self, reconnect_tx, shared -> {
             // Try to connect.
             let url = get_url().await?;
             let tasks = connect(&url, proxy.as_ref(), &shared).await?;
 
             // If a connection is established, continue in a background task.
-            run(shared, Some(tasks), get_url, proxy)
+            run(shared, Some(tasks), get_url, proxy, tx)
         });
 
         Ok(dealer)
@@ -426,6 +443,7 @@ impl DealerShared {
 struct Dealer {
     shared: Arc<DealerShared>,
     handle: TimeoutOnDrop<Result<(), Error>>,
+    reconnect_tx: watch::Sender<u64>,
 }
 
 impl Dealer {
@@ -482,6 +500,10 @@ impl Dealer {
         )
     }
 
+    pub fn reconnect_receiver(&self) -> watch::Receiver<u64> {
+        self.reconnect_tx.subscribe()
+    }
+
     pub async fn close(mut self) {
         debug!("closing dealer");
 
@@ -665,19 +687,24 @@ async fn run<F, Fut>(
     initial_tasks: Option<(JoinHandle<()>, JoinHandle<()>)>,
     mut get_url: F,
     proxy: Option<Url>,
+    reconnect_tx: watch::Sender<u64>,
 ) -> Result<(), Error>
 where
     Fut: Future<Output = GetUrlResult> + Send + 'static,
     F: (FnMut() -> Fut) + Send + 'static,
 {
     let init_task = |t| Some(TimeoutOnDrop::new(t, WEBSOCKET_CLOSE_TIMEOUT));
 
+    let has_had_initial_connection = initial_tasks.is_some();
+
     let mut tasks = if let Some((s, r)) = initial_tasks {
         (init_task(s), init_task(r))
     } else {
         (None, None)
     };
 
+    let mut has_connected = has_had_initial_connection;
+
     while !shared.is_closed() {
         match &mut tasks {
             (Some(t0), Some(t1)) => {
@@ -702,11 +729,38 @@ where
                     () = shared.closed() => {
                         break
                     },
-                    e = get_url() => e
-                }?;
+                    result = tokio::time::timeout(RECONNECT_URL_TIMEOUT, get_url()) => {
+                        match result {
+                            Ok(Ok(url)) => url,
+                            Ok(Err(e)) => {
+                                error!("Failed to resolve dealer URL: {e}");
+                                if has_connected {
+                                    reconnect_tx.send_modify(|n| *n += 1);
+                                }
+                                tokio::time::sleep(RECONNECT_INTERVAL).await;
+                                continue;
+                            }
+                            Err(_) => {
+                                error!("Timed out resolving dealer URL.");
+                                if has_connected {
+                                    reconnect_tx.send_modify(|n| *n += 1);
+                                }
+                                tokio::time::sleep(RECONNECT_INTERVAL).await;
+                                continue;
+                            }
+                        }
+                    }
+                };
 
                 match connect(&url, proxy.as_ref(), &shared).await {
-                    Ok((s, r)) => tasks = (init_task(s), init_task(r)),
+                    Ok((s, r)) => {
+                        tasks = (init_task(s), init_task(r));
+                        if has_connected {
+                            warn!("Dealer reconnected; notifying consumers.");
+                            reconnect_tx.send_modify(|n| *n += 1);
+                        }
+                        has_connected = true;
+                    }
                     Err(e) => {
                         error!("Error while connecting: {e}");
                         tokio::time::sleep(RECONNECT_INTERVAL).await;

Original file line number	Diff line number	Diff line change
`@@ -458,6 +458,9 @@ impl SpircTask {`
`458`	`458`	`};`
`459`	`459`	`}`
`460`	`460`
	`461`	`+ // Subscribe before start() so we can't miss a reconnect notification.`
	`462`	`+ let mut reconnect_rx = self.session.dealer().reconnect_receiver();`
	`463`	`+`
`461`	`464`	`if let Err(why) = self.session.dealer().start().await {`
`462`	`465`	`error!("starting dealer failed: {why}");`
`463`	`466`	`return;`
`@@ -585,6 +588,12 @@ impl SpircTask {`
`585`	`588`	`}`
`586`	`589`	`}`
`587`	`590`	`},`
	`591`	`+ // dealer reconnected after a connection loss — our subscription`
	`592`	`+ // streams are stale, so break out and let main.rs re-create spirc`
	`593`	`+ Ok(()) = reconnect_rx.changed() => {`
	`594`	`+ warn!("Dealer reconnected; restarting spirc to refresh subscriptions.");`
	`595`	`+ break;`
	`596`	`+ },`
`588`	`597`	`else => break`
`589`	`598`	`}`
`590`	`599`	`}`