Skip to content

Fix cluster pub/sub eviction and re-home races#155

Merged
ghostdogpr merged 1 commit into
mainfrom
fix/cluster-pubsub-eviction-races
Jul 2, 2026
Merged

Fix cluster pub/sub eviction and re-home races#155
ghostdogpr merged 1 commit into
mainfrom
fix/cluster-pubsub-eviction-races

Conversation

@ghostdogpr

@ghostdogpr ghostdogpr commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Three cluster pub/sub concurrency bugs in the shard-subscription manager and its placement ledger.

H3 — eviction racing an in-flight subscribe silently kills the stream (reported healthy)

evictEmptyShardConns snapshotted empty connections under the lock but closed them outside it. A concurrent place/reconcile holding a stale connection reference (from an earlier ensure) could bind a sink in that window; the close then terminated the just-registered sink (stream ends as if complete) while awaitActive returned normally on Closed (cluster attaches don't use failIfUnconfirmed). Placement recorded the phantom, fullyPlaced passed, no retry was scheduled — a dead subscription reported healthy.

Fix: SubscriptionConnection.closeIfEmpty() decides emptiness and the Closed transition in one critical section. A racing attach either registers first (connection stays non-empty, kept) or observes Closed and throws before registering — so its sink is never bound to the dying connection, and place leaves it unplaced to retry onto a fresh one. evictEmptyShardConns re-checks emptiness through closeIfEmpty at close time.

M2 — rehomeClassic leaks the dropped connection on partial failure

It nulled classicConn and retried without closing conn, which by then can be Live carrying successfully re-attached subs: duplicate deliveries on two live sockets, plus a leaked socket and watchdog.

Fix: SubscriptionConnection.shutdown() closes the socket + watchdog without terminating the manager-owned sinks (so the retry re-attaches them; transport.close() interrupts the reader, so a backpressured sink can't hang the join). rehomeClassic calls it before retrying.

M3 — fullyPlaced summed per-node set sizes

A channel double-recorded on two nodes (possible because place computes its plan outside the reconcile single-flight, against a stale topology) inflated the sum, masking another channel that never landed — stranded with no retry.

Fix: count the distinct union of placed channels. The double-placement then self-heals on the next reconcile.

Contributing bug (also fixed): an error reply to SSUBSCRIBE (e.g. MOVED) was misrouted by onFrame as a bootstrap/PONG reply and silently swallowed, so an unconfirmed subscribe was recorded as placed. A non-bootstrap error frame now drops the connection (off the reader thread) so the manager re-homes (cluster) or reconnects (standalone). Note: this also changes standalone behavior — an unexpected error on a subscription socket, previously ignored, now triggers a reconnect; a post-bootstrap error there is always abnormal.

Tests

  • PlacementSpec: fullyPlaced counts distinct channels, so a double-recorded one can't mask an unplaced channel.
  • SubscriptionConnectionSpec: closeIfEmpty keeps a connection holding a sink / closes once empty / rejects a racing attach; shutdown drops the socket but leaves the sink usable for re-attach; an unexpected error reply drops the connection instead of being swallowed.

All six client backend cells compile; all 141 sage.client.internal unit tests pass; scalafmt clean. The M3 and MOVED tests were confirmed to fail against the pre-fix code.

…e miscount

H3: shard-connection eviction snapshotted empty connections under the lock but
closed them outside it, so a concurrent place/reconcile holding a stale
connection reference could bind a sink in that window; the close then
terminated the just-registered sink while awaitActive returned normally on
Closed, recording a phantom placement with no retry. SubscriptionConnection
gains closeIfEmpty, which decides emptiness and the Closed transition in one
critical section: a racing attach either registers first (connection kept) or
observes Closed and fails before registering, so eviction never terminates a
sink an in-flight attach just bound.

M2: rehomeClassic nulled classicConn on partial failure without closing the
connection, which could be Live carrying re-attached subs — duplicate
deliveries on two live sockets plus a leaked socket and watchdog. It now calls
a new shutdown that closes the socket/watchdog without terminating the
manager-owned sinks, so the retry re-homes them onto a fresh connection.

M3: fullyPlaced summed per-node set sizes, so a channel double-recorded on two
nodes masked another that never landed. It now counts the distinct union of
placed channels. A contributing bug: an error reply to SSUBSCRIBE (e.g. MOVED)
was misrouted by onFrame as a bootstrap/PONG reply and swallowed, recording an
unconfirmed subscribe as placed; an unexpected error frame now drops the
connection so the manager re-homes or reconnects.

Regression tests in PlacementSpec and SubscriptionConnectionSpec.
@ghostdogpr ghostdogpr changed the title Fix cluster pub/sub eviction races, classic re-home leak, and coverage miscount Fix cluster pub/sub eviction and re-home races Jul 2, 2026
@ghostdogpr ghostdogpr force-pushed the fix/cluster-pubsub-eviction-races branch from 535851d to 4fe6ffa Compare July 2, 2026 09:17
@ghostdogpr ghostdogpr merged commit 9627f07 into main Jul 2, 2026
9 checks passed
@ghostdogpr ghostdogpr deleted the fix/cluster-pubsub-eviction-races branch July 2, 2026 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant