Fix cluster pub/sub eviction and re-home races by ghostdogpr · Pull Request #155 · ghostdogpr/sage

ghostdogpr · 2026-07-02T09:16:07Z

Three cluster pub/sub concurrency bugs in the shard-subscription manager and its placement ledger.

H3 — eviction racing an in-flight subscribe silently kills the stream (reported healthy)

evictEmptyShardConns snapshotted empty connections under the lock but closed them outside it. A concurrent place/reconcile holding a stale connection reference (from an earlier ensure) could bind a sink in that window; the close then terminated the just-registered sink (stream ends as if complete) while awaitActive returned normally on Closed (cluster attaches don't use failIfUnconfirmed). Placement recorded the phantom, fullyPlaced passed, no retry was scheduled — a dead subscription reported healthy.

Fix: SubscriptionConnection.closeIfEmpty() decides emptiness and the Closed transition in one critical section. A racing attach either registers first (connection stays non-empty, kept) or observes Closed and throws before registering — so its sink is never bound to the dying connection, and place leaves it unplaced to retry onto a fresh one. evictEmptyShardConns re-checks emptiness through closeIfEmpty at close time.

M2 — `rehomeClassic` leaks the dropped connection on partial failure

It nulled classicConn and retried without closing conn, which by then can be Live carrying successfully re-attached subs: duplicate deliveries on two live sockets, plus a leaked socket and watchdog.

Fix: SubscriptionConnection.shutdown() closes the socket + watchdog without terminating the manager-owned sinks (so the retry re-attaches them; transport.close() interrupts the reader, so a backpressured sink can't hang the join). rehomeClassic calls it before retrying.

M3 — `fullyPlaced` summed per-node set sizes

A channel double-recorded on two nodes (possible because place computes its plan outside the reconcile single-flight, against a stale topology) inflated the sum, masking another channel that never landed — stranded with no retry.

Fix: count the distinct union of placed channels. The double-placement then self-heals on the next reconcile.

Contributing bug (also fixed): an error reply to SSUBSCRIBE (e.g. MOVED) was misrouted by onFrame as a bootstrap/PONG reply and silently swallowed, so an unconfirmed subscribe was recorded as placed. A non-bootstrap error frame now drops the connection (off the reader thread) so the manager re-homes (cluster) or reconnects (standalone). Note: this also changes standalone behavior — an unexpected error on a subscription socket, previously ignored, now triggers a reconnect; a post-bootstrap error there is always abnormal.

Tests

PlacementSpec: fullyPlaced counts distinct channels, so a double-recorded one can't mask an unplaced channel.
SubscriptionConnectionSpec: closeIfEmpty keeps a connection holding a sink / closes once empty / rejects a racing attach; shutdown drops the socket but leaves the sink usable for re-attach; an unexpected error reply drops the connection instead of being swallowed.

All six client backend cells compile; all 141 sage.client.internal unit tests pass; scalafmt clean. The M3 and MOVED tests were confirmed to fail against the pre-fix code.

…e miscount H3: shard-connection eviction snapshotted empty connections under the lock but closed them outside it, so a concurrent place/reconcile holding a stale connection reference could bind a sink in that window; the close then terminated the just-registered sink while awaitActive returned normally on Closed, recording a phantom placement with no retry. SubscriptionConnection gains closeIfEmpty, which decides emptiness and the Closed transition in one critical section: a racing attach either registers first (connection kept) or observes Closed and fails before registering, so eviction never terminates a sink an in-flight attach just bound. M2: rehomeClassic nulled classicConn on partial failure without closing the connection, which could be Live carrying re-attached subs — duplicate deliveries on two live sockets plus a leaked socket and watchdog. It now calls a new shutdown that closes the socket/watchdog without terminating the manager-owned sinks, so the retry re-homes them onto a fresh connection. M3: fullyPlaced summed per-node set sizes, so a channel double-recorded on two nodes masked another that never landed. It now counts the distinct union of placed channels. A contributing bug: an error reply to SSUBSCRIBE (e.g. MOVED) was misrouted by onFrame as a bootstrap/PONG reply and swallowed, recording an unconfirmed subscribe as placed; an unexpected error frame now drops the connection so the manager re-homes or reconnects. Regression tests in PlacementSpec and SubscriptionConnectionSpec.

ghostdogpr changed the title ~~Fix cluster pub/sub eviction races, classic re-home leak, and coverage miscount~~ Fix cluster pub/sub eviction and re-home races Jul 2, 2026

ghostdogpr force-pushed the fix/cluster-pubsub-eviction-races branch from 535851d to 4fe6ffa Compare July 2, 2026 09:17

ghostdogpr merged commit 9627f07 into main Jul 2, 2026
9 checks passed

ghostdogpr deleted the fix/cluster-pubsub-eviction-races branch July 2, 2026 09:21

ghostdogpr mentioned this pull request Jul 2, 2026

Give each subscription connection its own bootstrap waiter #156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix cluster pub/sub eviction and re-home races#155

Fix cluster pub/sub eviction and re-home races#155
ghostdogpr merged 1 commit into
mainfrom
fix/cluster-pubsub-eviction-races

ghostdogpr commented Jul 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ghostdogpr commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

H3 — eviction racing an in-flight subscribe silently kills the stream (reported healthy)

M2 — rehomeClassic leaks the dropped connection on partial failure

M3 — fullyPlaced summed per-node set sizes

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ghostdogpr commented Jul 2, 2026 •

edited

Loading

M2 — `rehomeClassic` leaks the dropped connection on partial failure

M3 — `fullyPlaced` summed per-node set sizes