Fix timed notification waits stranded when suspended#116
Open
swlynch99 wants to merge 2 commits into
Open
Conversation
… Clock A task that waits on a notification with a timeout and whose in-memory suspend timeout elapses first was suspended via `suspend_task_no_wakeup`, i.e. with no `wakeup_at`. Such a task had no timer fallback and could only be revived by a re-delivered notification; a single missed delivery stranded it indefinitely. This also surfaced under a custom `Clock` (DST) because the storage timers compared against the database wall clock rather than the injected clock. Changes: - notify.rs: the timed wait now durably records its absolute deadline (computed from the injected `Clock`) as a recorded event before the result, so it survives suspend/replay. On suspend it records that deadline as `wakeup_at` (via `suspend_task`) so the timer revives the task even if the notification is never re-delivered, and on replay it resolves the remaining time against the clock and returns `None` once the deadline has passed. - storage: thread the injected clock's `now()` into the heartbeat (`insert_worker`, `heartbeat_worker`, `delete_*_expired_worker*`) and suspend-wakeup (`wake_suspended_tasks`) queries, replacing `CURRENT_TIMESTAMP`/`NOW()` so a `DstClock` fully controls these timers. - worker leader: fix the inverted wakeup delay (`now - wakeup_at`, always zero for a future wakeup) to sleep until `wakeup_at + suspend_margin`. - task replay: advance `txn_index` on a cached `enter` hit, matching the database and record paths. Without this a second cached `enter` in one host call re-reads the same event; the timed-wait deadline event is the first operation to record two events across a suspend. - Add a DST regression test plus a workflow that times out, driving the clock past the deadline to prove timer-based wakeup fires under a custom clock with no notification delivered. Regenerate the sqlx offline query cache for the changed queries.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a critical bug where timed notification waits could become stranded indefinitely when they suspend. The issue was that suspended tasks with timeouts were not recording a
wakeup_atdeadline, making them unable to be revived by the timer mechanism if their notification was never re-delivered.Key Changes
Notification timeout handling: Modified
notification_blocking_timeoutto durably record the absolute deadline (computed from the injectedClock) before suspending. This ensures:DstClockto drive wakeups)Worker clock integration: Updated all worker-related storage operations to accept and use the injected clock's current time instead of
CURRENT_TIMESTAMP:insert_worker: Seeds initial heartbeat from injected clockheartbeat_worker: Records heartbeats on injected clock timelinedelete_following_expired_workeranddelete_other_expired_workers: Measure heartbeat expiry against injected clockwake_suspended_tasks: Determines task wakeup eligibility using injected clockTimer wakeup calculation: Fixed the sleep duration calculation in the worker's main loop to correctly compute the delay until
wakeup_at + suspend_marginrather than using a negative duration that would collapse to zero and cause busy-looping.Replay event handling: Fixed
TaskState::enterto advance the transaction index when replaying cached events, preventing duplicate reads of the same event on subsequententercalls.Test coverage: Added comprehensive regression test
dst_notify_timeout_timer_wakeup_under_clockthat verifies:wakeup_atdeadlineNone(timeout) when revived by the timerImplementation Details
The core fix ensures that timed notification waits follow the same durable event pattern as other blocking operations: the deadline is recorded as a durable event before suspension, making it replay-safe and clock-driven. The worker's timer mechanism then uses this persisted deadline to revive suspended tasks, with all time comparisons resolved against the injected clock rather than the database wall clock.
https://claude.ai/code/session_013eQS3qJ3EbkkBifNTPwDP2