Skip to content

Fix timed notification waits stranded when suspended#116

Open
swlynch99 wants to merge 2 commits into
mainfrom
claude/beautiful-cerf-rafzlo
Open

Fix timed notification waits stranded when suspended#116
swlynch99 wants to merge 2 commits into
mainfrom
claude/beautiful-cerf-rafzlo

Conversation

@swlynch99

Copy link
Copy Markdown
Contributor

Summary

This PR fixes a critical bug where timed notification waits could become stranded indefinitely when they suspend. The issue was that suspended tasks with timeouts were not recording a wakeup_at deadline, making them unable to be revived by the timer mechanism if their notification was never re-delivered.

Key Changes

  • Notification timeout handling: Modified notification_blocking_timeout to durably record the absolute deadline (computed from the injected Clock) before suspending. This ensures:

    • The deadline is persisted and survives suspend/replay cycles
    • The deadline is resolved against the injected clock (enabling DstClock to drive wakeups)
    • Suspended tasks can be revived by the timer even without a re-delivered notification
  • Worker clock integration: Updated all worker-related storage operations to accept and use the injected clock's current time instead of CURRENT_TIMESTAMP:

    • insert_worker: Seeds initial heartbeat from injected clock
    • heartbeat_worker: Records heartbeats on injected clock timeline
    • delete_following_expired_worker and delete_other_expired_workers: Measure heartbeat expiry against injected clock
    • wake_suspended_tasks: Determines task wakeup eligibility using injected clock
  • Timer wakeup calculation: Fixed the sleep duration calculation in the worker's main loop to correctly compute the delay until wakeup_at + suspend_margin rather than using a negative duration that would collapse to zero and cause busy-looping.

  • Replay event handling: Fixed TaskState::enter to advance the transaction index when replaying cached events, preventing duplicate reads of the same event on subsequent enter calls.

  • Test coverage: Added comprehensive regression test dst_notify_timeout_timer_wakeup_under_clock that verifies:

    • Suspended tasks record a wakeup_at deadline
    • The runtime can wake tasks purely via clock advancement (without notifications)
    • Timed waits correctly return None (timeout) when revived by the timer

Implementation Details

The core fix ensures that timed notification waits follow the same durable event pattern as other blocking operations: the deadline is recorded as a durable event before suspension, making it replay-safe and clock-driven. The worker's timer mechanism then uses this persisted deadline to revive suspended tasks, with all time comparisons resolved against the injected clock rather than the database wall clock.

https://claude.ai/code/session_013eQS3qJ3EbkkBifNTPwDP2

claude added 2 commits June 12, 2026 23:56
… Clock

A task that waits on a notification with a timeout and whose in-memory
suspend timeout elapses first was suspended via `suspend_task_no_wakeup`,
i.e. with no `wakeup_at`. Such a task had no timer fallback and could only
be revived by a re-delivered notification; a single missed delivery
stranded it indefinitely. This also surfaced under a custom `Clock` (DST)
because the storage timers compared against the database wall clock rather
than the injected clock.

Changes:

- notify.rs: the timed wait now durably records its absolute deadline
  (computed from the injected `Clock`) as a recorded event before the
  result, so it survives suspend/replay. On suspend it records that
  deadline as `wakeup_at` (via `suspend_task`) so the timer revives the
  task even if the notification is never re-delivered, and on replay it
  resolves the remaining time against the clock and returns `None` once the
  deadline has passed.

- storage: thread the injected clock's `now()` into the heartbeat
  (`insert_worker`, `heartbeat_worker`, `delete_*_expired_worker*`) and
  suspend-wakeup (`wake_suspended_tasks`) queries, replacing
  `CURRENT_TIMESTAMP`/`NOW()` so a `DstClock` fully controls these timers.

- worker leader: fix the inverted wakeup delay (`now - wakeup_at`, always
  zero for a future wakeup) to sleep until `wakeup_at + suspend_margin`.

- task replay: advance `txn_index` on a cached `enter` hit, matching the
  database and record paths. Without this a second cached `enter` in one
  host call re-reads the same event; the timed-wait deadline event is the
  first operation to record two events across a suspend.

- Add a DST regression test plus a workflow that times out, driving the
  clock past the deadline to prove timer-based wakeup fires under a custom
  clock with no notification delivered.

Regenerate the sqlx offline query cache for the changed queries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants