[core] Turbo mode: fast-path the first invocation#2526
Conversation
🦋 Changeset detectedLatest commit: 0cdbdfd The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
c46f8a7 to
2962fd6
Compare
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (1 failed)express (1 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
On the first delivery of a run's first invocation, background run_started, skip the initial event-log load, and force optimistic inline start so the run reaches its first steps with no preceding network round-trips. Safe because the first delivery has no concurrent handler to race the step create-claim; turbo exits the moment a suspension creates a hook or wait, and is a no-op for every other invocation. On by default; disable with WORKFLOW_TURBO=0. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…rations of the same delivery after a hook/wait was already opened, because it is recomputed solely from the current batch's `suspensionResult` with no latch over the cumulative event log.
This commit fixes the issue reported at packages/core/src/runtime.ts:1534
## Bug
In `packages/core/src/runtime.ts` (~line 1534), `forceOptimisticStart` is computed **per replay-loop iteration** from the *current* batch only:
```ts
const forceOptimisticStart =
turbo &&
!suspensionResult.waitTimeout &&
!suspensionResult.hasHookEvents &&
!suspensionResult.hasAttributeEvents &&
!suspensionResult.hasAwaitedHookCreation;
```
All `suspensionResult.*` flags reflect **only what the current suspension batch created**. Confirmed in `suspension-handler.ts`:
* `hasHookEvents: hookEvents.length > 0` where `hookEvents` is built from `hooksNeedingCreation = allHookItems.filter(item => !item.hasCreatedEvent)` — i.e. only hooks that do **not** yet have a `hook_created` event. A hook created in an *earlier* iteration already has its created event, so it is excluded and `hasHookEvents` is `false` on subsequent iterations.
`turbo` is computed once per delivery (~line 516), and the replay loop runs many iterations within a single delivery.
### Reachability
A fire-and-forget hook (`createHook('h')` not awaited) writes `hook_created` but does **not** block the workflow, so the replay loop continues to later pure-step suspensions in the **same delivery**:
* Iteration A: `hook_created` written → `hasHookEvents = true` → `forceOptimisticStart = false`. ✅
* Iteration B (a later pure-step suspension): the hook already has its created event, so it is not in `hooksNeedingCreation` → `hasHookEvents = false` → `forceOptimisticStart = true` again. ❌
This directly contradicts the invariant documented in the surrounding comment ("The moment a hook or wait ... is created ... the single-handler guarantee that makes forced optimistic start safe no longer holds — turbo exits"). Because the hook is open, a concurrent resume handler can be triggered and race the inline create-claim, and `forceOptimisticStart` overrides the user's `WORKFLOW_OPTIMISTIC_INLINE_START=0` kill switch (see `step-executor.ts:332`), risking double-execution of a non-idempotent step body.
## Fix
Latch turbo off permanently once any hook or wait is open anywhere in the cumulative event log, using the existing `hasOpenHookOrWait` helper over `cachedEvents` (the cumulative replay log, set to `events` each iteration at ~line 1148):
```ts
const forceOptimisticStart =
turbo &&
!suspensionResult.waitTimeout &&
!suspensionResult.hasHookEvents &&
!suspensionResult.hasAttributeEvents &&
!suspensionResult.hasAwaitedHookCreation &&
!hasOpenHookOrWait(cachedEvents ?? []);
```
This mirrors the exact gate already applied to `requestInlineDelta` a few lines above (line 1522), where `!hasOpenHookOrWait(cachedEvents ?? [])` is used for the same "no out-of-band concurrent writer" safety reasoning. By keying off the cumulative log rather than the current batch, turbo now exits the moment a hook/wait exists and stays off for the remainder of the run, matching the documented single-handler guarantee.
Co-authored-by: Vercel <vercel[bot]@users.noreply.github.com>
Co-authored-by: VaguelySerious <[email protected]>
e1d4449 to
d446c4d
Compare
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
…urbo mode the first delivery synthesizes `startedAt` from the runtime-local clock while later non-turbo deliveries load the server-canonical `startedAt`, so replay regenerates different correlation IDs and throws `ReplayDivergenceError`.
This commit fixes the issue reported at packages/core/src/runtime.ts:763
## The bug
In `packages/core/src/workflow.ts` (`runWorkflow`), the workflow orchestrator context exposed:
```ts
generateUlid: () => ulid(+startedAt),
```
where `startedAt = workflowRun.startedAt`. Every durable correlation ID is derived from this:
- `step.ts:25` → `step_${ctx.generateUlid()}`
- `workflow/hook.ts:73` → `hook_${ctx.generateUlid()}`
- `workflow/sleep.ts:18` → `wait_${ctx.generateUlid()}`
- `workflow/attribute-dispatcher.ts:20` → `attr_${ctx.generateUlid()}`
The 48-bit time prefix of every correlation ID therefore equals `+startedAt`. For replay to succeed, the value fed to `ulid()` **must be identical on every delivery** — otherwise `EventsConsumer.onUnconsumedEvent` fires and rejects with `ReplayDivergenceError`.
## Why turbo breaks it
`startedAt` is **not** replay-stable under turbo:
- **Turbo first delivery** (`runtime.ts` ~L753): the run is synthesized locally with `startedAt: now`, where `now = new Date()` is the runtime-local clock. The first delivery's `generateUlid` thus encodes the local `now`, and any `step_started`/`wait`/`hook_created` events persisted in this delivery carry correlation IDs encoding that local `now`.
- **Backend persistence**: the backgrounded `run_started` write records the storage layer's own clock as the canonical `startedAt` (`world-local events-storage` uses `currentRun.startedAt ?? now`), which differs from the runtime's local `now`.
- **Next (non-turbo) delivery**: the run is loaded from the backend with the server-canonical `startedAt`. `generateUlid()` now produces ULIDs with a different time prefix, so the regenerated correlation IDs no longer match the persisted ones → `ReplayDivergenceError`.
The divergence only requires a ≥1 ms difference between the two ms-resolution clocks, so it is intermittent but real — and turbo is on by default.
This was already a known hazard: the RNG `seed` and the VM clock `fixedTimestamp` were *deliberately* decoupled from `startedAt`/`createdAt` (see the comment "Dropping the timestamp means the seed no longer depends on startedAt/createdAt, so it ... can be computed before any server round-trip"). `generateUlid` was simply missed in that refactor.
## The fix
Feed `generateUlid` the same replay-stable value already used for the seed and VM clock:
```ts
generateUlid: () => ulid(fixedTimestamp),
```
where `fixedTimestamp = runIdCreatedAt(workflowRun.runId) ?? +workflowRun.createdAt`. Production run IDs are always `wrun_<ulid>` (minted client-side in `start()`), so `runIdCreatedAt` recovers the same epoch-ms value the instant the queue message arrives — identical on turbo and non-turbo deliveries alike. Correlation IDs become replay-stable in all delivery paths.
`workflowStartedAt` (line 296, a user-facing `Date` exposed to workflow code) intentionally keeps using `startedAt` — it is not a correlation ID and is not part of replay matching.
## Test compatibility
The two integration tests that compute expected correlation IDs through the real `runWorkflow` path use non-ULID run IDs (`wrun_stale_wait_replay`, `wrun_test`). For those, `runIdCreatedAt` returns `undefined` and `fixedTimestamp` falls back to `+createdAt`, which equals `+startedAt` in those fixtures — so `ulid(fixedTimestamp)` yields the same IDs as before and the assertions still hold. The unit-test fixtures that hand-build their own `generateUlid: () => ulid(workflowStartedAt)` do not go through `runWorkflow` and are unaffected.
Co-authored-by: Vercel <vercel[bot]@users.noreply.github.com>
Co-authored-by: VaguelySerious <[email protected]>
…ch step_completed Turbo overlaps start round-trips with step bodies but still awaits each step_completed before advancing. Documents the considered "run-ahead" extension (defer step writes to a background queue, run sequential steps ahead) and why it was not pursued: crash re-execution blast radius, and divergent branches when a step runs against a non-durable result a redelivery can re-decide. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
… IDs The generateUlid fix (correlation IDs keyed on the replay-stable fixedTimestamp = runIdCreatedAt(runId) instead of startedAt) changes the ULID time prefix for fixtures whose ULID runId encodes a different time than their startedAt. This race-replay fixture used a 2025 ULID runId with a 2024 startedAt, so its step_ correlation ID prefixes move from the startedAt-derived 01HK153X00 to the runId-derived 01K75533W5 (suffixes, seed-derived, are unchanged). Realigns the fixture; no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…tered-step barrier, docs)
- forceOptimisticStart now defers to an explicit WORKFLOW_OPTIMISTIC_INLINE_START=0:
turbo still forces optimistic start when the flag is unset, but an operator's
explicit opt-out (the "body runs before start is confirmed" property) wins.
Adds isOptimisticInlineStartExplicitlyDisabled().
- Gate the unregistered-step ("step not found") lazy step_started on
runReadyBarrier so it never precedes the backgrounded run_started under turbo.
- Document that the forced-optimistic first step body's stream/ops writes run
before run_started (stream-safety caveat + WORKFLOW_TURBO=0), and that a run
cancelled/expired before its first delivery still runs the first step body
(reconciled away) under turbo.
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
The two turbo tests that wait for the inline step body to run before releasing the run-ready barrier used vi.waitFor's 1s default, which the full VM replay can exceed on cold Windows CI (intermittent "expected [] to include 'body'"). Bump to 15s, matching the existing queue_dispatch_start waitFor in the same suite. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…dge)
Turbo's immediate re-invoke exits returned `{ timeoutSeconds: 0 }`, which makes
the queue reschedule the CURRENT delivery's message. That message carries
`runInput`, and on async queues (graphile-worker / world-vercel) a reschedule
comes back as delivery attempt 1 — so turbo re-engaged, skipped the event-log
load again, replayed against an empty log, never observed the hook/attr event it
had just written, re-suspended, and rescheduled forever. The run wedged (every
hook + experimental_setAttributes e2e test timed out on world-postgres and
world-vercel; world-local's reschedule increments the attempt, so it was unaffected).
Turbo now re-invokes via an explicit continuation that carries NO `runInput`
(`reinvoke()`), so the next delivery is a normal non-turbo load-and-replay that
observes the committed events and makes progress. Applies to the hasHookConflict,
hasAttributeEvents, hasAwaitedHookCreation, and throttle re-invoke exits.
Verified against world-postgres: hook.getConflict() + experimental_setAttributes
workflows that previously wedged now complete.
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
| executionContext: runInput.executionContext, | ||
| input: runInput.input, | ||
| attributes: runInput.attributes ?? {}, | ||
| startedAt: now, |
There was a problem hiding this comment.
Follow-up to the (now-resolved) correlation-ID divergence: the replay-matching fix is solid — seed/fixedTimestamp/generateUlid are all runId-derived now, so step/wait/hook IDs are stable across deliveries. ✅
But one residual remains from this synthesized startedAt: now. It still flows into the orchestrator-visible getWorkflowMetadata().workflowStartedAt (workflow.ts:296 new Date(+startedAt) → WORKFLOW_CONTEXT_SYMBOL), and getWorkflowMetadata() is replayed user code. On the turbo first delivery that value is the local clock; on any later (non-turbo) delivery it's the server-canonical startedAt — so a workflow that branches on it (e.g. if (Date.now() - +meta.workflowStartedAt > THRESHOLD) …, where Date.now() is now fixedTimestamp-stable but workflowStartedAt is not) can take a different path on resume → ReplayDivergenceError.
Much narrower than the original bug (only workflows reading workflowStartedAt in replayed control flow), but it means the "replay is fully decoupled from startedAt" framing isn't quite complete — the user-facing value is still delivery-dependent. Worth either deriving the synthesized workflowStartedAt from the same replay-stable runIdCreatedAt(runId) value, or adding a one-line caveat that workflowStartedAt may differ by the start→first-delivery latency on the first invocation and shouldn't drive replayed branching.
There was a problem hiding this comment.
Pushed 0cdbdfdb8 to address this so it is not left silently open:
- Doc caveat in the turbo-mode changelog (
### workflowStartedAt reflects the first delivery's clock) — explicitly says to treat it as an approximate human-facing timestamp and not branch replayed control flow on it. - Regression test in
workflow.test.tsproving step correlation IDs are regenerated from the run-ID-derivedfixedTimestamp(notstartedAt) and stay stable across deliveries. Verified it fails (ReplayDivergenceError) whengenerateUlidis reverted toulid(+startedAt).
Left the deeper code fix — deriving the orchestrator-visible workflowStartedAt from fixedTimestamp so the value itself is replay-stable rather than just documented — as your call, since it changes the public getWorkflowMetadata().workflowStartedAt semantics for non-turbo runs too (it would become run-creation time instead of run_started time).
Add a workflow.test.ts regression that replays a recorded step under a startedAt that diverges from createdAt, proving step correlation IDs are regenerated from the run-ID-derived fixedTimestamp (not startedAt) and so stay stable across deliveries. Reverting generateUlid to ulid(+startedAt) fails this test. Document in the turbo-mode changelog that getWorkflowMetadata().workflowStartedAt reflects the first delivery's clock under turbo (local on the first delivery, server-canonical on later ones) and must not drive replayed control flow. Co-Authored-By: Claude Opus 4.8 <[email protected]>
Adds turbo mode (on by default) to fast-path the very first delivery of a run's first invocation — where time-to-first-step matters most. Stacked on #2516.
On that first delivery the runtime:
run_started— synthesizes the run entity locally from the queued run input so replay begins immediately; the round-trip overlaps replay (reuses the resilient-start create-on-the-fly contract).WORKFLOW_OPTIMISTIC_INLINE_START— the step body runs immediately; only thestep_startedwrite waits on the backgroundedrun_started.Net effect: the first step body starts after just the in-process replay, with
run_started/step_startedhappening around it and noevents.listbefore it.Why it's safe
attempt === 1is the first delivery (plus: not a background-step or recovery replay). No new message field, no world/backend change.step_startedis chained on it (body still runs immediately), the suspension handler awaits it before any eager write, and terminal run writes await it too. The log staysrun_created → run_started → step_created → step_started → step_completed.WORKFLOW_TURBO=0).Config
On by default;
WORKFLOW_TURBO=0/falsedisables it (kill-switch for non-idempotent/stream-unsafe first-step bodies).Docs
New changelog page
docs/content/docs/v5/changelog/turbo-mode.md(+ meta). Preview links to follow once the docs deployment is up.