feat(recovery): start the World at server boot to recover in-flight runs#2544
feat(recovery): start the World at server boot to recover in-flight runs#2544pranaygp wants to merge 10 commits into
Conversation
Self-hosted worlds (local, postgres) run boot-time recovery (`reenqueueActiveRuns`) inside `world.start()`, but nothing called it at server startup — so a process that restarted mid-flight never resumed its in-flight runs without a workflow operation. - core: add idempotent `ensureWorldStarted()` (once-per-process), exported from `@workflow/core/runtime` and `workflow/runtime`. - world-vercel: add a no-op `start()` for interface compliance (push-based; VQS redelivers, no boot recovery needed). Document the `start()` contract. - framework startup wiring (un-gated; no-op on Vercel): Next workbench `instrumentation.ts`, a Nitro server plugin (covers express/hono/fastify/ nuxt), un-gate SvelteKit `init` + Nest `bootstrap`, Astro middleware. - test: kill/restart e2e proving an in-flight sleeping run resumes after a hard restart with no workflow op; fails if startup wiring is removed. Covers local + postgres. New `e2e-restart-recovery` CI job. - docs: deploying/recovering-in-flight-runs (v4 + v5). Co-Authored-By: Claude Opus 4.8 <[email protected]>
🦋 Changeset detectedLatest commit: aadabdd The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests📦 Local Production (127 failed)nuxt-stable (127 failed):
🐘 Local Postgres (126 failed)nuxt-stable (126 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
❌ 📦 Local Production
❌ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
|
|
||
| You can call this regardless of which World you target. On the [Vercel World](/docs/deploying/world/vercel-world) it is a no-op — delivery is push-based and the queue redelivers in-flight messages on its own, so there is no long-lived process to recover. | ||
|
|
||
| ## Wiring it per framework |
There was a problem hiding this comment.
we should also have these ben an optional accordion/compressed setup step mentioned in each of the framework's getting started guides. the step should state this this is not required for vercel deployments (serverless/push based queue worlds) but required for pull based/worker based workflow sdk deployments. and it can link to this docs pages for details
There was a problem hiding this comment.
same for v4 and v5 docs
There was a problem hiding this comment.
Done in 6116307 — added an optional, collapsed accordion ("Recover in-flight runs after a restart") to each framework's getting-started guide (Next/Nitro/Express/Hono/Fastify/Nuxt/Vite/TanStack Start/SvelteKit/Nest/Astro, v4 + v5). Each shows the framework's startup snippet, notes it is not required for Vercel deployments, and links to the full Recovering in-flight runs page.
| { | ||
| "title": "Deploying", | ||
| "pages": ["...deploying", "building-a-world"] | ||
| "pages": ["...deploying", "building-a-world", "recovering-in-flight-runs"] |
There was a problem hiding this comment.
put this before "building a world"
There was a problem hiding this comment.
same for v4 and v5
There was a problem hiding this comment.
Done in 6116307 — moved recovering-in-flight-runs before building-a-world in the Deploying nav (v4 + v5).
CI surfaced two failures from un-gating `world.start()` at server boot: - world-local: `initDataDir` ran `parseVersion()` on the `bundled` sentinel that `getPackageInfo()` returns in framework server bundles, throwing `Invalid version string: "bundled"` and crashing startup (500s on sveltekit/astro/vite/etc.). Skip version-compat enforcement when the version is `bundled`. - nitro: the startup plugin's bare static `import "@workflow/core/runtime"` couldn't be resolved by Rollup/Vite, breaking every Nitro(+Vite) build (503s on nitro/express/hono/fastify/tanstack-start). Resolve the runtime to a `file://` URL at build time and dynamic-import it (vite-/webpack-ignored), mirroring the dashboard handler; gate the plugin to non-Vercel builds (the Vercel World's start() is a no-op and the file:// path wouldn't resolve in a serverless function). Verified: express (Nitro, bundled build) boots and serves the manifest 200 with no version/world-start errors. Co-Authored-By: Claude Opus 4.8 <[email protected]>
… order
- Move "Recovering in-flight runs" before "Building a World" in the Deploying
nav (v4 + v5).
- Add an optional, collapsed accordion ("Recover in-flight runs after a
restart") to each framework's getting-started guide (Next/Nitro/Express/
Hono/Fastify/Nuxt/Vite/TanStack Start/SvelteKit/Nest/Astro, v4 + v5) with the
framework-specific startup snippet, noting it is NOT required for Vercel
deployments and linking to the full Recovering in-flight runs page.
Co-Authored-By: Claude Opus 4.8 <[email protected]>
|
|
||
| ## Deploying to Production | ||
|
|
||
| <Accordion type="single" collapsible> |
There was a problem hiding this comment.
This isn't the right place. Actually let's not add anything to each individual framework getting started (the "recovering runs section" is good enough imo. instead let's just ensure that the postgres world and local world, world docs, mention that they need the world to be started at instantiation with the right framework-wide docs (just like the we did in recovering in-flight runs) along with the code snippet since only those worlds need it
There was a problem hiding this comment.
Done in 54b5b2e. Reverted the per-framework getting-started accordions, and instead added a "Starting the World" note to the local-world and postgres-world docs (only the self-hosted worlds need it): each states the World must be started at server boot, shows the ensureWorldStarted() snippet, and links to Recovering in-flight runs for per-framework wiring. Also consolidated postgres-world's prior per-framework world.start() tabs into that single snippet + link (now using the idempotent helper).
Per review: revert the per-framework getting-started accordions (the Recovering in-flight runs page is enough) and instead document the startup requirement on the worlds that actually need it. - local-world / postgres-world: add a "Starting the World" note that the World must be started at server boot, with the `ensureWorldStarted()` snippet and a link to Recovering in-flight runs for per-framework wiring. - postgres-world: consolidate the prior per-framework `world.start()` tabs into that single snippet + link (DRY; uses the idempotent helper). - Remove the getting-started accordions added in the previous commit. Co-Authored-By: Claude Opus 4.8 <[email protected]>
Re-introduce the dropped auto startup plugin so self-hosted Nitro apps (Nitro v2/v3, Nuxt, Express/Hono/Fastify on Nitro) recover in-flight runs after a restart with no manual wiring. The previously-reverted version imported the runtime via a build-time `file://` URL, which collided with the bundled flow handler's copy of the same file (CJS/ESM dual-load -> ERR_INTERNAL_ASSERTION, 500ing the flow route). This version instead emits a real plugin file in the build dir that imports `workflow/runtime` via a *bare* dynamic import — mirroring a hand-written Nitro plugin — so the bundler resolves and dedupes it with the flow handler's runtime. `ensureWorldStarted()` caches its start promise on `globalThis`, so the World starts exactly once. Gated off Vercel deploys (the Vercel World's start() is a no-op). Removes the manual `start-pg-world.ts` workbench workaround and updates the recovering-in-flight-runs docs to note Nitro starts the World automatically. Co-Authored-By: Claude Opus 4.8 <[email protected]>
…-sleep The existing case kills during a workflow `sleep()` — a delayed, unlocked queue job. This adds a case that kills while a STEP is executing, so the step's queue job is held/locked at crash time (the postgres graphile-worker case from #679). Adds `longStepWorkflow` (one ~12s step) and a second restart-recovery test that kills mid-step and asserts the run resumes after a restart with no workflow op. Verified locally on both the local and postgres worlds: the run recovers (reenqueueActiveRuns re-drives the flow; the replayed step is re-dispatched and graphile schedules a fresh run rather than stalling on the locked job). Co-Authored-By: Claude Opus 4.8 <[email protected]>
The postgres-world "Starting the World" section listed "a Nitro server plugin" as a place to call ensureWorldStarted(), which is now misleading — @workflow/nitro starts the World automatically. Clarify that Nitro/Nuxt apps need no manual call (consistent with the recovering-in-flight-runs page and local-world docs). Co-Authored-By: Claude Opus 4.8 <[email protected]>
…s it) `@workflow/nuxt` registers `@workflow/nitro`, which now auto-starts the World at boot. The pre-existing `server/plugins/start-pg-world.ts` called `world.start()` directly (bypassing the `ensureWorldStarted` once-guard), so with the auto-plugin nuxt+postgres started the World twice (double `queue.start()`). Remove it — boot recovery is handled automatically. Verified: nuxt dev (local world) starts the World at boot via the auto-plugin with no manual plugin present. Co-Authored-By: Claude Opus 4.8 <[email protected]>
Closes #679
Related: #1531 (closed; same root cause)
Context
Self-hosted worlds (
local,postgres) run boot-time recovery —reenqueueActiveRuns()re-enqueuespending/runningruns — but it lives only insideworld.start(), and nothing calledworld.start()at server startup. So a self-hosted server that restarted while a run was in flight (sleeping, waiting on a hook, between steps) never resumed that run without a subsequent workflow operation. The Vercel World didn't even implementstart().This wires
world.start()to run once at server boot across the framework integrations, makes the Vercel World'sstart()an explicit no-op (push-based — VQS redelivers, no boot recovery needed), and adds an e2e test that proves recovery happens on startup with no workflow operation — and fails if the wiring is removed.Changes
@workflow/core: add idempotentensureWorldStarted()(once-per-process guard;getWorld()→world.start?.()), exported from@workflow/core/runtimeandworkflow/runtime.@workflow/world-vercel: add a no-opasync start()for interface compliance; expand theWorld.start()contract doc (must be idempotent; may be a no-op for push-based worlds).instrumentation.tscallsensureWorldStarted()fromworkflow/runtime, guarded byNEXT_RUNTIME === 'nodejs'.@workflow/nitroauto-registers a Nitro server plugin that runs at app boot — no manual wiring required. (See the "Nitro auto-start" update below for the implementation detail.)world.start()calls (were@workflow/world-postgres-only, so local never recovered) and route throughensureWorldStarted().src/middleware.tsonce-guard (Astro has no all-adapter startup hook).packages/core/e2e/restart-recovery.test.ts— starts a sleeping run server-side, hard-kills the server mid-sleep, restarts it, and asserts the run completes with no workflow op. Pure-reader test process; robust process-group kill. Gated byRESTART_RECOVERY_TEST=1; covers local + postgres on nextjs-turbopack.e2e-restart-recoveryjob (matrix local + postgres; owns the server lifecycle, so it does not pre-start the server) added to the summary +e2e-required-checkgates.deploying/recovering-in-flight-runs.mdx(v4 + v5).Verification
Ran the restart e2e locally, all four combinations:
running, 120s timeoutrunning, 120s timeoutThe run is killed mid-sleep (~0.5s into an 8s sleep) and only recovers after restart via the startup
ensureWorldStarted(). Postgres fails-without-fix because the graphile worker only auto-boots on the next enqueue, and the test issues no post-restart op.pnpm build/typecheckpass for all changed packages.Note
The plan called for a
@workflow/nextregisterhelper, but consuming it viaworkflow/next/instrumentationbroke Turbopack (a CJS re-export double-hop hit@workflow/core/dist/runtimeexports-encapsulation and dropped the named export through interop). UsingensureWorldStarted()fromworkflow/runtimedirectly is robust and consistent with how SvelteKit/Nest/Astro are wired.Docs Preview
(Preview sits behind deployment protection — requires Vercel team access.)
Update (CI fixes)
Initial CI surfaced two issues from un-gating
world.start(), both fixed:world-local:initDataDirthrewInvalid version string: "bundled"in bundled server builds — now skips version-compat when the version is thebundledsentinel.ERR_INTERNAL_ASSERTION. It was briefly removed in favor of manual wiring, then restored with a fix — see below.Update (Nitro auto-start, restored & fixed)
Nitro now starts the World automatically again — no manual server plugin needed.
The original auto-plugin imported the runtime via a build-time
file://URL, which collided with the bundled flow handler'srequire()of the same file (CJS/ESM dual-load →ERR_INTERNAL_ASSERTION, 500ing the flow route). The fix:@workflow/nitronow emits a real plugin file in the build dir that importsworkflow/runtimevia a bare dynamic import — mirroring a hand-written Nitro plugin — so the bundler resolves and dedupes it with the flow handler's runtime (no second physical module, no dual-load).ensureWorldStarted()caches its start promise onglobalThis, so the World starts exactly once. Registration is gated off Vercel deploys (the Vercel World'sstart()is a no-op).Also removes the temporary manual
workbench/nitro-v3/plugins/start-pg-world.tsworkaround + its config entry, and reverts the Nitro/Nuxt section of the recovery docs back to "no action required."Verified on
workbench/nitro-v3(local world), dev and production build:world.start()runs at true boot before any request (creates the data dir with no request made), and the flow handler serves HTTP 200 with zeroERR_INTERNAL_ASSERTIONafter the boot plugin has loaded the runtime — the exact regression that caused the original removal. The nitro+postgres path should be confirmed green in CI.🤖 Generated with Claude Code