Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/nitro-auto-world-start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
'@workflow/nitro': minor
---

Start the workflow World automatically at server boot via a generated Nitro plugin, so self-hosted Nitro apps (Nitro v2/v3, Nuxt, Express/Hono/Fastify on Nitro) recover in-flight runs after a restart with no manual wiring. Skipped on Vercel deploys.
5 changes: 5 additions & 0 deletions .changeset/world-local-bundled-version.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
'@workflow/world-local': patch
---

Skip data-dir version-compat enforcement when the package version is the `bundled` sentinel (framework server bundles), so `world.start()` at server startup no longer throws `Invalid version string: "bundled"`.
6 changes: 6 additions & 0 deletions .changeset/world-start-recovery-core.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
'workflow': minor
'@workflow/core': minor
---

Add `ensureWorldStarted()` (exported from `workflow/runtime`) which starts the World once per process at server startup, running boot-time recovery of in-flight runs for self-hosted worlds. Call it from your framework's startup hook (e.g. a Next.js `instrumentation.ts`).
5 changes: 5 additions & 0 deletions .changeset/world-start-recovery-vercel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
'@workflow/world-vercel': minor
---

Add a no-op `start()` for World-interface compliance. The Vercel World is push-based (VQS redelivery), so it needs no boot-time recovery.
5 changes: 5 additions & 0 deletions .changeset/world-start-recovery-world.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
'@workflow/world': patch
---

Document the `start()` contract: it must be idempotent and may be a no-op for push-based/serverless worlds, and is where queue-backed worlds run boot-time recovery.
115 changes: 112 additions & 3 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -931,11 +931,114 @@ jobs:
env-vars: ${{ matrix.world.env-vars }}
secrets: inherit

# Restart recovery: prove an in-flight (sleeping) run resumes after a hard
# restart with NO workflow operation — i.e. server startup wiring
# (instrumentation.ts -> ensureWorldStarted) runs boot-time recovery. Unlike
# the other e2e jobs, this one does NOT pre-start the server; the test owns
# the server lifecycle (spawn, SIGKILL, respawn). Covers local + postgres on
# nextjs-turbopack.
e2e-restart-recovery:
name: E2E Restart Recovery (nextjs-turbopack - ${{ matrix.world }})
runs-on: ubuntu-latest
timeout-minutes: 30
if: ${{ needs.ci-scope.outputs.fast-path != 'true' && !contains(github.event.pull_request.labels.*.name, 'workflow-server-test') }}
needs: [ci-scope, e2e-package-build]
strategy:
fail-fast: false
matrix:
world: [local, postgres]

# Defined unconditionally; the local matrix entry simply does not use it.
services:
postgres:
image: postgres:18-alpine
env:
POSTGRES_USER: world
POSTGRES_PASSWORD: world
POSTGRES_DB: world
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5

env:
TURBO_TOKEN: ${{ secrets.TURBO_TOKEN }}
TURBO_TEAM: ${{ vars.TURBO_TEAM }}
WORKFLOW_PUBLIC_MANIFEST: '1'
APP_NAME: nextjs-turbopack
RESTART_RECOVERY_TEST: '1'
DEPLOYMENT_URL: 'http://localhost:3000'
# Empty for the local matrix entry -> resolveWorkflowTargetWorld() falls
# back to the local world.
WORKFLOW_TARGET_WORLD: ${{ matrix.world == 'postgres' && '@workflow/world-postgres' || '' }}
WORKFLOW_POSTGRES_URL: ${{ matrix.world == 'postgres' && 'postgres://world:world@localhost:5432/world' || '' }}

steps:
- name: Checkout Repo
uses: actions/checkout@v4
with:
ref: ${{ github.event.pull_request.head.sha || github.sha }}

- name: Setup environment
uses: ./.github/actions/setup-workflow-dev
with:
install-dependencies: 'false'
build-packages: 'false'

- name: Install Dependencies
run: pnpm install --frozen-lockfile

- name: Download shared package builds
uses: actions/download-artifact@v4
with:
name: e2e-package-build-artifacts
path: packages

- name: Setup PostgreSQL Database
if: ${{ matrix.world == 'postgres' }}
run: ./packages/world-postgres/bin/setup.js

- name: Prepare workbench path
id: prepare-workbench
uses: ./.github/actions/prepare-workbench-path
with:
app-name: nextjs-turbopack

- name: Build workbench
run: pnpm vitest run packages/core/e2e/local-build.test.ts
env:
APP_NAME: nextjs-turbopack
WORKBENCH_APP_PATH: ${{ steps.prepare-workbench.outputs.workbench_app_path }}

- name: Run Restart Recovery Test
run: |
pnpm vitest run packages/core/e2e/restart-recovery.test.ts --reporter=default --reporter=json --reporter=./packages/core/e2e/github-reporter.ts --outputFile=e2e-restart-recovery-${{ matrix.world }}.json
env:
NODE_OPTIONS: "--enable-source-maps"
APP_NAME: nextjs-turbopack
WORKBENCH_APP_PATH: ${{ steps.prepare-workbench.outputs.workbench_app_path }}

- name: Generate E2E summary
if: always()
run: node .github/scripts/aggregate-e2e-results.js . --job-name "E2E Restart Recovery (nextjs-turbopack - ${{ matrix.world }})" >> $GITHUB_STEP_SUMMARY || true

- name: Upload E2E results
if: always()
uses: actions/upload-artifact@v4
with:
name: e2e-results-restart-recovery-${{ matrix.world }}
path: e2e-restart-recovery-${{ matrix.world }}.json
retention-days: 7
if-no-files-found: ignore

# Final job: Aggregate all E2E results and update PR comment
summary:
name: E2E Summary
runs-on: ubuntu-latest
needs: [ci-scope, e2e-vercel-prod, e2e-local-dev, e2e-local-prod, e2e-local-postgres, e2e-windows]
needs: [ci-scope, e2e-vercel-prod, e2e-local-dev, e2e-local-prod, e2e-local-postgres, e2e-restart-recovery, e2e-windows]
if: always() && !cancelled() && needs.ci-scope.outputs.fast-path != 'true'
timeout-minutes: 10

Expand Down Expand Up @@ -966,15 +1069,17 @@ jobs:
LOCAL_DEV_STATUS="${{ needs.e2e-local-dev.result }}"
LOCAL_PROD_STATUS="${{ needs.e2e-local-prod.result }}"
POSTGRES_STATUS="${{ needs.e2e-local-postgres.result }}"
RESTART_RECOVERY_STATUS="${{ needs.e2e-restart-recovery.result }}"
WINDOWS_STATUS="${{ needs.e2e-windows.result }}"

echo "vercel=$VERCEL_STATUS" >> $GITHUB_OUTPUT
echo "local-dev=$LOCAL_DEV_STATUS" >> $GITHUB_OUTPUT
echo "local-prod=$LOCAL_PROD_STATUS" >> $GITHUB_OUTPUT
echo "postgres=$POSTGRES_STATUS" >> $GITHUB_OUTPUT
echo "restart-recovery=$RESTART_RECOVERY_STATUS" >> $GITHUB_OUTPUT
echo "windows=$WINDOWS_STATUS" >> $GITHUB_OUTPUT

if [[ "$VERCEL_STATUS" == "failure" || "$LOCAL_DEV_STATUS" == "failure" || "$LOCAL_PROD_STATUS" == "failure" || "$POSTGRES_STATUS" == "failure" || "$WINDOWS_STATUS" == "failure" ]]; then
if [[ "$VERCEL_STATUS" == "failure" || "$LOCAL_DEV_STATUS" == "failure" || "$LOCAL_PROD_STATUS" == "failure" || "$POSTGRES_STATUS" == "failure" || "$RESTART_RECOVERY_STATUS" == "failure" || "$WINDOWS_STATUS" == "failure" ]]; then
echo "has_failures=true" >> $GITHUB_OUTPUT
else
echo "has_failures=false" >> $GITHUB_OUTPUT
Expand All @@ -1001,6 +1106,7 @@ jobs:
- Local Dev: ${{ needs.e2e-local-dev.result }}
- Local Prod: ${{ needs.e2e-local-prod.result }}
- Local Postgres: ${{ needs.e2e-local-postgres.result }}
- Restart Recovery: ${{ needs.e2e-restart-recovery.result }}
- Windows: ${{ needs.e2e-windows.result }}

Check the [workflow run](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}) for details.
Expand All @@ -1010,7 +1116,7 @@ jobs:
e2e-required-check:
name: E2E Required Check
runs-on: ubuntu-latest
needs: [ci-scope, unit, e2e-package-build, e2e-vercel-prod, e2e-local-dev, e2e-local-prod, e2e-local-postgres, e2e-windows]
needs: [ci-scope, unit, e2e-package-build, e2e-vercel-prod, e2e-local-dev, e2e-local-prod, e2e-local-postgres, e2e-restart-recovery, e2e-windows]
if: always()
timeout-minutes: 5

Expand All @@ -1023,6 +1129,7 @@ jobs:
LOCAL_DEV_STATUS: ${{ needs.e2e-local-dev.result }}
LOCAL_PROD_STATUS: ${{ needs.e2e-local-prod.result }}
POSTGRES_STATUS: ${{ needs.e2e-local-postgres.result }}
RESTART_RECOVERY_STATUS: ${{ needs.e2e-restart-recovery.result }}
WINDOWS_STATUS: ${{ needs.e2e-windows.result }}
FAST_PATH: ${{ needs.ci-scope.outputs.fast-path }}
VALIDATION_FAST_PATH: ${{ needs.ci-scope.outputs.validation-fast-path }}
Expand Down Expand Up @@ -1050,6 +1157,7 @@ jobs:
[[ "$LOCAL_DEV_STATUS" == "skipped" ]] || echo "Warning: e2e-local-dev was not skipped ($LOCAL_DEV_STATUS)"
[[ "$LOCAL_PROD_STATUS" == "skipped" ]] || echo "Warning: e2e-local-prod was not skipped ($LOCAL_PROD_STATUS)"
[[ "$POSTGRES_STATUS" == "skipped" ]] || echo "Warning: e2e-local-postgres was not skipped ($POSTGRES_STATUS)"
[[ "$RESTART_RECOVERY_STATUS" == "skipped" ]] || echo "Warning: e2e-restart-recovery was not skipped ($RESTART_RECOVERY_STATUS)"
[[ "$WINDOWS_STATUS" == "skipped" ]] || echo "Warning: e2e-windows was not skipped ($WINDOWS_STATUS)"
else
echo "Standard PR - checking all jobs"
Expand All @@ -1059,6 +1167,7 @@ jobs:
[[ "$LOCAL_DEV_STATUS" == "success" ]] || FAILED_JOBS+=("e2e-local-dev ($LOCAL_DEV_STATUS)")
[[ "$LOCAL_PROD_STATUS" == "success" ]] || FAILED_JOBS+=("e2e-local-prod ($LOCAL_PROD_STATUS)")
[[ "$POSTGRES_STATUS" == "success" ]] || FAILED_JOBS+=("e2e-local-postgres ($POSTGRES_STATUS)")
[[ "$RESTART_RECOVERY_STATUS" == "success" ]] || FAILED_JOBS+=("e2e-restart-recovery ($RESTART_RECOVERY_STATUS)")
[[ "$WINDOWS_STATUS" == "success" ]] || FAILED_JOBS+=("e2e-windows ($WINDOWS_STATUS)")
fi

Expand Down
2 changes: 1 addition & 1 deletion docs/content/docs/v4/deploying/meta.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"title": "Deploying",
"pages": ["...deploying", "building-a-world"]
"pages": ["...deploying", "recovering-in-flight-runs", "building-a-world"]
}
88 changes: 88 additions & 0 deletions docs/content/docs/v4/deploying/recovering-in-flight-runs.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
title: Recovering in-flight runs
description: Start the World at server boot so runs that were in flight when the process stopped resume after a restart.
type: guide
summary: Call the World's start() at server boot to recover in-flight runs after a restart.
prerequisites:
- /docs/deploying
related:
- /docs/deploying/world/local-world
- /docs/deploying/world/postgres-world
- /docs/deploying/world/vercel-world
---

When you self-host a workflow app on a long-lived server (the [local](/docs/deploying/world/local-world) and [Postgres](/docs/deploying/world/postgres-world) worlds), a run can be mid-flight — sleeping, waiting on a hook, or between steps — when the process stops or crashes. To resume those runs, the World's `start()` method runs **boot-time recovery**: it re-enqueues every `pending`/`running` run so execution continues.

Recovery only happens if `start()` is actually called, and it must be called **once at server startup** — not in response to a request. Otherwise an idle server that restarted with in-flight runs would never pick them back up.

## `ensureWorldStarted()`

Call `ensureWorldStarted()` from `workflow/runtime` in your framework's server-startup hook:

```ts
import { ensureWorldStarted } from 'workflow/runtime';

await ensureWorldStarted();
```

It is **idempotent** — it starts the World at most once per process, so it is safe to call from a hook that may run more than once. Re-enqueuing a run that is already progressing is harmless: the workflow handler is replay-idempotent, so duplicate enqueues converge rather than double-execute.

You can call this regardless of which World you target. On the [Vercel World](/docs/deploying/world/vercel-world) it is a no-op — delivery is push-based and the queue redelivers in-flight messages on its own, so there is no long-lived process to recover.

## Wiring it per framework

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also have these ben an optional accordion/compressed setup step mentioned in each of the framework's getting started guides. the step should state this this is not required for vercel deployments (serverless/push based queue worlds) but required for pull based/worker based workflow sdk deployments. and it can link to this docs pages for details

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for v4 and v5 docs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6116307 — added an optional, collapsed accordion ("Recover in-flight runs after a restart") to each framework's getting-started guide (Next/Nitro/Express/Hono/Fastify/Nuxt/Vite/TanStack Start/SvelteKit/Nest/Astro, v4 + v5). Each shows the framework's startup snippet, notes it is not required for Vercel deployments, and links to the full Recovering in-flight runs page.


### Next.js

Add an `instrumentation.ts` at your project root. Guard on the Node.js runtime — `instrumentation.ts` also runs in the Edge runtime, which can't load the world modules:

```ts title="instrumentation.ts"
export async function register() {
if (process.env.NEXT_RUNTIME === 'nodejs') {
const { ensureWorldStarted } = await import('workflow/runtime');
await ensureWorldStarted();
}
}
```

### Nitro, Nuxt, Express, Hono, Fastify (Nitro)

No action required — the `@workflow/nitro` integration registers a Nitro server plugin that starts the World at boot for you. (Not on Vercel deploys, where the push-based Vercel World needs no boot recovery.)

### SvelteKit

Use the [`init`](https://svelte.dev/docs/kit/hooks#Shared-hooks-init) server hook:

```ts title="src/hooks.server.ts"
import type { ServerInit } from '@sveltejs/kit';

export const init: ServerInit = async () => {
const { ensureWorldStarted } = await import('workflow/runtime');
await ensureWorldStarted();
};
```

### NestJS

Call it in your `bootstrap()` before listening:

```ts title="src/main.ts"
async function bootstrap() {
const { ensureWorldStarted } = await import('workflow/runtime');
await ensureWorldStarted();
// ...create and listen
}
```

### Astro

Astro has no startup hook that works across all adapters, so start the World from middleware. `ensureWorldStarted()` is idempotent, so it only does real work on the first request:

```ts title="src/middleware.ts"
import { defineMiddleware } from 'astro:middleware';

export const onRequest = defineMiddleware(async (_context, next) => {
const { ensureWorldStarted } = await import('workflow/runtime');
await ensureWorldStarted();
return next();
});
```
12 changes: 12 additions & 0 deletions docs/content/docs/v4/deploying/world/local-world.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,18 @@ To explicitly use the local world in any environment, set the environment variab
WORKFLOW_TARGET_WORLD=local
```

## Starting the World

The Local World keeps its queue in memory, so a run that is in flight (for example, sleeping) when the process stops only resumes if the World is started again on boot. Start it once at server startup so in-flight runs recover after a restart:

```ts
import { ensureWorldStarted } from 'workflow/runtime';

await ensureWorldStarted();
```

Where to call this depends on your framework — see [Recovering in-flight runs](/docs/deploying/recovering-in-flight-runs).

## Observability

The `workflow` CLI uses the local world by default. Running these commands inside your workflow project will show your local development workflows:
Expand Down
Loading