feat: wire GMS checkpoint restore flow by galletas1712 · Pull Request #8833 · ai-dynamo/dynamo

galletas1712 · 2026-04-29T07:46:34Z

Summary

Stacked on #8829 (573746394b). This wires checkpoint/restore for GMS without enabling failover yet.

Allows checkpoint + GMS admission for non-failover services and removes the DynamoCheckpoint CRD-level GMS block.
Injects --load-format gms for vLLM whenever GMS is enabled, including auto/manual checkpoint jobs.
Adds GMS saver/loader lifecycle barriers so snapshot-agent does not publish snapshot-complete / restore-complete until GMS tensor save/load is done.
Keeps the GMS sentinel handoff fast during restore by polling GMS completion and snapshot-control sentinels at 50ms.
Keeps GMS tensor artifacts on the snapshot PVC under /checkpoints/gms/<checkpoint-hash>/versions/<artifact-version>/device-*, separate from the CRIU tree.
Adds an NVML-avoidance fast path for GMS socket UUIDs via GMS_GPU_UUIDS, GMS_GPU_UUID_<index>, or UUID-valued CUDA_VISIBLE_DEVICES / NVIDIA_VISIBLE_DEVICES.

Deployment behavior

Intra-pod GMS

Checkpoint job:

The operator adds the usual checkpoint PVC/control-volume wiring to the main worker.
It adds gms-server as a restartable init sidecar with a startup probe that waits for all per-device GMS sockets.
It adds gms-saver as a regular Job container. The saver waits on the GMS RO lock for committed weights, writes GMS tensors to the snapshot PVC, writes a pod-UID-scoped gms-save-complete-* sentinel, then exits so the Job can complete.
snapshot-agent waits for that sentinel before waking the checkpointed workload.

Restore pod:

The operator adds gms-server as a restartable init sidecar with the same socket-readiness startup probe.
It adds gms-loader as a regular sidecar container on the restore target pod. Because Kubernetes only starts regular containers after the init sidecar startup probe succeeds, the loader does not carry a separate socket wait loop.
The loader reads from the same GMS artifact directory, waits on the GMS RW lock as needed, writes a pod-UID-scoped gms-load-complete-* sentinel, then stays alive with the pod.
snapshot-agent waits for that sentinel before writing restore-complete.

Inter-pod GMS

Restore:

The engine pod is shaped as the CRIU restore target, but it does not get an intra-pod/private GMS server.
The dedicated RoleGMS weight-server pod gets a regular gms-loader sidecar that mounts the snapshot PVC and loads into the real inter-pod GMS server at /run/gms/shared.
The engine pod carries a shared GMS completion-mode annotation so snapshot-agent waits for the RoleGMS loader sentinel before resuming the restored engine.

Checkpoint jobs still run with a local GMS server+saver pair so the capture pod can load through GMS and persist the tensors for restore.

NVML / GPU UUID note

The GMS Python path now prefers UUIDs supplied by env before querying NVML. The operator cannot know the final per-pod DRA allocation at template render time, so the direct containerd/ResourceClaim-to-env injection should happen from the runtime/DRA side. Once that env exists, GMS will avoid the slow NVML call in the startup path.

Out of scope

GMS + failover is not enabled here. Admission allows checkpoint+GMS when failover is disabled; failover remains guarded by the internal bypass from feat(operator): gate GMS + Snapshot via Helm config #8829.
I did not change the actual containerd/DRA env injection mechanism for GPU UUIDs; this PR adds the GMS-side env contract and fallback behavior.

Static review

Static review found service admission still blocking GMS, inter-pod restore loading a private server, stale restore sentinels, and intra-pod failover targeting. This patch addresses those by allowing non-failover admission, moving inter-pod loading to RoleGMS, making sentinel naming pod-scoped or explicitly shared, and keeping failover out of scope.

Dev image

Operator image for 3d172262bef: nvcr.io/nvidian/dynamo-dev/schwinns:kubernetes-operator-3d172262bef-gms-init-server-20260430
Digest: sha256:137c7b9a29bbea886d197ee84a168bd8d275465e7e7bd68d654e4008a0a25843
Real GMS checkpoint runs with this operator also need matching runtime/placeholder images from 3d172262bef or newer, because gms-saver now exits in the image code path.

Tests

CGO_ENABLED=0 go test -count=1 ./internal/checkpoint ./internal/dynamo ./internal/webhook/validation from deploy/operator
CGO_ENABLED=0 go test -count=1 ./internal/checkpoint ./internal/gms from deploy/operator
CGO_ENABLED=0 go test -count=1 -skip TestControllers ./internal/controller from deploy/operator
python3 -m py_compile lib/gpu_memory_service/common/utils.py lib/gpu_memory_service/cli/snapshot/loader.py lib/gpu_memory_service/cli/snapshot/saver.py
CGO_ENABLED=0 go test -count=1 ./protocol from deploy/snapshot
GOOS=linux CGO_ENABLED=0 go test -c ./internal/controller -o /tmp/dynamo-snapshot-controller.test from deploy/snapshot
uv run --project . --extra test pytest -q -c pyproject.toml tests/test_common_utils.py from lib/gpu_memory_service
go test ./internal/controller from deploy/snapshot
python -m py_compile components/src/dynamo/common/utils/snapshot.py

copy-pr-bot · 2026-04-29T07:46:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

pull-request-size Bot added the size/XL label Apr 29, 2026

github-actions Bot added feat deployment::k8s Relates to dynamo deployment in kubernetes labels Apr 29, 2026

pull-request-size Bot added size/XXL and removed size/XL labels Apr 30, 2026

galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch from 5737463 to 346cac5 Compare May 5, 2026 21:15

galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch 2 times, most recently from e855c4e to 7bb557a Compare May 5, 2026 23:45

galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch from 346cac5 to cff2def Compare May 5, 2026 23:45

galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from 7bb557a to 2443fa4 Compare May 5, 2026 23:48

galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch 2 times, most recently from ab13f45 to 9a8fe70 Compare May 5, 2026 23:52

galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from 2443fa4 to a2d0267 Compare May 5, 2026 23:52

galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch from 9a8fe70 to 925ecc3 Compare May 5, 2026 23:54

galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from a2d0267 to 7e1e56a Compare May 5, 2026 23:54

copy-pr-bot Bot temporarily deployed to GITLAB May 5, 2026 23:54 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 00:19 Inactive

galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch from 925ecc3 to 382c118 Compare May 6, 2026 07:33

galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from 7e1e56a to b7c308f Compare May 6, 2026 07:33

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 07:33 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 07:44 Inactive

galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch from 382c118 to d6f4f5b Compare May 6, 2026 07:54

galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from b7c308f to f853475 Compare May 6, 2026 07:54

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 07:54 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 08:00 Inactive

galletas1712 added 4 commits May 6, 2026 01:01

feat: wire GMS checkpoint restore flow

1efc754

fix(snapshot): use fast startup probe cadence on restore

24f3151

fix(snapshot): reduce GMS sentinel handoff delay

8daf70c

fix(snapshot): reuse cached images for restore placeholders

3ebd59d

galletas1712 added 4 commits May 6, 2026 01:01

checkpoint: run gms restore helpers as sidecars

da0fe8a

checkpoint: keep gms server as init sidecar

dd03987

snapshot: checkpoint when target container is ready

838099d

snapshot: reduce GMS restore startup wait

2cca8b5

galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch from d6f4f5b to 0a37aac Compare May 6, 2026 08:02

galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from f853475 to 2cca8b5 Compare May 6, 2026 08:02

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 08:02 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 08:11 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: wire GMS checkpoint restore flow#8833

feat: wire GMS checkpoint restore flow#8833
galletas1712 wants to merge 8 commits intoschwinns/gms-checkpoint-admission-bypassfrom
codex/interpod-gms-checkpoint-restore

galletas1712 commented Apr 29, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

galletas1712 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Deployment behavior

Intra-pod GMS

Inter-pod GMS

NVML / GPU UUID note

Out of scope

Static review

Dev image

Tests

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

galletas1712 commented Apr 29, 2026 •

edited

Loading