Skip to content

feat: wire GMS checkpoint restore flow#8833

Draft
galletas1712 wants to merge 8 commits intoschwinns/gms-checkpoint-admission-bypassfrom
codex/interpod-gms-checkpoint-restore
Draft

feat: wire GMS checkpoint restore flow#8833
galletas1712 wants to merge 8 commits intoschwinns/gms-checkpoint-admission-bypassfrom
codex/interpod-gms-checkpoint-restore

Conversation

@galletas1712
Copy link
Copy Markdown
Contributor

@galletas1712 galletas1712 commented Apr 29, 2026

Summary

Stacked on #8829 (573746394b). This wires checkpoint/restore for GMS without enabling failover yet.

  • Allows checkpoint + GMS admission for non-failover services and removes the DynamoCheckpoint CRD-level GMS block.
  • Injects --load-format gms for vLLM whenever GMS is enabled, including auto/manual checkpoint jobs.
  • Adds GMS saver/loader lifecycle barriers so snapshot-agent does not publish snapshot-complete / restore-complete until GMS tensor save/load is done.
  • Keeps the GMS sentinel handoff fast during restore by polling GMS completion and snapshot-control sentinels at 50ms.
  • Keeps GMS tensor artifacts on the snapshot PVC under /checkpoints/gms/<checkpoint-hash>/versions/<artifact-version>/device-*, separate from the CRIU tree.
  • Adds an NVML-avoidance fast path for GMS socket UUIDs via GMS_GPU_UUIDS, GMS_GPU_UUID_<index>, or UUID-valued CUDA_VISIBLE_DEVICES / NVIDIA_VISIBLE_DEVICES.

Deployment behavior

Intra-pod GMS

Checkpoint job:

  • The operator adds the usual checkpoint PVC/control-volume wiring to the main worker.
  • It adds gms-server as a restartable init sidecar with a startup probe that waits for all per-device GMS sockets.
  • It adds gms-saver as a regular Job container. The saver waits on the GMS RO lock for committed weights, writes GMS tensors to the snapshot PVC, writes a pod-UID-scoped gms-save-complete-* sentinel, then exits so the Job can complete.
  • snapshot-agent waits for that sentinel before waking the checkpointed workload.

Restore pod:

  • The operator adds gms-server as a restartable init sidecar with the same socket-readiness startup probe.
  • It adds gms-loader as a regular sidecar container on the restore target pod. Because Kubernetes only starts regular containers after the init sidecar startup probe succeeds, the loader does not carry a separate socket wait loop.
  • The loader reads from the same GMS artifact directory, waits on the GMS RW lock as needed, writes a pod-UID-scoped gms-load-complete-* sentinel, then stays alive with the pod.
  • snapshot-agent waits for that sentinel before writing restore-complete.

Inter-pod GMS

Restore:

  • The engine pod is shaped as the CRIU restore target, but it does not get an intra-pod/private GMS server.
  • The dedicated RoleGMS weight-server pod gets a regular gms-loader sidecar that mounts the snapshot PVC and loads into the real inter-pod GMS server at /run/gms/shared.
  • The engine pod carries a shared GMS completion-mode annotation so snapshot-agent waits for the RoleGMS loader sentinel before resuming the restored engine.

Checkpoint jobs still run with a local GMS server+saver pair so the capture pod can load through GMS and persist the tensors for restore.

NVML / GPU UUID note

The GMS Python path now prefers UUIDs supplied by env before querying NVML. The operator cannot know the final per-pod DRA allocation at template render time, so the direct containerd/ResourceClaim-to-env injection should happen from the runtime/DRA side. Once that env exists, GMS will avoid the slow NVML call in the startup path.

Out of scope

  • GMS + failover is not enabled here. Admission allows checkpoint+GMS when failover is disabled; failover remains guarded by the internal bypass from feat(operator): gate GMS + Snapshot via Helm config #8829.
  • I did not change the actual containerd/DRA env injection mechanism for GPU UUIDs; this PR adds the GMS-side env contract and fallback behavior.

Static review

Static review found service admission still blocking GMS, inter-pod restore loading a private server, stale restore sentinels, and intra-pod failover targeting. This patch addresses those by allowing non-failover admission, moving inter-pod loading to RoleGMS, making sentinel naming pod-scoped or explicitly shared, and keeping failover out of scope.

Dev image

  • Operator image for 3d172262bef: nvcr.io/nvidian/dynamo-dev/schwinns:kubernetes-operator-3d172262bef-gms-init-server-20260430
  • Digest: sha256:137c7b9a29bbea886d197ee84a168bd8d275465e7e7bd68d654e4008a0a25843
  • Real GMS checkpoint runs with this operator also need matching runtime/placeholder images from 3d172262bef or newer, because gms-saver now exits in the image code path.

Tests

  • CGO_ENABLED=0 go test -count=1 ./internal/checkpoint ./internal/dynamo ./internal/webhook/validation from deploy/operator
  • CGO_ENABLED=0 go test -count=1 ./internal/checkpoint ./internal/gms from deploy/operator
  • CGO_ENABLED=0 go test -count=1 -skip TestControllers ./internal/controller from deploy/operator
  • python3 -m py_compile lib/gpu_memory_service/common/utils.py lib/gpu_memory_service/cli/snapshot/loader.py lib/gpu_memory_service/cli/snapshot/saver.py
  • CGO_ENABLED=0 go test -count=1 ./protocol from deploy/snapshot
  • GOOS=linux CGO_ENABLED=0 go test -c ./internal/controller -o /tmp/dynamo-snapshot-controller.test from deploy/snapshot
  • uv run --project . --extra test pytest -q -c pyproject.toml tests/test_common_utils.py from lib/gpu_memory_service
  • go test ./internal/controller from deploy/snapshot
  • python -m py_compile components/src/dynamo/common/utils/snapshot.py

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added feat deployment::k8s Relates to dynamo deployment in kubernetes labels Apr 29, 2026
@galletas1712 galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch from 5737463 to 346cac5 Compare May 5, 2026 21:15
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch 2 times, most recently from e855c4e to 7bb557a Compare May 5, 2026 23:45
@galletas1712 galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch from 346cac5 to cff2def Compare May 5, 2026 23:45
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from 7bb557a to 2443fa4 Compare May 5, 2026 23:48
@galletas1712 galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch 2 times, most recently from ab13f45 to 9a8fe70 Compare May 5, 2026 23:52
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from 2443fa4 to a2d0267 Compare May 5, 2026 23:52
@galletas1712 galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch from 9a8fe70 to 925ecc3 Compare May 5, 2026 23:54
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from a2d0267 to 7e1e56a Compare May 5, 2026 23:54
@galletas1712 galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch from 925ecc3 to 382c118 Compare May 6, 2026 07:33
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from 7e1e56a to b7c308f Compare May 6, 2026 07:33
@galletas1712 galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch from 382c118 to d6f4f5b Compare May 6, 2026 07:54
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from b7c308f to f853475 Compare May 6, 2026 07:54
@galletas1712 galletas1712 force-pushed the schwinns/gms-checkpoint-admission-bypass branch from d6f4f5b to 0a37aac Compare May 6, 2026 08:02
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from f853475 to 2cca8b5 Compare May 6, 2026 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes feat size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant