feat: wire GMS checkpoint restore flow#8833
Draft
galletas1712 wants to merge 8 commits intoschwinns/gms-checkpoint-admission-bypassfrom
Draft
feat: wire GMS checkpoint restore flow#8833galletas1712 wants to merge 8 commits intoschwinns/gms-checkpoint-admission-bypassfrom
galletas1712 wants to merge 8 commits intoschwinns/gms-checkpoint-admission-bypassfrom
Conversation
5737463 to
346cac5
Compare
e855c4e to
7bb557a
Compare
346cac5 to
cff2def
Compare
7bb557a to
2443fa4
Compare
ab13f45 to
9a8fe70
Compare
2443fa4 to
a2d0267
Compare
9a8fe70 to
925ecc3
Compare
a2d0267 to
7e1e56a
Compare
925ecc3 to
382c118
Compare
7e1e56a to
b7c308f
Compare
382c118 to
d6f4f5b
Compare
b7c308f to
f853475
Compare
d6f4f5b to
0a37aac
Compare
f853475 to
2cca8b5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #8829 (
573746394b). This wires checkpoint/restore for GMS without enabling failover yet.--load-format gmsfor vLLM whenever GMS is enabled, including auto/manual checkpoint jobs.snapshot-complete/restore-completeuntil GMS tensor save/load is done./checkpoints/gms/<checkpoint-hash>/versions/<artifact-version>/device-*, separate from the CRIU tree.GMS_GPU_UUIDS,GMS_GPU_UUID_<index>, or UUID-valuedCUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES.Deployment behavior
Intra-pod GMS
Checkpoint job:
gms-serveras a restartable init sidecar with a startup probe that waits for all per-device GMS sockets.gms-saveras a regular Job container. The saver waits on the GMS RO lock for committed weights, writes GMS tensors to the snapshot PVC, writes a pod-UID-scopedgms-save-complete-*sentinel, then exits so the Job can complete.Restore pod:
gms-serveras a restartable init sidecar with the same socket-readiness startup probe.gms-loaderas a regular sidecar container on the restore target pod. Because Kubernetes only starts regular containers after the init sidecar startup probe succeeds, the loader does not carry a separate socket wait loop.gms-load-complete-*sentinel, then stays alive with the pod.restore-complete.Inter-pod GMS
Restore:
gms-loadersidecar that mounts the snapshot PVC and loads into the real inter-pod GMS server at/run/gms/shared.Checkpoint jobs still run with a local GMS server+saver pair so the capture pod can load through GMS and persist the tensors for restore.
NVML / GPU UUID note
The GMS Python path now prefers UUIDs supplied by env before querying NVML. The operator cannot know the final per-pod DRA allocation at template render time, so the direct containerd/ResourceClaim-to-env injection should happen from the runtime/DRA side. Once that env exists, GMS will avoid the slow NVML call in the startup path.
Out of scope
Static review
Static review found service admission still blocking GMS, inter-pod restore loading a private server, stale restore sentinels, and intra-pod failover targeting. This patch addresses those by allowing non-failover admission, moving inter-pod loading to RoleGMS, making sentinel naming pod-scoped or explicitly shared, and keeping failover out of scope.
Dev image
3d172262bef:nvcr.io/nvidian/dynamo-dev/schwinns:kubernetes-operator-3d172262bef-gms-init-server-20260430sha256:137c7b9a29bbea886d197ee84a168bd8d275465e7e7bd68d654e4008a0a258433d172262befor newer, becausegms-savernow exits in the image code path.Tests
CGO_ENABLED=0 go test -count=1 ./internal/checkpoint ./internal/dynamo ./internal/webhook/validationfromdeploy/operatorCGO_ENABLED=0 go test -count=1 ./internal/checkpoint ./internal/gmsfromdeploy/operatorCGO_ENABLED=0 go test -count=1 -skip TestControllers ./internal/controllerfromdeploy/operatorpython3 -m py_compile lib/gpu_memory_service/common/utils.py lib/gpu_memory_service/cli/snapshot/loader.py lib/gpu_memory_service/cli/snapshot/saver.pyCGO_ENABLED=0 go test -count=1 ./protocolfromdeploy/snapshotGOOS=linux CGO_ENABLED=0 go test -c ./internal/controller -o /tmp/dynamo-snapshot-controller.testfromdeploy/snapshotuv run --project . --extra test pytest -q -c pyproject.toml tests/test_common_utils.pyfromlib/gpu_memory_servicego test ./internal/controllerfromdeploy/snapshotpython -m py_compile components/src/dynamo/common/utils/snapshot.py