Skip to content

feat(gms): add pluggable snapshot transfer backends#8900

Draft
galletas1712 wants to merge 5 commits intocodex/interpod-gms-checkpoint-restorefrom
schwinns/gms-nixl-pr8833
Draft

feat(gms): add pluggable snapshot transfer backends#8900
galletas1712 wants to merge 5 commits intocodex/interpod-gms-checkpoint-restorefrom
schwinns/gms-nixl-pr8833

Conversation

@galletas1712
Copy link
Copy Markdown
Contributor

@galletas1712 galletas1712 commented Apr 30, 2026

Summary

  • split GMS snapshot restore byte movement behind a TransferBackend interface while keeping allocation layout, metadata restore, and commit semantics in GMSStorageClient
  • keep the default CPU-staged restore path and add selectable aio, nixl-gds, cufile-gds, local-ssd-striped, and local-ssd-pinned backends
  • pass GMS transfer/backend tuning env vars from restore/checkpoint workloads into gms-loader / gms-saver, including local SSD roots and shard sizing
  • support separate GMS artifact storage from the snapshot control PVC while preserving snapshot-agent-visible sentinel semantics
  • reduce restore loader startup overhead by avoiding eager NIXL/PyTorch imports and discovering devices from checkpoint layout before falling back to NVML
  • add transfer timing logs and focused tests for the storage-client/backend boundary and backend transfer setup

Validation

  • PYTHONPATH=lib python3 -m py_compile lib/gpu_memory_service/common/utils.py lib/gpu_memory_service/snapshot/disk.py lib/gpu_memory_service/snapshot/transfer.py lib/gpu_memory_service/snapshot/storage_client.py lib/gpu_memory_service/cli/storage_runner.py lib/gpu_memory_service/cli/snapshot/loader.py lib/gpu_memory_service/cli/snapshot/saver.py
  • PYTHONPATH=/home/schwinns/dynamo-gms-nixl-pr8833/lib /home/schwinns/dynamo-pr8833/lib/gpu_memory_service/.venv/bin/python -m pytest -c /dev/null /home/schwinns/dynamo-gms-nixl-pr8833/lib/gpu_memory_service/tests/test_snapshot_transfer.py -q -o cache_dir=/tmp/pytest-cache-gms-pr8900 (8 passed)
  • go test ./internal/checkpoint ./internal/controller from deploy/operator

nscale benchmark notes

AIO smoke

Same-node nscale DRA node cluster-0967a26d-pool-14bee067-prctr-s2877, GMS_TRANSFER_BACKEND=aio, GMS_LOAD_WORKERS=8.

Artifacts: /home/schwinns/blog-post-bench/results/gms-aio-samenode-s2877-20260430T061030Z

model default preadv GMS load AIO GMS load result
Qwen3-0.6B 0.530s 0.700s coherent, length-capped
Qwen2.5-72B 20.210s 19.590s coherent, finish_reason=stop

The AIO 72B loader logs confirmed modes=odirect-libaio:41, so the run used native AIO rather than buffered fallback. The result is only a small/noisy improvement over the default CPU-staged path.

Local SSD pinned matrix

Same-node nscale DRA node cluster-0967a26d-pool-14bee067-prctr-l9nsv, GMS_TRANSFER_BACKEND=local-ssd-pinned, GMS_{SAVE,LOAD}_WORKERS=8, 1 GiB shards, /mnt/nvme2..9/gms.

Artifacts: /home/schwinns/blog-post-bench/results/gms-local-ssd-pinned-posix-full-l9nsv-20260503T010828Z/

model GMS load Phase B Phase B BW result
Qwen3-0.6B 0.740s 0.243s 4.84 GiB/s coherent, stop
Qwen3-8B 1.270s 0.638s 24.01 GiB/s coherent, stop
Qwen3-14B 1.590s 0.987s 27.95 GiB/s coherent, stop
Qwen3-32B 2.450s 1.729s 35.34 GiB/s coherent, stop
Qwen2.5-72B 4.690s 3.949s 34.34 GiB/s coherent, stop
GPT-OSS-120B 2.980s 2.254s 32.86 GiB/s coherent, stop
Llama 3.3 70B FP8 2.680s 1.966s 34.50 GiB/s coherent, stop

The earlier Qwen2.5-72B 7.52s local-SSD-pinned row reproduced as noise: rerunning the same code path gave 4.93s, the previous image gave 5.14s, and the final posix-buffer image gave 4.69s.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added deployment::k8s Relates to dynamo deployment in kubernetes feat labels Apr 30, 2026
@galletas1712 galletas1712 force-pushed the schwinns/gms-nixl-pr8833 branch from 744fd74 to 39de4e6 Compare April 30, 2026 06:26
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 2, 2026
@galletas1712 galletas1712 force-pushed the schwinns/gms-nixl-pr8833 branch from b938488 to c6d01bd Compare May 3, 2026 04:54
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from 1aaefc6 to e855c4e Compare May 5, 2026 21:15
@galletas1712 galletas1712 force-pushed the schwinns/gms-nixl-pr8833 branch from c6d01bd to 0a4fcf2 Compare May 5, 2026 21:15
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from e855c4e to 7bb557a Compare May 5, 2026 23:45
@galletas1712 galletas1712 force-pushed the schwinns/gms-nixl-pr8833 branch from 0a4fcf2 to b7f08cd Compare May 5, 2026 23:45
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from 7bb557a to 2443fa4 Compare May 5, 2026 23:48
@galletas1712 galletas1712 force-pushed the schwinns/gms-nixl-pr8833 branch 2 times, most recently from 7dda089 to bd786d9 Compare May 5, 2026 23:52
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch 2 times, most recently from a2d0267 to 7e1e56a Compare May 5, 2026 23:54
@galletas1712 galletas1712 force-pushed the schwinns/gms-nixl-pr8833 branch from bd786d9 to 1d16b6e Compare May 5, 2026 23:54
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from 7e1e56a to b7c308f Compare May 6, 2026 07:33
@galletas1712 galletas1712 force-pushed the schwinns/gms-nixl-pr8833 branch from 1d16b6e to ded951d Compare May 6, 2026 07:33
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from b7c308f to f853475 Compare May 6, 2026 07:54
@galletas1712 galletas1712 force-pushed the schwinns/gms-nixl-pr8833 branch from ded951d to 838e62e Compare May 6, 2026 07:54
@galletas1712 galletas1712 force-pushed the codex/interpod-gms-checkpoint-restore branch from f853475 to 2cca8b5 Compare May 6, 2026 08:02
@galletas1712 galletas1712 force-pushed the schwinns/gms-nixl-pr8833 branch from 838e62e to 81a5db7 Compare May 6, 2026 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation feat size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant