feat(gms): add pluggable snapshot transfer backends#8900
Draft
galletas1712 wants to merge 5 commits intocodex/interpod-gms-checkpoint-restorefrom
Draft
feat(gms): add pluggable snapshot transfer backends#8900galletas1712 wants to merge 5 commits intocodex/interpod-gms-checkpoint-restorefrom
galletas1712 wants to merge 5 commits intocodex/interpod-gms-checkpoint-restorefrom
Conversation
744fd74 to
39de4e6
Compare
b938488 to
c6d01bd
Compare
1aaefc6 to
e855c4e
Compare
c6d01bd to
0a4fcf2
Compare
e855c4e to
7bb557a
Compare
0a4fcf2 to
b7f08cd
Compare
7bb557a to
2443fa4
Compare
7dda089 to
bd786d9
Compare
a2d0267 to
7e1e56a
Compare
bd786d9 to
1d16b6e
Compare
7e1e56a to
b7c308f
Compare
1d16b6e to
ded951d
Compare
b7c308f to
f853475
Compare
ded951d to
838e62e
Compare
Signed-off-by: Schwinn Saereesitthipitak <[email protected]>
f853475 to
2cca8b5
Compare
838e62e to
81a5db7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TransferBackendinterface while keeping allocation layout, metadata restore, and commit semantics inGMSStorageClientaio,nixl-gds,cufile-gds,local-ssd-striped, andlocal-ssd-pinnedbackendsgms-loader/gms-saver, including local SSD roots and shard sizingValidation
PYTHONPATH=lib python3 -m py_compile lib/gpu_memory_service/common/utils.py lib/gpu_memory_service/snapshot/disk.py lib/gpu_memory_service/snapshot/transfer.py lib/gpu_memory_service/snapshot/storage_client.py lib/gpu_memory_service/cli/storage_runner.py lib/gpu_memory_service/cli/snapshot/loader.py lib/gpu_memory_service/cli/snapshot/saver.pyPYTHONPATH=/home/schwinns/dynamo-gms-nixl-pr8833/lib /home/schwinns/dynamo-pr8833/lib/gpu_memory_service/.venv/bin/python -m pytest -c /dev/null /home/schwinns/dynamo-gms-nixl-pr8833/lib/gpu_memory_service/tests/test_snapshot_transfer.py -q -o cache_dir=/tmp/pytest-cache-gms-pr8900(8 passed)go test ./internal/checkpoint ./internal/controllerfromdeploy/operatornscale benchmark notes
AIO smoke
Same-node nscale DRA node
cluster-0967a26d-pool-14bee067-prctr-s2877,GMS_TRANSFER_BACKEND=aio,GMS_LOAD_WORKERS=8.Artifacts:
/home/schwinns/blog-post-bench/results/gms-aio-samenode-s2877-20260430T061030Z0.530s0.700s20.210s19.590sfinish_reason=stopThe AIO 72B loader logs confirmed
modes=odirect-libaio:41, so the run used native AIO rather than buffered fallback. The result is only a small/noisy improvement over the default CPU-staged path.Local SSD pinned matrix
Same-node nscale DRA node
cluster-0967a26d-pool-14bee067-prctr-l9nsv,GMS_TRANSFER_BACKEND=local-ssd-pinned,GMS_{SAVE,LOAD}_WORKERS=8, 1 GiB shards,/mnt/nvme2..9/gms.Artifacts:
/home/schwinns/blog-post-bench/results/gms-local-ssd-pinned-posix-full-l9nsv-20260503T010828Z/0.740s0.243s4.84 GiB/sstop1.270s0.638s24.01 GiB/sstop1.590s0.987s27.95 GiB/sstop2.450s1.729s35.34 GiB/sstop4.690s3.949s34.34 GiB/sstop2.980s2.254s32.86 GiB/sstop2.680s1.966s34.50 GiB/sstopThe earlier Qwen2.5-72B
7.52slocal-SSD-pinned row reproduced as noise: rerunning the same code path gave4.93s, the previous image gave5.14s, and the final posix-buffer image gave4.69s.