Skip to content

fix: prefer NVML v2 memory info for inference setup#41

Open
Rohithmatham12 wants to merge 3 commits into
NVIDIA:mainfrom
Rohithmatham12:fix-nvml-v2-memory-info
Open

fix: prefer NVML v2 memory info for inference setup#41
Rohithmatham12 wants to merge 3 commits into
NVIDIA:mainfrom
Rohithmatham12:fix-nvml-v2-memory-info

Conversation

@Rohithmatham12

Copy link
Copy Markdown

Summary

  • try pynvml.nvmlDeviceGetMemoryInfo_v2() before the legacy v1 memory-info API in inference setup
  • keep compatibility with older pynvml builds by falling back to nvmlDeviceGetMemoryInfo() when v2 is unavailable
  • keep the existing torch/default memory fallback and ensure nvmlShutdown() runs even when NVML probing fails
  • add unit coverage for v2 success when v1 would raise NVMLError_NotSupported, plus the older-pynvml fallback path

Why
DGX Spark / GB10 platforms can report pynvml.NVMLError_NotSupported from the legacy v1 nvmlDeviceGetMemoryInfo() call during Cosmos3 inference setup. The v2 NVML memory-info API is the supported path there, while older environments still need the v1 fallback.

Testing

  • python3 -m py_compile cosmos_framework/inference/args.py cosmos_framework/inference/args_test.py
  • git diff --check

Not run locally:

  • python3 -m pytest cosmos_framework/inference/args_test.py -q because this local environment does not have pytest installed
  • import smoke because this local environment is missing framework dependencies such as pydantic

Related to NVIDIA/cosmos#180

lfengad
lfengad previously approved these changes Jun 15, 2026
@Rohithmatham12

Copy link
Copy Markdown
Author

Pushed a test-only follow-up for the unittest failure. CI's pynvml build does not expose nvmlDeviceGetMemoryInfo_v2, so the regression test now monkeypatches that symbol with raising=False before exercising the v2-preferred path.\n\nLocal validation after the change:\n- python3 -m py_compile cosmos_framework/inference/args.py cosmos_framework/inference/args_test.py\n- git diff --check\n\nI also tried the targeted pytest locally, but this environment does not have pytest installed.

@Rohithmatham12

Copy link
Copy Markdown
Author

Sorry, the approval became stale because I pushed a one-line test-only follow-up to fix the failing unittest on CI. The implementation code is unchanged from the approved version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants