test(omnidreams): same-seed bitwise-reproducibility GPU test + nightly CI#324
test(omnidreams): same-seed bitwise-reproducibility GPU test + nightly CI#324jmccaffrey-nv wants to merge 1 commit into
Conversation
…ghtly CI Add a GPU test that runs the distilled omnidreams runner twice at the same seed under PyTorch's strict-determinism flags (torch.use_deterministic_algorithms(True, warn_only=True), CUBLAS_WORKSPACE_CONFIG=:4096:8, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True) and asserts the two output MP4s are byte-identical (sha256). Unlike a golden-hash regression, it compares the two runs against each other, so it needs no committed digest and stays green across toolchain/hardware changes. The test carries the ci_gpu tier marker but skips unless OMNIDREAMS_REPRO_RUN is set, so the per-PR `pytest -m ci_gpu` job collects-and-skips it instantly. A new nightly workflow (.github/workflows/determinism.yml, schedule + workflow_dispatch) sets OMNIDREAMS_REPRO_RUN=1 to actually run it on the GPU runner. Verified passing on an H100 (slurm, cu130 container): two seed=1 rollouts produced an identical MP4 (1 passed in 384.73s). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Greptile SummaryAdds a nightly GPU job that verifies bitwise reproducibility of the OmniDreams distilled runner by running it twice at the same seed and asserting sha256 equality of the two output MP4s — no committed golden hash needed.
Confidence Score: 4/5Safe to merge — the test logic and subprocess isolation are sound; the only issues are a stale comment in the workflow and a floating action ref. The test correctly isolates each rollout in a fresh subprocess, sets all determinism knobs before the first CUDA context, and compares sha256 digests with clear failure diagnostics. The workflow header comment says the test is marked .github/workflows/determinism.yml — stale Important Files Changed
Sequence DiagramsequenceDiagram
participant CI as Nightly CI (determinism.yml)
participant Test as test_same_seed_is_bitwise_reproducible
participant SubA as Subprocess A (python -c bootstrap)
participant SubB as Subprocess B (python -c bootstrap)
participant GPU as GPU / CUDA
CI->>Test: "pytest (OMNIDREAMS_REPRO_RUN=1, seed=1)"
Test->>SubA: spawn with CUBLAS_WORKSPACE_CONFIG + PYTORCH_CUDA_ALLOC_CONF
SubA->>SubA: os.environ.setdefault CUBLAS_WORKSPACE_CONFIG
SubA->>SubA: "torch.use_deterministic_algorithms(True, warn_only=True)"
SubA->>GPU: "run distilled runner (seed=1, total_blocks=4)"
GPU-->>SubA: inference output
SubA-->>Test: "rc=0, run_a/recipe.mp4"
Test->>SubB: "spawn identical env (seed=1)"
SubB->>SubB: same determinism bootstrap
SubB->>GPU: "run distilled runner (seed=1, total_blocks=4)"
GPU-->>SubB: inference output
SubB-->>Test: "rc=0, run_b/recipe.mp4"
Test->>Test: "sha256(mp4_a) == sha256(mp4_b)?"
alt byte-identical
Test-->>CI: PASS
else digest mismatch
Test-->>CI: FAIL (diff hint + tmp paths retained)
end
Reviews (1): Last reviewed commit: "test(omnidreams): add same-seed bitwise-..." | Re-trigger Greptile |
| # The same-seed reproducibility test is marked `manual`: it needs a real GPU, | ||
| # downloads the distilled checkpoint + an example HDMap clip from HF, and runs | ||
| # two full rollouts, so it is too heavy for the per-PR `ci_gpu` job. Instead it | ||
| # runs here on a nightly schedule (and on demand via the Actions "Run workflow" | ||
| # button). Scheduled runs execute on the default branch only. |
There was a problem hiding this comment.
The header comment says the test is marked
manual, but the test file uses pytestmark = pytest.mark.ci_gpu with an env-gate skip — not manual. The PR description even explains why manual was deliberately avoided (the pytest-manual-marker plugin would xfail it at setup). A reader consulting only this file would have a wrong mental model of why the nightly job is needed and how the per-PR gate works.
| # The same-seed reproducibility test is marked `manual`: it needs a real GPU, | |
| # downloads the distilled checkpoint + an example HDMap clip from HF, and runs | |
| # two full rollouts, so it is too heavy for the per-PR `ci_gpu` job. Instead it | |
| # runs here on a nightly schedule (and on demand via the Actions "Run workflow" | |
| # button). Scheduled runs execute on the default branch only. | |
| # The same-seed reproducibility test carries the `ci_gpu` marker but skips | |
| # unless OMNIDREAMS_REPRO_RUN is set, so the per-PR `pytest -m ci_gpu` run | |
| # collects it and skips it in milliseconds. This job sets that env var so the | |
| # test actually executes. (The `manual` marker was deliberately avoided: | |
| # pytest-manual-marker xfails every `manual` test at setup, so a `manual` test | |
| # never executes in automation.) Scheduled runs target the default branch only. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| rm -rf /var/lib/apt/lists/* | ||
|
|
||
| - name: Setup proxy cache | ||
| uses: nv-gha-runners/setup-proxy-cache@main |
There was a problem hiding this comment.
nv-gha-runners/setup-proxy-cache@main pins to a floating branch head. Any future push to main in that repo changes what runs here without a diff in this file. For internal infra actions this is often intentional, but it's worth confirming whether pinning to a commit SHA or a version tag is feasible — if the action is updated in a breaking or unexpected way, this job will silently change behavior.
What
Adds a GPU test that pins bitwise reproducibility at the same seed for the OmniDreams distilled runner, plus a nightly CI job that runs it.
The test (
integrations/omnidreams/tests/test_omnidreams_same_seed_reproducibility.py) runs the distilled runner (omnidreams-sv-2steps-chunk2-loc6-lightvae-lighttae) twice at the same seed under PyTorch's strict-determinism flags and asserts the two output MP4s are byte-identical (sha256):CUBLAS_WORKSPACE_CONFIG=:4096:8torch.use_deterministic_algorithms(True, warn_only=True)PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True(flags set in a fresh subprocess before the first CUDA context, per the PyTorch reproducibility notes)
Unlike a golden-hash regression, it compares the two runs against each other, so there is no committed digest to maintain and it stays green across toolchain/hardware changes.
How it's gated
ci_gpubut skips unlessOMNIDREAMS_REPRO_RUNis set, so the per-PRpytest -m ci_gpujob collects-and-skips it instantly..github/workflows/determinism.yml(cron +workflow_dispatch) setsOMNIDREAMS_REPRO_RUN=1to actually run it on the GPU runner.(We avoid the
manualmarker on purpose:pytest-manual-markerxfails everymanualtest at setup, so it would never execute as a CI guard.)Verification
Ran on an H100 (slurm, cu130 container): two
seed=1rollouts produced an identical MP4 —1 passed in 384.73s.🤖 Generated with Claude Code