fix(omnidreams): robust native build locking + memory-aware nvcc fanout by jmccaffrey-nv · Pull Request #332 · NVIDIA/flashdreams

jmccaffrey-nv · 2026-06-12T21:54:42Z

The single-view native extension loader stalled indefinitely when a prior build was interrupted (Ctrl-C, OOM, machine crash). PyTorch's FileBaton polls a leftover <build_dir>/lock forever because nothing ever reclaims an orphaned lock. Compounding it, MAX_JOBS was capped only by CPU count, so on memory-constrained hosts many parallel nvcc processes exhaust RAM+swap and freeze the machine -- which is exactly how the orphaned locks get left behind in the first place.

Add a non-blocking flock build guard around cpp_extension.load. The kernel auto-releases an flock when its holder dies, so an interrupted build never wedges the next run. While we hold the guard, any leftover lock/ .ninja_lock is provably stale and is reclaimed (logged on deletion). If a live process holds the guard we fail fast with NativeBuildBusyError and a clear message instead of polling. No-op on non-POSIX (Windows) -- preserves prior behavior.
Cap compile parallelism by host RAM as well as CPU: min(cpu_count, 8, RAM_GB // per_job). per_job defaults to 8 GiB (realistic for the -O3 CUTLASS/SageAttention TUs) and is tunable via OMNIDREAMS_SINGLEVIEW_NATIVE_MEM_PER_JOB_GB. Explicit MAX_JOBS / max_jobs= still win.

Adds 7 tests; full pytest -m ci_cpu is green (508 passed).

The single-view native extension loader stalled indefinitely when a prior build was interrupted (Ctrl-C, OOM, machine crash). PyTorch's FileBaton polls a leftover `<build_dir>/lock` forever because nothing ever reclaims an orphaned lock. Compounding it, MAX_JOBS was capped only by CPU count, so on memory-constrained hosts many parallel nvcc processes exhaust RAM+swap and freeze the machine -- which is exactly how the orphaned locks get left behind in the first place. - Add a non-blocking flock build guard around cpp_extension.load. The kernel auto-releases an flock when its holder dies, so an interrupted build never wedges the next run. While we hold the guard, any leftover `lock`/ `.ninja_lock` is provably stale and is reclaimed (logged on deletion). If a *live* process holds the guard we fail fast with NativeBuildBusyError and a clear message instead of polling. No-op on non-POSIX (Windows) -- preserves prior behavior. - Cap compile parallelism by host RAM as well as CPU: min(cpu_count, 8, RAM_GB // per_job). per_job defaults to 8 GiB (realistic for the -O3 CUTLASS/SageAttention TUs) and is tunable via OMNIDREAMS_SINGLEVIEW_NATIVE_MEM_PER_JOB_GB. Explicit MAX_JOBS / max_jobs= still win. Adds 7 tests; full `pytest -m ci_cpu` is green (508 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

copy-pr-bot · 2026-06-12T21:54:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(omnidreams): robust native build locking + memory-aware nvcc fanout#332

fix(omnidreams): robust native build locking + memory-aware nvcc fanout#332
jmccaffrey-nv wants to merge 1 commit into
mainfrom
dev/jmccaffrey/native-build-flock-guard

jmccaffrey-nv commented Jun 12, 2026

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jmccaffrey-nv commented Jun 12, 2026

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant