Skip to content

fix(omnidreams): robust native build locking + memory-aware nvcc fanout#332

Draft
jmccaffrey-nv wants to merge 1 commit into
mainfrom
dev/jmccaffrey/native-build-flock-guard
Draft

fix(omnidreams): robust native build locking + memory-aware nvcc fanout#332
jmccaffrey-nv wants to merge 1 commit into
mainfrom
dev/jmccaffrey/native-build-flock-guard

Conversation

@jmccaffrey-nv

Copy link
Copy Markdown
Collaborator

The single-view native extension loader stalled indefinitely when a prior build was interrupted (Ctrl-C, OOM, machine crash). PyTorch's FileBaton polls a leftover <build_dir>/lock forever because nothing ever reclaims an orphaned lock. Compounding it, MAX_JOBS was capped only by CPU count, so on memory-constrained hosts many parallel nvcc processes exhaust RAM+swap and freeze the machine -- which is exactly how the orphaned locks get left behind in the first place.

  • Add a non-blocking flock build guard around cpp_extension.load. The kernel auto-releases an flock when its holder dies, so an interrupted build never wedges the next run. While we hold the guard, any leftover lock/ .ninja_lock is provably stale and is reclaimed (logged on deletion). If a live process holds the guard we fail fast with NativeBuildBusyError and a clear message instead of polling. No-op on non-POSIX (Windows) -- preserves prior behavior.
  • Cap compile parallelism by host RAM as well as CPU: min(cpu_count, 8, RAM_GB // per_job). per_job defaults to 8 GiB (realistic for the -O3 CUTLASS/SageAttention TUs) and is tunable via OMNIDREAMS_SINGLEVIEW_NATIVE_MEM_PER_JOB_GB. Explicit MAX_JOBS / max_jobs= still win.

Adds 7 tests; full pytest -m ci_cpu is green (508 passed).

The single-view native extension loader stalled indefinitely when a prior
build was interrupted (Ctrl-C, OOM, machine crash). PyTorch's FileBaton
polls a leftover `<build_dir>/lock` forever because nothing ever reclaims
an orphaned lock. Compounding it, MAX_JOBS was capped only by CPU count, so
on memory-constrained hosts many parallel nvcc processes exhaust RAM+swap
and freeze the machine -- which is exactly how the orphaned locks get left
behind in the first place.

- Add a non-blocking flock build guard around cpp_extension.load. The kernel
  auto-releases an flock when its holder dies, so an interrupted build never
  wedges the next run. While we hold the guard, any leftover `lock`/
  `.ninja_lock` is provably stale and is reclaimed (logged on deletion). If a
  *live* process holds the guard we fail fast with NativeBuildBusyError and a
  clear message instead of polling. No-op on non-POSIX (Windows) -- preserves
  prior behavior.
- Cap compile parallelism by host RAM as well as CPU:
  min(cpu_count, 8, RAM_GB // per_job). per_job defaults to 8 GiB (realistic
  for the -O3 CUTLASS/SageAttention TUs) and is tunable via
  OMNIDREAMS_SINGLEVIEW_NATIVE_MEM_PER_JOB_GB. Explicit MAX_JOBS / max_jobs=
  still win.

Adds 7 tests; full `pytest -m ci_cpu` is green (508 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant