Skip to content

jmweb-org/gpu-gate

Repository files navigation

gpu-gate

CI PyPI Python License: MIT

Wait for a free GPU, claim it, set CUDA_VISIBLE_DEVICES, and run your command.

On a shared multi-GPU box without a cluster scheduler, starting a job usually means watching nvidia-smi, picking a card by hand, exporting the env var, and remembering to actually launch. gpu-gate is the small wait-pick-export-run loop that does this for you, with a cooperative lock so two invocations on the same host do not grab the same just-freed card. No daemon, no server, nothing to administer.

$ gpu-gate run --min-free-mb 8000 -- python train.py
gpu-gate: waiting for a free GPU ...
# ... blocks until a card has >= 8 GB free, then runs train.py with
# CUDA_VISIBLE_DEVICES set to the chosen index

Install

$ pip install gpu-gate                 # from PyPI, once released
$ pip install git+https://github.com/jmweb-org/gpu-gate   # latest, available now

It requires an NVIDIA driver at run time. The NVML binding (nvidia-ml-py) is pulled in automatically; the package still installs and imports on machines without a GPU, so it is safe to add to shared requirements.

Usage

Run a command on a free GPU

$ gpu-gate run -n 1 --min-free-mb 8000 -- python train.py --epochs 50

Everything after -- is the command. gpu-gate blocks until the requirements are met, claims the chosen device(s), exports CUDA_VISIBLE_DEVICES, and execs the command. Its own exit code is the command's exit code, so it drops cleanly into scripts and CI.

Common options:

Option Meaning
-n, --count Number of GPUs to claim (default 1)
--min-free-mb Require at least this much free memory
--max-util Skip cards busier than this percent
--only 0,1 Restrict the search to these indices
--exclude 2,3 Never pick these indices
--poll Seconds between checks (default 5)
--timeout Give up after N seconds (exit 124)

Just wait, then use the result yourself

$ export CUDA_VISIBLE_DEVICES=$(gpu-gate wait --min-free-mb 8000)

Inspect the current state

$ gpu-gate status
idx  name           free        total       util
  0  NVIDIA L40S    44211 MiB   46068 MiB    3%
  1  NVIDIA L40S      812 MiB   46068 MiB   97%

$ gpu-gate status --json

Exit codes

Code Meaning
0 Command ran (its own code is forwarded)
2 Bad invocation (for example, no command after --)
124 Timed out waiting for a GPU
3 Requirements could never be met
4 Could not read GPU state (no driver / NVML error)

How selection works

A GPU is eligible when it has enough free memory, is below the utilization ceiling, is not excluded, and is not currently locked by another gpu-gate caller. Eligible cards are ranked by most free memory, then lowest utilization, then index, and the top --count are chosen. The ordering is fully deterministic.

Locking

While a command runs, gpu-gate holds an advisory file lock per claimed device under $GPU_GATE_LOCK_DIR (a per-user directory by default). Other gpu-gate invocations skip locked devices, which avoids the classic race where two jobs both see the same card free at the same instant. The lock is advisory: it coordinates gpu-gate callers, not arbitrary CUDA programs.

License

MIT. See LICENSE.

About

Wait for a free GPU, claim it, and run a command on it.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors