benchForge opportunistically soaks idle TU Delft DAIC RTX Pro 6000 GPUs
(96 GB, NVIDIA Blackwell, sm_120) to run a small, airtight, reproducible suite
of code/LLM benchmarks. It is a sibling of preCal and inherits its philosophy:
be a polite backfill citizen, decouple the expensive non-deterministic step from the
cheap deterministic one, and publish auditable artifacts rather than a leaderboard
screenshot.
The core idea is a STRICT GPU-generate / CPU-score decouple:
- GPU generate — a vLLM OpenAI-compatible server serves a pinned model from an
apptainer SIF; thin OpenAI-client harnesses prompt it and write raw
samples.jsonl. This is the only step that needs a GPU, and it is not bitwise-deterministic. - CPU score — a CPU-only scoring SIF executes/grades the generated code inside a
two-layer sandbox and writes
results.jsonl. This is deterministic and re-runnable.
benchForge then publishes a per-sample results dataset and a deterministically
recomputed leaderboard to the Hugging Face Hub under the D4vidHuang namespace.
A first run is published — 6 models × up to 5 benchmark cells (27 cells), run on idle
DAIC Pro 6000s. Full table + caveats in RESULTS.md.
- Leaderboard: https://huggingface.co/datasets/D4vidHuang/benchforge-leaderboard
- Per-sample results: https://huggingface.co/datasets/D4vidHuang/benchforge-results
- Leaderboard Space: https://huggingface.co/spaces/D4vidHuang/benchforge-leaderboard-space
Benchmarks: HumanEval+, MBPP+ (EvalPlus), CRUXEval-I/-O, LiveCodeBench (release_v6, with a
contamination-controlled clean subset). Models: Qwen2.5-Coder 7B/14B/32B, Qwen3-Coder-30B-A3B
(Apache-2.0), plus non-OSI DeepSeek-Coder-V2-Lite-Instruct and StarCoder2-15B-Instruct
(license-tagged, osi_approved=False).
Where the working pipeline lives: the scripts that actually produced these results are in
daic/(seedaic/README.md). Thebenchforge/package below is the original design scaffold; itscliexecution internals are stillNotImplementedErrorstubs. Reconciling the two (folding thedaic/runners intobenchforge.cli) is tracked future work.
benchForge is authored on macOS but its execution target is Linux + CUDA + SLURM on DAIC. The repo runs end-to-end without a cluster for everything that does not need a GPU (config loading, planning, metrics, provenance, CLI dispatch, all docs/yaml/sbatch/shell). Heavy GPU/harness internals are lazy-imported and, where they cannot be exercised off-cluster, are clearly-marked stubs that raise
NotImplementedErrorwith a precise docstring of the real implementation.
- Why this design
- Data flow
- The stage table
- Make targets
- The three Hugging Face repos
- Offline / online split
- Quickstart
- Honest limits
- Where to read next
A summary of the locked decisions that shape benchForge. See
DESIGN.md for the full rationale and tables.
- Serving: vLLM, one OpenAI-compatible
/v1endpoint per generate shard, run from an apptainer SIF. Harnesses are thin OpenAI clients — they never touch the engine directly.- Primary image
nvcr.io/nvidia/vllm:25.09-py3(sm_120, vLLM 0.10.1.1 / CUDA 13, DAIC-UNVERIFIED until smoke — stamp the resolved digest per row). - Fallback image
vllm/vllm-openai:latest(stock, >= 0.13.x cu129/130). dtype=bfloat16,quantization=none. Per-shard pins:--max-model-len 16384,--gpu-memory-utilization 0.90,--max-num-seqs 64,--seed <run.seed>.
- Primary image
- Harnesses: per-benchmark official harnesses pinned to a git commit SHA,
frozen in a CPU-only scoring SIF, wrapped behind
benchforge.harness.*adapters (generate -> samples,score -> results).lm-eval-harnesswas dropped. - v1 default airtight benchmarks:
humanevalplus(evalplus/humanevalplus, Apache-2.0),mbppplus(evalplus/mbppplus, Apache-2.0),livecodebench(livecodebench/code_generation_lite, pinnedrelease_v6, per-model date window + contamination flag),cruxeval(cruxeval-org/cruxeval, MIT, scenarioscruxeval_iandcruxeval_o). - Opt-in Tier-A-extended (only after a smoke offline-toolchain readiness gate):
bigcodebench(generation hard-pinned vllm/openai backend, scoring hard-pinned--execution local, gradio/e2b remote backends disabled),multipl-e,ds1000. - Deferred out of v1: SWE-bench Verified (Docker-per-instance; v2 spike).
- Core models (Apache-2.0, pinned by HF revision SHA, all fit 96 GB bf16):
Qwen/Qwen2.5-Coder-7B-Instruct,Qwen/Qwen2.5-Coder-14B-Instruct,Qwen/Qwen2.5-Coder-32B-Instruct,Qwen/Qwen3-Coder-30B-A3B-Instruct. Optional non-OSI (opt-in,hub_license-tagged, output-segregated):deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct,bigcode/starcoder2-15b,meta-llama/Llama-3.1-8B-Instruct(gated -> stage on login node). Excluded:mistralai/Codestral-22B-v0.1(MNPL),meta-llama/Llama-3.3-70B(bf16 > 96 GB). - Sandbox: two-layer. OUTER = one CPU-only apptainer SIF per benchmark family
(pinned python + harness@SHA),
apptainer exec --net --network none, read-only binds, scratch-only writable tmp. INNER = harness-native per-samplemultiprocessing.Process+resource.setrlimit(AS/DATA/STACK/CPU/NOFILE/NPROC) + a hard parent-side wall-clock kill (notsignal.alarm). The python blacklist is a soft layer; apptainer + rlimits +--net noneare the real boundary. - Idle / preemption: pure backfill citizenship. GPU array is one GPU per task,
--qos=$BENCHFORGE_GPU_QOS(defaultlong),--nice=$BENCHFORGE_NICE(default1000000),--time=$BENCHFORGE_GPU_TIME(default04:00:00),--requeue,--signal=B:USR1@120,--open-mode=append,--array=0-(N-1)%FREEwhereFREEis sized at submit from the live idle Pro 6000 count minuscompute.headroom_gpus(2). DAIC has live preemption disabled -> short--time+ backfill ordering are the only yield levers;monitor_backfill.shmeasures a GPU-second floor. - Determinism, reframed: vLLM batched inference is not bitwise-deterministic
even at
temperature=0. The guarantee is a reproducible PROTOCOL + auditable artifacts: scores (not tokens) are the reproducible quantity, and bootstrap CIs quantify the residual variance. Exactly-one chat template is enforced (templating_modechat|completions, smoke prompt-diff,chat_template_shaper row). - Stats: greedy
temperature=0.0pass@1 headline across the matrix; pass@k (k in {5, 10}) only whenn_samples >= kvia the unbiased Chen et al. (2021) estimator (refuse when n < k); bootstrap-over-problems 95% CI on every aggregate. - Artifact: three HF repos under
D4vidHuang(see below). The leaderboard is recomputed from raw at publish, never a running tally. - Budget stop:
run.max_requeue_per_shard(5) +run.gpu_hour_budget(2000) ceiling; a chronically-failing shard is markedfailedand surfaced byresubmit_pending.sh.
LOGIN / STAGING NODE (internet allowed)
┌──────────────────────────────────────────────────────────────────────────┐
│ stage pull weights+datasets -> $BENCHFORGE_HF_HOME (HF_HOME) │
│ pull-image pull vLLM SIF + scoring SIFs -> $BENCHFORGE_SIF_DIR │
│ plan enumerate model x benchmark x scenario x sample-window │
│ -> shards/manifest.jsonl (SINGLE SOURCE OF TRUTH) │
└──────────────────────────────────────────────────────────────────────────┘
│ manifest.jsonl + staged scratch (HF_HUB_OFFLINE=1)
▼
GPU COMPUTE NODE(s) (offline, backfill array)
┌──────────────────────────────────────────────────────────────────────────┐
│ generate (per gen shard) │
│ start_vllm_server(SIF) -> ONE /v1 endpoint ──► thin OpenAI harness │
│ writes gen/<shard_id>/samples.jsonl (append + fsync + atomic rename) │
│ on row-count reconcile vs manifest n -> manifests/generate-<id>.done │
└──────────────────────────────────────────────────────────────────────────┘
│ samples.jsonl + generate-<id>.done
▼
CPU COMPUTE NODE(s) (offline, distinct cpu_partition)
┌──────────────────────────────────────────────────────────────────────────┐
│ score (per score shard) REFUSES shards lacking generate-<id>.done │
│ OUTER apptainer SIF (--net none, ro binds) │
│ INNER multiprocessing.Process + setrlimit + hard wall-clock kill │
│ writes score/<shard_id>/results.jsonl (per-sample idempotent) │
└──────────────────────────────────────────────────────────────────────────┘
│ results.jsonl
▼
LOGIN / STAGING NODE (internet allowed)
┌──────────────────────────────────────────────────────────────────────────┐
│ aggregate pass@1 / pass@k (Chen et al.) + bootstrap CI (RECOMPUTED) │
│ publish assemble publish/ layout -> upload_large_folder to 3 repos │
└──────────────────────────────────────────────────────────────────────────┘
The run lives entirely under $BENCHFORGE_SCRATCH/<run.name>/:
$BENCHFORGE_SCRATCH/<run.name>/
shards/manifest.jsonl SINGLE SOURCE OF TRUTH; one JSON line per shard
manifests/<stage>-<shard_id>.done done marker; <stage> in {generate, score}
manifests/<stage>-<shard_id>.committed append-only resume cursor of committed keys
gen/<shard_id>/samples.jsonl GEN output (append + fsync + atomic rename)
score/<shard_id>/results.jsonl SCORE output (append-only, per-sample idempotent)
publish/benchforge-results/ assembled per-sample Parquet (Hive layout)
publish/benchforge-leaderboard/ assembled aggregate Parquet
The CLI is python -m benchforge.cli <sub> [--config configs/X.yaml] [--shard-id N] [--resume] [--dry-run] [--force]. Subcommands, in run order:
| Stage | CLI subcommand | Where it runs | Internet | What it does |
|---|---|---|---|---|
| probe | (shell) | login + a probe job | n/a | scripts/daic_autodetect.sh + scripts/sandbox_probe.sh resolve the PLACEHOLDER_* gres/partition/constraint/QOS and the sandbox mode into ~/.benchforge.env. |
| stage | stage |
login / staging | yes | Download pinned model revisions + dataset revisions into $BENCHFORGE_HF_HOME. Gated models staged here. |
| pull-image | pull-image |
login / staging | yes | Pull the vLLM SIF (primary, then fallback) and the per-family CPU scoring SIFs into $BENCHFORGE_SIF_DIR; record digests. |
| plan | plan |
login / staging | no | Deterministically enumerate model x benchmark x scenario x sample-window into shards/manifest.jsonl. |
| smoke | smoke |
login + tiny GPU job | no | Tiny end-to-end check: image boots, one chat template, prompt-diff, offline-toolchain readiness gate (also gates opt-in Tier-A-extended). |
| generate | generate |
GPU compute (array) | no | Per gen shard: start vLLM endpoint, run OpenAI-client harness, write samples.jsonl, write generate-<id>.done on row-count reconcile. |
| score | score |
CPU compute (array) | no | Per score shard: execute/grade in the two-layer sandbox, write results.jsonl. Refuses shards without generate-<id>.done. |
| aggregate | aggregate |
login / staging | no | Recompute pass@1 / pass@k + bootstrap CI from raw results.jsonl. |
| publish | publish |
login / staging | yes | Assemble publish/ layout and upload_large_folder to the three HF repos. |
| status | status |
anywhere | no | Report per-shard pending/running/done/failed from markers + manifest. |
| verify | verify |
anywhere | no | Re-check artifact integrity (row counts, schema, done markers). |
The Makefile wraps the CLI and reads config via scripts/cfg.py. Targets mirror the
stage table:
make stage # download staged weights + datasets (login, online)
make pull-image # pull vLLM + scoring SIFs (login, online)
make plan # write shards/manifest.jsonl (login, offline)
make smoke # tiny end-to-end readiness gate (login + tiny GPU)
make generate # submit the GPU backfill array (sbatch)
make score # submit the CPU scoring array (sbatch)
make aggregate # recompute pass@1/pass@k + CI from raw (login, offline)
make publish # upload to the three HF repos (login, online)
make status # per-shard pending/running/done/failed
make verify # artifact integrity re-check
scripts/submit.sh overrides each sbatch's #SBATCH PLACEHOLDER lines from
~/.benchforge.env at submit time, so the same sbatch files work across partitions.
All under the D4vidHuang namespace. The leaderboard is recomputed from raw at
publish, never a tally.
- benchforge-results — per-sample Parquet, one row per generated sample.
Hive-partitioned
data/benchmark=<b>/model=<org__model>/part-*.parquet, one README YAML config per benchmark. https://huggingface.co/datasets/D4vidHuang/benchforge-results - benchforge-leaderboard — aggregate Parquet, one row per model x benchmark, recomputed from raw at publish (never a running tally). https://huggingface.co/datasets/D4vidHuang/benchforge-leaderboard
- benchforge-leaderboard-space — a Gradio Space using the
gradio_leaderboardcomponent, callingload_dataset()at startup against the leaderboard repo. https://huggingface.co/spaces/D4vidHuang/benchforge-leaderboard-space
Upload is via HfApi().upload_large_folder(repo_type='dataset', num_workers=16) from a
login node only.
benchForge enforces a hard split between nodes that may touch the internet and nodes that may not:
- Login / staging nodes (internet allowed): the only place that does model +
dataset download, SIF pulls, gated fetches, and Hub uploads. This is
stage,pull-image, andpublish. - Compute nodes (no internet): run with
HF_HUB_OFFLINE=1andTRANSFORMERS_OFFLINE=1, reading only staged scratch files. This isgenerateandscore.offline.hf_hub_offline=1is the default.
DAIC-UNVERIFIED caveat box
Several values are not yet confirmed against the live cluster and are written as
PLACEHOLDER_*inconfigs/default.yaml, each pointing at the probe that resolves it. Do not invent a gres or partition — run the probe first.
compute.gpu_partition/compute.cpu_partition—PLACEHOLDER_gpu/PLACEHOLDER_cpu, resolved byscripts/daic_autodetect.sh.compute.gres—gpu:PLACEHOLDER_pro6000:1, resolved bydaic_autodetect.sh.compute.constraint—PLACEHOLDER_sm120, resolved bydaic_autodetect.sh.compute.account—PLACEHOLDER_account, resolved bydaic_autodetect.sh.BENCHFORGE_SANDBOX_MODE—apptainer-setuid | apptainer-userns | rlimit-only, resolved byscripts/sandbox_probe.sh.engine.image(nvcr.io/nvidia/vllm:25.09-py3,sm_120) — boots only after thesmokegate confirms the image runs on Blackwell; the resolved digest is stamped per row.
This quickstart is for a DAIC login node. See
RUN_ON_DAIC.mdfor the full runbook (tmux bootstrap, probing, morning troubleshooting).
# 0. one-time: write per-user overrides
cp scripts/benchforge.env.example ~/.benchforge.env # then edit scratch path etc.
source scripts/activate_env.sh # exports BENCHFORGE_* + folds in ~/.benchforge.env
# 1. probe the genuinely-unknown cluster facts into ~/.benchforge.env
bash scripts/daic_autodetect.sh
bash scripts/sandbox_probe.sh
# 2. stage weights/datasets + pull images (ONLINE, login node)
make stage
make pull-image
# 3. plan the run (offline)
make plan
make status
# 4. tiny readiness gate (image boots, one chat template, offline toolchain)
make smoke
# 5. the soak: GPU generate -> CPU score (backfill arrays)
make generate
make score
# 6. recompute + publish (ONLINE, login node)
make aggregate
make publishbenchForge is deliberately honest about what is unproven and what it cannot guarantee.
- vLLM non-determinism. Batched inference is not bitwise-deterministic even at
temperature=0— token streams can differ run-to-run. benchForge therefore does not promise identical tokens; it promises a reproducible protocol and auditable artifacts, treats scores as the reproducible quantity, and reports bootstrap CIs to quantify residual variance. Exactly-one chat template is pinned and prompt-diffed. - Contamination. LiveCodeBench is pinned to
release_v6with a per-model date window and acontamination_flag, but no benchmark is contamination-proof; treat cross-model comparisons accordingly and read the flag. - Sandbox is defense-in-depth, not a proof. All generated code is hostile. The
real boundary is apptainer (
--net none, read-only binds, scratch-only tmp) plus per-samplesetrlimitand a hard parent-side wall-clock kill. The python builtin blacklist is a documented soft layer only. Whether the outer layer is setuid-apptainer, userns-apptainer, or rlimit-only depends onsandbox_probe.sh. - NGC auth. The primary vLLM image lives on NGC (
nvcr.io/...) and may require authentication to pull; if it is unavailable, benchForge falls back tovllm/vllm-openai:latest. Thesm_120image is DAIC-unverified until smoke. - Backfill, not a reservation. benchForge only soaks idle GPUs as a low-priority
backfill citizen (high
--nice, short--time). DAIC has live preemption disabled, so throughput depends entirely on how much idle capacity exists;monitor_backfill.shmeasures the realized GPU-second floor and recommends QOS/nice fallbacks.
DESIGN.md— full architecture, locked-decisions table, published data schema, sharding, requeue/idempotency, staging, licensing/contamination, open questions to probe, smoke-vs-full.RUN_ON_DAIC.md— the one-page runbook: tmux bootstrap, what is genuinely unknown and how it is probed, the full step-by-step, where things live, and a morning troubleshooting list.