benchForge

benchForge opportunistically soaks idle TU Delft DAIC RTX Pro 6000 GPUs (96 GB, NVIDIA Blackwell, sm_120) to run a small, airtight, reproducible suite of code/LLM benchmarks. It is a sibling of preCal and inherits its philosophy: be a polite backfill citizen, decouple the expensive non-deterministic step from the cheap deterministic one, and publish auditable artifacts rather than a leaderboard screenshot.

The core idea is a STRICT GPU-generate / CPU-score decouple:

GPU generate — a vLLM OpenAI-compatible server serves a pinned model from an apptainer SIF; thin OpenAI-client harnesses prompt it and write raw samples.jsonl. This is the only step that needs a GPU, and it is not bitwise-deterministic.
CPU score — a CPU-only scoring SIF executes/grades the generated code inside a two-layer sandbox and writes results.jsonl. This is deterministic and re-runnable.

benchForge then publishes a per-sample results dataset and a deterministically recomputed leaderboard to the Hugging Face Hub under the D4vidHuang namespace.

🔩 Live results (v1)

A first run is published — 6 models × up to 5 benchmark cells (27 cells), run on idle DAIC Pro 6000s. Full table + caveats in RESULTS.md.

Leaderboard: https://huggingface.co/datasets/D4vidHuang/benchforge-leaderboard
Per-sample results: https://huggingface.co/datasets/D4vidHuang/benchforge-results
Leaderboard Space: https://huggingface.co/spaces/D4vidHuang/benchforge-leaderboard-space

Benchmarks: HumanEval+, MBPP+ (EvalPlus), CRUXEval-I/-O, LiveCodeBench (release_v6, with a contamination-controlled clean subset). Models: Qwen2.5-Coder 7B/14B/32B, Qwen3-Coder-30B-A3B (Apache-2.0), plus non-OSI DeepSeek-Coder-V2-Lite-Instruct and StarCoder2-15B-Instruct (license-tagged, osi_approved=False).

Where the working pipeline lives: the scripts that actually produced these results are in daic/ (see daic/README.md). The benchforge/ package below is the original design scaffold; its cli execution internals are still NotImplementedError stubs. Reconciling the two (folding the daic/ runners into benchforge.cli) is tracked future work.

benchForge is authored on macOS but its execution target is Linux + CUDA + SLURM on DAIC. The repo runs end-to-end without a cluster for everything that does not need a GPU (config loading, planning, metrics, provenance, CLI dispatch, all docs/yaml/sbatch/shell). Heavy GPU/harness internals are lazy-imported and, where they cannot be exercised off-cluster, are clearly-marked stubs that raise NotImplementedError with a precise docstring of the real implementation.

Why this design

A summary of the locked decisions that shape benchForge. See DESIGN.md for the full rationale and tables.

Serving: vLLM, one OpenAI-compatible /v1 endpoint per generate shard, run from an apptainer SIF. Harnesses are thin OpenAI clients — they never touch the engine directly.
- Primary image nvcr.io/nvidia/vllm:25.09-py3 (sm_120, vLLM 0.10.1.1 / CUDA 13, DAIC-UNVERIFIED until smoke — stamp the resolved digest per row).
- Fallback image vllm/vllm-openai:latest (stock, >= 0.13.x cu129/130).
- dtype=bfloat16, quantization=none. Per-shard pins: --max-model-len 16384, --gpu-memory-utilization 0.90, --max-num-seqs 64, --seed <run.seed>.
Harnesses: per-benchmark official harnesses pinned to a git commit SHA, frozen in a CPU-only scoring SIF, wrapped behind benchforge.harness.* adapters (generate -> samples, score -> results). lm-eval-harness was dropped.
v1 default airtight benchmarks: humanevalplus (evalplus/humanevalplus, Apache-2.0), mbppplus (evalplus/mbppplus, Apache-2.0), livecodebench (livecodebench/code_generation_lite, pinned release_v6, per-model date window + contamination flag), cruxeval (cruxeval-org/cruxeval, MIT, scenarios cruxeval_i and cruxeval_o).
Opt-in Tier-A-extended (only after a smoke offline-toolchain readiness gate): bigcodebench (generation hard-pinned vllm/openai backend, scoring hard-pinned --execution local, gradio/e2b remote backends disabled), multipl-e, ds1000.
Deferred out of v1: SWE-bench Verified (Docker-per-instance; v2 spike).
Core models (Apache-2.0, pinned by HF revision SHA, all fit 96 GB bf16): Qwen/Qwen2.5-Coder-7B-Instruct, Qwen/Qwen2.5-Coder-14B-Instruct, Qwen/Qwen2.5-Coder-32B-Instruct, Qwen/Qwen3-Coder-30B-A3B-Instruct. Optional non-OSI (opt-in, hub_license-tagged, output-segregated): deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, bigcode/starcoder2-15b, meta-llama/Llama-3.1-8B-Instruct (gated -> stage on login node). Excluded: mistralai/Codestral-22B-v0.1 (MNPL), meta-llama/Llama-3.3-70B (bf16 > 96 GB).
Sandbox: two-layer. OUTER = one CPU-only apptainer SIF per benchmark family (pinned python + harness@SHA), apptainer exec --net --network none, read-only binds, scratch-only writable tmp. INNER = harness-native per-sample multiprocessing.Process + resource.setrlimit (AS/DATA/STACK/CPU/NOFILE/NPROC) + a hard parent-side wall-clock kill (not signal.alarm). The python blacklist is a soft layer; apptainer + rlimits + --net none are the real boundary.
Idle / preemption: pure backfill citizenship. GPU array is one GPU per task, --qos=$BENCHFORGE_GPU_QOS (default long), --nice=$BENCHFORGE_NICE (default 1000000), --time=$BENCHFORGE_GPU_TIME (default 04:00:00), --requeue, --signal=B:USR1@120, --open-mode=append, --array=0-(N-1)%FREE where FREE is sized at submit from the live idle Pro 6000 count minus compute.headroom_gpus (2). DAIC has live preemption disabled -> short --time + backfill ordering are the only yield levers; monitor_backfill.sh measures a GPU-second floor.
Determinism, reframed: vLLM batched inference is not bitwise-deterministic even at temperature=0. The guarantee is a reproducible PROTOCOL + auditable artifacts: scores (not tokens) are the reproducible quantity, and bootstrap CIs quantify the residual variance. Exactly-one chat template is enforced (templating_mode chat|completions, smoke prompt-diff, chat_template_sha per row).
Stats: greedy temperature=0.0 pass@1 headline across the matrix; pass@k (k in {5, 10}) only when n_samples >= k via the unbiased Chen et al. (2021) estimator (refuse when n < k); bootstrap-over-problems 95% CI on every aggregate.
Artifact: three HF repos under D4vidHuang (see below). The leaderboard is recomputed from raw at publish, never a running tally.
Budget stop: run.max_requeue_per_shard (5) + run.gpu_hour_budget (2000) ceiling; a chronically-failing shard is marked failed and surfaced by resubmit_pending.sh.

Data flow

                          LOGIN / STAGING NODE  (internet allowed)
  ┌──────────────────────────────────────────────────────────────────────────┐
  │  stage        pull weights+datasets -> $BENCHFORGE_HF_HOME (HF_HOME)       │
  │  pull-image   pull vLLM SIF + scoring SIFs -> $BENCHFORGE_SIF_DIR          │
  │  plan         enumerate model x benchmark x scenario x sample-window       │
  │                  -> shards/manifest.jsonl   (SINGLE SOURCE OF TRUTH)       │
  └──────────────────────────────────────────────────────────────────────────┘
            │ manifest.jsonl + staged scratch (HF_HUB_OFFLINE=1)
            ▼
                       GPU COMPUTE NODE(s)  (offline, backfill array)
  ┌──────────────────────────────────────────────────────────────────────────┐
  │  generate (per gen shard)                                                  │
  │    start_vllm_server(SIF) -> ONE /v1 endpoint  ──► thin OpenAI harness     │
  │      writes gen/<shard_id>/samples.jsonl  (append + fsync + atomic rename) │
  │    on row-count reconcile vs manifest n  -> manifests/generate-<id>.done   │
  └──────────────────────────────────────────────────────────────────────────┘
            │ samples.jsonl + generate-<id>.done
            ▼
                       CPU COMPUTE NODE(s)  (offline, distinct cpu_partition)
  ┌──────────────────────────────────────────────────────────────────────────┐
  │  score (per score shard)   REFUSES shards lacking generate-<id>.done       │
  │    OUTER apptainer SIF  (--net none, ro binds)                             │
  │      INNER multiprocessing.Process + setrlimit + hard wall-clock kill      │
  │      writes score/<shard_id>/results.jsonl  (per-sample idempotent)        │
  └──────────────────────────────────────────────────────────────────────────┘
            │ results.jsonl
            ▼
                          LOGIN / STAGING NODE  (internet allowed)
  ┌──────────────────────────────────────────────────────────────────────────┐
  │  aggregate    pass@1 / pass@k (Chen et al.) + bootstrap CI  (RECOMPUTED)   │
  │  publish      assemble publish/ layout -> upload_large_folder to 3 repos   │
  └──────────────────────────────────────────────────────────────────────────┘

The run lives entirely under $BENCHFORGE_SCRATCH/<run.name>/:

$BENCHFORGE_SCRATCH/<run.name>/
  shards/manifest.jsonl                  SINGLE SOURCE OF TRUTH; one JSON line per shard
  manifests/<stage>-<shard_id>.done      done marker; <stage> in {generate, score}
  manifests/<stage>-<shard_id>.committed append-only resume cursor of committed keys
  gen/<shard_id>/samples.jsonl           GEN output (append + fsync + atomic rename)
  score/<shard_id>/results.jsonl         SCORE output (append-only, per-sample idempotent)
  publish/benchforge-results/            assembled per-sample Parquet (Hive layout)
  publish/benchforge-leaderboard/        assembled aggregate Parquet

The stage table

The CLI is python -m benchforge.cli <sub> [--config configs/X.yaml] [--shard-id N] [--resume] [--dry-run] [--force]. Subcommands, in run order:

Stage	CLI subcommand	Where it runs	Internet	What it does
probe	(shell)	login + a probe job	n/a	`scripts/daic_autodetect.sh` + `scripts/sandbox_probe.sh` resolve the `PLACEHOLDER_*` gres/partition/constraint/QOS and the sandbox mode into `~/.benchforge.env`.
stage	`stage`	login / staging	yes	Download pinned model revisions + dataset revisions into `$BENCHFORGE_HF_HOME`. Gated models staged here.
pull-image	`pull-image`	login / staging	yes	Pull the vLLM SIF (primary, then fallback) and the per-family CPU scoring SIFs into `$BENCHFORGE_SIF_DIR`; record digests.
plan	`plan`	login / staging	no	Deterministically enumerate model x benchmark x scenario x sample-window into `shards/manifest.jsonl`.
smoke	`smoke`	login + tiny GPU job	no	Tiny end-to-end check: image boots, one chat template, prompt-diff, offline-toolchain readiness gate (also gates opt-in Tier-A-extended).
generate	`generate`	GPU compute (array)	no	Per gen shard: start vLLM endpoint, run OpenAI-client harness, write `samples.jsonl`, write `generate-<id>.done` on row-count reconcile.
score	`score`	CPU compute (array)	no	Per score shard: execute/grade in the two-layer sandbox, write `results.jsonl`. Refuses shards without `generate-<id>.done`.
aggregate	`aggregate`	login / staging	no	Recompute pass@1 / pass@k + bootstrap CI from raw `results.jsonl`.
publish	`publish`	login / staging	yes	Assemble `publish/` layout and `upload_large_folder` to the three HF repos.
status	`status`	anywhere	no	Report per-shard `pending/running/done/failed` from markers + manifest.
verify	`verify`	anywhere	no	Re-check artifact integrity (row counts, schema, done markers).

Make targets

The Makefile wraps the CLI and reads config via scripts/cfg.py. Targets mirror the stage table:

make stage          # download staged weights + datasets        (login, online)
make pull-image     # pull vLLM + scoring SIFs                   (login, online)
make plan           # write shards/manifest.jsonl               (login, offline)
make smoke          # tiny end-to-end readiness gate            (login + tiny GPU)
make generate       # submit the GPU backfill array             (sbatch)
make score          # submit the CPU scoring array              (sbatch)
make aggregate      # recompute pass@1/pass@k + CI from raw     (login, offline)
make publish        # upload to the three HF repos              (login, online)
make status         # per-shard pending/running/done/failed
make verify         # artifact integrity re-check

scripts/submit.sh overrides each sbatch's #SBATCH PLACEHOLDER lines from ~/.benchforge.env at submit time, so the same sbatch files work across partitions.

The three Hugging Face repos

All under the D4vidHuang namespace. The leaderboard is recomputed from raw at publish, never a tally.

benchforge-results — per-sample Parquet, one row per generated sample. Hive-partitioned data/benchmark=<b>/model=<org__model>/part-*.parquet, one README YAML config per benchmark. https://huggingface.co/datasets/D4vidHuang/benchforge-results
benchforge-leaderboard — aggregate Parquet, one row per model x benchmark, recomputed from raw at publish (never a running tally). https://huggingface.co/datasets/D4vidHuang/benchforge-leaderboard
benchforge-leaderboard-space — a Gradio Space using the gradio_leaderboard component, calling load_dataset() at startup against the leaderboard repo. https://huggingface.co/spaces/D4vidHuang/benchforge-leaderboard-space

Upload is via HfApi().upload_large_folder(repo_type='dataset', num_workers=16) from a login node only.

Offline / online split

benchForge enforces a hard split between nodes that may touch the internet and nodes that may not:

Login / staging nodes (internet allowed): the only place that does model + dataset download, SIF pulls, gated fetches, and Hub uploads. This is stage, pull-image, and publish.
Compute nodes (no internet): run with HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1, reading only staged scratch files. This is generate and score. offline.hf_hub_offline=1 is the default.

DAIC-UNVERIFIED caveat box

Several values are not yet confirmed against the live cluster and are written as PLACEHOLDER_* in configs/default.yaml, each pointing at the probe that resolves it. Do not invent a gres or partition — run the probe first.

compute.gpu_partition / compute.cpu_partition — PLACEHOLDER_gpu / PLACEHOLDER_cpu, resolved by scripts/daic_autodetect.sh.

compute.gres — gpu:PLACEHOLDER_pro6000:1, resolved by daic_autodetect.sh.

compute.constraint — PLACEHOLDER_sm120, resolved by daic_autodetect.sh.

compute.account — PLACEHOLDER_account, resolved by daic_autodetect.sh.

BENCHFORGE_SANDBOX_MODE — apptainer-setuid | apptainer-userns | rlimit-only, resolved by scripts/sandbox_probe.sh.

engine.image (nvcr.io/nvidia/vllm:25.09-py3, sm_120) — boots only after the smoke gate confirms the image runs on Blackwell; the resolved digest is stamped per row.

Quickstart

This quickstart is for a DAIC login node. See RUN_ON_DAIC.md for the full runbook (tmux bootstrap, probing, morning troubleshooting).

# 0. one-time: write per-user overrides
cp scripts/benchforge.env.example ~/.benchforge.env   # then edit scratch path etc.
source scripts/activate_env.sh                         # exports BENCHFORGE_* + folds in ~/.benchforge.env

# 1. probe the genuinely-unknown cluster facts into ~/.benchforge.env
bash scripts/daic_autodetect.sh
bash scripts/sandbox_probe.sh

# 2. stage weights/datasets + pull images (ONLINE, login node)
make stage
make pull-image

# 3. plan the run (offline)
make plan
make status

# 4. tiny readiness gate (image boots, one chat template, offline toolchain)
make smoke

# 5. the soak: GPU generate -> CPU score (backfill arrays)
make generate
make score

# 6. recompute + publish (ONLINE, login node)
make aggregate
make publish

Honest limits

benchForge is deliberately honest about what is unproven and what it cannot guarantee.

vLLM non-determinism. Batched inference is not bitwise-deterministic even at temperature=0 — token streams can differ run-to-run. benchForge therefore does not promise identical tokens; it promises a reproducible protocol and auditable artifacts, treats scores as the reproducible quantity, and reports bootstrap CIs to quantify residual variance. Exactly-one chat template is pinned and prompt-diffed.
Contamination. LiveCodeBench is pinned to release_v6 with a per-model date window and a contamination_flag, but no benchmark is contamination-proof; treat cross-model comparisons accordingly and read the flag.
Sandbox is defense-in-depth, not a proof. All generated code is hostile. The real boundary is apptainer (--net none, read-only binds, scratch-only tmp) plus per-sample setrlimit and a hard parent-side wall-clock kill. The python builtin blacklist is a documented soft layer only. Whether the outer layer is setuid-apptainer, userns-apptainer, or rlimit-only depends on sandbox_probe.sh.
NGC auth. The primary vLLM image lives on NGC (nvcr.io/...) and may require authentication to pull; if it is unavailable, benchForge falls back to vllm/vllm-openai:latest. The sm_120 image is DAIC-unverified until smoke.
Backfill, not a reservation. benchForge only soaks idle GPUs as a low-priority backfill citizen (high --nice, short --time). DAIC has live preemption disabled, so throughput depends entirely on how much idle capacity exists; monitor_backfill.sh measures the realized GPU-second floor and recommends QOS/nice fallbacks.

Where to read next

DESIGN.md — full architecture, locked-decisions table, published data schema, sharding, requeue/idempotency, staging, licensing/contamination, open questions to probe, smoke-vs-full.
RUN_ON_DAIC.md — the one-page runbook: tmux bootstrap, what is genuinely unknown and how it is probed, the full step-by-step, where things live, and a morning troubleshooting list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

benchForge

🔩 Live results (v1)

Table of contents

Why this design

Data flow

The stage table

Make targets

The three Hugging Face repos

Offline / online split

Quickstart

Honest limits

Where to read next

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchforge		benchforge
configs		configs
daic		daic
scripts		scripts
slurm		slurm
space		space
tests		tests
.gitignore		.gitignore
DESIGN.md		DESIGN.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RESULTS.md		RESULTS.md
RUN_ON_DAIC.md		RUN_ON_DAIC.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

benchForge

🔩 Live results (v1)

Table of contents

Why this design

Data flow

The stage table

Make targets

The three Hugging Face repos

Offline / online split

Quickstart

Honest limits

Where to read next

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages