Skip to content

D4vidHuang/benchForge

Repository files navigation

benchForge

benchForge opportunistically soaks idle TU Delft DAIC RTX Pro 6000 GPUs (96 GB, NVIDIA Blackwell, sm_120) to run a small, airtight, reproducible suite of code/LLM benchmarks. It is a sibling of preCal and inherits its philosophy: be a polite backfill citizen, decouple the expensive non-deterministic step from the cheap deterministic one, and publish auditable artifacts rather than a leaderboard screenshot.

The core idea is a STRICT GPU-generate / CPU-score decouple:

  • GPU generate — a vLLM OpenAI-compatible server serves a pinned model from an apptainer SIF; thin OpenAI-client harnesses prompt it and write raw samples.jsonl. This is the only step that needs a GPU, and it is not bitwise-deterministic.
  • CPU score — a CPU-only scoring SIF executes/grades the generated code inside a two-layer sandbox and writes results.jsonl. This is deterministic and re-runnable.

benchForge then publishes a per-sample results dataset and a deterministically recomputed leaderboard to the Hugging Face Hub under the D4vidHuang namespace.


🔩 Live results (v1)

A first run is published — 6 models × up to 5 benchmark cells (27 cells), run on idle DAIC Pro 6000s. Full table + caveats in RESULTS.md.

Benchmarks: HumanEval+, MBPP+ (EvalPlus), CRUXEval-I/-O, LiveCodeBench (release_v6, with a contamination-controlled clean subset). Models: Qwen2.5-Coder 7B/14B/32B, Qwen3-Coder-30B-A3B (Apache-2.0), plus non-OSI DeepSeek-Coder-V2-Lite-Instruct and StarCoder2-15B-Instruct (license-tagged, osi_approved=False).

Where the working pipeline lives: the scripts that actually produced these results are in daic/ (see daic/README.md). The benchforge/ package below is the original design scaffold; its cli execution internals are still NotImplementedError stubs. Reconciling the two (folding the daic/ runners into benchforge.cli) is tracked future work.

benchForge is authored on macOS but its execution target is Linux + CUDA + SLURM on DAIC. The repo runs end-to-end without a cluster for everything that does not need a GPU (config loading, planning, metrics, provenance, CLI dispatch, all docs/yaml/sbatch/shell). Heavy GPU/harness internals are lazy-imported and, where they cannot be exercised off-cluster, are clearly-marked stubs that raise NotImplementedError with a precise docstring of the real implementation.


Table of contents


Why this design

A summary of the locked decisions that shape benchForge. See DESIGN.md for the full rationale and tables.

  • Serving: vLLM, one OpenAI-compatible /v1 endpoint per generate shard, run from an apptainer SIF. Harnesses are thin OpenAI clients — they never touch the engine directly.
    • Primary image nvcr.io/nvidia/vllm:25.09-py3 (sm_120, vLLM 0.10.1.1 / CUDA 13, DAIC-UNVERIFIED until smoke — stamp the resolved digest per row).
    • Fallback image vllm/vllm-openai:latest (stock, >= 0.13.x cu129/130).
    • dtype=bfloat16, quantization=none. Per-shard pins: --max-model-len 16384, --gpu-memory-utilization 0.90, --max-num-seqs 64, --seed <run.seed>.
  • Harnesses: per-benchmark official harnesses pinned to a git commit SHA, frozen in a CPU-only scoring SIF, wrapped behind benchforge.harness.* adapters (generate -> samples, score -> results). lm-eval-harness was dropped.
  • v1 default airtight benchmarks: humanevalplus (evalplus/humanevalplus, Apache-2.0), mbppplus (evalplus/mbppplus, Apache-2.0), livecodebench (livecodebench/code_generation_lite, pinned release_v6, per-model date window + contamination flag), cruxeval (cruxeval-org/cruxeval, MIT, scenarios cruxeval_i and cruxeval_o).
  • Opt-in Tier-A-extended (only after a smoke offline-toolchain readiness gate): bigcodebench (generation hard-pinned vllm/openai backend, scoring hard-pinned --execution local, gradio/e2b remote backends disabled), multipl-e, ds1000.
  • Deferred out of v1: SWE-bench Verified (Docker-per-instance; v2 spike).
  • Core models (Apache-2.0, pinned by HF revision SHA, all fit 96 GB bf16): Qwen/Qwen2.5-Coder-7B-Instruct, Qwen/Qwen2.5-Coder-14B-Instruct, Qwen/Qwen2.5-Coder-32B-Instruct, Qwen/Qwen3-Coder-30B-A3B-Instruct. Optional non-OSI (opt-in, hub_license-tagged, output-segregated): deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, bigcode/starcoder2-15b, meta-llama/Llama-3.1-8B-Instruct (gated -> stage on login node). Excluded: mistralai/Codestral-22B-v0.1 (MNPL), meta-llama/Llama-3.3-70B (bf16 > 96 GB).
  • Sandbox: two-layer. OUTER = one CPU-only apptainer SIF per benchmark family (pinned python + harness@SHA), apptainer exec --net --network none, read-only binds, scratch-only writable tmp. INNER = harness-native per-sample multiprocessing.Process + resource.setrlimit (AS/DATA/STACK/CPU/NOFILE/NPROC) + a hard parent-side wall-clock kill (not signal.alarm). The python blacklist is a soft layer; apptainer + rlimits + --net none are the real boundary.
  • Idle / preemption: pure backfill citizenship. GPU array is one GPU per task, --qos=$BENCHFORGE_GPU_QOS (default long), --nice=$BENCHFORGE_NICE (default 1000000), --time=$BENCHFORGE_GPU_TIME (default 04:00:00), --requeue, --signal=B:USR1@120, --open-mode=append, --array=0-(N-1)%FREE where FREE is sized at submit from the live idle Pro 6000 count minus compute.headroom_gpus (2). DAIC has live preemption disabled -> short --time + backfill ordering are the only yield levers; monitor_backfill.sh measures a GPU-second floor.
  • Determinism, reframed: vLLM batched inference is not bitwise-deterministic even at temperature=0. The guarantee is a reproducible PROTOCOL + auditable artifacts: scores (not tokens) are the reproducible quantity, and bootstrap CIs quantify the residual variance. Exactly-one chat template is enforced (templating_mode chat|completions, smoke prompt-diff, chat_template_sha per row).
  • Stats: greedy temperature=0.0 pass@1 headline across the matrix; pass@k (k in {5, 10}) only when n_samples >= k via the unbiased Chen et al. (2021) estimator (refuse when n < k); bootstrap-over-problems 95% CI on every aggregate.
  • Artifact: three HF repos under D4vidHuang (see below). The leaderboard is recomputed from raw at publish, never a running tally.
  • Budget stop: run.max_requeue_per_shard (5) + run.gpu_hour_budget (2000) ceiling; a chronically-failing shard is marked failed and surfaced by resubmit_pending.sh.

Data flow

                          LOGIN / STAGING NODE  (internet allowed)
  ┌──────────────────────────────────────────────────────────────────────────┐
  │  stage        pull weights+datasets -> $BENCHFORGE_HF_HOME (HF_HOME)       │
  │  pull-image   pull vLLM SIF + scoring SIFs -> $BENCHFORGE_SIF_DIR          │
  │  plan         enumerate model x benchmark x scenario x sample-window       │
  │                  -> shards/manifest.jsonl   (SINGLE SOURCE OF TRUTH)       │
  └──────────────────────────────────────────────────────────────────────────┘
            │ manifest.jsonl + staged scratch (HF_HUB_OFFLINE=1)
            ▼
                       GPU COMPUTE NODE(s)  (offline, backfill array)
  ┌──────────────────────────────────────────────────────────────────────────┐
  │  generate (per gen shard)                                                  │
  │    start_vllm_server(SIF) -> ONE /v1 endpoint  ──► thin OpenAI harness     │
  │      writes gen/<shard_id>/samples.jsonl  (append + fsync + atomic rename) │
  │    on row-count reconcile vs manifest n  -> manifests/generate-<id>.done   │
  └──────────────────────────────────────────────────────────────────────────┘
            │ samples.jsonl + generate-<id>.done
            ▼
                       CPU COMPUTE NODE(s)  (offline, distinct cpu_partition)
  ┌──────────────────────────────────────────────────────────────────────────┐
  │  score (per score shard)   REFUSES shards lacking generate-<id>.done       │
  │    OUTER apptainer SIF  (--net none, ro binds)                             │
  │      INNER multiprocessing.Process + setrlimit + hard wall-clock kill      │
  │      writes score/<shard_id>/results.jsonl  (per-sample idempotent)        │
  └──────────────────────────────────────────────────────────────────────────┘
            │ results.jsonl
            ▼
                          LOGIN / STAGING NODE  (internet allowed)
  ┌──────────────────────────────────────────────────────────────────────────┐
  │  aggregate    pass@1 / pass@k (Chen et al.) + bootstrap CI  (RECOMPUTED)   │
  │  publish      assemble publish/ layout -> upload_large_folder to 3 repos   │
  └──────────────────────────────────────────────────────────────────────────┘

The run lives entirely under $BENCHFORGE_SCRATCH/<run.name>/:

$BENCHFORGE_SCRATCH/<run.name>/
  shards/manifest.jsonl                  SINGLE SOURCE OF TRUTH; one JSON line per shard
  manifests/<stage>-<shard_id>.done      done marker; <stage> in {generate, score}
  manifests/<stage>-<shard_id>.committed append-only resume cursor of committed keys
  gen/<shard_id>/samples.jsonl           GEN output (append + fsync + atomic rename)
  score/<shard_id>/results.jsonl         SCORE output (append-only, per-sample idempotent)
  publish/benchforge-results/            assembled per-sample Parquet (Hive layout)
  publish/benchforge-leaderboard/        assembled aggregate Parquet

The stage table

The CLI is python -m benchforge.cli <sub> [--config configs/X.yaml] [--shard-id N] [--resume] [--dry-run] [--force]. Subcommands, in run order:

Stage CLI subcommand Where it runs Internet What it does
probe (shell) login + a probe job n/a scripts/daic_autodetect.sh + scripts/sandbox_probe.sh resolve the PLACEHOLDER_* gres/partition/constraint/QOS and the sandbox mode into ~/.benchforge.env.
stage stage login / staging yes Download pinned model revisions + dataset revisions into $BENCHFORGE_HF_HOME. Gated models staged here.
pull-image pull-image login / staging yes Pull the vLLM SIF (primary, then fallback) and the per-family CPU scoring SIFs into $BENCHFORGE_SIF_DIR; record digests.
plan plan login / staging no Deterministically enumerate model x benchmark x scenario x sample-window into shards/manifest.jsonl.
smoke smoke login + tiny GPU job no Tiny end-to-end check: image boots, one chat template, prompt-diff, offline-toolchain readiness gate (also gates opt-in Tier-A-extended).
generate generate GPU compute (array) no Per gen shard: start vLLM endpoint, run OpenAI-client harness, write samples.jsonl, write generate-<id>.done on row-count reconcile.
score score CPU compute (array) no Per score shard: execute/grade in the two-layer sandbox, write results.jsonl. Refuses shards without generate-<id>.done.
aggregate aggregate login / staging no Recompute pass@1 / pass@k + bootstrap CI from raw results.jsonl.
publish publish login / staging yes Assemble publish/ layout and upload_large_folder to the three HF repos.
status status anywhere no Report per-shard pending/running/done/failed from markers + manifest.
verify verify anywhere no Re-check artifact integrity (row counts, schema, done markers).

Make targets

The Makefile wraps the CLI and reads config via scripts/cfg.py. Targets mirror the stage table:

make stage          # download staged weights + datasets        (login, online)
make pull-image     # pull vLLM + scoring SIFs                   (login, online)
make plan           # write shards/manifest.jsonl               (login, offline)
make smoke          # tiny end-to-end readiness gate            (login + tiny GPU)
make generate       # submit the GPU backfill array             (sbatch)
make score          # submit the CPU scoring array              (sbatch)
make aggregate      # recompute pass@1/pass@k + CI from raw     (login, offline)
make publish        # upload to the three HF repos              (login, online)
make status         # per-shard pending/running/done/failed
make verify         # artifact integrity re-check

scripts/submit.sh overrides each sbatch's #SBATCH PLACEHOLDER lines from ~/.benchforge.env at submit time, so the same sbatch files work across partitions.


The three Hugging Face repos

All under the D4vidHuang namespace. The leaderboard is recomputed from raw at publish, never a tally.

  1. benchforge-results — per-sample Parquet, one row per generated sample. Hive-partitioned data/benchmark=<b>/model=<org__model>/part-*.parquet, one README YAML config per benchmark. https://huggingface.co/datasets/D4vidHuang/benchforge-results
  2. benchforge-leaderboard — aggregate Parquet, one row per model x benchmark, recomputed from raw at publish (never a running tally). https://huggingface.co/datasets/D4vidHuang/benchforge-leaderboard
  3. benchforge-leaderboard-space — a Gradio Space using the gradio_leaderboard component, calling load_dataset() at startup against the leaderboard repo. https://huggingface.co/spaces/D4vidHuang/benchforge-leaderboard-space

Upload is via HfApi().upload_large_folder(repo_type='dataset', num_workers=16) from a login node only.


Offline / online split

benchForge enforces a hard split between nodes that may touch the internet and nodes that may not:

  • Login / staging nodes (internet allowed): the only place that does model + dataset download, SIF pulls, gated fetches, and Hub uploads. This is stage, pull-image, and publish.
  • Compute nodes (no internet): run with HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1, reading only staged scratch files. This is generate and score. offline.hf_hub_offline=1 is the default.

DAIC-UNVERIFIED caveat box

Several values are not yet confirmed against the live cluster and are written as PLACEHOLDER_* in configs/default.yaml, each pointing at the probe that resolves it. Do not invent a gres or partition — run the probe first.

  • compute.gpu_partition / compute.cpu_partitionPLACEHOLDER_gpu / PLACEHOLDER_cpu, resolved by scripts/daic_autodetect.sh.
  • compute.gresgpu:PLACEHOLDER_pro6000:1, resolved by daic_autodetect.sh.
  • compute.constraintPLACEHOLDER_sm120, resolved by daic_autodetect.sh.
  • compute.accountPLACEHOLDER_account, resolved by daic_autodetect.sh.
  • BENCHFORGE_SANDBOX_MODEapptainer-setuid | apptainer-userns | rlimit-only, resolved by scripts/sandbox_probe.sh.
  • engine.image (nvcr.io/nvidia/vllm:25.09-py3, sm_120) — boots only after the smoke gate confirms the image runs on Blackwell; the resolved digest is stamped per row.

Quickstart

This quickstart is for a DAIC login node. See RUN_ON_DAIC.md for the full runbook (tmux bootstrap, probing, morning troubleshooting).

# 0. one-time: write per-user overrides
cp scripts/benchforge.env.example ~/.benchforge.env   # then edit scratch path etc.
source scripts/activate_env.sh                         # exports BENCHFORGE_* + folds in ~/.benchforge.env

# 1. probe the genuinely-unknown cluster facts into ~/.benchforge.env
bash scripts/daic_autodetect.sh
bash scripts/sandbox_probe.sh

# 2. stage weights/datasets + pull images (ONLINE, login node)
make stage
make pull-image

# 3. plan the run (offline)
make plan
make status

# 4. tiny readiness gate (image boots, one chat template, offline toolchain)
make smoke

# 5. the soak: GPU generate -> CPU score (backfill arrays)
make generate
make score

# 6. recompute + publish (ONLINE, login node)
make aggregate
make publish

Honest limits

benchForge is deliberately honest about what is unproven and what it cannot guarantee.

  • vLLM non-determinism. Batched inference is not bitwise-deterministic even at temperature=0 — token streams can differ run-to-run. benchForge therefore does not promise identical tokens; it promises a reproducible protocol and auditable artifacts, treats scores as the reproducible quantity, and reports bootstrap CIs to quantify residual variance. Exactly-one chat template is pinned and prompt-diffed.
  • Contamination. LiveCodeBench is pinned to release_v6 with a per-model date window and a contamination_flag, but no benchmark is contamination-proof; treat cross-model comparisons accordingly and read the flag.
  • Sandbox is defense-in-depth, not a proof. All generated code is hostile. The real boundary is apptainer (--net none, read-only binds, scratch-only tmp) plus per-sample setrlimit and a hard parent-side wall-clock kill. The python builtin blacklist is a documented soft layer only. Whether the outer layer is setuid-apptainer, userns-apptainer, or rlimit-only depends on sandbox_probe.sh.
  • NGC auth. The primary vLLM image lives on NGC (nvcr.io/...) and may require authentication to pull; if it is unavailable, benchForge falls back to vllm/vllm-openai:latest. The sm_120 image is DAIC-unverified until smoke.
  • Backfill, not a reservation. benchForge only soaks idle GPUs as a low-priority backfill citizen (high --nice, short --time). DAIC has live preemption disabled, so throughput depends entirely on how much idle capacity exists; monitor_backfill.sh measures the realized GPU-second floor and recommends QOS/nice fallbacks.

Where to read next

  • DESIGN.md — full architecture, locked-decisions table, published data schema, sharding, requeue/idempotency, staging, licensing/contamination, open questions to probe, smoke-vs-full.
  • RUN_ON_DAIC.md — the one-page runbook: tmux bootstrap, what is genuinely unknown and how it is probed, the full step-by-step, where things live, and a morning troubleshooting list.

About

Forge a reproducible code/LLM eval leaderboard on idle TU Delft DAIC RTX Pro 6000 (Blackwell, sm_120) GPUs — sibling of preCal. Strict GPU-generate / CPU-score decouple, requeue-safe SLURM backfill, published to the HF Hub.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors