A verified multilingual benchmark for code-completion hallucinations.
Every golden completion compiles. Every hallucination provably doesn't.
π Read the preprint on arXiv β
Every Delulu sample ships as a self-contained Docker image. The viewer above lets you browse the dataset, pull a sample's verifier, and re-run verify golden / verify hallucinated / verify patch <your-completion> with one click β so every label is grounded in a real compile / type-check outcome.
Modern code LLMs hallucinate confidently: they invent functions that don't exist, pass parameters the API never accepted, and import modules nobody wrote. Most "hallucination" benchmarks score this with a textual judge β but plausible-looking code is exactly what these models are good at, so judges disagree and scores drift.
Delulu grounds every label in execution. For each Fill-in-the-Middle context we ship:
- a golden completion that compiles cleanly in the original repository, and
- a hallucinated completion that looks plausible but provably fails to compile β the symbol doesn't exist, the method signature is wrong, the import resolves to nothing.
Both completions are verified inside a per-sample Docker image that runs the project's build/type-check toolchain, so every label is grounded in a real compiler or type-checker outcome β not in a model's opinion.
- π§ͺ Compiler-grounded. Every hallucination has a reproducible compile / type-check failure inside its own Docker image.
- π 7 languages, 1 schema. C++, C#, Go, Java, TypeScript, Python, Rust β
unified
prefix/suffix/golden/hallucinatedcolumns. - π¬ Real repos, real APIs. Samples are mined from permissively licensed
GitHub projects, with
licenseandrepo_urlrecorded per row. - π§° Two evaluation modes out of the box β an LLM-as-judge harness and a compile-based pass@1 + offline-metrics runner.
- π₯οΈ Browsable viewer. A local UI (also a Docker image) shows the diff, the compiler error trace, and lets you re-run any patch live.
- π¦ One row, one image. No flaky environment setup β the verifier carries its own toolchain.
| Languages | Samples | Unique FIM contexts | Hallucination types |
|---|---|---|---|
| C++, C#, Go, Java, TypeScript, Python, Rust | 1,947 | 947 | import, method, parameter, undefinedvariable |
|
|
from tools.load import load_delulu
df = load_delulu()
print(df.shape, df["language"].value_counts())Or pull it straight from the Hub:
from datasets import load_dataset
ds = load_dataset("microsoft/delulu-fim-benchmark", split="test")For an end-to-end 5-minute walkthrough on a 14-sample mini set, see examples/quickstart.md.
evaluations/run_delulu_judges.py shows a judge model the prefix, suffix,
and a candidate completion, asks it to score 0 or 1, and counts a sample
as correct only when the judge scores golden=1 AND hallucinated=0.
cp .env.example .env
pip install -e ".[eval]"
python evaluations/run_delulu_judges.py --models GPT-5.5 Claude-4.5-SonnetPer-model caches are written under evaluations/results/, so runs are
resumable. See evaluations/README.md.
evaluations/run_completion_metrics.py takes a CSV of model predictions
(columns: benchmark_id, model_completion) and computes:
- pass@1 β compile-based, by piping each completion into the per-sample
Docker verifier image (
docker run -i <image> verify patch), which runs the project's build / type-check toolchain. - exact_match β strict equality with the golden completion.
- edit_similarity β char-level normalised Levenshtein vs. golden.
- hallucination_rate β share of completions closer to the hallucinated variant than to the golden one.
# Fast smoke-test: 2 samples per language (14 total)
python evaluations/run_completion_metrics.py \
--predictions my_model.csv \
--model-name my-model \
--smoke-test
# Full run
python evaluations/run_completion_metrics.py \
--predictions my_model.csv \
--model-name my-modelVerifier images are pulled from ${DELULU_REGISTRY} (defaults to a
pre-release registry; the public Docker Hub mirror is TBA).
A browsable UI for the dataset is shipped as a Docker image β see tools/viewer/README.md:
docker run --rm -p 127.0.0.1:8000:8000 \
-v /var/run/docker.sock:/var/run/docker.sock \
delulubench/delulu-viewer:v1The viewer is what's shown in the screenshot at the top.
data/ # delulu.csv + datasheet
tools/
load.py / stats.py / slice.py / show.py # CLIs for working with the dataset
viewer/ # browser UI (also a Docker image)
evaluations/
run_delulu_judges.py # LLM-as-judge harness (the "judge" tool)
run_completion_metrics.py # pass@1 + offline metrics (the "metrics" tool)
examples/ # 5-minute walkthrough on a 14-sample mini set
@inproceedings{Erfanian2026Delulu,
title={Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks},
author={Mahdi Erfanian and Nelson Troncoso and Aashna Garg and Amabel Gale and Xiaoyu Liu and Pareesa Ameneh Golnari and Shengyu Fu},
year={2026},
url={https://arxiv.org/abs/2605.07024}
}A machine-readable citation is in CITATION.cff.
- Code in this repository: MIT (see LICENSE).
- Data: each sample inherits the license of its source repository,
recorded in the
licenseandrepo_urlcolumns. See DATA_LICENSE.md.
- Paper (arXiv): https://arxiv.org/abs/2605.07024
- Hugging Face dataset: https://huggingface.co/datasets/microsoft/delulu-fim-benchmark
- Docker Hub: TBA
