Skip to content

microsoft/delulu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Delulu

A verified multilingual benchmark for code-completion hallucinations.
Every golden completion compiles. Every hallucination provably doesn't.

samples languages hallucination types license arXiv HF

πŸ“„ Read the preprint on arXiv β†’

Delulu benchmark viewer showing a verified hallucination sample

Every Delulu sample ships as a self-contained Docker image. The viewer above lets you browse the dataset, pull a sample's verifier, and re-run verify golden / verify hallucinated / verify patch <your-completion> with one click β€” so every label is grounded in a real compile / type-check outcome.


Why Delulu?

Modern code LLMs hallucinate confidently: they invent functions that don't exist, pass parameters the API never accepted, and import modules nobody wrote. Most "hallucination" benchmarks score this with a textual judge β€” but plausible-looking code is exactly what these models are good at, so judges disagree and scores drift.

Delulu grounds every label in execution. For each Fill-in-the-Middle context we ship:

  • a golden completion that compiles cleanly in the original repository, and
  • a hallucinated completion that looks plausible but provably fails to compile β€” the symbol doesn't exist, the method signature is wrong, the import resolves to nothing.

Both completions are verified inside a per-sample Docker image that runs the project's build/type-check toolchain, so every label is grounded in a real compiler or type-checker outcome β€” not in a model's opinion.

Highlights

  • πŸ§ͺ Compiler-grounded. Every hallucination has a reproducible compile / type-check failure inside its own Docker image.
  • 🌍 7 languages, 1 schema. C++, C#, Go, Java, TypeScript, Python, Rust β€” unified prefix / suffix / golden / hallucinated columns.
  • πŸ”¬ Real repos, real APIs. Samples are mined from permissively licensed GitHub projects, with license and repo_url recorded per row.
  • 🧰 Two evaluation modes out of the box β€” an LLM-as-judge harness and a compile-based pass@1 + offline-metrics runner.
  • πŸ–₯️ Browsable viewer. A local UI (also a Docker image) shows the diff, the compiler error trace, and lets you re-run any patch live.
  • πŸ“¦ One row, one image. No flaky environment setup β€” the verifier carries its own toolchain.

Stats

Languages Samples Unique FIM contexts Hallucination types
C++, C#, Go, Java, TypeScript, Python, Rust 1,947 947 import, method, parameter, undefinedvariable
Language Count
TypeScript 420
Python 370
Go 291
Rust 252
C# 246
Java 243
C++ 125
Hallucination type Count
undefinedvariable 576
import 476
method 460
parameter 435

Quickstart

from tools.load import load_delulu
df = load_delulu()
print(df.shape, df["language"].value_counts())

Or pull it straight from the Hub:

from datasets import load_dataset
ds = load_dataset("microsoft/delulu-fim-benchmark", split="test")

For an end-to-end 5-minute walkthrough on a 14-sample mini set, see examples/quickstart.md.

The two evaluation tools

1. Judge β€” Can a foundation model tell which completion is hallucinated?

evaluations/run_delulu_judges.py shows a judge model the prefix, suffix, and a candidate completion, asks it to score 0 or 1, and counts a sample as correct only when the judge scores golden=1 AND hallucinated=0.

cp .env.example .env
pip install -e ".[eval]"
python evaluations/run_delulu_judges.py --models GPT-5.5 Claude-4.5-Sonnet

Per-model caches are written under evaluations/results/, so runs are resumable. See evaluations/README.md.

2. Metrics β€” Do the model's completions actually run? How close are they to the truth?

evaluations/run_completion_metrics.py takes a CSV of model predictions (columns: benchmark_id, model_completion) and computes:

  • pass@1 β€” compile-based, by piping each completion into the per-sample Docker verifier image (docker run -i <image> verify patch), which runs the project's build / type-check toolchain.
  • exact_match β€” strict equality with the golden completion.
  • edit_similarity β€” char-level normalised Levenshtein vs. golden.
  • hallucination_rate β€” share of completions closer to the hallucinated variant than to the golden one.
# Fast smoke-test: 2 samples per language (14 total)
python evaluations/run_completion_metrics.py \
    --predictions my_model.csv \
    --model-name my-model \
    --smoke-test

# Full run
python evaluations/run_completion_metrics.py \
    --predictions my_model.csv \
    --model-name my-model

Verifier images are pulled from ${DELULU_REGISTRY} (defaults to a pre-release registry; the public Docker Hub mirror is TBA).

Viewer

A browsable UI for the dataset is shipped as a Docker image β€” see tools/viewer/README.md:

docker run --rm -p 127.0.0.1:8000:8000 \
    -v /var/run/docker.sock:/var/run/docker.sock \
    delulubench/delulu-viewer:v1

The viewer is what's shown in the screenshot at the top.

What's in the repo

data/                # delulu.csv + datasheet
tools/
  load.py / stats.py / slice.py / show.py    # CLIs for working with the dataset
  viewer/                                    # browser UI (also a Docker image)
evaluations/
  run_delulu_judges.py        # LLM-as-judge harness (the "judge" tool)
  run_completion_metrics.py   # pass@1 + offline metrics (the "metrics" tool)
examples/            # 5-minute walkthrough on a 14-sample mini set

Citation

@inproceedings{Erfanian2026Delulu,
  title={Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks},
  author={Mahdi Erfanian and Nelson Troncoso and Aashna Garg and Amabel Gale and Xiaoyu Liu and Pareesa Ameneh Golnari and Shengyu Fu},
  year={2026},
  url={https://arxiv.org/abs/2605.07024}
}

A machine-readable citation is in CITATION.cff.

License

  • Code in this repository: MIT (see LICENSE).
  • Data: each sample inherits the license of its source repository, recorded in the license and repo_url columns. See DATA_LICENSE.md.

Links

About

A VERIFIED MULTI-LINGUAL BENCHMARK FOR CODE HALLUCINATION DETECTION IN FILL-IN- THE-MIDDLE TASKS

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors