Code and data for reproducing the post-hoc analysis of vacuity-based out-of-distribution (OOD) detection in Evidential Deep Learning (EDL), as described in the accompanying paper: Rethinking Vacuity for OOD Detection in Evidential Deep Learning.
This repository investigates the sensitvity of vacuity as Uncertainty Mass (UM) to K-variance on multiple-choice QA benchmarks.
The key question: does vacuity-based OOD detection succeed because the model is genuinely more uncertain on OOD data, or because vacuity is an artefact of class cardinality K?
Vacuity is defined as:
vacuity = K / S, where S = sum(alpha_i) = sum(e_i + 1)
Because K appears directly in the numerator, datasets with different numbers of answer choices will produce systematically different vacuity distributions, independent of model uncertainty. The synthetic experiment in synthetic_auroc_analysis.py isolates this effect.
.
├── train_ib_edl.py # Fine-tune Llama-3-8B with IB-EDL loss
├── train_standard_edl.py # Fine-tune Llama-3-8B with Standard EDL loss
├── inference_ib_edl.py # Generate predictions with IB-EDL checkpoint
├── inference_standard_edl.py # Generate predictions with Standard EDL checkpoint
│
├── auroc_npz_uncertainty.py # AUROC via uncertainty mass & max probability (NPZ inputs)
├── auroc_entropy_analysis.py # AUROC via Shannon entropy, uncertainty, and max probability (JSON inputs)
├── synthetic_auroc_analysis.py # Synthetic K-expansion experiment (main analysis)
├── synthetic_vacuity_results.csv # Output results table from the synthetic experiment
├── synthetic_auroc.png # AUROC vs K plot (generated output)
├── synthetic_aupr.png # AUPR vs K plot (generated output)
│
├── Implementation_A_ib-edl_sigma_mult_0/ # IB-EDL outputs from the authors' original code (sigma_mult=0)
├── Implementation_A_standard_edl/ # Standard EDL outputs from the authors' original code
├── Implementation_B_ib-edl_sigma_mult_0/ # IB-EDL outputs from our reimplementation (sigma_mult=0)
└── Implementation_B_standard_edl/ # Standard EDL outputs from our reimplementation
Datasets. All benchmarks are publicly available. Prediction files use the following abbreviations:
| Abbreviation | Dataset | Role |
|---|---|---|
obqa |
OpenBookQA (4 classes) | ID |
arc_c |
ARC-Challenge (4 classes) | OOD |
arc_e |
ARC-Easy (4 classes) | OOD |
csqa |
CommonsenseQA (4 or 5 classes) | OOD |
mmlu_math |
MMLU — Abstract Algebra (4 classes) | OOD |
Analysis scripts:
pip install numpy scipy scikit-learn pandas matplotlibTraining and inference scripts:
pip install torch transformers peft datasets bitsandbytes accelerate wandbA HuggingFace account with access to meta-llama/Meta-Llama-3-8B is required. Set your token before running:
export HF_TOKEN=<your_token>Both training scripts fine-tune meta-llama/Meta-Llama-3-8B on OpenBookQA using LoRA (target modules: q_proj, v_proj, lm_head; r=8). Training runs for 10,080 steps with batch size 4 and learning rate 5e-5.
Set the output directory via the OUTPUT_DIR environment variable (default: ./out-ib-edl or ./out-standard-edl).
# IB-EDL
python train_ib_edl.py
# Standard EDL
python train_standard_edl.pyWandB logging is enabled by default. Set WANDB_API_KEY in your environment or run wandb login beforehand.
Both inference scripts load a trained checkpoint and generate prediction JSON files for all five evaluation datasets (OBQA, ARC-C, ARC-E, MMLU-Math, CSQA).
Set paths via environment variables:
| Variable | Description | Default |
|---|---|---|
PEFT_PATH |
Path to the trained LoRA checkpoint | ./out-*/checkpoint-10080 |
OUTPUT_DIR |
Directory to write JSON result files | . |
# IB-EDL
PEFT_PATH=./out-ib-edl/checkpoint-10080 python inference_ib_edl.py
# Standard EDL
PEFT_PATH=./out-standard-edl/checkpoint-10080 python inference_standard_edl.pyEach script writes one JSON file per dataset (e.g. ib_edl_obqa_results.json). CSQA produces two files: one with K=4 and one with K=5.
Pre-generated prediction files are included in the Implementation_* directories, so the analysis scripts can be run without re-training.
python synthetic_auroc_analysis.pyOutputs: synthetic_vacuity_results.csv, synthetic_auroc.png, synthetic_aupr.png, and a printed interpretation summary.
python auroc_entropy_analysis.pyPrints per-dataset AUROC and AUPR to stdout. Uncomment the plt.savefig lines to save ROC plots.
python auroc_npz_uncertainty.pySaves a three-panel AUROC figure to Implementation_A_ib-edl_sigma_mult_0/.
- Baseline (K=4 for both ID and OOD): AUROC ≈ 0.57 — near-random discrimination.
- OOD-only K expansion (ID stays at K=4; OOD inflated to K=5–8): AUROC rises to ≈ 0.92 at K=8, driven entirely by the K term in the vacuity formula.
- Matched expansion (both distributions expanded equally): AUROC remains ≈ 0.57 at all K values, confirming the effect is purely a function of relative class counts rather than genuine uncertainty.
If you use this code or data, please cite the accompanying paper:
@misc{mcnamara2026rethinkingvacuityooddetection,
title={Rethinking Vacuity for OOD Detection in Evidential Deep Learning},
author={Claire McNamara},
year={2026},
eprint={2605.06382},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.06382},
}The benchmark questions included in the JSON files are drawn from publicly available datasets released under their respective licenses (see the dataset links above). The analysis code in this repository is released under the MIT License.