Vacuity Analysis

Code and data for reproducing the post-hoc analysis of vacuity-based out-of-distribution (OOD) detection in Evidential Deep Learning (EDL), as described in the accompanying paper: Rethinking Vacuity for OOD Detection in Evidential Deep Learning.

Overview

This repository investigates the sensitvity of vacuity as Uncertainty Mass (UM) to K-variance on multiple-choice QA benchmarks.

The key question: does vacuity-based OOD detection succeed because the model is genuinely more uncertain on OOD data, or because vacuity is an artefact of class cardinality K?

Vacuity is defined as:

vacuity = K / S,   where S = sum(alpha_i) = sum(e_i + 1)

Because K appears directly in the numerator, datasets with different numbers of answer choices will produce systematically different vacuity distributions, independent of model uncertainty. The synthetic experiment in synthetic_auroc_analysis.py isolates this effect.

Repository structure

.
├── train_ib_edl.py                       # Fine-tune Llama-3-8B with IB-EDL loss
├── train_standard_edl.py                 # Fine-tune Llama-3-8B with Standard EDL loss
├── inference_ib_edl.py                   # Generate predictions with IB-EDL checkpoint
├── inference_standard_edl.py             # Generate predictions with Standard EDL checkpoint
│
├── auroc_npz_uncertainty.py              # AUROC via uncertainty mass & max probability (NPZ inputs)
├── auroc_entropy_analysis.py             # AUROC via Shannon entropy, uncertainty, and max probability (JSON inputs)
├── synthetic_auroc_analysis.py           # Synthetic K-expansion experiment (main analysis)
├── synthetic_vacuity_results.csv         # Output results table from the synthetic experiment
├── synthetic_auroc.png                   # AUROC vs K plot (generated output)
├── synthetic_aupr.png                    # AUPR vs K plot (generated output)
│
├── Implementation_A_ib-edl_sigma_mult_0/ # IB-EDL outputs from the authors' original code (sigma_mult=0)
├── Implementation_A_standard_edl/        # Standard EDL outputs from the authors' original code
├── Implementation_B_ib-edl_sigma_mult_0/ # IB-EDL outputs from our reimplementation (sigma_mult=0)
└── Implementation_B_standard_edl/        # Standard EDL outputs from our reimplementation

Datasets. All benchmarks are publicly available. Prediction files use the following abbreviations:

Abbreviation	Dataset	Role
`obqa`	OpenBookQA (4 classes)	ID
`arc_c`	ARC-Challenge (4 classes)	OOD
`arc_e`	ARC-Easy (4 classes)	OOD
`csqa`	CommonsenseQA (4 or 5 classes)	OOD
`mmlu_math`	MMLU — Abstract Algebra (4 classes)	OOD

Requirements

Analysis scripts:

pip install numpy scipy scikit-learn pandas matplotlib

Training and inference scripts:

pip install torch transformers peft datasets bitsandbytes accelerate wandb

A HuggingFace account with access to meta-llama/Meta-Llama-3-8B is required. Set your token before running:

export HF_TOKEN=<your_token>

Fine-tuning

Both training scripts fine-tune meta-llama/Meta-Llama-3-8B on OpenBookQA using LoRA (target modules: q_proj, v_proj, lm_head; r=8). Training runs for 10,080 steps with batch size 4 and learning rate 5e-5.

Set the output directory via the OUTPUT_DIR environment variable (default: ./out-ib-edl or ./out-standard-edl).

# IB-EDL
python train_ib_edl.py

# Standard EDL
python train_standard_edl.py

WandB logging is enabled by default. Set WANDB_API_KEY in your environment or run wandb login beforehand.

Inference

Both inference scripts load a trained checkpoint and generate prediction JSON files for all five evaluation datasets (OBQA, ARC-C, ARC-E, MMLU-Math, CSQA).

Set paths via environment variables:

Variable	Description	Default
`PEFT_PATH`	Path to the trained LoRA checkpoint	`./out-*/checkpoint-10080`
`OUTPUT_DIR`	Directory to write JSON result files	`.`

# IB-EDL
PEFT_PATH=./out-ib-edl/checkpoint-10080 python inference_ib_edl.py

# Standard EDL
PEFT_PATH=./out-standard-edl/checkpoint-10080 python inference_standard_edl.py

Each script writes one JSON file per dataset (e.g. ib_edl_obqa_results.json). CSQA produces two files: one with K=4 and one with K=5.

Reproducing the analysis

Pre-generated prediction files are included in the Implementation_* directories, so the analysis scripts can be run without re-training.

1. Synthetic K-expansion experiment (primary result)

python synthetic_auroc_analysis.py

Outputs: synthetic_vacuity_results.csv, synthetic_auroc.png, synthetic_aupr.png, and a printed interpretation summary.

2. AUROC via Shannon entropy / uncertainty / max probability (JSON)

python auroc_entropy_analysis.py

Prints per-dataset AUROC and AUPR to stdout. Uncomment the plt.savefig lines to save ROC plots.

3. AUROC via uncertainty mass from NPZ files (authors' original outputs)

python auroc_npz_uncertainty.py

Saves a three-panel AUROC figure to Implementation_A_ib-edl_sigma_mult_0/.

Key findings

Baseline (K=4 for both ID and OOD): AUROC ≈ 0.57 — near-random discrimination.
OOD-only K expansion (ID stays at K=4; OOD inflated to K=5–8): AUROC rises to ≈ 0.92 at K=8, driven entirely by the K term in the vacuity formula.
Matched expansion (both distributions expanded equally): AUROC remains ≈ 0.57 at all K values, confirming the effect is purely a function of relative class counts rather than genuine uncertainty.

Citation

If you use this code or data, please cite the accompanying paper:

@misc{mcnamara2026rethinkingvacuityooddetection,
      title={Rethinking Vacuity for OOD Detection in Evidential Deep Learning}, 
      author={Claire McNamara},
      year={2026},
      eprint={2605.06382},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.06382}, 
}

License

The benchmark questions included in the JSON files are drawn from publicly available datasets released under their respective licenses (see the dataset links above). The analysis code in this repository is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vacuity Analysis

Overview

Repository structure

Requirements

Fine-tuning

Inference

Reproducing the analysis

1. Synthetic K-expansion experiment (primary result)

2. AUROC via Shannon entropy / uncertainty / max probability (JSON)

3. AUROC via uncertainty mass from NPZ files (authors' original outputs)

Key findings

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Implementation_A_ib-edl_sigma_mult_0		Implementation_A_ib-edl_sigma_mult_0
Implementation_A_standard_edl		Implementation_A_standard_edl
Implementation_B_ib-edl_sigma_mult_0		Implementation_B_ib-edl_sigma_mult_0
Implementation_B_standard_edl		Implementation_B_standard_edl
.gitattributes		.gitattributes
Implementation A.txt		Implementation A.txt
Implementation B.txt		Implementation B.txt
README.md		README.md
auroc_entropy_analysis.py		auroc_entropy_analysis.py
auroc_npz_uncertainty.py		auroc_npz_uncertainty.py
inference_ib_edl.py		inference_ib_edl.py
inference_standard_edl.py		inference_standard_edl.py
synthetic_aupr.png		synthetic_aupr.png
synthetic_auroc.png		synthetic_auroc.png
synthetic_auroc_analysis.py		synthetic_auroc_analysis.py
synthetic_vacuity_results.csv		synthetic_vacuity_results.csv
train_ib_edl.py		train_ib_edl.py
train_standard_edl.py		train_standard_edl.py

Folders and files

Latest commit

History

Repository files navigation

Vacuity Analysis

Overview

Repository structure

Requirements

Fine-tuning

Inference

Reproducing the analysis

1. Synthetic K-expansion experiment (primary result)

2. AUROC via Shannon entropy / uncertainty / max probability (JSON)

3. AUROC via uncertainty mass from NPZ files (authors' original outputs)

Key findings

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages