Official PyTorch Implementation of One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection, 2026.
- [2026-06-02]: Thank you for your interest in UniADet. We are truly grateful for the widespread attention and valuable feedback from researchers, students, and CV practitioners worldwide, including those from the University of Cambridge, Texas A&M University, Institute of Automation of the Chinese Academy of Sciences, Xi'an Jiaotong-Liverpool University, Shenzhen University, University of Jinan, and other institutions. The paper is currently under peer review. The full source code and pre-trained models will be officially released upon the acceptance. We sincerely appreciate your patience and understanding.
- [2026-02-02]: ✅ UniADet has been independently reproduced by a third-party team, confirming our state-of-the-art results.
- [2026-01-13]: 🚀 Initialized the official UniADet code repository.
- [2026-01-09]: 📄 The UniADet paper is now available on arXiv.
- Introduction
- UniADet Framework
- Language-Free UniADet with Different Foundation Models
- Comparison with State-of-the-Arts
- Complexity and Efficiency Comparisons
- Ablation Studies
- Comparison with Language-Dependent AnomalyCLIP
- ToDo List
- Citation
UniADet is a language-free universal (Zero- and Few-shot) visual anomaly detection framework. It achieves superior performance outperforming state-of-the-art language-dependent Zero- and Few-shot AD models while also exhibiting remarkable simplicity and efficiency.
- We rethink vision-language ADs and find that language prompts and encoders are unnecessary. This insight leads to an embarrassingly simple(language-free+dual-decoupling), efficient(0.015M or 0.02M learnable params), effective (SOTA zero-/few-shot) and general (support VLMs and Pure VMs) framework for universal anomaly detection.
- We fully decouple global anomaly classification and local anomaly segmentation across multi-scale hierarchical features, i.e., learning layer-wise cls/seg weights, effectively mitigating the learning conflict between different feature manifolds and substantially improving AD performance.
- Comprehensive experiments conclusively validate that our approach achieves state-of-the-art zero-shot and few-shot performance. Notably, our few-shot UniADet is the first to outperform full-shot state-of-the-art.
| Backbone | Shot | MVTec-AD | VisA | Real-IAD |
|---|---|---|---|---|
| CLIP (ViT-L/14@336px) | 0 | 92.4 / 42.8 | 88.0 / 28.0 | 78.6 / 33.6 |
| CLIP (ViT-L/14@336px) | 4 | 97.7 / 58.8 | 93.3 / 36.7 | 84.3 / 37.2 |
| DINOv2 (Register ViT-L/14) | 0 | 93.5 / 50.9 | 91.3 / 32.7 | 82.5 / 43.1 |
| DINOv2 (Register ViT-L/14) | 4 | 98.7 / 65.4 | 96.9 / 45.2 | 90.3 / 48.5 |
| DINOv3 (ViT-L/16) | 0 | 94.0 / 52.7 | 91.9 / 32.5 | 81.2 / 41.6 |
| DINOv3 (ViT-L/16) | 4 | 98.2 / 69.0 | 97.1 / 45.5 | 88.5 / 49.8 |
Note: The performance is mesured by Image-AUROC / Pixel-AUPR, and the same below.
| Methods | Venue | Language-Free | Shots | MVTec | VisA | Real-IAD |
|---|---|---|---|---|---|---|
|
UniADet |
ours | 0 | 93.5 / 50.9 | 91.3 / 32.7 | 82.5 / 43.1 | |
| WinCLIP | CVPR 23 | 0 | 90.4 / 18.2 | 75.5 / 5.4 | 67.0 / 3.3 | |
| APRIL-GAN | CVPRW 23 | 0 | 86.1 / - | 78.0 / - | - | |
| AnomalyCLIP | ICLR 24 | 0 | 91.6 / 34.5 | 82.0 / 21.3 | 69.5 / 26.7 | |
| AdaCLIP | ECCV 24 | 0 | 90.7 / 39.1 | 81.7 / 31.0 | 73.3 / 30.5 | |
| VCPCLIP | ECCV 24 | 0 | 92.1 / 49.4 | 83.8 / 30.1 | - | |
| Bayes-PFL | CVPR 25 | 0 | 92.5 / 48.3 | 87.0 / 29.8 | 70.0 / 27.6 | |
| AA-CLIP | CVPR 25 | 0 | 90.5 / - | 84.6 / - | - | |
| FE-CLIP | ICCV 25 | 0 | 91.9 / - | 84.6 / - | - | |
| FAPrompt | ICCV 25 | 0 | 91.9 / - | 84.6 / - | - | |
| RareCLIP | ICCV 25 | 0 | 91.5 / 46.1 | 86.1 / 27.0 | - | |
| AdaptCLIP | AAAI 26 | 0 | 93.5 / 38.3 | 84.8 / 26.1 | 74.2 / 28.2 |
| Methods | Venue | Language-Free | Shots | MVTec | VisA | Real-IAD |
|---|---|---|---|---|---|---|
|
UniADet |
ours | 1 | 97.6 / 63.1 | 95.2 / 42.1 | 88.7 / 48.4 | |
|
UniADet |
ours | 2 | 98.0 / 64.1 | 96.1 / 44.2 | 89.0 / 46.7 | |
|
UniADet |
ours | 4 | 98.7 / 65.4 | 96.9 / 45.2 | 90.3 / 48.5 | |
| MetaUAS | NeurIPS 24 | 1 | 90.7 / 59.3 | 81.2 / 42.7 | 80.0 / 36.6 | |
| APRIL-GAN | CVPRW 23 | 4 | 92.8 / 54.5 | 92.6 / 32.2 | - | |
| PromptAD | CVPR 24 | 4 | 96.6 / 52.9 | 89.1 / 31.5 | - | |
| UniVAD | CVPR 25 | 1 | 97.8 / 55.6 | 93.5 / 42.8 | 85.1 / 37.6 | |
| AdaptCLIP | AAAI 26 | 1 | 94.5 / 53.7 | 90.5 / 38.9 | 81.8 / 36.6 | |
| AdaptCLIP | AAAI 26 | 2 | 95.7 / 55.1 | 92.2 / 40.7 | 82.9 / 37.8 | |
| AdaptCLIP | AAAI 26 | 4 | 96.6 / 57.2 | 93.1 / 41.8 | 83.9 / 39.1 |
| Methods | Venue | Language-Free | Setting | MVTec | VisA | Real-IAD |
|---|---|---|---|---|---|---|
| Dinomaly | CVPR 25 | multi-class (full train set) | 99.6 / 69.3 | 98.7 / 53.2 | 89.3 / 42.8 | |
| UniAD | NeurIPS 24 | multi-class (full train set) | 96.5 / 44.7 | 90.8 / 33.6 | 83.0 / 21.1 | |
| MuSc | ICLR 24 | online (full test set) | 97.8 / 62.7 | 92.8 / 45.1 | - |
Important
Note1: If you find that any existing zero-shot/few-shot AD methods are missing from the table above, please feel free to open an issue so we can add them.
Note2: Dinomaly and UniAD are multi-class unsupervised AD algorithms, and they require dataset-specific training with full normal images.
Note3: MuSc is an Online algorithm that requires access to statistics from the entire test dataset to evaluate the current image. Therefore, it is not a strictly zero-shot AD.
| Shots | Methods | Models | Input Size | # Params (M) | Inf. Time (ms) |
|---|---|---|---|---|---|
| 0 | AdaCLIP | CLIP ViT-L/14@336px | 518×518 | 428.8 + 1.1e+1 | 107.4 |
| 0 | AnomalyCLIP | CLIP ViT-L/14@336px | 518×518 | 427.9 + 5.6e+0 | 70.7 |
| 0 | Bayes-PFL | CLIP ViT-L/14@336px | 518×518 | 427.9 + 2.7e+1 | 154.9 |
| 0 | AdaptCLIP-Zero | CLIP ViT-L/14@336px | 518×518 | 427.9 + 6.0e-1 | 57.5 |
| 0 |
UniADet |
CLIP ViT-L/14@336px | 518×518 | 342.9 + 1.5e-2 | 15.7 |
| 0 |
UniADet |
DINOv2 ViT-L/14 | 518×518 | 303.2 + 2.0e-2 | 41.9 |
| 1 | InCtrl | CLIP ViT-B-16+240 | 240×240 | 208.4 + 3.0e-1 | 59.0 |
| 1 | AnomalyCLIP+ | CLIP ViT-L/14@336px | 518×518 | 427.9 + 5.6e+0 | 76.2 |
| 1 | AdaptCLIP | CLIP ViT-L/14@336px | 518×518 | 342.9 + 1.8e+0 | 58.7 |
| 1 |
UniADet |
CLIP ViT-L/14@336px | 518×518 | 342.9 + 1.5e-2 | 22.4 |
| 1 |
UniADet |
DINOv2 ViT-L/14 | 518×518 | 303.2 + 2.0e-2 | 48.4 |
Note: The number of learnable parameters (1.5e-3 and 2.0e-3) is not correct for our UniADet
Ablation studies about different components.
| No | DCS | DHF | CAA | Shot | MVTec | VisA |
|---|---|---|---|---|---|---|
| 0 | 0 | 85.4 / 36.4 | 77.9 / 26.1 | |||
| 1 | 0 | 91.8 / 38.3 | 85.9 / 27.2 | |||
| 2 | 0 | 92.2 / 40.7 | 86.0 / 27.6 | |||
| 3 | 0 | 92.4 / 42.8 | 88.0 / 28.0 | |||
| 4 | random | 0 | 91.3 / 41.5 | 87.5 / 26.6 | ||
| 5 | 1 | 95.9 / 54.6 | 91.3 / 32.5 |
Note: The ablation studies are conducted by UniADet
- Essential Differences
| Feature | AnomalyCLIP | UniADet (Ours) |
|---|---|---|
| 🧠 Paradigm | 🔴 Language-Dependent | ✅ Language-Free |
| ⚡ Task Decoupling | Shared Cls/Seg Weight |
✅ Decoupled Cls/Seg Weights |
| 🏗️ Hierarchical Decoupling | Single / Last Layer | ✅ Layer-Wise Cls/Seg Weights |
| 🤖 Backbones | CLIP Only | ✅ CLIP, DINOv2-R, DINOv3 |
| 📉 Params | 🚀 0.015M or 0.020M(Efficient) |
Important
Note1: The official AnomalyCLIP team has clarified that their initial multi-layer claim was a code bug; the effective implementation relies only on the last layer.
Note2: A naive extension of AnomalyCLIP to multiple blocks leads to significant performance degradation.
- Commonality
| Feature | AnomalyCLIP & UniADet |
|---|---|
| Core Objective | Learning Normal / Anomaly Weights |
| Loss Function | CE for Cls and Focal + Dice for Seg |
| Training Data | Auxiliary Data (e.g., VisA or MVTec) |
- Performace on VisA
| Shots | Backbones | AnomalyCLIP | UniADet (Ours) |
|---|---|---|---|
| 0-Shot | CLIP | 82.0 / 21.3 | 88.0 / 28.0 |
| 0-Shot | DINOv2-R | Not supported | 91.3 / 32.7 |
| 0-Shot | DINOv3 | Not supported | 91.9 / 32.5 |
| 4-Shot | CLIP | Not supported | 93.3 / 36.7 |
| 4-Shot | DINOv2-R | Not supported | 96.9 / 45.2 |
| 4-Shot | DINOv3 | Not supported | 97.1 / 45.5 |
- release pre-trained UniADet models
- deploy online UniADet Demo on huggingface
- open training and testing code
If you find this work useful in your research, please consider citing:
@inproceedings{uniadet,
title={One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection},
author={Gao, Bin-Bin and Wang, Chengjie},
booktitle={arXiv:2601.05552},
year={2026}
}

