GitHub - Justlovesmile/SynCLIP: [CVPR 2026] Official Code for “SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception”

[CVPR 2026] SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception

Abstract

Open-vocabulary dense perception (OVDP) aims to localize objects unseen during training by leveraging textual knowledge. Despite the remarkable progress of recent CLIP-based approaches, we identify a critical limitation: synonym-induced grounding inconsistency, where semantically equivalent expressions yield disparate spatial attention patterns. This inconsistency undermines the robustness and performance of existing methods in real-world OVDP applications.

To address this issue, we propose SynCLIP, a Synonym-Coherent Language-Image Pretraining framework that enhances synonym-robust grounding for OVDP tasks. SynCLIP introduces a Semantic-consistent Spatial Attention alignment (SSA) module to enhance spatial attention consistency by minimizing discrepancies between attention maps of original and synonymous expressions. Furthermore, a Spatial Attention Refinement (SAR) module selectively strengthens the most semantically relevant spatial regions within aligned maps, resulting in more precise and stable grounding. To support synonym-coherent pretraining, we also construct a Synonym-Enriched Visual Corpus (SEViC), which augments each category with multiple synonyms and textual definitions. Extensive experiments on multiple benchmarks demonstrate that SynCLIP substantially improves grounding consistency under diverse linguistic variants and achieves state-of-the-art performance among CLIP-based OVDP methods.

Installation

Clone this repository and install the required packages.

git clone https://github.com/Justlovesmile/SynCLIP.git
cd SynCLIP

conda create -n synclip python=3.10 -y
conda activate synclip
pip install -r requirements.txt
pip install -e . -v

# Blow for F-ViT (see F-ViT/README.md)
cd ./F-ViT/mmcv-1.7.0
MMCV_WITH_OPS=1 pip install -e . -v
cd ../../

cd ./F-ViT/mmdetection-2.28.1
pip install -e . -v
cd ../../

The main environment used by this repository is based on PyTorch 2.7.1 and torchvision 0.22.1. Please install CUDA-compatible PyTorch wheels if your local CUDA version requires a different build.

Data Preparation

The main experiments are conducted on COCO and LVIS. The SynCLIP distillation stage only requires images and expressions from SEViC. The downstream finetuning stage follows the OV-COCO and OV-LVIS settings used by CLIPSelf and DeCLIP.

Please organize the datasets as follows. The capitalization of Annotations and Images matches the provided training scripts.

SynCLIP/
|-- data/
|   |-- coco/
|   |   |-- Annotations/
|   |   |   |-- instances_train2017.json              # used to access training images
|   |   |   |-- lvis_coco_categories_train2017.json   # category labels for SEViC
|   |   |   |-- panoptic_val2017.json                 # COCO panoptic validation annotations
|   |   |   |-- panoptic_val2017/                     # COCO panoptic validation masks
|   |   |   `-- sevic.json                            # synonyms and definitions
|   |   `-- Images/
|   |       |-- train2017/
|   |       `-- val2017/
|   `-- lvis_v1/
|       |-- Annotations/
|       |   `-- lvis_v1_train.json                    # used to access LVIS training images
|       `-- Images/
|           |-- train2017/                            # can be linked to COCO train2017
|           `-- val2017/                              # can be linked to COCO val2017

Checkpoint Preparation

Download the pretrained weights from EVA-CLIP, DeCLIP, and DINOv2. Put them under ckpts/.

SynCLIP/
`-- ckpts/
    |-- EVA02_CLIP_B_psz16_s8B.pt
    |-- EVA02_CLIP_L_336_psz14_s6B.pt
    |-- DeCLIP_EVA-B_DINOv2-B_csa_0.05_2.0.pt
    |-- DeCLIP_EVA-L_DINOv2-L_csa_0.05_2.0.pt
    |-- dinov2_vitb14_reg4_pretrain.pth
    `-- dinov2_vitl14_reg4_pretrain.pth

Pretraining

We provide distributed training scripts under scripts/. Before launching a job, update the dataset path, checkpoint path, visible GPUs, and output name in the script if necessary.

For SynCLIP pretraining with EVA02-CLIP-B/16 on COCO:

bash scripts/dist_SynCLIP_eva_vitb16_coco.sh

The script expects:

COCO data at data/coco/
SEViC expressions at data/coco/Annotations/sevic.json
category names at data/coco/Annotations/lvis_coco_categories_train2017.json
EVA02-CLIP-B/16 weights at ckpts/EVA02_CLIP_B_psz16_s8B.pt

Consider turning on --swanlab to monitor the training progress in real-time with swanlab.

Checkpoints are saved under the experiment directory configured by exp_name.

Downstream Validation

Open-Vocabulary Detection with F-ViT

F-ViT finetuning is provided in F-ViT/. Edit F-ViT/run_train.sh to set synclip_workdir, exp_name, and the number of GPUs for your machine, then run:

cd F-ViT
bash run_train.sh

The script loads the SynCLIP checkpoint from:

${synclip_workdir}/${exp_name}/checkpoints/epoch_6.pt

Open-Vocabulary Segmentation with CAT-Seg

SynCLIP can also be transferred to CAT-Seg-style open-vocabulary segmentation pipelines. Please set the SynCLIP checkpoint in the corresponding CAT-Seg configuration before launching validation or finetuning.

Repository Structure

SynCLIP/
|-- assets/                 # figures and visual assets
|-- ckpts/                  # pretrained and released checkpoints
|-- data/                   # COCO, LVIS, and SEViC annotations
|-- scripts/                # SynCLIP pretraining scripts
|-- logs/                   # pretraining entry points and training logic
|-- F-ViT/                  # open-vocabulary detection validation
|-- CAT-Seg/                # open-vocabulary segmentation validation
`-- README.md
`-- requirements.txt
`-- setup.py

Acknowledgement

This work is built on many excellent research works and open-source projects. We thank the authors for sharing their code and models.

Citation

If you find this project useful in your research, please consider starring the repository and citing our paper.

@inproceedings{xie2026synclip,
  title={SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception},
  author={Xie, Mingjie and He, Guangjun and Xu, Dongli and Lin, Youtian and Li, Hongjue and Feng, Pengming and Guan, Jian and Deng, Yue},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
F-ViT		F-ViT
assets		assets
data/coco/Annotations		data/coco/Annotations
metadata		metadata
scripts		scripts
src		src
tools		tools
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR 2026] SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception

Abstract

Installation

Data Preparation

Checkpoint Preparation

Pretraining

Downstream Validation

Open-Vocabulary Detection with F-ViT

Open-Vocabulary Segmentation with CAT-Seg

Repository Structure

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[CVPR 2026] SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception

Abstract

Installation

Data Preparation

Checkpoint Preparation

Pretraining

Downstream Validation

Open-Vocabulary Detection with F-ViT

Open-Vocabulary Segmentation with CAT-Seg

Repository Structure

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages