CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large Vision-Language Models

Sangin Lee$^1$ and Yukyung Choi$^{\ddagger}$

📜 News

🔥 [2026/05/07] Our code is now open source!

🔥 [2026/05/01] Our LiteLVLM is accepted by ICML 2026!

📢 Outline

LiteLVLM
Installation
Preparation
Model Zoo
Evaluation
License
Acknowledgement

LiteLVLM

In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in-depth analysis of CLIP, we observe that visual tokens within referent regions often exhibit low similarity to their textual representation. Motivated by this insight, we introduce LiteLVLM, a training-free, text-guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP's visual-text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear foreground-background separation. Extensive experiments demonstrate that LiteLVLM significantly outperforms existing methods by over 5% across diverse token budgets. Without any training or fine-tuning, LiteLVLM maintains 90% of the original performance with a 22% speedup and a 2.3× memory reduction.

🚀 Installation

Clone this repository.

git clone https://github.com/sejong-rcv/LiteLVLM.git
cd LiteLVLM

Setup a conda environment and install packages.

conda create -n LiteLVLM python=3.10 -y
conda activate LiteLVLM
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
pip install flash-attn==2.3.6 --no-build-isolation

Install mmcv

git clone https://github.com/open-mmlab/mmcv
cd mmcv
git checkout v1.4.7
MMCV_WITH_OPS=1 pip install -e .

📌 Datasets

Please see docs/datasets.md for dataset preparation guidelines.

🔍 Model Zoo

We use the official pretrained checkpoints released by GLaMM. Download the GLaMM-RefSeg from the HugginFace and place it in checkpoints/. If you plan to fine-tune LiteLVLM, please additionally download the GLaMM-GranD-Pretrained checkpoint.

⚡ Evaluation

Run the following example to evaluate our LiteLVLM on Referring Expression Segmentation benchmarks.

1. Prepare the pretrained checkpoints and datasets.

Check MODEL_ZOO to
- download the pretrained pixel grounding model checkpoints to the folder ./checkpoints/.
Check Datasets to set up dataset.

2. After preparation, run the following script to evaluate LiteLVLM.

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH="./:$PYTHONPATH"
MASTER_PORT=22999

CKPT_PATH=$1
REF_SEG_DATASET=$2
RESULT_PATH=$3
RETAIN_TOKENS=$4

deepspeed --master_port="$MASTER_PORT" eval/referring_seg/infer_and_evaluate.py \
    --version "$CKPT_PATH" \
    --refer_seg_data "$REF_SEG_DATASET" \
    --results_path "$RESULT_PATH" \
    --num_retain_tokens $RETAIN_TOKENS \
    --pretrained

To evaluate the RefCOCO benchamrk with 192 retained tokens, run the following command:

bash eval/referring_seg/single_evaluation.sh 'checkpoints/GLaMM-RefSeg' 'refcoco|val' 'run/LiteLVLM/192' 192

3. One-Click evaluation

If you want to evaluate all benchmarks, run the following commad:

bash eval/referring_seg/run_evaluation.sh 'checkpoints/GLaMM-RefSeg' 'run/LiteLVLM/192' 192

📝 License

This project is released under the Apache 2.0 license.

Citation

If you use LiteLVLM in your research, please cite our work by using the following BibTeX entry:

🙏 Acknowledgement

We thank to GLaMM and VideoGLaMM for releasing their code as open source.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
dataset		dataset
docs		docs
eval		eval
imgs		imgs
mmdet		mmdet
model		model
scripts		scripts
tools		tools
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train_ft.py		train_ft.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large Vision-Language Models

📜 News

📢 Outline

LiteLVLM

🚀 Installation

📌 Datasets

🔍 Model Zoo

⚡ Evaluation

📝 License

Citation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large Vision-Language Models

📜 News

📢 Outline

LiteLVLM

🚀 Installation

📌 Datasets

🔍 Model Zoo

⚡ Evaluation

📝 License

Citation

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages