Skip to content

PLAN-Lab/PyraTok

Repository files navigation

$\color{orange}{\textbf{{[CVPR 2026]}}}$ PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

PyraTok is a language-aligned pyramidal video tokenizer designed for both video understanding and generation.
This repository includes model code, inference scripts, and finetuning scripts with Accelerate.

pyratok_overview

Highlights

  • Language-aligned pyramidal quantization (LaPQ) for semantically meaningful video tokens.
  • End-to-end VAE-style reconstruction with optional text conditioning.
  • Sliding-window inference over full videos at frame level.
  • Finetuning pipeline with multi-GPU support via accelerate.

Model Weights

  • Hugging Face model page: https://huggingface.co/onkarsus13/PyraTok

Download with script:

python download.py

Or directly:

python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='onkarsus13/PyraTok', local_dir='./checkpoints', local_dir_use_symlinks=False)"

Installation

1) Clone and enter the repo

git clone <your-repo-url>
cd CVPR25-PyraTok

2) Create environment

python -m venv .venv
source .venv/bin/activate

3) Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Data Format

Training/inference manifest supports .json or .jsonl.

Each row must contain:

  • path: relative or absolute video path
  • caption: text instruction/caption

Example .json:

[
  {"path": "videos/clip_0001.mp4", "caption": "a person is walking in a park"},
  {"path": "videos/clip_0002.mp4", "caption": "two people talking indoors"}
]

Example .jsonl:

{"path": "videos/clip_0001.mp4", "caption": "a person is walking in a park"}
{"path": "videos/clip_0002.mp4", "caption": "two people talking indoors"}

Inference

Run with explicit inputs:

python infer.py /abs/path/to/video.mp4 "your prompt here"

Outputs are written under ./reconstructions/:

  • *_input.mp4
  • *_recon.mp4
  • *_sbs.mp4
  • reconstruction_metadata.json

Inference settings (edit in infer.py)

  • vae_model_path: path to downloaded/finetuned VAE checkpoint.
  • qwen_model_path: path to text encoder checkpoint.
  • use_text_condition: enable/disable language conditioning.
  • num_frames: sliding window size.
  • window_stride: sliding step (None means no overlap, stride = num_frames).
  • height, width, fps, device, dtype.

Finetuning (Complete Workflow)

This project uses finetune.py as the Accelerate entrypoint, which calls fine_tune.py.

1) Prepare data

Update these paths in fine_tune.py (TrainConfig):

  • video_base_path
  • fallback_video_base_paths
  • train_manifest_path
  • pyratok_pretrained_path
  • qwen_model_path
  • output_dir

2) Configure training hyperparameters

In fine_tune.py, set:

  • Core training: num_epochs, max_steps, learning_rate, weight_decay, grad_clip_norm
  • Batch/layout: batch_size, gradient_accumulation_steps, num_workers, num_frames
  • Precision/distributed: mixed_precision, offload_text_encoder_to_cpu
  • LaPQ params: lapq_num_codes, lapq_num_quantizers, lapq_codebook_dim, etc.

3) Configure Accelerate (first run)

accelerate config

4) Launch training

Single command:

accelerate launch --mixed_precision bf16 finetune.py

Or use provided script:

bash train.sh

train.sh example:

NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --mixed_precision bf16 finetune.py

5) Training outputs

Saved in output_dir:

  • periodic checkpoints: checkpoint_step_XXXXXXX.pt
  • diffusers-format VAE folders: vae_checkpoint_step_XXXXXXX/
  • dumped config: hardcoded_train_config.json

Notes and Troubleshooting

  • Video decoding tries torchvision, then imageio, then opencv.
  • Unreadable videos are skipped during dataset loading instead of crashing training.
  • If you see missing decoder/backend errors, install both torchvision and opencv-python (or imageio-ffmpeg).
  • fine_tune.py currently uses hardcoded config via TrainConfig, so editing that dataclass is the primary way to change runs.

Contact

While installing or finetuning, if you find any issues, please contact to [email protected].

Citation

⭐ If you find this work useful, please cite our paper

@inproceedings{susladkar2026pyratok,
  title={PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation},
  author={Susladkar, Onkar and Prakash, Tushar and Juvekar, Adheesh and Nguyen, Kiet A. and Jang, Dong-Hwan and Dhillon, Inderjit S. and Lourentzou, Ismini},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

About

Official Repository

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors