TimesCLIP: Teaching Time Series to See and Speak

🚀 Code Coming Soon! Star ⭐ to get notified when released!

📖 Abstract

We present TimesCLIP, the first work to apply vision-language contrastive learning (à la CLIP) to time series forecasting. Our approach demonstrates that CLIP is ALL you NEED - simply replacing complex transformer architectures with CLIP's text encoder achieves state-of-the-art performance on 16 short-term forecasting datasets without any hyperparameter tuning.

🎯 Key Contributions

🔥 First CLIP-based Time Series Forecasting: Pioneer work applying vision-language contrastive learning to time series
⚡ Zero Hyperparameter Tuning: Direct SOTA performance without architectural modifications
🎨 Multimodal Framework: Combines numerical and visual understanding of time series patterns
📊 Comprehensive Evaluation: SOTA on 16/22 benchmarks with significant improvements

🏗️ Method Overview

Core Innovation: CLIP is ALL you NEED

Instead of complex transformer architectures requiring extensive tuning, we:

Replace backbone with CLIP-Text encoder - leverages pre-trained multimodal representations
Add visual understanding - convert time series to line plots for visual pattern recognition
Multimodal contrastive learning - align numerical and visual representations following CoCa framework

Architecture Components

🧠 CLIP-Text Backbone: Pre-trained text encoder as the core forecasting engine
👁️ Visual Encoder: Processes time series line plots for pattern recognition
🔗 Variate Selection Module: Handles inter-variable dependencies with contrastive learning
🎨 Multimodal Alignment: Bridges numerical data and visual patterns

📊 Experimental Results

Short-term Forecasting (16 datasets)

🏆 SOTA Performance: Outperforms all existing methods
📈 Average 15% Improvement: Significant gains across all benchmarks
⚡ Zero Tuning Required: Direct application without hyperparameter search

Long-term Forecasting (6 datasets)

✅ Competitive Results: Strong performance on Exchange, Traffic, Weather, ETTm1, ETTm2, Solar Energy
🎯 Focused Evaluation: Excludes unrealistic 96→720/2160 prediction horizons

Key Findings

CLIP-Text >> Custom Transformers: Pre-trained representations outperform task-specific architectures
Visual Patterns Matter: Line plot understanding significantly improves forecasting
Multimodal > Unimodal: Combined numerical+visual approach beats single modality

🚀 Getting Started

Installation

# Coming Soon! 
git clone https://github.com/Ironieser/TimesCLIP.git
cd TimesCLIP
pip install -r requirements.txt

Quick Start

# Coming Soon!
from timesclip import TimesCLIP

# Initialize model
model = TimesCLIP(
    input_len=96,
    pred_len=24,
)

Dataset Preparation

# Coming Soon!
# Download and prepare datasets
python scripts/prepare_datasets.py --dataset ETTh1
python scripts/prepare_datasets.py --dataset Weather

📁 Project Structure

TimesCLIP/
├── timesclip/           # Core model implementation
│   ├── models/         # TimesCLIP architecture
│   ├── data/           # Data loading and preprocessing
│   └── utils/          # Utility functions
├── experiments/        # Experiment scripts
├── configs/           # Configuration files
├── scripts/           # Data preparation scripts
└── results/           # Experimental results

🎯 Reproducibility

We provide complete experimental setups for all benchmarks:

📋 Detailed Configs: All hyperparameters and settings
🔄 Exact Preprocessing: Data preparation scripts
📊 Result Analysis: Comprehensive evaluation metrics
🎨 Visualization: Time series plotting and analysis tools

🔬 Ablation Studies

Our comprehensive ablations investigate:

🎨 Visual Encoding: Line plots vs. spectrograms vs. raw values
🧠 Backbone Comparison: CLIP-Text vs. BERT vs. GPT-2 vs. T5
🎯 Multimodal Learning: Impact of vision-language contrastive training
🔗 Architecture Components: Variate selection and attention mechanisms

📚 Citation

If you find our work helpful, please consider citing:

@article{sixun2025teaching,
  title={Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives},
  author={Sixun, Dong and Wei, Fan and Wu, Teresa and Yanjie, Fu},
  journal={arXiv preprint arXiv:2506.24124},
  year={2025}
}

🤝 Contributing

We welcome contributions! Please feel free to:

🐛 Report bugs and issues
💡 Suggest new features
🔧 Submit pull requests
📖 Improve documentation

📞 Contact

Sixun Dong - [email protected]
Project Page: https://cv.ironieser.cc/projects/timesclip.html
Personal Homepage: https://cv.ironieser.cc/

🙏 Acknowledgments

Thanks to the CLIP team for the foundational vision-language model
Appreciation to the time series forecasting community for benchmarks and baselines
Special thanks to reviewers and collaborators for valuable feedback

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

⭐ Star this repo if you find it helpful! ⭐

🔔 Watch for updates when code is released! 🔔

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TimesCLIP: Teaching Time Series to See and Speak

📖 Abstract

🎯 Key Contributions

🏗️ Method Overview

Core Innovation: CLIP is ALL you NEED

Architecture Components

📊 Experimental Results

Short-term Forecasting (16 datasets)

Long-term Forecasting (6 datasets)

Key Findings

🚀 Getting Started

Installation

Quick Start

Dataset Preparation

📁 Project Structure

🎯 Reproducibility

🔬 Ablation Studies

📚 Citation

🤝 Contributing

📞 Contact

🙏 Acknowledgments

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TimesCLIP: Teaching Time Series to See and Speak

📖 Abstract

🎯 Key Contributions

🏗️ Method Overview

Core Innovation: CLIP is ALL you NEED

Architecture Components

📊 Experimental Results

Short-term Forecasting (16 datasets)

Long-term Forecasting (6 datasets)

Key Findings

🚀 Getting Started

Installation

Quick Start

Dataset Preparation

📁 Project Structure

🎯 Reproducibility

🔬 Ablation Studies

📚 Citation

🤝 Contributing

📞 Contact

🙏 Acknowledgments

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages