We present TimesCLIP, the first work to apply vision-language contrastive learning (à la CLIP) to time series forecasting. Our approach demonstrates that CLIP is ALL you NEED - simply replacing complex transformer architectures with CLIP's text encoder achieves state-of-the-art performance on 16 short-term forecasting datasets without any hyperparameter tuning.
- 🔥 First CLIP-based Time Series Forecasting: Pioneer work applying vision-language contrastive learning to time series
- ⚡ Zero Hyperparameter Tuning: Direct SOTA performance without architectural modifications
- 🎨 Multimodal Framework: Combines numerical and visual understanding of time series patterns
- 📊 Comprehensive Evaluation: SOTA on 16/22 benchmarks with significant improvements
Instead of complex transformer architectures requiring extensive tuning, we:
- Replace backbone with CLIP-Text encoder - leverages pre-trained multimodal representations
- Add visual understanding - convert time series to line plots for visual pattern recognition
- Multimodal contrastive learning - align numerical and visual representations following CoCa framework
- 🧠 CLIP-Text Backbone: Pre-trained text encoder as the core forecasting engine
- 👁️ Visual Encoder: Processes time series line plots for pattern recognition
- 🔗 Variate Selection Module: Handles inter-variable dependencies with contrastive learning
- 🎨 Multimodal Alignment: Bridges numerical data and visual patterns
- 🏆 SOTA Performance: Outperforms all existing methods
- 📈 Average 15% Improvement: Significant gains across all benchmarks
- ⚡ Zero Tuning Required: Direct application without hyperparameter search
- ✅ Competitive Results: Strong performance on Exchange, Traffic, Weather, ETTm1, ETTm2, Solar Energy
- 🎯 Focused Evaluation: Excludes unrealistic 96→720/2160 prediction horizons
- CLIP-Text >> Custom Transformers: Pre-trained representations outperform task-specific architectures
- Visual Patterns Matter: Line plot understanding significantly improves forecasting
- Multimodal > Unimodal: Combined numerical+visual approach beats single modality
# Coming Soon!
git clone https://github.com/Ironieser/TimesCLIP.git
cd TimesCLIP
pip install -r requirements.txt# Coming Soon!
from timesclip import TimesCLIP
# Initialize model
model = TimesCLIP(
input_len=96,
pred_len=24,
)# Coming Soon!
# Download and prepare datasets
python scripts/prepare_datasets.py --dataset ETTh1
python scripts/prepare_datasets.py --dataset WeatherTimesCLIP/
├── timesclip/ # Core model implementation
│ ├── models/ # TimesCLIP architecture
│ ├── data/ # Data loading and preprocessing
│ └── utils/ # Utility functions
├── experiments/ # Experiment scripts
├── configs/ # Configuration files
├── scripts/ # Data preparation scripts
└── results/ # Experimental results
We provide complete experimental setups for all benchmarks:
- 📋 Detailed Configs: All hyperparameters and settings
- 🔄 Exact Preprocessing: Data preparation scripts
- 📊 Result Analysis: Comprehensive evaluation metrics
- 🎨 Visualization: Time series plotting and analysis tools
Our comprehensive ablations investigate:
- 🎨 Visual Encoding: Line plots vs. spectrograms vs. raw values
- 🧠 Backbone Comparison: CLIP-Text vs. BERT vs. GPT-2 vs. T5
- 🎯 Multimodal Learning: Impact of vision-language contrastive training
- 🔗 Architecture Components: Variate selection and attention mechanisms
If you find our work helpful, please consider citing:
@article{sixun2025teaching,
title={Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives},
author={Sixun, Dong and Wei, Fan and Wu, Teresa and Yanjie, Fu},
journal={arXiv preprint arXiv:2506.24124},
year={2025}
}We welcome contributions! Please feel free to:
- 🐛 Report bugs and issues
- 💡 Suggest new features
- 🔧 Submit pull requests
- 📖 Improve documentation
- Sixun Dong - [email protected]
- Project Page: https://cv.ironieser.cc/projects/timesclip.html
- Personal Homepage: https://cv.ironieser.cc/
- Thanks to the CLIP team for the foundational vision-language model
- Appreciation to the time series forecasting community for benchmarks and baselines
- Special thanks to reviewers and collaborators for valuable feedback
This project is licensed under the MIT License - see the LICENSE file for details.
⭐ Star this repo if you find it helpful! ⭐
🔔 Watch for updates when code is released! 🔔