Skip to content

Ironieser/TimesCLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

TimesCLIP: Teaching Time Series to See and Speak

arXiv Project Page Blog 知乎

🚀 Code Coming Soon! Star ⭐ to get notified when released!


📖 Abstract

We present TimesCLIP, the first work to apply vision-language contrastive learning (à la CLIP) to time series forecasting. Our approach demonstrates that CLIP is ALL you NEED - simply replacing complex transformer architectures with CLIP's text encoder achieves state-of-the-art performance on 16 short-term forecasting datasets without any hyperparameter tuning.

🎯 Key Contributions

  • 🔥 First CLIP-based Time Series Forecasting: Pioneer work applying vision-language contrastive learning to time series
  • ⚡ Zero Hyperparameter Tuning: Direct SOTA performance without architectural modifications
  • 🎨 Multimodal Framework: Combines numerical and visual understanding of time series patterns
  • 📊 Comprehensive Evaluation: SOTA on 16/22 benchmarks with significant improvements

🏗️ Method Overview

Core Innovation: CLIP is ALL you NEED

Instead of complex transformer architectures requiring extensive tuning, we:

  1. Replace backbone with CLIP-Text encoder - leverages pre-trained multimodal representations
  2. Add visual understanding - convert time series to line plots for visual pattern recognition
  3. Multimodal contrastive learning - align numerical and visual representations following CoCa framework

Architecture Components

  • 🧠 CLIP-Text Backbone: Pre-trained text encoder as the core forecasting engine
  • 👁️ Visual Encoder: Processes time series line plots for pattern recognition
  • 🔗 Variate Selection Module: Handles inter-variable dependencies with contrastive learning
  • 🎨 Multimodal Alignment: Bridges numerical data and visual patterns

📊 Experimental Results

Short-term Forecasting (16 datasets)

  • 🏆 SOTA Performance: Outperforms all existing methods
  • 📈 Average 15% Improvement: Significant gains across all benchmarks
  • ⚡ Zero Tuning Required: Direct application without hyperparameter search

Long-term Forecasting (6 datasets)

  • ✅ Competitive Results: Strong performance on Exchange, Traffic, Weather, ETTm1, ETTm2, Solar Energy
  • 🎯 Focused Evaluation: Excludes unrealistic 96→720/2160 prediction horizons

Key Findings

  • CLIP-Text >> Custom Transformers: Pre-trained representations outperform task-specific architectures
  • Visual Patterns Matter: Line plot understanding significantly improves forecasting
  • Multimodal > Unimodal: Combined numerical+visual approach beats single modality

🚀 Getting Started

Installation

# Coming Soon! 
git clone https://github.com/Ironieser/TimesCLIP.git
cd TimesCLIP
pip install -r requirements.txt

Quick Start

# Coming Soon!
from timesclip import TimesCLIP

# Initialize model
model = TimesCLIP(
    input_len=96,
    pred_len=24,
)

Dataset Preparation

# Coming Soon!
# Download and prepare datasets
python scripts/prepare_datasets.py --dataset ETTh1
python scripts/prepare_datasets.py --dataset Weather

📁 Project Structure

TimesCLIP/
├── timesclip/           # Core model implementation
│   ├── models/         # TimesCLIP architecture
│   ├── data/           # Data loading and preprocessing
│   └── utils/          # Utility functions
├── experiments/        # Experiment scripts
├── configs/           # Configuration files
├── scripts/           # Data preparation scripts
└── results/           # Experimental results

🎯 Reproducibility

We provide complete experimental setups for all benchmarks:

  • 📋 Detailed Configs: All hyperparameters and settings
  • 🔄 Exact Preprocessing: Data preparation scripts
  • 📊 Result Analysis: Comprehensive evaluation metrics
  • 🎨 Visualization: Time series plotting and analysis tools

🔬 Ablation Studies

Our comprehensive ablations investigate:

  • 🎨 Visual Encoding: Line plots vs. spectrograms vs. raw values
  • 🧠 Backbone Comparison: CLIP-Text vs. BERT vs. GPT-2 vs. T5
  • 🎯 Multimodal Learning: Impact of vision-language contrastive training
  • 🔗 Architecture Components: Variate selection and attention mechanisms

📚 Citation

If you find our work helpful, please consider citing:

@article{sixun2025teaching,
  title={Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives},
  author={Sixun, Dong and Wei, Fan and Wu, Teresa and Yanjie, Fu},
  journal={arXiv preprint arXiv:2506.24124},
  year={2025}
}

🤝 Contributing

We welcome contributions! Please feel free to:

  • 🐛 Report bugs and issues
  • 💡 Suggest new features
  • 🔧 Submit pull requests
  • 📖 Improve documentation

📞 Contact

🙏 Acknowledgments

  • Thanks to the CLIP team for the foundational vision-language model
  • Appreciation to the time series forecasting community for benchmarks and baselines
  • Special thanks to reviewers and collaborators for valuable feedback

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


⭐ Star this repo if you find it helpful! ⭐

🔔 Watch for updates when code is released! 🔔

About

The offical repo of "Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors