⚡ sped — Universal Speculative Decoding

sped pairs any small draft model with any large target model — even with completely different tokenizers — and accelerates inference 2–5× with zero loss in output quality.

Install

uv pip install sped

Or from source:

git clone https://github.com/egesabanci/sped
cd sped
uv venv && uv pip install -e .

Quick Start

# List recommended draft-target pairs
sped list pairings

# Run inference with speculation (same-vocab)
sped serve run \
  --target meta-llama/Llama-3.1-70B \
  --draft meta-llama/Llama-3.2-1B

# Cross-vocab (different tokenizers)
sped serve run \
  --target meta-llama/Llama-3.1-70B \
  --draft Qwen/Qwen2.5-0.5B \
  --align hybrid

# Benchmark
sped serve run --target meta-llama/Llama-3.1-70B --draft Qwen/Qwen2.5-0.5B --benchmark

# Distill a draft model via PEFT
sped distil run \
  --draft Qwen/Qwen2.5-0.5B \
  --target meta-llama/Llama-3.1-70B \
  --dataset HuggingFaceH4/ultrachat_200k

# Auto-tune draft K
sped experiment auto-tune --target meta-llama/Llama-3.1-70B --draft Qwen/Qwen2.5-0.5B

Features

Feature	Description
🧠 Any model pair	Works across tokenizers via Intel/Weizmann vocabulary-agnostic alignment
⚡ 2–5× speedup	Lossless — output matches target model distribution exactly
🎯 PEFT distillation	Align tiny draft models with LoRA in ~1 GPU-hour
🔄 Online adaptation	Draft improves during inference via OSD
🍎 MLX backend	Optimized for Apple Silicon (M1–M4)
🏭 vLLM support	Production-grade serving (optional)
🧪 Experiment runner	Grid-search, HTML reports, auto-tune

How It Works

Standard:    tok1 → tok2 → tok3 → tok4 → ...  (1 pass/token, slow)
sped:        draft: [tok1 tok2 tok3 tok4]      (4 cheap passes)
             target: ──────verify──────→ [✓✓✓✗]  (1 expensive pass)
             → same output, 2–5× faster

Documentation

CLI Reference — Full command reference
User Guide — Tutorials and best practices
Architecture — System design and data flow
Contributing — Development guide

Backends

# HF Transformers (default, works everywhere)
sped serve run --target meta-llama/Llama-3.1-70B --backend hf

# MLX (Apple Silicon — 2–3× faster on Mac)
uv pip install mlx-lm
sped serve run --target mlx-community/Llama-3.2-3B --backend mlx

# vLLM (production serving)
uv pip install vllm
sped serve run --target meta-llama/Llama-3.1-70B --backend vllm

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
sped		sped
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
pyproject.toml		pyproject.toml
sped.service		sped.service
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ sped — Universal Speculative Decoding

Install

Quick Start

Features

How It Works

Documentation

Backends

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚡ sped — Universal Speculative Decoding

Install

Quick Start

Features

How It Works

Documentation

Backends

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages