Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model
- OpenAI-compatible Speech endpoint, multi-language support
- English (US/GB), Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, Mandarin Chinese
- Per-word timestamped caption generation, voice mixing with weighted combinations
- Phoneme endpoints: generate phonemes from text, or generate audio from phonemes
- Prebuilt multiplatform images
- CPU and NVIDIA GPU (CUDA): linux/amd64 + linux/arm64
- AMD GPU (ROCm, experimental): linux/amd64 only
- Apple Silicon (MPS) supported when running directly via UV (no image)
Quickest Start (docker run)
Pre-built multi-arch images with models baked in.
:latest is available, but please pin to a release tag for stable usage.
| Your hardware | Image |
|---|---|
| No GPU (any laptop, VPS, CPU-only server) | kokoro-fastapi-cpu:latest |
| Apple Silicon (M1/M2/M3) | kokoro-fastapi-cpu:latest in Docker, or ./start-gpu_mac.sh natively for MPS |
| NVIDIA GTX 9xx, 10xx, 20xx, 30xx, 40xx (x86_64) | kokoro-fastapi-gpu:latest-cu126 or kokoro-fastapi-gpu:latest |
| NVIDIA RTX 50-series / Blackwell (x86_64) | kokoro-fastapi-gpu:latest-cu128 |
| NVIDIA on arm64 (Jetson, GH200) | kokoro-fastapi-gpu:latest (ships cu129, no cu126 arm64 wheels upstream) |
| AMD GPU | kokoro-fastapi-rocm:latest (experimental, x86_64 only) |
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest # CPU
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest # NVIDIA (x86_64 or arm64)
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest-cu128 # NVIDIA Blackwell / RTX 50-series
docker run --device=/dev/kfd --device=/dev/dri -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-rocm:latest # AMDConfiguration via environment variables, see core/config.py. The :latest and :latest-cu126 tags resolve to the same multi-arch image.
Quick Start (docker compose)
- Install prerequisites, and start the service using Docker Compose (Full setup including UI):
- Install Docker
- Clone the repository:
git clone https://github.com/remsky/Kokoro-FastAPI.git cd Kokoro-FastAPI cd docker/gpu # For NVIDIA GPU support # or cd docker/cpu # For CPU support # or cd docker/rocm # For AMD GPU (ROCm, experimental, amd64 only) docker compose up --build # *Note for Apple Silicon (M1/M2/M3) users: # The Docker GPU image is CUDA-only and won't run on Apple Silicon. With Docker, use `docker/cpu`. # For native MPS (Apple GPU) acceleration, run directly via UV with `./start-gpu_mac.sh`. # Models will auto-download, but if needed you can manually download: python docker/scripts/download_model.py --output api/src/models/v1_0 # Or run directly via UV: ./start-gpu.sh # For GPU support ./start-cpu.sh # For CPU support
Direct Run (via uv)
- Install prerequisites ():
-
Install astral-uv
-
Install espeak-ng in your system if you want it available as a fallback for unknown words/sounds. The upstream libraries may attempt to handle this, but results have varied.
-
Clone the repository:
git clone https://github.com/remsky/Kokoro-FastAPI.git cd Kokoro-FastAPIRun the model download script if you haven't already
Start directly via UV (with hot-reload)
Linux and macOS
./start-cpu.sh OR ./start-gpu.sh
Windows
.\start-cpu.ps1 OR .\start-gpu.ps1
-
Up and Running?
Run locally as an OpenAI-Compatible Speech Endpoint
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8880/v1", api_key="not-needed"
)
with client.audio.speech.with_streaming_response.create(
model="kokoro",
voice="af_sky+af_bella", #single or multiple voicepack combo
input="Hello world!"
) as response:
response.stream_to_file("output.mp3")-
The API will be available at http://localhost:8880
-
API Documentation: http://localhost:8880/docs
-
Web Interface: http://localhost:8880/web
OpenAI-Compatible Speech Endpoint
# Using OpenAI's Python library
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8880/v1", api_key="not-needed")
response = client.audio.speech.create(
model="kokoro",
voice="af_bella+af_sky", # see /api/src/core/openai_mappings.json to customize
input="Hello world!",
response_format="mp3"
)
response.stream_to_file("output.mp3")Or Via Requests:
import requests
response = requests.get("http://localhost:8880/v1/audio/voices")
voices = [v["id"] for v in response.json()["voices"]]
# Generate audio
response = requests.post(
"http://localhost:8880/v1/audio/speech",
json={
"model": "kokoro",
"input": "Hello world!",
"voice": "af_bella",
"response_format": "mp3", # Supported: mp3, wav, opus, flac
"speed": 1.0
}
)
# Save audio
with open("output.mp3", "wb") as f:
f.write(response.content)Quick tests (run from another terminal):
python examples/assorted_checks/test_openai/test_openai_tts.py # Test OpenAI Compatibility
python examples/assorted_checks/test_voices/test_all_voices.py # Test all available voicesVoice Combination
- Weighted voice combinations using ratios (e.g., "af_bella(2)+af_heart(1)" for 67%/33% mix)
- Ratios are automatically normalized to sum to 100%
- Available through any endpoint by adding weights in parentheses
- Saves generated voicepacks for future use
Combine voices and generate audio:
import requests
response = requests.get("http://localhost:8880/v1/audio/voices")
voices = [v["id"] for v in response.json()["voices"]]
# Example 1: Simple voice combination (50%/50% mix)
response = requests.post(
"http://localhost:8880/v1/audio/speech",
json={
"input": "Hello world!",
"voice": "af_bella+af_sky", # Equal weights
"response_format": "mp3"
}
)
# Example 2: Weighted voice combination (67%/33% mix)
response = requests.post(
"http://localhost:8880/v1/audio/speech",
json={
"input": "Hello world!",
"voice": "af_bella(2)+af_sky(1)", # 2:1 ratio = 67%/33%
"response_format": "mp3"
}
)
# Example 3: Download combined voice as .pt file
response = requests.post(
"http://localhost:8880/v1/audio/voices/combine",
json="af_bella(2)+af_sky(1)" # 2:1 ratio = 67%/33%
)
# Save the .pt file
with open("combined_voice.pt", "wb") as f:
f.write(response.content)
# Use the downloaded voice file
response = requests.post(
"http://localhost:8880/v1/audio/speech",
json={
"input": "Hello world!",
"voice": "combined_voice", # Use the saved voice file
"response_format": "mp3"
}
)Streaming Support
# OpenAI-compatible streaming
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8880/v1", api_key="not-needed")
# Stream to file
with client.audio.speech.with_streaming_response.create(
model="kokoro",
voice="af_bella",
input="Hello world!"
) as response:
response.stream_to_file("output.mp3")
# Stream to speakers (requires PyAudio)
import pyaudio
player = pyaudio.PyAudio().open(
format=pyaudio.paInt16,
channels=1,
rate=24000,
output=True
)
with client.audio.speech.with_streaming_response.create(
model="kokoro",
voice="af_bella",
response_format="pcm",
input="Hello world!"
) as response:
for chunk in response.iter_bytes(chunk_size=1024):
player.write(chunk)Or via requests:
import requests
response = requests.post(
"http://localhost:8880/v1/audio/speech",
json={
"input": "Hello world!",
"voice": "af_bella",
"response_format": "pcm"
},
stream=True
)
for chunk in response.iter_content(chunk_size=1024):
if chunk:
# Process streaming chunks
passKey Streaming Metrics:
- First token latency @ chunksize
- ~300ms (GPU) @ 400
- ~3500ms (CPU) @ 200 (older i7)
- ~<1s (CPU) @ 200 (M3 Pro)
- Adjustable chunking settings for real-time playback
Note: Artifacts in intonation can increase with smaller chunks
Performance & Benchmarks
# GPU: Requires NVIDIA driver with CUDA 12.6+ support (~35x-100x realtime speed)
cd docker/gpu
docker compose up --build
# CPU: PyTorch CPU inference
cd docker/cpu
docker compose up --build
# AMD GPU: ROCm 6.4 (experimental, amd64 only)
cd docker/rocm
docker compose up --buildBenchmarking was performed on generation via the local API using text lengths up to feature-length books (~1.5 hours output), measuring processing time and realtime factor. Tests were run on:
- Windows 11 Home w/ WSL2
- NVIDIA 4060Ti 16gb GPU @ CUDA 12.1
- 11th Gen i7-11700 @ 2.5GHz
- 64gb RAM
- WAV native output
- H.G. Wells - The Time Machine (full text)
Key Performance Metrics:
- Realtime Speed: Ranges between 35x-100x (generation time to output audio length)
- Average Processing Rate: 137.67 tokens/second (cl100k_base)
End-to-end roundtrip: synthesize with Kokoro, transcribe the result back with faster-whisper, compare to the source text. Scripts and data live under examples/assorted_checks/test_transcription/.
Long-form English (full book, A Journey to the Centre of the Earth, Project Gutenberg, voice af_heart, base.en Whisper on CUDA float16, baseline captured on cu126 GPU build):
| Run | Input chars | Audio length | Synth speedup | Transcribe speedup | WER |
|---|---|---|---|---|---|
| Short (~ch.7) | 64,996 | 66m 06s | 36.4x rt | 62.4x rt | 0.047 |
| Full book | 502,766 | 507m 52s | 45.7x rt | 65.1x rt | 0.033 |
See examples/assorted_checks/test_transcription/BASELINE.md for the full regression bands.
Per-language check (single-sentence per voice, multilingual Whisper small. WER for Latin scripts, CER for ja/zh/hi):
| Language | Voice | Metric | Score |
|---|---|---|---|
| English | af_heart |
WER | 0.000 |
| English (UK) | bf_emma |
WER | 0.111 |
| Spanish | ef_dora |
WER | 0.000 |
| French | ff_siwis |
WER | 0.000 |
| Italian | if_sara |
WER | 0.000 |
| Portuguese | pf_dora |
WER | 0.000 |
| Hindi | hf_alpha |
CER | 0.059 |
| Japanese | jf_alpha |
CER | 0.000 |
| Chinese | zf_xiaobei |
CER | 0.143 |
Caveat: these are single short sentences, not a comprehensive per-language quality benchmark. They confirm each voice produces transcribable audio in its target language; deeper quality evaluation per language is open work.
To reproduce, see examples/assorted_checks/test_transcription/README.md.
Natural Boundary Detection
- Automatically splits and stitches at sentence boundaries
- Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output
The model is capable of processing up to a 510 phonemized token chunk at a time, however, this can often lead to 'rushed' speech or other artifacts. An additional layer of chunking is applied in the server, that creates flexible chunks with a TARGET_MIN_TOKENS , TARGET_MAX_TOKENS, and ABSOLUTE_MAX_TOKENS which are configurable via environment variables, and set to 175, 250, 450 by default
Timestamped Captions & Phonemes
Generate audio with word-level timestamps without streaming:
import requests
import base64
import json
response = requests.post(
"http://localhost:8880/dev/captioned_speech",
json={
"model": "kokoro",
"input": "Hello world!",
"voice": "af_bella",
"speed": 1.0,
"response_format": "mp3",
"stream": False,
},
stream=False
)
with open("output.mp3","wb") as f:
audio_json=json.loads(response.content)
# Decode base 64 stream to bytes
chunk_audio=base64.b64decode(audio_json["audio"].encode("utf-8"))
# Process streaming chunks
f.write(chunk_audio)
# Print word level timestamps
print(audio_json["timestamps"])Generate audio with word-level timestamps with streaming:
import requests
import base64
import json
response = requests.post(
"http://localhost:8880/dev/captioned_speech",
json={
"model": "kokoro",
"input": "Hello world!",
"voice": "af_bella",
"speed": 1.0,
"response_format": "mp3",
"stream": True,
},
stream=True
)
f=open("output.mp3","wb")
for chunk in response.iter_lines(decode_unicode=True):
if chunk:
chunk_json=json.loads(chunk)
# Decode base 64 stream to bytes
chunk_audio=base64.b64decode(chunk_json["audio"].encode("utf-8"))
# Process streaming chunks
f.write(chunk_audio)
# Print word level timestamps
print(chunk_json["timestamps"])Phoneme & Token Routes
Convert text to phonemes and/or generate audio directly from phonemes:
import requests
def get_phonemes(text: str, language: str = "a"):
"""Get phonemes and tokens for input text"""
response = requests.post(
"http://localhost:8880/dev/phonemize",
json={"text": text, "language": language} # "a" for American English
)
response.raise_for_status()
result = response.json()
return result["phonemes"], result["tokens"]
def generate_audio_from_phonemes(phonemes: str, voice: str = "af_bella"):
"""Generate audio from phonemes"""
response = requests.post(
"http://localhost:8880/dev/generate_from_phonemes",
json={"phonemes": phonemes, "voice": voice},
headers={"Accept": "audio/wav"}
)
if response.status_code != 200:
print(f"Error: {response.text}")
return None
return response.content
# Example usage
text = "Hello world!"
try:
# Convert text to phonemes
phonemes, tokens = get_phonemes(text)
print(f"Phonemes: {phonemes}") # e.g. ðɪs ɪz ˈoʊnli ɐ tˈɛst
print(f"Tokens: {tokens}") # Token IDs including start/end tokens
# Generate and save audio
if audio_bytes := generate_audio_from_phonemes(phonemes):
with open("speech.wav", "wb") as f:
f.write(audio_bytes)
print(f"Generated {len(audio_bytes)} bytes of audio")
except Exception as e:
print(f"Error: {e}")See examples/phoneme_examples/generate_phonemes.py for a sample script.
Debug Endpoints
Monitor system state and resource usage with these endpoints:
/debug/threads- Get thread information and stack traces/debug/storage- Monitor temp file and output directory usage/debug/system- Get system information (CPU, memory, GPU)
Useful for debugging resource exhaustion or performance issues.
Logging
Global API loguru logging level can be set using the API_LOG_LEVEL environment variable. Defaults to DEBUG.
Docker
Modify the appropriate compose yml or append to command line.
docker run --env 'API_LOG_LEVEL=WARNING' ...Direct via UV
Linux and macOS
export API_LOG_LEVEL=WARNING
./start-cpu.sh OR
./start-gpu.shWindows
$env:API_LOG_LEVEL = 'WARNING'
.\start-cpu.ps1 OR
.\start-gpu.ps1Missing words & Missing some timestamps
The api will automatically do text normalization on input text which may incorrectly remove or change some phrases. This can be disabled by adding "normalization_options":{"normalize": false} to your request json:
import requests
response = requests.post(
"http://localhost:8880/v1/audio/speech",
json={
"input": "Hello world!",
"voice": "af_heart",
"response_format": "pcm",
"normalization_options":
{
"normalize": False
}
},
stream=True
)
for chunk in response.iter_content(chunk_size=1024):
if chunk:
# Process streaming chunks
passLinux GPU Permissions
Some Linux users may encounter GPU permission issues when running as non-root. Can't guarantee anything, but here are some common solutions, consider your security requirements carefully
services:
kokoro-tts:
# ... existing config ...
group_add:
- "video"
- "render"services:
kokoro-tts:
# ... existing config ...
user: "${UID}:${GID}"
group_add:
- "video"Note: May require adding host user to groups: sudo usermod -aG docker,video $USER and system restart.
services:
kokoro-tts:
# ... existing config ...
devices:
- /dev/nvidia0:/dev/nvidia0
- /dev/nvidiactl:/dev/nvidiactl
- /dev/nvidia-uvm:/dev/nvidia-uvmPrerequisites: NVIDIA GPU, drivers, and container toolkit must be properly configured.
Visit NVIDIA Container Toolkit installation for more detailed information
WAV duration reported as nonsense in some readers
WAV responses ship with streaming-sentinel (0xFFFFFFFF) size fields in the header. Most readers (soundfile, pydub/ffmpeg, browsers, OS players) handle this fine. Python's stdlib wave does not, and reports a bogus duration. Use soundfile.info(path).duration or ffprobe for exact length.
Versioning & Development
Branching Strategy:
releasebranch: Contains the latest stable build, recommended for production use. Docker images tagged with specific versions are built from this branch.masterbranch: Used for active development. It may contain experimental features, ongoing changes, or fixes not yet in a stable release. Use this branch if you want the absolute latest code, but be aware it might be less stable. ThelatestDocker tag often points to builds from this branch.
Note: This is a development focused project at its core.
If you run into trouble, you may have to roll back a version on the release tags if something comes up, or build up from source and/or troubleshoot + submit a PR.
Free and open source is a community effort, and there's only really so many hours in a day. If you'd like to support the work, feel free to open a PR, buy me a coffee, or report any bugs/features/etc you find during use.
Model
This API uses the Kokoro-82M model from HuggingFace.
Visit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.
License
This project is licensed under the Apache License 2.0 - see below for details:- The Kokoro model weights are licensed under Apache 2.0 (see model page)
- The FastAPI wrapper code in this repository is licensed under Apache 2.0 to match
- The inference code adapted from StyleTTS2 is MIT licensed
The full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0
Made with contrib.rocks.








