_{_FastKoko}

Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model

OpenAI-compatible Speech endpoint, multi-language support
- English (US/GB), Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, Mandarin Chinese
Per-word timestamped caption generation, voice mixing with weighted combinations
Phoneme endpoints: generate phonemes from text, or generate audio from phonemes
Prebuilt multiplatform images
- CPU and NVIDIA GPU (CUDA): linux/amd64 + linux/arm64
- AMD GPU (ROCm, experimental): linux/amd64 only
Apple Silicon (MPS) supported when running directly via UV (no image)

Integration Guides

Get Started

Quickest Start (docker run)

Pre-built multi-arch images with models baked in.

:latest is available, but please pin to a release tag for stable usage.

Your hardware	Image
No GPU (any laptop, VPS, CPU-only server)	`kokoro-fastapi-cpu:latest`
Apple Silicon (M1/M2/M3)	`kokoro-fastapi-cpu:latest` in Docker, or `./start-gpu_mac.sh` natively for MPS
NVIDIA GTX 9xx, 10xx, 20xx, 30xx, 40xx (x86_64)	`kokoro-fastapi-gpu:latest-cu126` or `kokoro-fastapi-gpu:latest`
NVIDIA RTX 50-series / Blackwell (x86_64)	`kokoro-fastapi-gpu:latest-cu128`
NVIDIA on arm64 (Jetson, GH200)	`kokoro-fastapi-gpu:latest` (ships cu129, no cu126 arm64 wheels upstream)
AMD GPU	`kokoro-fastapi-rocm:latest` (experimental, x86_64 only)

docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest                                       # CPU
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest                            # NVIDIA (x86_64 or arm64)
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest-cu128                      # NVIDIA Blackwell / RTX 50-series
docker run --device=/dev/kfd --device=/dev/dri -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-rocm:latest  # AMD

Configuration via environment variables, see core/config.py. The :latest and :latest-cu126 tags resolve to the same multi-arch image.

Quick Start (docker compose)

Install prerequisites, and start the service using Docker Compose (Full setup including UI):

Install Docker

Clone the repository:

git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI

cd docker/gpu   # For NVIDIA GPU support
# or cd docker/cpu   # For CPU support
# or cd docker/rocm  # For AMD GPU (ROCm, experimental, amd64 only)
docker compose up --build

# *Note for Apple Silicon (M1/M2/M3) users:
# The Docker GPU image is CUDA-only and won't run on Apple Silicon. With Docker, use `docker/cpu`.
# For native MPS (Apple GPU) acceleration, run directly via UV with `./start-gpu_mac.sh`.

# Models will auto-download, but if needed you can manually download:
python docker/scripts/download_model.py --output api/src/models/v1_0

# Or run directly via UV:
./start-gpu.sh  # For GPU support
./start-cpu.sh  # For CPU support

Direct Run (via uv)

Install prerequisites ():
- Install astral-uv
- Install espeak-ng in your system if you want it available as a fallback for unknown words/sounds. The upstream libraries may attempt to handle this, but results have varied.
- Clone the repository:
```
git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI
```
  Run the model download script if you haven't already
  
  Start directly via UV (with hot-reload)
  
  Linux and macOS
```
./start-cpu.sh OR
./start-gpu.sh 
```
  Windows
```
.\start-cpu.ps1 OR
.\start-gpu.ps1 
```

Up and Running?

Run locally as an OpenAI-Compatible Speech Endpoint

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8880/v1", api_key="not-needed"
)

with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="af_sky+af_bella", #single or multiple voicepack combo
    input="Hello world!"
  ) as response:
      response.stream_to_file("output.mp3")

The API will be available at http://localhost:8880
API Documentation: http://localhost:8880/docs
Web Interface: http://localhost:8880/web

Features

OpenAI-Compatible Speech Endpoint

# Using OpenAI's Python library
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8880/v1", api_key="not-needed")
response = client.audio.speech.create(
    model="kokoro",  
    voice="af_bella+af_sky", # see /api/src/core/openai_mappings.json to customize
    input="Hello world!",
    response_format="mp3"
)

response.stream_to_file("output.mp3")

Or Via Requests:

import requests


response = requests.get("http://localhost:8880/v1/audio/voices")
voices = [v["id"] for v in response.json()["voices"]]

# Generate audio
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "model": "kokoro",  
        "input": "Hello world!",
        "voice": "af_bella",
        "response_format": "mp3",  # Supported: mp3, wav, opus, flac
        "speed": 1.0
    }
)

# Save audio
with open("output.mp3", "wb") as f:
    f.write(response.content)

Quick tests (run from another terminal):

python examples/assorted_checks/test_openai/test_openai_tts.py # Test OpenAI Compatibility
python examples/assorted_checks/test_voices/test_all_voices.py # Test all available voices

Voice Combination

Weighted voice combinations using ratios (e.g., "af_bella(2)+af_heart(1)" for 67%/33% mix)
Ratios are automatically normalized to sum to 100%
Available through any endpoint by adding weights in parentheses
Saves generated voicepacks for future use

Combine voices and generate audio:

import requests
response = requests.get("http://localhost:8880/v1/audio/voices")
voices = [v["id"] for v in response.json()["voices"]]

# Example 1: Simple voice combination (50%/50% mix)
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "af_bella+af_sky",  # Equal weights
        "response_format": "mp3"
    }
)

# Example 2: Weighted voice combination (67%/33% mix)
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "af_bella(2)+af_sky(1)",  # 2:1 ratio = 67%/33%
        "response_format": "mp3"
    }
)

# Example 3: Download combined voice as .pt file
response = requests.post(
    "http://localhost:8880/v1/audio/voices/combine",
    json="af_bella(2)+af_sky(1)"  # 2:1 ratio = 67%/33%
)

# Save the .pt file
with open("combined_voice.pt", "wb") as f:
    f.write(response.content)

# Use the downloaded voice file
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "combined_voice",  # Use the saved voice file
        "response_format": "mp3"
    }
)

Multiple Output Audio Formats

mp3
wav
opus
flac
m4a
pcm

Streaming Support

# OpenAI-compatible streaming
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8880/v1", api_key="not-needed")

# Stream to file
with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="af_bella",
    input="Hello world!"
) as response:
    response.stream_to_file("output.mp3")

# Stream to speakers (requires PyAudio)
import pyaudio
player = pyaudio.PyAudio().open(
    format=pyaudio.paInt16, 
    channels=1, 
    rate=24000, 
    output=True
)

with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="af_bella",
    response_format="pcm",
    input="Hello world!"
) as response:
    for chunk in response.iter_bytes(chunk_size=1024):
        player.write(chunk)

Or via requests:

import requests

response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "af_bella",
        "response_format": "pcm"
    },
    stream=True
)

for chunk in response.iter_content(chunk_size=1024):
    if chunk:
        # Process streaming chunks
        pass

Key Streaming Metrics:

First token latency @ chunksize
- ~300ms (GPU) @ 400
- ~3500ms (CPU) @ 200 (older i7)
- ~<1s (CPU) @ 200 (M3 Pro)
Adjustable chunking settings for real-time playback

Note: Artifacts in intonation can increase with smaller chunks

Processing Details

Performance & Benchmarks

Hardware variants

# GPU: Requires NVIDIA driver with CUDA 12.6+ support (~35x-100x realtime speed)
cd docker/gpu
docker compose up --build

# CPU: PyTorch CPU inference
cd docker/cpu
docker compose up --build

# AMD GPU: ROCm 6.4 (experimental, amd64 only)
cd docker/rocm
docker compose up --build

Throughput

Benchmarking was performed on generation via the local API using text lengths up to feature-length books (~1.5 hours output), measuring processing time and realtime factor. Tests were run on:

Windows 11 Home w/ WSL2
NVIDIA 4060Ti 16gb GPU @ CUDA 12.1
11th Gen i7-11700 @ 2.5GHz
64gb RAM
WAV native output
H.G. Wells - The Time Machine (full text)

Key Performance Metrics:

Realtime Speed: Ranges between 35x-100x (generation time to output audio length)
Average Processing Rate: 137.67 tokens/second (cl100k_base)

Transcription roundtrip (WER/CER)

End-to-end roundtrip: synthesize with Kokoro, transcribe the result back with faster-whisper, compare to the source text. Scripts and data live under examples/assorted_checks/test_transcription/.

Long-form English (full book, A Journey to the Centre of the Earth, Project Gutenberg, voice af_heart, base.en Whisper on CUDA float16, baseline captured on cu126 GPU build):

Run	Input chars	Audio length	Synth speedup	Transcribe speedup	WER
Short (~ch.7)	64,996	66m 06s	36.4x rt	62.4x rt	0.047
Full book	502,766	507m 52s	45.7x rt	65.1x rt	0.033

See examples/assorted_checks/test_transcription/BASELINE.md for the full regression bands.

Per-language check (single-sentence per voice, multilingual Whisper small. WER for Latin scripts, CER for ja/zh/hi):

Language	Voice	Metric	Score
English	`af_heart`	WER	0.000
English (UK)	`bf_emma`	WER	0.111
Spanish	`ef_dora`	WER	0.000
French	`ff_siwis`	WER	0.000
Italian	`if_sara`	WER	0.000
Portuguese	`pf_dora`	WER	0.000
Hindi	`hf_alpha`	CER	0.059
Japanese	`jf_alpha`	CER	0.000
Chinese	`zf_xiaobei`	CER	0.143

Caveat: these are single short sentences, not a comprehensive per-language quality benchmark. They confirm each voice produces transcribable audio in its target language; deeper quality evaluation per language is open work.

To reproduce, see examples/assorted_checks/test_transcription/README.md.

Natural Boundary Detection

Automatically splits and stitches at sentence boundaries
Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output

The model is capable of processing up to a 510 phonemized token chunk at a time, however, this can often lead to 'rushed' speech or other artifacts. An additional layer of chunking is applied in the server, that creates flexible chunks with a TARGET_MIN_TOKENS , TARGET_MAX_TOKENS, and ABSOLUTE_MAX_TOKENS which are configurable via environment variables, and set to 175, 250, 450 by default

Timestamped Captions & Phonemes

Generate audio with word-level timestamps without streaming:

import requests
import base64
import json

response = requests.post(
    "http://localhost:8880/dev/captioned_speech",
    json={
        "model": "kokoro",
        "input": "Hello world!",
        "voice": "af_bella",
        "speed": 1.0,
        "response_format": "mp3",
        "stream": False,
    },
    stream=False
)

with open("output.mp3","wb") as f:

    audio_json=json.loads(response.content)
    
    # Decode base 64 stream to bytes
    chunk_audio=base64.b64decode(audio_json["audio"].encode("utf-8"))
    
    # Process streaming chunks
    f.write(chunk_audio)
    
    # Print word level timestamps
    print(audio_json["timestamps"])

Generate audio with word-level timestamps with streaming:

import requests
import base64
import json

response = requests.post(
    "http://localhost:8880/dev/captioned_speech",
    json={
        "model": "kokoro",
        "input": "Hello world!",
        "voice": "af_bella",
        "speed": 1.0,
        "response_format": "mp3",
        "stream": True,
    },
    stream=True
)

f=open("output.mp3","wb")
for chunk in response.iter_lines(decode_unicode=True):
    if chunk:
        chunk_json=json.loads(chunk)
        
        # Decode base 64 stream to bytes
        chunk_audio=base64.b64decode(chunk_json["audio"].encode("utf-8"))
        
        # Process streaming chunks
        f.write(chunk_audio)
        
        # Print word level timestamps
        print(chunk_json["timestamps"])

Phoneme & Token Routes

Convert text to phonemes and/or generate audio directly from phonemes:

import requests

def get_phonemes(text: str, language: str = "a"):
    """Get phonemes and tokens for input text"""
    response = requests.post(
        "http://localhost:8880/dev/phonemize",
        json={"text": text, "language": language}  # "a" for American English
    )
    response.raise_for_status()
    result = response.json()
    return result["phonemes"], result["tokens"]

def generate_audio_from_phonemes(phonemes: str, voice: str = "af_bella"):
    """Generate audio from phonemes"""
    response = requests.post(
        "http://localhost:8880/dev/generate_from_phonemes",
        json={"phonemes": phonemes, "voice": voice},
        headers={"Accept": "audio/wav"}
    )
    if response.status_code != 200:
        print(f"Error: {response.text}")
        return None
    return response.content

# Example usage
text = "Hello world!"
try:
    # Convert text to phonemes
    phonemes, tokens = get_phonemes(text)
    print(f"Phonemes: {phonemes}")  # e.g. ðɪs ɪz ˈoʊnli ɐ tˈɛst
    print(f"Tokens: {tokens}")      # Token IDs including start/end tokens

    # Generate and save audio
    if audio_bytes := generate_audio_from_phonemes(phonemes):
        with open("speech.wav", "wb") as f:
            f.write(audio_bytes)
        print(f"Generated {len(audio_bytes)} bytes of audio")
except Exception as e:
    print(f"Error: {e}")

See examples/phoneme_examples/generate_phonemes.py for a sample script.

Debug Endpoints

Monitor system state and resource usage with these endpoints:

/debug/threads - Get thread information and stack traces
/debug/storage - Monitor temp file and output directory usage
/debug/system - Get system information (CPU, memory, GPU)

Useful for debugging resource exhaustion or performance issues.

Logging

Global API loguru logging level can be set using the API_LOG_LEVEL environment variable. Defaults to DEBUG.

Docker

Modify the appropriate compose yml or append to command line.

docker run --env 'API_LOG_LEVEL=WARNING' ...

Direct via UV

Linux and macOS

export API_LOG_LEVEL=WARNING
./start-cpu.sh OR
./start-gpu.sh

Windows

$env:API_LOG_LEVEL = 'WARNING'
.\start-cpu.ps1 OR
.\start-gpu.ps1

Known Issues & Troubleshooting

Missing words & Missing some timestamps

The api will automatically do text normalization on input text which may incorrectly remove or change some phrases. This can be disabled by adding "normalization_options":{"normalize": false} to your request json:

import requests

response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "af_heart",
        "response_format": "pcm",
        "normalization_options":
        {
            "normalize": False
        }
    },
    stream=True
)

for chunk in response.iter_content(chunk_size=1024):
    if chunk:
        # Process streaming chunks
        pass

Linux GPU Permissions

Some Linux users may encounter GPU permission issues when running as non-root. Can't guarantee anything, but here are some common solutions, consider your security requirements carefully

Option 1: Container Groups (Likely the best option)

services:
  kokoro-tts:
    # ... existing config ...
    group_add:
      - "video"
      - "render"

Option 2: Host System Groups

services:
  kokoro-tts:
    # ... existing config ...
    user: "${UID}:${GID}"
    group_add:
      - "video"

Note: May require adding host user to groups: sudo usermod -aG docker,video $USER and system restart.

Option 3: Device Permissions (Use with caution)

services:
  kokoro-tts:
    # ... existing config ...
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-uvm:/dev/nvidia-uvm

⚠️ Warning: Reduces system security. Use only in development environments.

Prerequisites: NVIDIA GPU, drivers, and container toolkit must be properly configured.

Visit NVIDIA Container Toolkit installation for more detailed information

WAV duration reported as nonsense in some readers

WAV responses ship with streaming-sentinel (0xFFFFFFFF) size fields in the header. Most readers (soundfile, pydub/ffmpeg, browsers, OS players) handle this fine. Python's stdlib wave does not, and reports a bogus duration. Use soundfile.info(path).duration or ffprobe for exact length.

Project

Versioning & Development

Branching Strategy:

release branch: Contains the latest stable build, recommended for production use. Docker images tagged with specific versions are built from this branch.
master branch: Used for active development. It may contain experimental features, ongoing changes, or fixes not yet in a stable release. Use this branch if you want the absolute latest code, but be aware it might be less stable. The latest Docker tag often points to builds from this branch.

Note: This is a development focused project at its core.

If you run into trouble, you may have to roll back a version on the release tags if something comes up, or build up from source and/or troubleshoot + submit a PR.

Free and open source is a community effort, and there's only really so many hours in a day. If you'd like to support the work, feel free to open a PR, buy me a coffee, or report any bugs/features/etc you find during use.

Model

This API uses the Kokoro-82M model from HuggingFace.

Visit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.

License

This project is licensed under the Apache License 2.0 - see below for details:

The Kokoro model weights are licensed under Apache 2.0 (see model page)
The FastAPI wrapper code in this repository is licensed under Apache 2.0 to match
The inference code adapted from StyleTTS2 is MIT licensed

The full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0

Contributor Stats

Made with contrib.rocks.

Name		Name	Last commit message	Last commit date
Latest commit History 509 Commits
.github		.github
api		api
assets		assets
charts/kokoro-fastapi		charts/kokoro-fastapi
dev		dev
docker		docker
docs		docs
examples		examples
scripts		scripts
ui		ui
web		web
.codeql-config.yml		.codeql-config.yml
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
.ruff.toml		.ruff.toml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
debug.http		debug.http
docker-bake.hcl		docker-bake.hcl
githubbanner.png		githubbanner.png
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
start-cpu.ps1		start-cpu.ps1
start-cpu.sh		start-cpu.sh
start-gpu.ps1		start-gpu.ps1
start-gpu.sh		start-gpu.sh
start-gpu_mac.sh		start-gpu_mac.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

_{_FastKoko}

Integration Guides

Get Started

Features

Processing Details

Hardware variants

Throughput

Transcription roundtrip (WER/CER)

Known Issues & Troubleshooting

Option 1: Container Groups (Likely the best option)

Option 2: Host System Groups

Option 3: Device Permissions (Use with caution)

Project

Contributor Stats

About

Uh oh!

Releases 17

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FastKoko

Integration Guides

Get Started

Features

Processing Details

Hardware variants

Throughput

Transcription roundtrip (WER/CER)

Known Issues & Troubleshooting

Option 1: Container Groups (Likely the best option)

Option 2: Host System Groups

Option 3: Device Permissions (Use with caution)

Project

Contributor Stats

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 17

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

_{_FastKoko}

Packages