Skip to content

TopCS/csm.rs

 
 

Repository files navigation

csm.rs

csm.rs is a high-performance Rust implementation of Sesame's Conversational Speech Model (CSM), designed for fast, efficient, and real-time streaming text-to-speech (TTS) inference. It is built on the candle machine learning framework.

This implementation is simple, straightforward, and aims for raw performance.

✨ Features

  • ⚡️ Blazing-Fast: High-performance inference powered by Rust and candle.
  • 🤗 Broad Model Support: Natively supports both the original sesame/csm-1b weights and weights from Hugging Face transformers-compatible fine-tunes.
  • 🤏 Quantization: Supports GGUF-based q8_0 and q4_k quantization for reduced memory footprint and faster inference on CPU.
  • ⚙️ Multiple Backends: Leverages candle to support multiple hardware targets, including MKL, Accelerate (macOS), CUDA, cuDNN, and Metal (Apple Silicon).
  • 🔌 OpenAI Compatible: Includes an OpenAI-compatible API web server for seamless integration with existing tools.

🚀 Getting Started

Compilation

To build the project, select the appropriate feature flag for your target hardware. The project provides three main binaries: main (for command-line interface usage), benchmark (for throughput measurement), and server (for the OpenAI-compatible API).

CPU (MKL - Linux/Windows) For optimal performance on Intel CPUs.

RUSTFLAGS="-C target-cpu=native" cargo build --release --features mkl

CPU (Accelerate - macOS) For optimal performance on Apple CPUs.

RUSTFLAGS="-C target-cpu=native" cargo build --release --features accelerate

NVIDIA GPU (CUDA) Requires the CUDA Toolkit to be installed.

cargo build --release --features cuda

NVIDIA GPU (cuDNN) For faster CUDA performance with cuDNN.

cargo build --release --features cudnn

Apple Silicon GPU (Metal) For running on M-series Macs.

cargo build --release --features metal

The compiled binaries will be available in the ./target/release/ directory.

💻 Usage

Command-Line Interface (CLI)

The CLI allows you to generate audio directly from your terminal. Models are downloaded automatically from the Hugging Face Hub on first use.

Generate audio with a full-precision model:

./target/release/main \
    --text "Hello there from the full precision model" \
    --model-id "sesame/csm-1b" \
    --output "output_fp16.wav"

Generate audio with a quantized model:

./target/release/main \
    --text "Hello there from the quantized model" \
    --model-id cartesia/sesame-csm-1b-gguf \
    --model-file q8.gguf \
    --output "output_q8.wav"

To quantize your own models see the Quantization section.

OpenAI-Compatible Server

csm.rs includes a server that is compatible with the OpenAI Speech API, allowing you to use it as a drop-in replacement.

Start the server with a full-precision model:

./target/release/server --port 8080 --model-id "sesame/csm-1b"

Start the server with a quantized model:

./target/release/server \
    --port 8080 \
    --model-id cartesia/sesame-csm-1b-gguf \
    --model-file q8.gguf

Python Client Example You can use the official OpenAI Python client to interact with the server.

# pip install openai
from openai import OpenAI
from pathlib import Path

# Point the client to your local server
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

# Request speech synthesis
response = client.audio.speech.create(
    model="csm-1b", # Model name is ignored by the server but required by the API
    input="Hello! This audio was generated by the server.",
    voice="alloy", # Voice is ignored, use speaker_id instead
    # You can pass custom parameters in extra_body
    extra_body={
        "speaker_id": 0,
        "temperature": 0.7,
    }
)

# Save the output to a file
speech_file_path = Path("server_output.wav")
response.stream_to_file(speech_file_path)

# Or use the streaming endpoint
with client.audio.speech.with_streaming_response.create(
    model="csm-1b",
    voice="alloy",
    input="Hello from the streaming endpoint",
    response_format="wav",
    extra_body=dict(
        speaker_id=0,
    )
) as response:
    for chunk in response.iter_bytes(chunk_size=1024):
        print(chunk)

Command-Line Arguments

All binaries share a common set of arguments for model loading and hardware selection.

Common Arguments (for main, benchmark, server)

Argument Description Default Value
--weights-path Absolute path to a weight file (.safetensors or .gguf). Overrides all other model loading options. None
--model-id The model ID from the Hugging Face Hub (e.g., 'sesame/csm-1b'). None
--model-path Path to a local directory containing the model files. None
--model-file The name of a single model file to use within a --model-id or --model-path. None
--index-file The name of the index file for sharded models. None
--tokenizer-id The tokenizer ID from the Hugging Face Hub. Defaults to the --model-id if not set. None
--cpu If set, forces the computation to run on the CPU. false

Specific Arguments for main (CLI)

Argument Description Default Value
--text The text to generate audio from. "Hello there, this is a test"
--output The path to save the output .wav file. "csm_output.wav"
--speaker-id The speaker ID to use for generation. 0
--temperature Sampling temperature. 0.7
--top-k The number of highest probability tokens to consider for sampling (Top-K). 100
--max-audio-len-ms The maximum length of the generated audio in milliseconds. 30000.0
--buffer-size The number of audio frames to buffer before decoding to audio. 20
--tokenizer-template A custom tokenizer template. E.g., "<|begin_of_text|>[{speaker_id}]{text}<|end_of_text|>". None

Specific Arguments for benchmark

Argument Short Description Default Value
--text -t The text to use for benchmarking. "Hi there, this is a test"
--warmup-runs -w The number of warm-up runs to perform before timing. 1
--num-runs -n The number of timed runs to perform for the benchmark. 5
--speaker-id The speaker ID to use for generation. 0
--temperature Sampling temperature. 0.7
--top-k The number of highest probability tokens to consider for sampling (Top-K). 100
--buffer-size The number of audio frames to buffer before decoding to audio. 20
--tokenizer-template A custom tokenizer template. E.g., "<|begin_of_text|>[{speaker_id}]{text}<|end_of_text|>". None

Specific Arguments for server

Argument Description Default Value
--host The host address to bind the server to. "0.0.0.0"
--port The port to run the server on. 8080
--api-key If set, requires clients to provide this key in the Authorization: Bearer <key> header. None

🤏 Quantization

You can significantly reduce the model size and improve CPU inference speed by quantizing the weights to 8-bit (q8_0) or 4-bit (q4_k). We use the GGUF file format for quantized models.

A Python script is provided to handle downloading, loading, and converting .safetensors weights into a quantized GGUF file. The script can work directly with both single-file and sharded models from local paths or the Hugging Face Hub.

  1. Install dependencies:

    pip install -r scripts/requirements.txt
  2. Run the quantization script:

    The script can quantize a model directly from the Hugging Face Hub or from a local directory.

    To quantize a model from the Hub (e.g., sesame/csm-1b) to Q8_0:

    python scripts/quantize.py \
        --model-id "sesame/csm-1b" \
        --index-file "transformers.safetensors.index.json" \
        --output-path ./csm-1b-q8_0.gguf \
        --qtype q8_0

    To quantize a local model to Q4_K:

    python scripts/quantize.py \
        --model-path /path/to/your/local/model/directory \
        --output-path ./csm-1b-q4_k.gguf \
        --qtype q4_k

📊 Benchmarks

You can run the built-in benchmark tool to measure the performance on your hardware. The tool reports the Real-Time Factor (RTF), which is the time taken to generate 1 second of audio (lower is better), and Throughput (xRealTime), which is how many seconds of audio are generated in 1 second (higher is better).

Example benchmark command:

# For a full-precision model with CUDA
cargo run --release --features cuda --bin benchmark

# For a quantized model on CPU
RUSTFLAGS="-C target-cpu=native" cargo run --release --features mkl --bin benchmark -- --weights-path ./csm-1b-q8_0.gguf

📜 License

This project is licensed under the GNU Affero General Public License Version 3. See the LICENSE file for details.

🤝 Contributing

Contributions are welcome!

If you have suggestions for improvements, find a bug, or want to add a new feature, please feel free to open an issue or submit a pull request.

About

Blazing-fast rust implementation of Sesame's Conversational Speech Model (CSM)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Rust 90.2%
  • Python 9.3%
  • Dockerfile 0.5%