csm.rs is a high-performance Rust implementation of Sesame's Conversational Speech Model (CSM), designed for fast, efficient, and real-time streaming text-to-speech (TTS) inference. It is built on the candle machine learning framework.
This implementation is simple, straightforward, and aims for raw performance.
- ⚡️ Blazing-Fast: High-performance inference powered by Rust and
candle. - 🤗 Broad Model Support: Natively supports both the original
sesame/csm-1bweights and weights from Hugging Facetransformers-compatible fine-tunes. - 🤏 Quantization: Supports GGUF-based
q8_0andq4_kquantization for reduced memory footprint and faster inference on CPU. - ⚙️ Multiple Backends: Leverages
candleto support multiple hardware targets, including MKL, Accelerate (macOS), CUDA, cuDNN, and Metal (Apple Silicon). - 🔌 OpenAI Compatible: Includes an OpenAI-compatible API web server for seamless integration with existing tools.
To build the project, select the appropriate feature flag for your target hardware. The project provides three main binaries: main (for command-line interface usage), benchmark (for throughput measurement), and server (for the OpenAI-compatible API).
CPU (MKL - Linux/Windows) For optimal performance on Intel CPUs.
RUSTFLAGS="-C target-cpu=native" cargo build --release --features mklCPU (Accelerate - macOS) For optimal performance on Apple CPUs.
RUSTFLAGS="-C target-cpu=native" cargo build --release --features accelerateNVIDIA GPU (CUDA) Requires the CUDA Toolkit to be installed.
cargo build --release --features cudaNVIDIA GPU (cuDNN) For faster CUDA performance with cuDNN.
cargo build --release --features cudnnApple Silicon GPU (Metal) For running on M-series Macs.
cargo build --release --features metalThe compiled binaries will be available in the ./target/release/ directory.
The CLI allows you to generate audio directly from your terminal. Models are downloaded automatically from the Hugging Face Hub on first use.
Generate audio with a full-precision model:
./target/release/main \
--text "Hello there from the full precision model" \
--model-id "sesame/csm-1b" \
--output "output_fp16.wav"Generate audio with a quantized model:
./target/release/main \
--text "Hello there from the quantized model" \
--model-id cartesia/sesame-csm-1b-gguf \
--model-file q8.gguf \
--output "output_q8.wav"To quantize your own models see the Quantization section.
csm.rs includes a server that is compatible with the OpenAI Speech API, allowing you to use it as a drop-in replacement.
Start the server with a full-precision model:
./target/release/server --port 8080 --model-id "sesame/csm-1b"Start the server with a quantized model:
./target/release/server \
--port 8080 \
--model-id cartesia/sesame-csm-1b-gguf \
--model-file q8.ggufPython Client Example You can use the official OpenAI Python client to interact with the server.
# pip install openai
from openai import OpenAI
from pathlib import Path
# Point the client to your local server
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
# Request speech synthesis
response = client.audio.speech.create(
model="csm-1b", # Model name is ignored by the server but required by the API
input="Hello! This audio was generated by the server.",
voice="alloy", # Voice is ignored, use speaker_id instead
# You can pass custom parameters in extra_body
extra_body={
"speaker_id": 0,
"temperature": 0.7,
}
)
# Save the output to a file
speech_file_path = Path("server_output.wav")
response.stream_to_file(speech_file_path)
# Or use the streaming endpoint
with client.audio.speech.with_streaming_response.create(
model="csm-1b",
voice="alloy",
input="Hello from the streaming endpoint",
response_format="wav",
extra_body=dict(
speaker_id=0,
)
) as response:
for chunk in response.iter_bytes(chunk_size=1024):
print(chunk)All binaries share a common set of arguments for model loading and hardware selection.
| Argument | Description | Default Value |
|---|---|---|
--weights-path |
Absolute path to a weight file (.safetensors or .gguf). Overrides all other model loading options. |
None |
--model-id |
The model ID from the Hugging Face Hub (e.g., 'sesame/csm-1b'). |
None |
--model-path |
Path to a local directory containing the model files. | None |
--model-file |
The name of a single model file to use within a --model-id or --model-path. |
None |
--index-file |
The name of the index file for sharded models. | None |
--tokenizer-id |
The tokenizer ID from the Hugging Face Hub. Defaults to the --model-id if not set. |
None |
--cpu |
If set, forces the computation to run on the CPU. | false |
| Argument | Description | Default Value |
|---|---|---|
--text |
The text to generate audio from. | "Hello there, this is a test" |
--output |
The path to save the output .wav file. |
"csm_output.wav" |
--speaker-id |
The speaker ID to use for generation. | 0 |
--temperature |
Sampling temperature. | 0.7 |
--top-k |
The number of highest probability tokens to consider for sampling (Top-K). | 100 |
--max-audio-len-ms |
The maximum length of the generated audio in milliseconds. | 30000.0 |
--buffer-size |
The number of audio frames to buffer before decoding to audio. | 20 |
--tokenizer-template |
A custom tokenizer template. E.g., "<|begin_of_text|>[{speaker_id}]{text}<|end_of_text|>". |
None |
| Argument | Short | Description | Default Value |
|---|---|---|---|
--text |
-t |
The text to use for benchmarking. | "Hi there, this is a test" |
--warmup-runs |
-w |
The number of warm-up runs to perform before timing. | 1 |
--num-runs |
-n |
The number of timed runs to perform for the benchmark. | 5 |
--speaker-id |
The speaker ID to use for generation. | 0 |
|
--temperature |
Sampling temperature. | 0.7 |
|
--top-k |
The number of highest probability tokens to consider for sampling (Top-K). | 100 |
|
--buffer-size |
The number of audio frames to buffer before decoding to audio. | 20 |
|
--tokenizer-template |
A custom tokenizer template. E.g., "<|begin_of_text|>[{speaker_id}]{text}<|end_of_text|>". |
None |
| Argument | Description | Default Value |
|---|---|---|
--host |
The host address to bind the server to. | "0.0.0.0" |
--port |
The port to run the server on. | 8080 |
--api-key |
If set, requires clients to provide this key in the Authorization: Bearer <key> header. |
None |
You can significantly reduce the model size and improve CPU inference speed by quantizing the weights to 8-bit (q8_0) or 4-bit (q4_k). We use the GGUF file format for quantized models.
A Python script is provided to handle downloading, loading, and converting .safetensors weights into a quantized GGUF file. The script can work directly with both single-file and sharded models from local paths or the Hugging Face Hub.
-
Install dependencies:
pip install -r scripts/requirements.txt
-
Run the quantization script:
The script can quantize a model directly from the Hugging Face Hub or from a local directory.
To quantize a model from the Hub (e.g.,
sesame/csm-1b) to Q8_0:python scripts/quantize.py \ --model-id "sesame/csm-1b" \ --index-file "transformers.safetensors.index.json" \ --output-path ./csm-1b-q8_0.gguf \ --qtype q8_0To quantize a local model to Q4_K:
python scripts/quantize.py \ --model-path /path/to/your/local/model/directory \ --output-path ./csm-1b-q4_k.gguf \ --qtype q4_k
You can run the built-in benchmark tool to measure the performance on your hardware. The tool reports the Real-Time Factor (RTF), which is the time taken to generate 1 second of audio (lower is better), and Throughput (xRealTime), which is how many seconds of audio are generated in 1 second (higher is better).
Example benchmark command:
# For a full-precision model with CUDA
cargo run --release --features cuda --bin benchmark
# For a quantized model on CPU
RUSTFLAGS="-C target-cpu=native" cargo run --release --features mkl --bin benchmark -- --weights-path ./csm-1b-q8_0.ggufThis project is licensed under the GNU Affero General Public License Version 3. See the LICENSE file for details.
Contributions are welcome!
If you have suggestions for improvements, find a bug, or want to add a new feature, please feel free to open an issue or submit a pull request.