Skip to content

CoreWorxLab/openwakeword-training

Repository files navigation

OpenWakeWord Trainer

Train custom wake word models for OpenWakeWord using synthetic voices from Kokoro TTS combined with your real voice recordings.

Why this exists: The official OpenWakeWord training process relies on Google Colab notebooks that frequently break. This repo provides a working local training pipeline that produces quality models.

What You Get

  • A trained .onnx wake word model (~400KB)
  • Works with OpenWakeWord, Home Assistant, or any system that supports ONNX models
  • Typical results: 70%+ accuracy, <2 false positives per hour

Requirements

  • NVIDIA GPU with CUDA (RTX 3060 12GB or better recommended)
  • Docker with NVIDIA Container Toolkit
  • ~20GB disk space for training data

Quick Start (Docker)

Docker is the recommended approach - it handles all the dependency hell for you.

1. Clone

git clone https://github.com/CoreWorxLab/openwakeword-training.git
cd openwakeword-training

2. Download Training Data (~17GB, one-time)

docker compose build trainer
docker compose run --rm trainer ./setup-data.sh

3. Record Your Voice (Optional but Recommended)

Recording 20-50 samples of your actual voice significantly improves detection. This runs on your host machine (needs microphone access):

pip install pyaudio numpy scipy
python record_samples.py --wake-word "hey cal"
  • Press ENTER to start each 2-second recording
  • Say your wake word naturally
  • Vary your tone, speed, and distance from the mic
  • Press 'q' to quit

4. Train Your Model

docker compose run --rm trainer python train.py --wake-word "hey cal" --data-dir /app/data

Training takes 4-8 hours depending on GPU.

5. Test Your Model

Test on your host machine (needs microphone access):

pip install openwakeword pyaudio numpy
python test_model.py --model my_custom_model/hey_cal.onnx

Speak your wake word into the microphone and watch for detections.

Configuration

Parameter Default Description
--wake-word "hey cal" The wake word/phrase to detect
--samples-per-voice 200 Samples generated per Kokoro voice
--training-steps 50000 More steps = better but slower
--layer-size 64 Network size (32, 64, or 128)
--kokoro-url http://localhost:8880 Kokoro TTS endpoint
--data-dir . Training data directory (/app/data for Docker)

How It Works

  1. Sample Generation - Creates ~13K positive samples using 67 Kokoro voices with speed variation (0.7-1.3x), plus your real recordings (weighted 3x)

  2. Negative Samples - Generates samples of clearly different phrases ("hello", "hey siri", "alexa") to teach the model what NOT to detect

  3. Augmentation - OpenWakeWord adds noise, reverb, and mixing to simulate real-world conditions

  4. Training - Neural network learns to distinguish your wake word from everything else

Key Insight

Don't use similar-sounding negatives. Training on phrases like "hey call" or "hey carl" actually hurts performance. Use only clearly different phrases like "hello", "hey siri", "alexa".

Output

my_custom_model/
├── hey_cal.onnx          # Your trained model - use this!
└── hey_cal/
    ├── positive_train/   # Generated training samples
    ├── positive_test/    # Test samples
    ├── negative_train/   # Negative training samples
    └── negative_test/    # Negative test samples

Using Your Model

from openwakeword.model import Model

model = Model(wakeword_models=["my_custom_model/hey_cal.onnx"])

# Process 16kHz mono audio frames
prediction = model.predict(audio_frame)
if prediction["hey_cal"] > 0.5:
    print("Wake word detected!")

Manual Setup (No Docker)

If you prefer not to use Docker, you can set up the environment directly:

./setup.sh
source venv/bin/activate

# Start Kokoro TTS separately
docker run -d --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest

python train.py --wake-word "hey cal"

Note: This requires Python 3.10+ and working CUDA. The pinned dependency versions in requirements.txt can conflict with other Python packages on your system, which is why Docker is recommended.

Troubleshooting

"Reached EOF prematurely" warnings

Normal - Kokoro's WAV headers have a quirk but the audio data is fine.

Low recall in training metrics

Training metrics use synthetic test samples. Real-world performance is usually better.

Model not detecting wake word

  • Ensure audio is 16kHz mono
  • Model needs ~2 seconds of audio buffer to warm up
  • Try lowering detection threshold (default 0.5)

TFLite conversion error at end

Ignore - the ONNX model is saved successfully before this error.

Credits

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors