Skip to content

Latest commit

 

History

History
477 lines (321 loc) · 12.1 KB

File metadata and controls

477 lines (321 loc) · 12.1 KB

Speechmatics Speech-to-Text (STT) Transcription Tool

Transcribe audio files using the Speechmatics API with support for speaker diarisation, 55+ languages, and batch processing.

Features

  • Speaker Diarisation: Identify and label multiple speakers (S1, S2, S3, etc.)
  • Language Support: 55+ languages including auto-detection
  • Multiple Output Formats: JSON (full API response) + TXT (human-readable transcript)
  • Idempotent: Skip re-transcription if output already exists
  • Progress Monitoring: Real-time status updates with verbose logging
  • Region Selection: EU, US, and AU endpoints for data residency compliance
  • Operating Points: Standard (faster) or Enhanced (more accurate) models
  • META Warning Messages: Automatic disclaimer about potential transcription errors

Prerequisites

# Install dependencies (handled automatically by uv)
# - requests>=2.31

# Set your Speechmatics API key
export SPEECHMATICS_API_KEY="your_api_key_here"

Get your API key at: https://portal.speechmatics.com/

Quick Start

# Basic transcription
./stt_speechmatics.py audio.mp3

# With speaker diarisation
./stt_speechmatics.py -d audio.mp3

# Specify language (German)
./stt_speechmatics.py -l de audio.mp3

# Enhanced accuracy mode
./stt_speechmatics.py --operating-point enhanced audio.mp3

# US region with diarisation and max 3 speakers
./stt_speechmatics.py -R us -d --max-speakers 3 audio.mp3

META Transcript Warning Message

By default, all transcript outputs include a META warning message to remind readers that STT transcripts may contain errors.

What It Does

Automatically prepends a disclaimer to transcript files:

  • TXT files: YAML front matter format at top of file
  • JSON files: First key: {"_meta_note": "{message}", ...}

Default Message

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT AND MAY CONTAIN TRANSCRIPTION ERRORS. This transcript was generated by automated speech recognition technology and should be treated as a rough transcription for reference purposes. Common types of errors include: incorrect word recognition (especially homophones, proper nouns, technical terminology, or words in noisy audio conditions), missing or incorrect punctuation, speaker misidentification in multi-speaker scenarios, and timing inaccuracies. For best comprehension and to mentally correct potential errors, please consider: the broader conversational context, relevant domain knowledge, technical background of the subject matter, and any supplementary information about the speakers or topic. This transcript is intended to convey the general content and flow of the conversation rather than serving as a verbatim, word-perfect record. When critical accuracy is required, please verify important details against the original audio source.
---

Disabling the META Message

Via command-line flag:

./stt_speechmatics.py --no-meta-message audio.mp3
# or
./stt_speechmatics.py --disable-meta-message audio.mp3

Via environment variable (system-wide):

export STT_META_MESSAGE_DISABLE=1
./stt_speechmatics.py audio.mp3

Custom META Message

export STT_META_MESSAGE="DRAFT - UNVERIFIED TRANSCRIPT"
./stt_speechmatics.py audio.mp3

Usage Examples

1. Basic Transcription (Single Speaker)

./stt_speechmatics.py audio.mp3

Creates:

  • audio.mp3.speechmatics.json - Full Speechmatics API response
  • audio.mp3.txt - Plain text transcript

2. Speaker Diarisation (Multiple Speakers)

# Auto-detect number of speakers
./stt_speechmatics.py -d audio.mp3

# Limit to maximum 3 speakers
./stt_speechmatics.py -d --max-speakers 3 audio.mp3

# Adjust speaker detection sensitivity (higher = more speakers)
./stt_speechmatics.py -d --speaker-sensitivity 0.7 audio.mp3

Creates:

  • audio.mp3.speechmatics.json - Full response with speaker labels
  • audio.mp3.txt - Formatted transcript: Speaker S1:\t text

Note: Speechmatics uses S1, S2, S3, etc. for speaker labels (not A, B, C like AssemblyAI).

3. Language Selection

# English (default)
./stt_speechmatics.py audio.mp3

# German
./stt_speechmatics.py -l de audio.mp3

# French
./stt_speechmatics.py -l fr audio.mp3

# Japanese
./stt_speechmatics.py -l ja audio.mp3

# Mandarin Chinese
./stt_speechmatics.py -l cmn audio.mp3

Supported language codes: en, de, fr, es, it, pt, nl, pl, ru, ja, ko, zh, cmn, ar, hi, and 40+ more (ISO 639-1/639-3).

4. Operating Points (Accuracy vs Speed)

# Standard (faster processing)
./stt_speechmatics.py --operating-point standard audio.mp3

# Enhanced (higher accuracy)
./stt_speechmatics.py --operating-point enhanced audio.mp3

5. Region Selection

# EU region (default)
./stt_speechmatics.py -R eu audio.mp3

# US region
./stt_speechmatics.py -R us audio.mp3

# Australia region
./stt_speechmatics.py -R au audio.mp3

Regions:

6. Custom Output Path

# Specify output file
./stt_speechmatics.py -o transcript.txt audio.mp3

# Creates: transcript.txt, audio.mp3.speechmatics.json

7. Output to Stdout Only

# Print transcript to stdout, no file creation
./stt_speechmatics.py -o - audio.mp3

8. Verbose Logging

# Basic info (-v)
./stt_speechmatics.py -v audio.mp3

# Detailed debug output (-vvvvv)
./stt_speechmatics.py -vvvvv audio.mp3

Log levels:

  • -v (1+): INFO - Progress updates
  • -vvvvv (5+): DEBUG - Full API request/response details

9. Remote Audio URLs

# Transcribe from URL (skips upload step)
./stt_speechmatics.py https://example.com/audio.mp3

Command-Line Options

Positional Arguments

  • audio_input - Path to audio file or URL to transcribe

Supported formats: MP3, MP4, WAV, FLAC, M4A, OGG, and more

Speaker Diarisation

  • -d, --diarisation - Enable speaker diarisation (S1, S2, S3, etc.)
  • --max-speakers N - Maximum number of speakers (minimum: 2, default: unlimited)
  • --speaker-sensitivity FLOAT - Detection sensitivity (0-1, default: 0.5)

Output Control

  • -o, --output PATH - Output file path (default: {audio_input}.txt)
    • Use - for stdout only (no files created)
  • -q, --quiet - Suppress status messages (output only transcript)

Language & Model

  • -l, --language CODE - Language code (default: en)
  • --operating-point {standard,enhanced} - Model accuracy (default: server decides)

Region

  • -R, --region {eu,eu1,us,us1,au,au1} - API endpoint region (default: eu)

Logging

  • -v, --verbose - Increase verbosity (use multiple times: -v, -vv, -vvvvv)

META Message Control

  • --no-meta-message, --disable-meta-message - Disable META warning message

Environment Variables:

  • STT_META_MESSAGE_DISABLE=1 - Disable system-wide
  • STT_META_MESSAGE="text" - Custom message

Output Files

Default File Names

Input: audio.mp3

Output:

  • audio.mp3.speechmatics.json - Full Speechmatics API response (always created)
  • audio.mp3.txt - Human-readable transcript (default output)

JSON Format

Complete API response including:

  • job - Job metadata (id, created_at, duration)
  • metadata - Transcription config used
  • results - Word-level results with timing and confidence
  • _meta_note - META warning message (if enabled)

Example:

{
  "_meta_note": "THIS IS AN AUTOMATED SPEECH-TO-TEXT...",
  "job": {
    "id": "abc123",
    "created_at": "2025-01-06T10:00:00Z",
    "duration": 180
  },
  "results": [
    {
      "type": "word",
      "alternatives": [{"content": "Hello", "confidence": 0.98}],
      "start_time": 0.5,
      "end_time": 0.9,
      "speaker": "S1"
    }
  ]
}

TXT Format (Without Diarisation)

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT...
---
Hello everyone, welcome to the show. Today we're discussing...

TXT Format (With Diarisation)

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT...
---
Speaker S1:	Hello everyone, welcome to the show.
Speaker S2:	Thanks for having me.
Speaker S1:	Today we're discussing artificial intelligence.

Format: Speaker {label}:\t{text}\n (tab after colon)

Idempotent Behavior

The tool checks if output files already exist before transcribing:

$ ./stt_speechmatics.py audio.mp3
# Transcribes audio, creates files

$ ./stt_speechmatics.py audio.mp3
# Skips transcription, displays existing transcript
SKIPPING: transcription of audio.mp3 as audio.mp3.txt already exists

To force re-transcription: Delete existing .txt file

Workflow Integration

With Speaker Mapping

Process with stt_speechmatics_speaker_mapper.py:

# Step 1: Transcribe with diarisation
./stt_speechmatics.py -d audio.mp3

# Step 2: Map speaker labels to names
./stt_speechmatics_speaker_mapper.py -m "Alice Anderson,Bob Martinez" audio.mp3.speechmatics.json

Error Handling

Missing API Key

Error: SPEECHMATICS_API_KEY environment variable not set.
Get your API key at: https://portal.speechmatics.com/

Solution: export SPEECHMATICS_API_KEY="your_key"

File Not Found

ERROR: Error creating job: [Errno 2] No such file or directory: 'audio.mp3'

Solution: Check file path

Authentication Failed

ERROR: Error creating job: 401 Unauthorized
REST RESPONSE: {"error": "Invalid API key"}

Solution: Verify API key is correct

Job Rejected

ERROR: Error waiting for job: Job rejected: Unsupported audio format

Solution: Convert audio to supported format (MP3, WAV, etc.)

Advanced Usage

Processing Time

Jobs typically take less than half the audio duration. A 40-minute file should complete within 20 minutes.

Long Audio Files

For long audio (>1 hour), use verbose mode to monitor progress:

./stt_speechmatics.py -v audio.mp3

Output:

INFO: Processing audio input...
INFO: output filename: audio.mp3.txt
INFO: Submitting transcription job...
INFO: Job created: abc123...
INFO: Waiting for job to complete...
INFO: Job status: running
INFO: Job status: running
INFO: Job status: done
INFO: Retrieving transcript...
INFO: Writing output files...
INFO: Server response written to audio.mp3.speechmatics.json
INFO: Output written to audio.mp3.txt
INFO: Done.

Supported Languages (55+)

Major languages include:

  • European: English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Swedish, Norwegian, Danish, Finnish, Czech, Hungarian, Romanian, Greek, Bulgarian, Croatian
  • Asian: Japanese, Korean, Mandarin, Cantonese, Hindi, Thai, Vietnamese, Indonesian, Malay, Tamil, Tagalog
  • Other: Arabic, Hebrew, Turkish, Ukrainian, Swahili, Welsh

See full list at: https://docs.speechmatics.com/introduction/supported-languages

Related Tools

  • stt_speechmatics_speaker_mapper.py - Map speaker labels (S1, S2) to actual names
  • stt_assemblyai.py - Alternative tool using AssemblyAI API
  • stt_openai_OR_local_whisper_cli.py - Alternative tool using OpenAI Whisper

Comparison: Speechmatics vs AssemblyAI

Feature Speechmatics AssemblyAI
Speaker labels S1, S2, S3... A, B, C...
Unknown speaker UU -
Languages 55+ 99+
Word error rate 6.8% ~5%
Latency 150ms p95 Similar
Regions EU, US, AU EU, US

Troubleshooting

Problem: Output file truncated

Cause: Filename too long (>255 characters)

Solution: Tool automatically truncates long filenames while preserving extensions

Problem: Speaker labels not appearing

Cause: Forgot -d flag

Solution: Re-run with -d flag

Problem: Wrong language detected

Cause: Language not specified

Solution: Specify language explicitly with -l flag

Problem: Low accuracy

Possible causes:

  • Poor audio quality
  • Background noise
  • Non-standard accents

Solutions:

  • Use --operating-point enhanced for better accuracy
  • Specify language with -l flag
  • Use higher quality audio source

License

Part of the CLIAI handy_scripts collection.