Skip to content

Latest commit

 

History

History
539 lines (368 loc) · 13.1 KB

File metadata and controls

539 lines (368 loc) · 13.1 KB

AssemblyAI Speech-to-Text (STT) Transcription Tool

Transcribe audio files using the AssemblyAI API with support for speaker diarisation, multiple languages, and automatic error detection.

Features

  • Speaker Diarisation: Automatically detect and label multiple speakers (A, B, C, etc.)
  • Language Support: 99+ languages including auto-detection
  • Multiple Output Formats: JSON (full API response) + TXT (human-readable transcript)
  • Idempotent: Skip re-transcription if output already exists
  • Progress Monitoring: Real-time status updates with verbose logging
  • EU/US Region Selection: Choose API endpoint region for data residency compliance
  • META Warning Messages: Automatic disclaimer about potential transcription errors

Prerequisites

# Install dependencies (handled automatically by uv)
# - requests>=2.31

# Set your AssemblyAI API key
export ASSEMBLYAI_API_KEY="your_api_key_here"

Get your API key at: https://www.assemblyai.com/

Quick Start

# Basic transcription
./stt_assemblyai.py audio.mp3

# With speaker diarisation (2 speakers)
./stt_assemblyai.py -d -e 2 audio.mp3

# Specify language
./stt_assemblyai.py -l en_us audio.mp3

# Verbose output with EU region
./stt_assemblyai.py -v -R eu -d audio.mp3

META Transcript Warning Message

By default, all transcript outputs include a META warning message to remind readers that STT transcripts may contain errors.

What It Does

Automatically prepends a disclaimer to transcript files:

  • TXT files: YAML front matter format at top of file
  • JSON files: First key: {"_meta_note": "{message}", ...}

Default Message

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT AND MAY CONTAIN TRANSCRIPTION ERRORS. This transcript was generated by automated speech recognition technology and should be treated as a rough transcription for reference purposes. Common types of errors include: incorrect word recognition (especially homophones, proper nouns, technical terminology, or words in noisy audio conditions), missing or incorrect punctuation, speaker misidentification in multi-speaker scenarios, and timing inaccuracies. For best comprehension and to mentally correct potential errors, please consider: the broader conversational context, relevant domain knowledge, technical background of the subject matter, and any supplementary information about the speakers or topic. This transcript is intended to convey the general content and flow of the conversation rather than serving as a verbatim, word-perfect record. When critical accuracy is required, please verify important details against the original audio source.
---

Disabling the META Message

Via command-line flag:

./stt_assemblyai.py --no-meta-message audio.mp3
# or
./stt_assemblyai.py --disable-meta-message audio.mp3

Via environment variable (system-wide):

export STT_META_MESSAGE_DISABLE=1
./stt_assemblyai.py audio.mp3

Custom META Message

export STT_META_MESSAGE="DRAFT - UNVERIFIED TRANSCRIPT"
./stt_assemblyai.py audio.mp3

Example Output

Without diarisation (audio.mp3.txt):

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...
---
Hello everyone, welcome to the show. Today we're discussing...

With diarisation (audio.mp3.txt):

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...
---
Speaker A:Hello everyone, welcome to the show.
Speaker B:Thanks for having me.
Speaker A:Let's dive in.

Usage Examples

1. Basic Transcription (Single Speaker)

./stt_assemblyai.py audio.mp3

Creates:

  • audio.mp3.assemblyai.json - Full AssemblyAI API response
  • audio.mp3.txt - Plain text transcript

2. Speaker Diarisation (Multiple Speakers)

# Auto-detect number of speakers
./stt_assemblyai.py -d audio.mp3

# Specify expected number of speakers (more accurate)
./stt_assemblyai.py -d -e 2 audio.mp3

Creates:

  • audio.mp3.assemblyai.json - Full response with utterance-level speaker labels
  • audio.mp3.txt - Formatted transcript: Speaker A: text

3. Language Selection

# Auto-detect language (default)
./stt_assemblyai.py audio.mp3

# Specify English (US)
./stt_assemblyai.py -l en_us audio.mp3

# Spanish
./stt_assemblyai.py -l es audio.mp3

# French
./stt_assemblyai.py -l fr audio.mp3

Supported language codes: en, en_au, en_uk, en_us, es, fr, de, it, pt, nl, hi, ja, zh, fi, ko, pl, ru, and 80+ more

4. Custom Output Path

# Specify output file
./stt_assemblyai.py -o transcript.txt audio.mp3

# Creates: transcript.txt, audio.mp3.assemblyai.json

5. Output to Stdout Only

# Print transcript to stdout, no file creation
./stt_assemblyai.py -o - audio.mp3

6. Verbose Logging

# Basic info (-v)
./stt_assemblyai.py -v audio.mp3

# Detailed debug output (-vvvvv)
./stt_assemblyai.py -vvvvv audio.mp3

Log levels:

  • -v (1+): INFO - Progress updates
  • -vvvvv (5+): DEBUG - Full API request/response details

7. Quiet Mode

# Suppress all output except the transcript
./stt_assemblyai.py -q audio.mp3

8. Remote Audio URLs

# Transcribe from URL (skips upload step)
./stt_assemblyai.py https://example.com/audio.mp3

9. EU Region (GDPR Compliance)

# Use EU endpoint (default)
./stt_assemblyai.py -R eu audio.mp3

# Use US endpoint
./stt_assemblyai.py -R us audio.mp3

EU endpoint: https://api.eu.assemblyai.com US endpoint: https://api.assemblyai.com

Command-Line Options

Positional Arguments

  • audio_input - Path to audio file or URL to transcribe

Supported formats: MP3, MP4, WAV, FLAC, M4A, OGG, and more

Speaker Diarisation

  • -d, --diarisation - Enable speaker diarisation (labels each speaker)
  • -e, --expected-speakers N - Expected number of speakers (improves accuracy)

Note: -e automatically enables -d if not already specified

Output Control

  • -o, --output PATH - Output file path (default: {audio_input}.txt)
    • Use - for stdout only (no files created)
  • -q, --quiet - Suppress status messages (output only transcript)

Language

  • -l, --language CODE - Language code (default: auto for auto-detection)

Common codes:

  • en - English (auto-detect variant)
  • en_us - English (US)
  • en_uk - English (UK)
  • en_au - English (Australia)
  • es - Spanish
  • fr - French
  • de - German
  • it - Italian
  • pt - Portuguese
  • zh - Chinese
  • ja - Japanese
  • ko - Korean
  • hi - Hindi
  • ru - Russian

Region & API

  • -R, --region {eu|us} - API endpoint region (default: eu)

Logging

  • -v, --verbose - Increase verbosity (use multiple times: -v, -vv, -vvvvv)

META Message Control

  • --no-meta-message, --disable-meta-message - Disable META warning message

Environment Variables:

  • STT_META_MESSAGE_DISABLE=1 - Disable system-wide
  • STT_META_MESSAGE="text" - Custom message

Output Files

Default File Names

Input: audio.mp3

Output:

  • audio.mp3.assemblyai.json - Full AssemblyAI API response (always created)
  • audio.mp3.txt - Human-readable transcript (default output)

JSON Format

Complete API response including:

  • text - Full transcript text
  • utterances - Array of speaker segments (with diarisation)
  • words - Word-level timing and confidence
  • confidence - Overall transcript confidence
  • audio_duration - Audio length in milliseconds
  • _meta_note - META warning message (if enabled)

Example:

{
  "_meta_note": "THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...",
  "text": "Hello everyone, welcome to the show.",
  "confidence": 0.95,
  "audio_duration": 180000,
  "utterances": [
    {
      "speaker": "A",
      "text": "Hello everyone, welcome to the show.",
      "confidence": 0.96,
      "start": 100,
      "end": 3200
    }
  ]
}

TXT Format (Without Diarisation)

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...
---
Hello everyone, welcome to the show. Today we're discussing...

TXT Format (With Diarisation)

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...
---
Speaker A:Hello everyone, welcome to the show.
Speaker B:Thanks for having me.
Speaker A:Today we're discussing artificial intelligence.

Format: Speaker {label}:{text}\n (no space after colon)

Idempotent Behavior

The tool checks if output files already exist before transcribing:

$ ./stt_assemblyai.py audio.mp3
# Transcribes audio, creates files

$ ./stt_assemblyai.py audio.mp3
# Skips transcription, displays existing transcript
SKIPPING: transcription of audio.mp3 as audio.mp3.txt already exists

To force re-transcription: Delete existing .txt file

Workflow Integration

With Video Files

Use stt_video_using_assemblyai.sh wrapper:

./stt_video_using_assemblyai.sh video.mp4 2
# Extracts audio, transcribes with 2 speakers

With Speaker Mapping

Process with stt_assemblyai_speaker_mapper.py:

# Step 1: Transcribe with diarisation
./stt_assemblyai.py -d audio.mp3

# Step 2: Map speaker labels to names
./stt_assemblyai_speaker_mapper.py -m "Alice Anderson,Bob Martinez" audio.mp3.assemblyai.json

Error Handling

Missing API Key

Error: ASSEMBLYAI_API_KEY environment variable not set.

Solution: export ASSEMBLYAI_API_KEY="your_key"

File Not Found

ERROR: Error in upload_file: [Errno 2] No such file or directory: 'audio.mp3'

Solution: Check file path

Upload Failed

ERROR: Error in upload_file: 401 Unauthorized
REST RESPONSE: {"error":"Invalid API key"}

Solution: Verify API key is correct

Transcription Failed

ERROR: Error in create_transcript: Transcription failed: Unsupported audio format

Solution: Convert audio to supported format (MP3, WAV, etc.)

Network Timeout

ERROR: Error in create_transcript: Connection timeout

Solution: Check internet connection, try again

Advanced Usage

Automatic Diarisation Enablement

If you specify -e/--expected-speakers without -d, diarisation is automatically enabled:

# These are equivalent:
./stt_assemblyai.py -e 2 audio.mp3
./stt_assemblyai.py -d -e 2 audio.mp3

Output:

WARNING: -e/--expected-speakers specified without -d/--diarisation; enabling diarisation to satisfy AssemblyAI requirements.

Long Audio Files

For long audio (>2 hours), use verbose mode to monitor progress:

./stt_assemblyai.py -v audio.mp3

Output:

INFO: Processing audio input...
INFO: output filename: audio.mp3.txt
INFO: Uploading audio file...
INFO: File uploaded. URL: https://...
INFO: Creating transcript...
INFO: Transcript ID: abc123...
INFO: Current status: queued
INFO: Current status: processing
INFO: Current status: processing
INFO: Current status: completed
INFO: Transcript created. Writing output...
INFO: Server response written to audio.mp3.assemblyai.json
INFO: Output written to audio.mp3.txt
INFO: Done.

Polling Interval

The tool polls AssemblyAI every 5 seconds for status updates. For very short audio, this means the transcript may appear faster than expected.

Integration with stt-in-batch

When using the stt-in-batch pipeline, language is automatically detected from filenames:

Filename Language Detection

Include case-sensitive language codes with separators (space, dot, dash, underscore):

Pattern Language Example
EN English meeting_EN.mp3-l en
PL Polish call-PL.wav-l pl
DE German interview.DE.mp3-l de
AUTO Auto-detect podcast_AUTO.mp3-l auto

Examples:

meeting_EN.mp3      # Uses English
podcast-PL.mp3      # Uses Polish
call.DE.wav         # Uses German
recording_AUTO.mp3  # Auto-detects language
untitled.mp3        # Defaults to English

Pattern requirements: Code must be surrounded by separators (_EN_, -EN-, .EN., EN, etc.)

See stt-in-batch.README.md for full documentation.

Related Tools

  • stt_assemblyai_speaker_mapper.py - Map speaker labels (A, B, C) to actual names
  • stt-in-batch - Batch processing pipeline with language detection
  • stt_video_using_assemblyai.sh - Wrapper script for video transcription
  • stt_openai_OR_local_whisper_cli.py - Alternative tool using OpenAI Whisper

Troubleshooting

Problem: Output file truncated

Cause: Filename too long (>255 characters)

Solution: Tool automatically truncates long filenames while preserving extensions

Problem: Speaker labels not appearing

Cause: Forgot -d flag

Solution: Re-run with -d or -e N flags

Problem: Wrong language detected

Cause: Auto-detection uncertain

Solution: Specify language explicitly with -l flag

Problem: Low confidence scores

Possible causes:

  • Poor audio quality
  • Background noise
  • Multiple overlapping speakers
  • Strong accents or dialects

Solutions:

  • Use noise reduction on audio
  • Specify expected speakers with -e
  • Specify language with -l
  • Use higher quality audio source

License

Part of the CLIAI handy_scripts collection.