AssemblyAI Speech-to-Text (STT) Transcription Tool

Transcribe audio files using the AssemblyAI API with support for speaker diarisation, multiple languages, and automatic error detection.

Features

Speaker Diarisation: Automatically detect and label multiple speakers (A, B, C, etc.)
Language Support: 99+ languages including auto-detection
Multiple Output Formats: JSON (full API response) + TXT (human-readable transcript)
Idempotent: Skip re-transcription if output already exists
Progress Monitoring: Real-time status updates with verbose logging
EU/US Region Selection: Choose API endpoint region for data residency compliance
META Warning Messages: Automatic disclaimer about potential transcription errors

Prerequisites

# Install dependencies (handled automatically by uv)
# - requests>=2.31

# Set your AssemblyAI API key
export ASSEMBLYAI_API_KEY="your_api_key_here"

Get your API key at: https://www.assemblyai.com/

Quick Start

# Basic transcription
./stt_assemblyai.py audio.mp3

# With speaker diarisation (2 speakers)
./stt_assemblyai.py -d -e 2 audio.mp3

# Specify language
./stt_assemblyai.py -l en_us audio.mp3

# Verbose output with EU region
./stt_assemblyai.py -v -R eu -d audio.mp3

META Transcript Warning Message

By default, all transcript outputs include a META warning message to remind readers that STT transcripts may contain errors.

What It Does

Automatically prepends a disclaimer to transcript files:

TXT files: YAML front matter format at top of file
JSON files: First key: {"_meta_note": "{message}", ...}

Default Message

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT AND MAY CONTAIN TRANSCRIPTION ERRORS. This transcript was generated by automated speech recognition technology and should be treated as a rough transcription for reference purposes. Common types of errors include: incorrect word recognition (especially homophones, proper nouns, technical terminology, or words in noisy audio conditions), missing or incorrect punctuation, speaker misidentification in multi-speaker scenarios, and timing inaccuracies. For best comprehension and to mentally correct potential errors, please consider: the broader conversational context, relevant domain knowledge, technical background of the subject matter, and any supplementary information about the speakers or topic. This transcript is intended to convey the general content and flow of the conversation rather than serving as a verbatim, word-perfect record. When critical accuracy is required, please verify important details against the original audio source.
---

Disabling the META Message

Via command-line flag:

./stt_assemblyai.py --no-meta-message audio.mp3
# or
./stt_assemblyai.py --disable-meta-message audio.mp3

Via environment variable (system-wide):

export STT_META_MESSAGE_DISABLE=1
./stt_assemblyai.py audio.mp3

Custom META Message

export STT_META_MESSAGE="DRAFT - UNVERIFIED TRANSCRIPT"
./stt_assemblyai.py audio.mp3

Example Output

Without diarisation (audio.mp3.txt):

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...
---
Hello everyone, welcome to the show. Today we're discussing...

With diarisation (audio.mp3.txt):

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...
---
Speaker A:Hello everyone, welcome to the show.
Speaker B:Thanks for having me.
Speaker A:Let's dive in.

Usage Examples

1. Basic Transcription (Single Speaker)

./stt_assemblyai.py audio.mp3

Creates:

audio.mp3.assemblyai.json - Full AssemblyAI API response
audio.mp3.txt - Plain text transcript

2. Speaker Diarisation (Multiple Speakers)

# Auto-detect number of speakers
./stt_assemblyai.py -d audio.mp3

# Specify expected number of speakers (more accurate)
./stt_assemblyai.py -d -e 2 audio.mp3

Creates:

audio.mp3.assemblyai.json - Full response with utterance-level speaker labels
audio.mp3.txt - Formatted transcript: Speaker A: text

3. Language Selection

# Auto-detect language (default)
./stt_assemblyai.py audio.mp3

# Specify English (US)
./stt_assemblyai.py -l en_us audio.mp3

# Spanish
./stt_assemblyai.py -l es audio.mp3

# French
./stt_assemblyai.py -l fr audio.mp3

Supported language codes: en, en_au, en_uk, en_us, es, fr, de, it, pt, nl, hi, ja, zh, fi, ko, pl, ru, and 80+ more

4. Custom Output Path

# Specify output file
./stt_assemblyai.py -o transcript.txt audio.mp3

# Creates: transcript.txt, audio.mp3.assemblyai.json

5. Output to Stdout Only

# Print transcript to stdout, no file creation
./stt_assemblyai.py -o - audio.mp3

6. Verbose Logging

# Basic info (-v)
./stt_assemblyai.py -v audio.mp3

# Detailed debug output (-vvvvv)
./stt_assemblyai.py -vvvvv audio.mp3

Log levels:

-v (1+): INFO - Progress updates
-vvvvv (5+): DEBUG - Full API request/response details

7. Quiet Mode

# Suppress all output except the transcript
./stt_assemblyai.py -q audio.mp3

8. Remote Audio URLs

# Transcribe from URL (skips upload step)
./stt_assemblyai.py https://example.com/audio.mp3

9. EU Region (GDPR Compliance)

# Use EU endpoint (default)
./stt_assemblyai.py -R eu audio.mp3

# Use US endpoint
./stt_assemblyai.py -R us audio.mp3

EU endpoint: https://api.eu.assemblyai.com US endpoint: https://api.assemblyai.com

Command-Line Options

Positional Arguments

audio_input - Path to audio file or URL to transcribe

Supported formats: MP3, MP4, WAV, FLAC, M4A, OGG, and more

Speaker Diarisation

-d, --diarisation - Enable speaker diarisation (labels each speaker)
-e, --expected-speakers N - Expected number of speakers (improves accuracy)

Note: -e automatically enables -d if not already specified

Output Control

-o, --output PATH - Output file path (default: {audio_input}.txt)
- Use - for stdout only (no files created)
-q, --quiet - Suppress status messages (output only transcript)

Language

-l, --language CODE - Language code (default: auto for auto-detection)

Common codes:

en - English (auto-detect variant)
en_us - English (US)
en_uk - English (UK)
en_au - English (Australia)
es - Spanish
fr - French
de - German
it - Italian
pt - Portuguese
zh - Chinese
ja - Japanese
ko - Korean
hi - Hindi
ru - Russian

Region & API

-R, --region {eu|us} - API endpoint region (default: eu)

Logging

-v, --verbose - Increase verbosity (use multiple times: -v, -vv, -vvvvv)

META Message Control

--no-meta-message, --disable-meta-message - Disable META warning message

Environment Variables:

STT_META_MESSAGE_DISABLE=1 - Disable system-wide
STT_META_MESSAGE="text" - Custom message

Output Files

Default File Names

Input: audio.mp3

Output:

audio.mp3.assemblyai.json - Full AssemblyAI API response (always created)
audio.mp3.txt - Human-readable transcript (default output)

JSON Format

Complete API response including:

text - Full transcript text
utterances - Array of speaker segments (with diarisation)
words - Word-level timing and confidence
confidence - Overall transcript confidence
audio_duration - Audio length in milliseconds
_meta_note - META warning message (if enabled)

Example:

{
  "_meta_note": "THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...",
  "text": "Hello everyone, welcome to the show.",
  "confidence": 0.95,
  "audio_duration": 180000,
  "utterances": [
    {
      "speaker": "A",
      "text": "Hello everyone, welcome to the show.",
      "confidence": 0.96,
      "start": 100,
      "end": 3200
    }
  ]
}

TXT Format (Without Diarisation)

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...
---
Hello everyone, welcome to the show. Today we're discussing...

TXT Format (With Diarisation)

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...
---
Speaker A:Hello everyone, welcome to the show.
Speaker B:Thanks for having me.
Speaker A:Today we're discussing artificial intelligence.

Format: Speaker {label}:{text}\n (no space after colon)

Idempotent Behavior

The tool checks if output files already exist before transcribing:

$ ./stt_assemblyai.py audio.mp3
# Transcribes audio, creates files

$ ./stt_assemblyai.py audio.mp3
# Skips transcription, displays existing transcript
SKIPPING: transcription of audio.mp3 as audio.mp3.txt already exists

To force re-transcription: Delete existing .txt file

Workflow Integration

With Video Files

Use stt_video_using_assemblyai.sh wrapper:

./stt_video_using_assemblyai.sh video.mp4 2
# Extracts audio, transcribes with 2 speakers

With Speaker Mapping

Process with stt_assemblyai_speaker_mapper.py:

# Step 1: Transcribe with diarisation
./stt_assemblyai.py -d audio.mp3

# Step 2: Map speaker labels to names
./stt_assemblyai_speaker_mapper.py -m "Alice Anderson,Bob Martinez" audio.mp3.assemblyai.json

Error Handling

Missing API Key

Error: ASSEMBLYAI_API_KEY environment variable not set.

Solution: export ASSEMBLYAI_API_KEY="your_key"

File Not Found

ERROR: Error in upload_file: [Errno 2] No such file or directory: 'audio.mp3'

Solution: Check file path

Upload Failed

ERROR: Error in upload_file: 401 Unauthorized
REST RESPONSE: {"error":"Invalid API key"}

Solution: Verify API key is correct

Transcription Failed

ERROR: Error in create_transcript: Transcription failed: Unsupported audio format

Solution: Convert audio to supported format (MP3, WAV, etc.)

Network Timeout

ERROR: Error in create_transcript: Connection timeout

Solution: Check internet connection, try again

Advanced Usage

Automatic Diarisation Enablement

If you specify -e/--expected-speakers without -d, diarisation is automatically enabled:

# These are equivalent:
./stt_assemblyai.py -e 2 audio.mp3
./stt_assemblyai.py -d -e 2 audio.mp3

Output:

WARNING: -e/--expected-speakers specified without -d/--diarisation; enabling diarisation to satisfy AssemblyAI requirements.

Long Audio Files

For long audio (>2 hours), use verbose mode to monitor progress:

./stt_assemblyai.py -v audio.mp3

Output:

INFO: Processing audio input...
INFO: output filename: audio.mp3.txt
INFO: Uploading audio file...
INFO: File uploaded. URL: https://...
INFO: Creating transcript...
INFO: Transcript ID: abc123...
INFO: Current status: queued
INFO: Current status: processing
INFO: Current status: processing
INFO: Current status: completed
INFO: Transcript created. Writing output...
INFO: Server response written to audio.mp3.assemblyai.json
INFO: Output written to audio.mp3.txt
INFO: Done.

Polling Interval

The tool polls AssemblyAI every 5 seconds for status updates. For very short audio, this means the transcript may appear faster than expected.

Integration with stt-in-batch

When using the stt-in-batch pipeline, language is automatically detected from filenames:

Filename Language Detection

Include case-sensitive language codes with separators (space, dot, dash, underscore):

Pattern	Language	Example
`EN`	English	`meeting_EN.mp3` → `-l en`
`PL`	Polish	`call-PL.wav` → `-l pl`
`DE`	German	`interview.DE.mp3` → `-l de`
`AUTO`	Auto-detect	`podcast_AUTO.mp3` → `-l auto`

Examples:

meeting_EN.mp3      # Uses English
podcast-PL.mp3      # Uses Polish
call.DE.wav         # Uses German
recording_AUTO.mp3  # Auto-detects language
untitled.mp3        # Defaults to English

Pattern requirements: Code must be surrounded by separators (_EN_, -EN-, .EN., EN, etc.)

See stt-in-batch.README.md for full documentation.

Related Tools

stt_assemblyai_speaker_mapper.py - Map speaker labels (A, B, C) to actual names
stt-in-batch - Batch processing pipeline with language detection
stt_video_using_assemblyai.sh - Wrapper script for video transcription
stt_openai_OR_local_whisper_cli.py - Alternative tool using OpenAI Whisper

Troubleshooting

Problem: Output file truncated

Cause: Filename too long (>255 characters)

Solution: Tool automatically truncates long filenames while preserving extensions

Problem: Speaker labels not appearing

Cause: Forgot -d flag

Solution: Re-run with -d or -e N flags

Problem: Wrong language detected

Cause: Auto-detection uncertain

Solution: Specify language explicitly with -l flag

Problem: Low confidence scores

Possible causes:

Poor audio quality
Background noise
Multiple overlapping speakers
Strong accents or dialects

Solutions:

Use noise reduction on audio
Specify expected speakers with -e
Specify language with -l
Use higher quality audio source

License

Part of the CLIAI handy_scripts collection.

FilesExpand file tree

stt_assemblyai.README.md

Latest commit

History

stt_assemblyai.README.md

File metadata and controls

AssemblyAI Speech-to-Text (STT) Transcription Tool

Features

Prerequisites

Quick Start

META Transcript Warning Message

What It Does

Default Message

Disabling the META Message

Custom META Message

Example Output

Usage Examples

1. Basic Transcription (Single Speaker)

2. Speaker Diarisation (Multiple Speakers)

3. Language Selection

4. Custom Output Path

5. Output to Stdout Only

6. Verbose Logging

7. Quiet Mode

8. Remote Audio URLs

9. EU Region (GDPR Compliance)

Command-Line Options

Positional Arguments

Speaker Diarisation

Output Control

Language

Region & API

Logging

META Message Control

Output Files

Default File Names

JSON Format

TXT Format (Without Diarisation)

TXT Format (With Diarisation)

Idempotent Behavior

Workflow Integration

With Video Files

With Speaker Mapping

Error Handling

Missing API Key

File Not Found

Upload Failed

Transcription Failed

Network Timeout

Advanced Usage

Automatic Diarisation Enablement

Long Audio Files

Polling Interval

Integration with stt-in-batch

Filename Language Detection

Related Tools

Troubleshooting

Problem: Output file truncated

Problem: Speaker labels not appearing

Problem: Wrong language detected

Problem: Low confidence scores

License