Transcribe audio files using the AssemblyAI API with support for speaker diarisation, multiple languages, and automatic error detection.
- Speaker Diarisation: Automatically detect and label multiple speakers (A, B, C, etc.)
- Language Support: 99+ languages including auto-detection
- Multiple Output Formats: JSON (full API response) + TXT (human-readable transcript)
- Idempotent: Skip re-transcription if output already exists
- Progress Monitoring: Real-time status updates with verbose logging
- EU/US Region Selection: Choose API endpoint region for data residency compliance
- META Warning Messages: Automatic disclaimer about potential transcription errors
# Install dependencies (handled automatically by uv)
# - requests>=2.31
# Set your AssemblyAI API key
export ASSEMBLYAI_API_KEY="your_api_key_here"Get your API key at: https://www.assemblyai.com/
# Basic transcription
./stt_assemblyai.py audio.mp3
# With speaker diarisation (2 speakers)
./stt_assemblyai.py -d -e 2 audio.mp3
# Specify language
./stt_assemblyai.py -l en_us audio.mp3
# Verbose output with EU region
./stt_assemblyai.py -v -R eu -d audio.mp3By default, all transcript outputs include a META warning message to remind readers that STT transcripts may contain errors.
Automatically prepends a disclaimer to transcript files:
- TXT files: YAML front matter format at top of file
- JSON files: First key:
{"_meta_note": "{message}", ...}
---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT AND MAY CONTAIN TRANSCRIPTION ERRORS. This transcript was generated by automated speech recognition technology and should be treated as a rough transcription for reference purposes. Common types of errors include: incorrect word recognition (especially homophones, proper nouns, technical terminology, or words in noisy audio conditions), missing or incorrect punctuation, speaker misidentification in multi-speaker scenarios, and timing inaccuracies. For best comprehension and to mentally correct potential errors, please consider: the broader conversational context, relevant domain knowledge, technical background of the subject matter, and any supplementary information about the speakers or topic. This transcript is intended to convey the general content and flow of the conversation rather than serving as a verbatim, word-perfect record. When critical accuracy is required, please verify important details against the original audio source.
---
Via command-line flag:
./stt_assemblyai.py --no-meta-message audio.mp3
# or
./stt_assemblyai.py --disable-meta-message audio.mp3Via environment variable (system-wide):
export STT_META_MESSAGE_DISABLE=1
./stt_assemblyai.py audio.mp3export STT_META_MESSAGE="DRAFT - UNVERIFIED TRANSCRIPT"
./stt_assemblyai.py audio.mp3Without diarisation (audio.mp3.txt):
---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...
---
Hello everyone, welcome to the show. Today we're discussing...
With diarisation (audio.mp3.txt):
---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...
---
Speaker A:Hello everyone, welcome to the show.
Speaker B:Thanks for having me.
Speaker A:Let's dive in.
./stt_assemblyai.py audio.mp3Creates:
audio.mp3.assemblyai.json- Full AssemblyAI API responseaudio.mp3.txt- Plain text transcript
# Auto-detect number of speakers
./stt_assemblyai.py -d audio.mp3
# Specify expected number of speakers (more accurate)
./stt_assemblyai.py -d -e 2 audio.mp3Creates:
audio.mp3.assemblyai.json- Full response with utterance-level speaker labelsaudio.mp3.txt- Formatted transcript:Speaker A: text
# Auto-detect language (default)
./stt_assemblyai.py audio.mp3
# Specify English (US)
./stt_assemblyai.py -l en_us audio.mp3
# Spanish
./stt_assemblyai.py -l es audio.mp3
# French
./stt_assemblyai.py -l fr audio.mp3Supported language codes: en, en_au, en_uk, en_us, es, fr, de, it, pt, nl, hi, ja, zh, fi, ko, pl, ru, and 80+ more
# Specify output file
./stt_assemblyai.py -o transcript.txt audio.mp3
# Creates: transcript.txt, audio.mp3.assemblyai.json# Print transcript to stdout, no file creation
./stt_assemblyai.py -o - audio.mp3# Basic info (-v)
./stt_assemblyai.py -v audio.mp3
# Detailed debug output (-vvvvv)
./stt_assemblyai.py -vvvvv audio.mp3Log levels:
-v(1+): INFO - Progress updates-vvvvv(5+): DEBUG - Full API request/response details
# Suppress all output except the transcript
./stt_assemblyai.py -q audio.mp3# Transcribe from URL (skips upload step)
./stt_assemblyai.py https://example.com/audio.mp3# Use EU endpoint (default)
./stt_assemblyai.py -R eu audio.mp3
# Use US endpoint
./stt_assemblyai.py -R us audio.mp3EU endpoint: https://api.eu.assemblyai.com
US endpoint: https://api.assemblyai.com
audio_input- Path to audio file or URL to transcribe
Supported formats: MP3, MP4, WAV, FLAC, M4A, OGG, and more
-d, --diarisation- Enable speaker diarisation (labels each speaker)-e, --expected-speakers N- Expected number of speakers (improves accuracy)
Note: -e automatically enables -d if not already specified
-o, --output PATH- Output file path (default:{audio_input}.txt)- Use
-for stdout only (no files created)
- Use
-q, --quiet- Suppress status messages (output only transcript)
-l, --language CODE- Language code (default:autofor auto-detection)
Common codes:
en- English (auto-detect variant)en_us- English (US)en_uk- English (UK)en_au- English (Australia)es- Spanishfr- Frenchde- Germanit- Italianpt- Portuguesezh- Chineseja- Japaneseko- Koreanhi- Hindiru- Russian
-R, --region {eu|us}- API endpoint region (default:eu)
-v, --verbose- Increase verbosity (use multiple times:-v,-vv,-vvvvv)
--no-meta-message,--disable-meta-message- Disable META warning message
Environment Variables:
STT_META_MESSAGE_DISABLE=1- Disable system-wideSTT_META_MESSAGE="text"- Custom message
Input: audio.mp3
Output:
audio.mp3.assemblyai.json- Full AssemblyAI API response (always created)audio.mp3.txt- Human-readable transcript (default output)
Complete API response including:
text- Full transcript textutterances- Array of speaker segments (with diarisation)words- Word-level timing and confidenceconfidence- Overall transcript confidenceaudio_duration- Audio length in milliseconds_meta_note- META warning message (if enabled)
Example:
{
"_meta_note": "THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...",
"text": "Hello everyone, welcome to the show.",
"confidence": 0.95,
"audio_duration": 180000,
"utterances": [
{
"speaker": "A",
"text": "Hello everyone, welcome to the show.",
"confidence": 0.96,
"start": 100,
"end": 3200
}
]
}---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...
---
Hello everyone, welcome to the show. Today we're discussing...
---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT...
---
Speaker A:Hello everyone, welcome to the show.
Speaker B:Thanks for having me.
Speaker A:Today we're discussing artificial intelligence.
Format: Speaker {label}:{text}\n (no space after colon)
The tool checks if output files already exist before transcribing:
$ ./stt_assemblyai.py audio.mp3
# Transcribes audio, creates files
$ ./stt_assemblyai.py audio.mp3
# Skips transcription, displays existing transcript
SKIPPING: transcription of audio.mp3 as audio.mp3.txt already existsTo force re-transcription: Delete existing .txt file
Use stt_video_using_assemblyai.sh wrapper:
./stt_video_using_assemblyai.sh video.mp4 2
# Extracts audio, transcribes with 2 speakersProcess with stt_assemblyai_speaker_mapper.py:
# Step 1: Transcribe with diarisation
./stt_assemblyai.py -d audio.mp3
# Step 2: Map speaker labels to names
./stt_assemblyai_speaker_mapper.py -m "Alice Anderson,Bob Martinez" audio.mp3.assemblyai.jsonError: ASSEMBLYAI_API_KEY environment variable not set.Solution: export ASSEMBLYAI_API_KEY="your_key"
ERROR: Error in upload_file: [Errno 2] No such file or directory: 'audio.mp3'Solution: Check file path
ERROR: Error in upload_file: 401 Unauthorized
REST RESPONSE: {"error":"Invalid API key"}Solution: Verify API key is correct
ERROR: Error in create_transcript: Transcription failed: Unsupported audio formatSolution: Convert audio to supported format (MP3, WAV, etc.)
ERROR: Error in create_transcript: Connection timeoutSolution: Check internet connection, try again
If you specify -e/--expected-speakers without -d, diarisation is automatically enabled:
# These are equivalent:
./stt_assemblyai.py -e 2 audio.mp3
./stt_assemblyai.py -d -e 2 audio.mp3Output:
WARNING: -e/--expected-speakers specified without -d/--diarisation; enabling diarisation to satisfy AssemblyAI requirements.
For long audio (>2 hours), use verbose mode to monitor progress:
./stt_assemblyai.py -v audio.mp3Output:
INFO: Processing audio input...
INFO: output filename: audio.mp3.txt
INFO: Uploading audio file...
INFO: File uploaded. URL: https://...
INFO: Creating transcript...
INFO: Transcript ID: abc123...
INFO: Current status: queued
INFO: Current status: processing
INFO: Current status: processing
INFO: Current status: completed
INFO: Transcript created. Writing output...
INFO: Server response written to audio.mp3.assemblyai.json
INFO: Output written to audio.mp3.txt
INFO: Done.
The tool polls AssemblyAI every 5 seconds for status updates. For very short audio, this means the transcript may appear faster than expected.
When using the stt-in-batch pipeline, language is automatically detected from filenames:
Include case-sensitive language codes with separators (space, dot, dash, underscore):
| Pattern | Language | Example |
|---|---|---|
EN |
English | meeting_EN.mp3 → -l en |
PL |
Polish | call-PL.wav → -l pl |
DE |
German | interview.DE.mp3 → -l de |
AUTO |
Auto-detect | podcast_AUTO.mp3 → -l auto |
Examples:
meeting_EN.mp3 # Uses English
podcast-PL.mp3 # Uses Polish
call.DE.wav # Uses German
recording_AUTO.mp3 # Auto-detects language
untitled.mp3 # Defaults to EnglishPattern requirements: Code must be surrounded by separators (_EN_, -EN-, .EN., EN, etc.)
See stt-in-batch.README.md for full documentation.
- stt_assemblyai_speaker_mapper.py - Map speaker labels (A, B, C) to actual names
- stt-in-batch - Batch processing pipeline with language detection
- stt_video_using_assemblyai.sh - Wrapper script for video transcription
- stt_openai_OR_local_whisper_cli.py - Alternative tool using OpenAI Whisper
Cause: Filename too long (>255 characters)
Solution: Tool automatically truncates long filenames while preserving extensions
Cause: Forgot -d flag
Solution: Re-run with -d or -e N flags
Cause: Auto-detection uncertain
Solution: Specify language explicitly with -l flag
Possible causes:
- Poor audio quality
- Background noise
- Multiple overlapping speakers
- Strong accents or dialects
Solutions:
- Use noise reduction on audio
- Specify expected speakers with
-e - Specify language with
-l - Use higher quality audio source
Part of the CLIAI handy_scripts collection.