Skip to content

Latest commit

 

History

History
887 lines (620 loc) · 22.6 KB

File metadata and controls

887 lines (620 loc) · 22.6 KB

AssemblyAI Speaker Name Mapper

Post-processing tool to replace speaker labels (A, B, C) with actual speaker names in AssemblyAI transcription JSON files.

Purpose

When stt_assemblyai.py generates transcriptions with speaker diarisation (-d flag), it produces speaker labels like "Speaker A", "Speaker B", etc. This tool allows you to replace those generic labels with actual names after you've reviewed the transcript and identified who each speaker is.

Key Features

  • Format-agnostic: Uses recursive JSON traversal to find and replace ALL "speaker" keys, regardless of JSON structure
  • Future-proof: Works even if AssemblyAI changes their JSON format
  • Multiple input methods: Comma-separated, file-based (4 formats), interactive prompts, or LLM-assisted detection
  • Non-destructive: Creates new .mapped.json and .mapped.txt files, preserves originals
  • Idempotent: Can remap the same transcript multiple times with different mappings
  • LLM-powered (optional): Automatically detect speaker names from conversation context using AI

LLM-Assisted Speaker Detection (Optional)

NEW: Use AI to automatically identify speaker names from transcript context!

Prerequisites

Install Instructor library for LLM integration:

# Core dependencies
pip install instructor pydantic

# Provider-specific (install as needed)
pip install openai          # For OpenAI
pip install anthropic       # For Anthropic/Claude
pip install google-generativeai  # For Google Gemini

# For local Ollama (no API key needed)
# Install Ollama from ollama.com
ollama pull llama3.2

Quick Start

# Automatic detection with OpenAI
./stt_assemblyai_speaker_mapper.py --llm-detect openai/gpt-4o-mini audio.json

# Local/offline with Ollama (free, no API key)
./stt_assemblyai_speaker_mapper.py --llm-detect ollama/llama3.2 audio.json

# Interactive mode with AI suggestions
./stt_assemblyai_speaker_mapper.py --llm-interactive anthropic/claude-3-5-haiku audio.json

Supported LLM Providers

Provider Format Example Requirements
OpenAI openai/MODEL openai/gpt-4o-mini API key
Anthropic anthropic/MODEL anthropic/claude-3-5-haiku API key
Google google/MODEL google/gemini-2.0-flash-exp API key
Groq groq/MODEL groq/llama-3.1-70b-versatile API key (ultra-fast)
Ollama ollama/MODEL ollama/llama3.2 Local (no API key)
100+ more via LiteLLM litellm/... Varies

Set API keys:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GROQ_API_KEY="gsk_..."

LLM Detection Modes

1. Automatic Detection

AI analyzes transcript and applies best-guess speaker names:

./stt_assemblyai_speaker_mapper.py --llm-detect openai/gpt-4o-mini audio.json

Output:

INFO: Analyzing transcript with LLM...
INFO: LLM confidence: high
INFO: LLM reasoning: Names explicitly mentioned in conversation
INFO: Detected 2 speaker(s): A, B
INFO: Applied mappings:
INFO:   A → Alice Anderson
INFO:   B → Bob Smith
Created: audio.assemblyai.mapped.json, audio.mapped.txt

2. Interactive with AI Suggestions

AI suggests names, you confirm or override:

./stt_assemblyai_speaker_mapper.py --llm-interactive openai/gpt-4o-mini audio.json

Interaction:

=== AI-Detected Speaker Mappings ===
A => Alice Anderson
B => Bob Smith
C => Unknown

=== Review and Confirm ===
  Enter=accept | name=override | skip=abort | help=commands | play=audio
  about=edit context file | !cmd: run shell commands

A => [Alice Anderson]: _               ← Press Enter to accept
B => [Bob Smith]: Robert               ← Type to override
C => [Unknown]: about                  ← Opens editor to add context
→ Opening audio.about.md in nano...
C => [Unknown]: Charlie Chaplin        ← Now provide name

3. Fallback Mode

Try AI, fall back to manual if it fails:

./stt_assemblyai_speaker_mapper.py --llm-detect-fallback ollama/llama3.2 audio.json

If LLM fails (API error, timeout, etc.), automatically switches to manual interactive mode.

Advanced LLM Options

Custom Endpoint (Remote Ollama)

./stt_assemblyai_speaker_mapper.py \
  --llm-detect ollama/llama3.2 \
  --llm-endpoint http://gpu-server:11434 \
  audio.json

Sample Size Control

# Send more utterances for better context (default: 20)
./stt_assemblyai_speaker_mapper.py \
  --llm-detect openai/gpt-4o-mini \
  --llm-sample-size 30 \
  audio.json

Verbose LLM Output

./stt_assemblyai_speaker_mapper.py -vv --llm-detect openai/gpt-4o-mini audio.json

Shows detailed LLM reasoning and confidence scores.

Cost & Performance

Provider Speed Cost/transcript Quality Offline
Groq ⚡⚡⚡ ~$0.001 ⭐⭐⭐⭐ No
OpenAI gpt-4o-mini ⚡⚡ ~$0.005 ⭐⭐⭐⭐⭐ No
Anthropic Haiku ⚡⚡⭐ ~$0.002 ⭐⭐⭐⭐ No
Anthropic Sonnet ⚡⚡ ~$0.020 ⭐⭐⭐⭐⭐ No
Ollama (local) Free ⭐⭐⭐ Yes

Recommended: Start with groq/llama-3.1-70b-versatile (fast + cheap) or ollama/llama3.2 (free + offline).

META Transcript Warning Message

NEW: All transcript outputs now include a META warning message by default to remind readers that STT transcripts may contain errors.

What It Does

Automatically prepends a disclaimer to transcript files warning about potential transcription errors:

  • TXT files: YAML front matter format at top of file
  • JSON files: Appears as first key: {"_meta_note": "{message}", ...}

Default Message

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT AND MAY CONTAIN TRANSCRIPTION ERRORS. This transcript was generated by automated speech recognition technology and should be treated as a rough transcription for reference purposes. Common types of errors include: incorrect word recognition (especially homophones, proper nouns, technical terminology, or words in noisy audio conditions), missing or incorrect punctuation, speaker misidentification in multi-speaker scenarios, and timing inaccuracies. For best comprehension and to mentally correct potential errors, please consider: the broader conversational context, relevant domain knowledge, technical background of the subject matter, and any supplementary information about the speakers or topic. This transcript is intended to convey the general content and flow of the conversation rather than serving as a verbatim, word-perfect record. When critical accuracy is required, please verify important details against the original audio source.
---

Disabling the META Message

Via command-line flag:

./stt_assemblyai_speaker_mapper.py --no-meta-message -m "Alice,Bob" audio.json
# or
./stt_assemblyai_speaker_mapper.py --disable-meta-message -m "Alice,Bob" audio.json

Via environment variable (system-wide):

export STT_META_MESSAGE_DISABLE=1
./stt_assemblyai_speaker_mapper.py -m "Alice,Bob" audio.json

Custom META Message

You can provide your own custom warning message:

export STT_META_MESSAGE="DRAFT TRANSCRIPT - NOT VERIFIED - FOR INTERNAL USE ONLY"
./stt_assemblyai_speaker_mapper.py -m "Alice,Bob" audio.json

Example Output

TXT file (audio.mapped.txt):

---
meta: THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT AND MAY CONTAIN...
---
Alice Anderson:	Hello everyone, welcome to the show.
Bob Martinez:	Thanks for having me.

JSON file (audio.assemblyai.mapped.json):

{
  "_meta_note": "THIS IS AN AUTOMATED SPEECH-TO-TEXT (STT) TRANSCRIPT AND MAY CONTAIN...",
  "utterances": [
    {
      "speaker": "Alice Anderson",
      "text": "Hello everyone, welcome to the show."
    }
  ]
}

How It Works

The LLM analyzes a strategic sample of the transcript looking for:

  • Direct name mentions: "Hi Alice", "Thanks Bob"
  • Introductions: "I'm...", "My name is..."
  • Context clues: Professional roles, relationships, topics
  • Speaking patterns: Formality, expertise signals

It returns structured suggestions with confidence levels:

  • High: Names explicitly mentioned
  • Medium: Strong contextual clues
  • Low: Weak inference (often returns "Unknown")

Context Files for Speaker Detection

Two types of context files can improve LLM speaker detection:

  1. Directory context (SPEAKER.CONTEXT.md) - Applies to all audio files in a directory tree
  2. File-specific context ({audiofile}.about.md) - Applies to a single audio file

Directory Context File (SPEAKER.CONTEXT.md)

Create a SPEAKER.CONTEXT.md file in any directory. It applies to all audio files in that directory and subdirectories (similar to .gitignore).

Search behavior:

  • Searches in the audio file's directory first
  • Walks up parent directories until found
  • Searches both original path AND resolved symlink path

Example structure:

project/
├── SPEAKER.CONTEXT.md    ← Applies to all files below
├── meetings/
│   ├── meeting1.mp3
│   └── meeting2.mp3
└── interviews/
    ├── SPEAKER.CONTEXT.md  ← Overrides for this subdir
    └── interview1.mp3

Example content:

# Project Context

This project contains recordings from Company X.

Common speakers:
* Greg Williams - CEO, leads most meetings
* Alice Chen - CTO, discusses technical topics
* Bob Smith - Sales Director, client-facing calls

Topics: product roadmap, engineering, sales pipeline

About Files for Speaker Context

Provide file-specific context with .about.md files.

About File Specification

  • Path: {audiofile}.about.md (e.g., meeting.mp3.about.md)
  • Purpose: Provide context about speakers, roles, and topics
  • Format: Free-form markdown

Creating About Files

Via Interactive Command

During interactive mode, type about to open the file in your editor:

=== Review and Confirm ===
A => [Unknown]: about

→ Opening meeting.mp3.about.md in nano...
✓ About file saved: meeting.mp3.about.md

A => [Unknown]:

The file is opened in $EDITOR (or $VISUAL, or nano as fallback).

Manually

Create the file before running speaker detection:

cat > meeting.mp3.about.md << 'EOF'
## Meeting Context

This is a product planning meeting between:

* Alice Chen - Product Manager, leads the discussion
* Bob Smith - Engineering Lead, discusses technical feasibility
* Carol Davis - Designer, presents UI mockups

Topics covered: Q1 roadmap, feature prioritization, design review
EOF

How About Files Enhance Detection

When an .about.md file exists:

  1. Content is automatically loaded and passed to the LLM prompt
  2. LLM uses this context alongside transcript analysis
  3. Significantly improves accuracy when names aren't mentioned in audio

Example prompt addition:

CONTEXT PROVIDED BY USER:
## Meeting Context

This is a product planning meeting between:
* Alice Chen - Product Manager
* Bob Smith - Engineering Lead

Use the above context to help identify speakers...

{about} Placeholder

Use in interactive !commands:

!cat {about}        # View about file content
!less {about}       # Page through about file
!$EDITOR {about}    # Edit about file

Placeholders:

  • {about} - Full path to about file
  • {ab} - Short alias

Workflow Integration

Option 1: LLM-Powered (Automated)

NEW: Let AI identify speakers automatically!

# Step 1: Transcribe audio with speaker diarisation
./stt_assemblyai.py -d audio.mp3
# Creates: audio.mp3.assemblyai.json, audio.mp3.txt

# Step 2: Let AI identify speakers (single command!)
./stt_assemblyai_speaker_mapper.py --llm-detect openai/gpt-4o-mini audio.mp3.assemblyai.json
# Creates: audio.mp3.assemblyai.mapped.json, audio.mp3.mapped.txt

# Step 3: Review results
cat audio.mp3.mapped.txt
# Output:
# Alice Anderson:	Hello there
# Bob Smith:	Hi, how are you?
# Alice Anderson:	I'm doing well

Even better - Interactive with AI suggestions:

# Step 2 alternative: AI suggests, you confirm/override
./stt_assemblyai_speaker_mapper.py --llm-interactive openai/gpt-4o-mini audio.mp3.assemblyai.json

# Shows:
# === AI-Detected Speaker Mappings ===
# A => Alice Anderson
# B => Bob Smith
#
# === Review and Confirm (press Enter to accept, or type to override) ===
# A => [Alice Anderson]: ← Press Enter to accept
# B => [Bob Smith]: ← Press Enter to accept

Option 2: Manual Mapping (Traditional)

# Step 1: Transcribe audio with speaker diarisation
./stt_assemblyai.py -d audio.mp3
# Creates: audio.mp3.assemblyai.json, audio.mp3.txt

# Step 2: Review transcript to identify speakers
cat audio.mp3.txt
# Output:
# Speaker A: Hello there
# Speaker B: Hi, how are you?
# Speaker A: I'm doing well

# Step 3: Detect speakers in JSON (optional)
./stt_assemblyai_speaker_mapper.py --detect audio.mp3.assemblyai.json
# Output: Detected speakers: A, B

# Step 4: Apply speaker name mapping manually
./stt_assemblyai_speaker_mapper.py -m "Alice Anderson,Beat Barrinson" audio.mp3.assemblyai.json
# Creates: audio.mp3.assemblyai.mapped.json, audio.mp3.mapped.txt

# Step 5: Review mapped transcript
cat audio.mp3.mapped.txt
# Output:
# Alice Anderson:	Hello there
# Beat Barrinson:	Hi, how are you?
# Alice Anderson:	I'm doing well

Usage Examples

1. Detect Speakers (Dry-run)

./stt_assemblyai_speaker_mapper.py --detect audio.assemblyai.json

Output:

Detected speakers: A, B, C

2. Inline Comma-Separated Mapping

./stt_assemblyai_speaker_mapper.py -m "Alice Anderson,Beat Barrinson,Charlie Chaplin" audio.assemblyai.json

Maps speakers in sorted order:

  • A → Alice Anderson
  • B → Beat Barrinson
  • C → Charlie Chaplin

3. File-Based Mapping (Auto-Detects Format)

Format 1: Sequential (Simple)

File: speakers.txt

Alice Anderson
Beat Barrinson
Charlie Chaplin

Usage:

./stt_assemblyai_speaker_mapper.py -M speakers.txt audio.assemblyai.json

Mapping: Sorted speakers → Sequential names

  • A → Alice Anderson
  • B → Beat Barrinson
  • C → Charlie Chaplin

Format 2: Explicit Key:Value

File: speakers.txt

A: Alice Anderson
B: Beat Barrinson
C: Charlie Chaplin

Usage:

./stt_assemblyai_speaker_mapper.py -M speakers.txt audio.assemblyai.json

Mapping: Direct key-to-value

Format 3: Full Speaker Labels

File: speakers.txt

Speaker A: Alice Anderson
Speaker B: Beat Barrinson

Usage:

./stt_assemblyai_speaker_mapper.py -M speakers.txt audio.assemblyai.json

Mapping: Full label as key

Format 4: Mixed (Flexible)

File: speakers.txt

A: Alice Anderson
Speaker B: Beat Barrinson
C: Charlie Chaplin

Usage:

./stt_assemblyai_speaker_mapper.py -M speakers.txt audio.assemblyai.json

Mapping: Handles both formats in the same file

4. Interactive Mapping

./stt_assemblyai_speaker_mapper.py --interactive audio.assemblyai.json

Interaction:

=== Detected Speakers ===
Name for 'A' (press Enter to keep): Alice Anderson
Name for 'B' (press Enter to keep): Beat Barrinson
Name for 'C' (press Enter to keep):
INFO: Detected 3 speaker(s): A, B, C
INFO: Applied mappings:
INFO:   A → Alice Anderson
INFO:   B → Beat Barrinson
INFO: Wrote JSON: audio.assemblyai.mapped.json
INFO: Wrote TXT: audio.mapped.txt
Created: audio.assemblyai.mapped.json, audio.mapped.txt

5. Advanced Options

Verbose Output + Force Overwrite

./stt_assemblyai_speaker_mapper.py -vv -f -m "Host,Guest" interview.json

Custom Output Path

./stt_assemblyai_speaker_mapper.py -o final_transcript -m "Alice,Bob" audio.json
# Creates: final_transcript.json, final_transcript.txt

Generate Only TXT (Quick Preview)

./stt_assemblyai_speaker_mapper.py --txt-only -m "Alice,Bob" audio.json
# Creates only: audio.mapped.txt

Generate Only JSON

./stt_assemblyai_speaker_mapper.py --json-only -m "Alice,Bob" audio.json
# Creates only: audio.assemblyai.mapped.json

Command-Line Options

Positional Arguments

  • input_json - Path to AssemblyAI JSON file (e.g., audio.assemblyai.json)

Mapping Sources (Mutually Exclusive)

  • -m, --speaker-map STR - Comma-separated speaker names (e.g., "Alice,Bob,Charlie")
  • -M, --speaker-map-file PATH - File with speaker mappings (auto-detects format)
  • --interactive - Interactively prompt for speaker names

Output Control

  • -o, --output BASE - Output base name (default: auto-generate with .mapped)
  • -f, --force - Overwrite existing output files
  • --txt-only - Generate only .txt file (skip .json)
  • --json-only - Generate only .json file (skip .txt)
  • --detect - Only show detected speakers and exit (no processing)

Logging

  • -v, --verbose - Increase verbosity (count-based: -v = INFO, -vvvvv = DEBUG)
  • -q, --quiet - Suppress all non-error output

META Message Control

  • --no-meta-message, --disable-meta-message - Disable the META warning message about transcription errors

Environment Variables:

  • STT_META_MESSAGE_DISABLE=1 - Disable META message system-wide
  • STT_META_MESSAGE="text" - Use custom META message

Output Files

Default Naming

  • Input: audio.mp3.assemblyai.json
  • JSON output: audio.mp3.assemblyai.mapped.json (full JSON with speaker fields replaced)
  • TXT output: audio.mp3.mapped.txt (formatted transcript with tab after speaker name)

TXT Format

Alice Anderson:	Hello there
Beat Barrinson:	Hi, how are you?
Alice Anderson:	I'm doing well

Note: Tab character (\t) after colon for easy parsing/alignment

JSON Format

All "speaker" key values are replaced throughout the JSON structure:

{
  "utterances": [
    {
      "speaker": "Alice Anderson",
      "text": "Hello there",
      "confidence": 0.95,
      "start": 100,
      "end": 1500,
      "words": [
        {
          "text": "Hello",
          "start": 100,
          "end": 500,
          "confidence": 0.98,
          "speaker": "Alice Anderson"
        }
      ]
    }
  ]
}

How It Works: Recursive Traversal

The tool uses recursive JSON traversal to find and replace speaker values, making it robust against JSON structure changes:

def replace_speakers_recursive(obj, speaker_map):
    if isinstance(obj, dict):
        for key, value in obj.items():
            if key == "speaker" and isinstance(value, str):
                # Replace speaker value
                obj[key] = speaker_map.get(value, value)
            else:
                # Recurse into nested structures
                replace_speakers_recursive(value, speaker_map)
    elif isinstance(obj, list):
        for item in obj:
            replace_speakers_recursive(item, speaker_map)

Benefits:

  • Works with ANY JSON structure containing "speaker" keys
  • Future-proof: handles AssemblyAI API changes
  • Comprehensive: catches speaker references in unexpected locations
  • Portable: could work with other STT providers' JSON formats

Validation & Warnings

The tool validates your mapping and provides helpful warnings:

Unmapped Speakers

WARNING: Unmapped speakers (keeping original): C

You provided mapping for A and B, but speaker C exists in the transcript. C will remain as "Speaker C".

Extra Mappings

WARNING: Extra mappings for non-existent speakers: D

You provided a mapping for speaker D, but no speaker D exists in the JSON.

Empty Mapping

WARNING: Empty speaker mapping - no changes will be made

No valid mappings were found (e.g., empty file or all speakers skipped in interactive mode).

Error Handling

File Not Found

ERROR: File not found: audio.assemblyai.json

Invalid JSON

ERROR: Invalid JSON: Expecting value: line 1 column 1 (char 0)

No Speakers Detected

ERROR: No speakers detected in JSON (no 'speaker' keys found)

This means the JSON doesn't contain speaker diarisation data. Run stt_assemblyai.py with -d flag to enable diarisation.

No Mapping Source

ERROR: No mapping source provided (use -m, -M, or --interactive)

Output Files Exist

ERROR: Output file(s) already exist: audio.mapped.txt, audio.assemblyai.mapped.json
ERROR: Use -f/--force to overwrite

Edge Cases & Tips

Partial Mapping

You can map only some speakers:

# Only map speaker A, keep B and C as-is
./stt_assemblyai_speaker_mapper.py -m "Alice" audio.json

Comment Lines in Mapping Files

Mapping files support comment lines (lines starting with #):

# Project interview speakers
A: Alice Anderson
B: Beat Barrinson
# C was not identified yet

Remapping Multiple Times

The tool is idempotent - you can remap the same file multiple times:

# First attempt (wrong names)
./stt_assemblyai_speaker_mapper.py -m "John,Jane" audio.json

# Correct attempt
./stt_assemblyai_speaker_mapper.py -f -m "Alice,Bob" audio.json

Integration with Wrapper Scripts

Works seamlessly with stt_video_using_assemblyai.sh:

# Extract and transcribe
./stt_video_using_assemblyai.sh -d video.mp4

# Review transcript
cat video.mp4.txt

# Map speakers
./stt_assemblyai_speaker_mapper.py -m "Host,Guest" video.mp4.assemblyai.json

Troubleshooting

Problem: TXT file not created

Cause: No transcript segments found in JSON

Solution: Check that JSON contains diarisation data with --detect flag

Problem: Wrong speaker order

Cause: Speakers are mapped in sorted order (A, B, C)

Solution: Use explicit key:value format in mapping file:

B: Bob (first speaker chronologically)
A: Alice (second speaker chronologically)

Problem: JSON structure changed

Cause: AssemblyAI updated their API response format

Solution: The recursive traversal should handle this automatically. If not, file an issue with sample JSON.

Development & Testing

Run Unit Tests

python3 test_stt_assemblyai_speaker_mapper.py

Test with Sample Data

# Create sample JSON
cat > sample.json << 'EOF'
{
  "utterances": [
    {"speaker": "A", "text": "Hello"},
    {"speaker": "B", "text": "Hi there"}
  ]
}
EOF

# Test detection
./stt_assemblyai_speaker_mapper.py --detect sample.json

# Test mapping
./stt_assemblyai_speaker_mapper.py -m "Alice,Bob" sample.json

Related Tools

  • stt_assemblyai.py - Main transcription tool (creates the JSON files this tool processes)
  • stt_video_using_assemblyai.sh - Wrapper script for video transcription
  • google_cloud_ai/multi-speaker_markup_from_dialog_transcript.py - Similar tool for Google Cloud AI TTS

License

Part of the CLIAI handy_scripts collection.