PCS

PCS is a data engineering and AI pipeline that transforms long French parliamentary audio sessions into a relational database enriched with speaker identities, semantic search capabilities, and audit metadata.

The pipeline combines voice activity detection, speaker diarization, speech recognition, entity extraction, semantic embeddings, audio-quality audits, and SQLite persistence.

Rather than producing only a transcript, PCS tracks uncertainty throughout the processing chain, highlighting where the audio, transcript, speaker attribution, and official parliamentary record agree, disagree, or require review.

A typical 5-hour parliamentary session (~580 MB WAV) produces a self-contained SQLite database of roughly 5 MB containing transcripts, speaker identities, named entities, semantic chunks, embeddings, and audit metadata.

Across multiple test runs, roughly 13 hours of parliamentary audio were distilled into about 13 MB of SQLite databases.

The resulting database can be explored with SQL, searched semantically, or inspected through generated audit artifacts.

Example Output

5h 20m parliamentary audio
          │
          ▼
+-----------------------+
|  assemblybot.sqlite   |
|        5.4 MB         |
+-----------------------+

13 relational tables
214 speaking turns
45 identified speakers
720 semantic chunks
934 embeddings

A standalone, dependency-free search demo is included in demo/. It allows the generated SQLite database to be explored immediately with nothing more than Python.
See the demo directory.

Why This Exists

Long-form audio processing involves multiple imperfect sources of information.

Speech recognition can hallucinate or drift. Speaker diarization estimates who spoke when but may split or merge speakers incorrectly. Voice activity detection identifies probable speech regions without understanding their content. Official parliamentary records provide valuable context but are edited documents rather than direct acoustic observations.

Instead of hiding uncertainty, the pipeline preserves it. Each stage contributes evidence, confidence signals, provenance, audit metrics, and intermediate artifacts. Disagreements between sources are treated as useful information rather than failures.

The goal is not to produce a perfectly clean transcript. The goal is to produce a structured dataset where uncertain regions can be inspected, filtered, reviewed, or reprocessed with full traceability.

Pipeline Overview

.wav
  │
  ▼
Voice Activity Detection
  │
  ▼
Speaker Diarization
  │
  ▼
Transcription
  │
  ▼
Audio & Transcript Audits
  │
  ▼
Speaker / Transcript Alignment
  │
  ▼
Turn Consolidation
  │
  ▼
Identity Resolution
  │
  ▼
Semantic Chunking & Embeddings
  │
  ▼
SQLite Export

Current technologies

Stage	Technology
Voice Activity Detection	Silero VAD
Speaker Diarization	pyannote
Transcription	Faster-Whisper (large-v3)
Audio Audit	librosa
Named Entity Recognition	CamemBERT NER
Semantic Embeddings	STS CamemBERT
Persistence	SQLite + SQLAlchemy

What It Does

Given one or more .wav files into data/audio/unprocessed, main.py discovers candidate audio files, creates a reproducible run folder, executes the pipeline stage by stage, writes a manifest after every stage, and moves the source file to processed or failed depending on the result.

The current pipeline runs:

Silero VAD to detect speech regions.
pyannote diarization to estimate who spoke when and generate speaker embeddings for later identity resolution.
Faster Whisper transcription to generate timestamped French transcript segments.
librosa audio audit to compute acoustic metrics such as energy, dB, spectral flatness, and related quality signals.
Transcript audit to flag suspicious transcript regions using VAD, diarization, audio evidence, whisper confidence proxy and heuristics.
Transcript/diarization alignment to connect text segments with speaker activity.
Turn consolidation to produce speaker turns.
Named-entity extraction and speaker identity resolution using French NER models and ground-truth lists of deputies and ministers.
Semantic chunking and embeddings for semantic search and retrieval.
SQLite export for queryable downstream analysis.

Each run is stored in its own reproducible run directory containing configuration snapshots, logs, intermediate artifacts, audit outputs, embeddings, and the final SQLite database.

Repository Map

main.py — pipeline entrypoint.
src/assemblybot/stages/ — processing stages (VAD, diarization, transcription, audits, enrichment, export).
src/assemblybot/orchestration/ — run management, manifests, provenance, and pipeline execution.
src/assemblybot/models/ — shared domain models and artifacts.
src/assemblybot/db/ — SQLite schema and export logic.
tests/ — unit and integration tests.
docs/ — technical documentation and diagrams.
notes/ — experiment journals and design decisions.
ground_truth_PER/ — deputy and minister reference datasets.

Requirements

Python >= 3.12
uv
NVIDIA GPU recommended
Hugging Face access token required for model downloads

Tested on:

RTX 4080 Laptop GPU (12 GB VRAM)
CUDA 13
Linux

Running The Pipeline

Place one or more .wav files in:

data/audio/unprocessed/

Run the default batch:

uv run python main.py

Performance

Tested on:

RTX 4080 Laptop GPU (12 GB VRAM)
CUDA 13

Typical processing time:

Input:   5h 20m parliamentary session
Runtime: 26 minutes

≈ 12× faster than real time

Reproducibility and Traceability

This project is deliberately designed around reproducibility, provenance, and inspection.

Run-level provenance, including source file hashes, ingestion metadata, and processing timestamps.
Stage-by-stage manifests recording status, runtime, configuration, and generated artifacts.
Persistent intermediate artifacts to allow inspection and reprocessing without rerunning the full pipeline.
Audit signals and quality flags for suspicious transcript, diarization, or audio regions.
Relational persistence through SQLite for downstream analysis and retrieval.
Automated tests covering alignment, enrichment, chunking, and database export workflows.

The goal is to make every output traceable to the evidence that produced it. Model outputs are not treated as truth by default; they are treated as observations that can be validated, audited, and revisited.

Roadmap

PCS runs end to end on French parliamentary audio and produces auditable intermediate artifacts, speaker and person enrichment, semantic embeddings, and a searchable SQLite database.

Current areas of work include:

improving speaker identity recall
refining speaker attribution and confidence scoring
expanding evaluation against manually reviewed samples
improving retrieval quality through chunk and segment filtering
simplifying the search and query experience for downstream users

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PCS

Example Output

Why This Exists

Pipeline Overview

Current technologies

What It Does

Repository Map

Requirements

Running The Pipeline

Performance

Reproducibility and Traceability

Further Reading

Roadmap

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
data		data
demo		demo
docs		docs
experiments		experiments
ground_truth_PER		ground_truth_PER
notes		notes
src/assemblybot		src/assemblybot
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
analysis_last_merge.txt		analysis_last_merge.txt
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

PCS

Example Output

Why This Exists

Pipeline Overview

Current technologies

What It Does

Repository Map

Requirements

Running The Pipeline

Performance

Reproducibility and Traceability

Further Reading

Roadmap

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages