Skip to content

PhillMckinnon/LocalVoxScribe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LocalVoxScribe

An asynchronous, microservice-based application for private, offline multimedia analysis.

Quick Start Β· System Architecture Β· Low-Resource Optimization Β· Data Contracts

Turn your local terminal into an autonomous multimedia intelligence hub. Drop long-form video, audio recordings, or text-heavy transcripts into your workspaceβ€”your local pipeline handles demultiplexing, speech-to-text, diarization, structural aggregation, and abstract summary processing entirely on your host machine.

The Crucial Distinction: This is not a wrapper relying on external cloud APIs or high-end enterprise hardware. This system orchestrates open-source models within tight infrastructure boundaries, maintaining complete data isolation and minimal external telemetry.

πŸ› οΈ Ecosystem & Token Reference

LocalVoxScribe is built on a distributed network of open-source frameworks. It relies on the following core environments to handle localized pipelines.

Component / Service Logo Resource Gateway Setup & Credentials
Python 3.10+ Python Python.org Documentation Execution Runtime
RabbitMQ RabbitMQ RabbitMQ Tutorials Local Broker Container
Ollama Ollama Ollama Model Library Pulls qwen2.5:1.5b
Docker Docker Docker Specification Engine Containerization Runtime
Hugging Face Hugging Face Hugging Face Portal Requires User Access Token
Telegram Bot API Telegram Telegram BotFather Core Requires HTTP Bot Token

πŸ”‘ Token Retrieval Guide

To initialize the private offline workspace containers successfully, you must capture authorization tokens from the following endpoints and append them to your .env configuration file:

1. PyAnnote Speaker Diarization Token (HUGGINGFACE_TOKEN)

Because PyAnnote Audio 3.1 requires explicit acceptance of its open-source license agreements, you must link a Hugging Face user profile:

  1. Create an account or log in to the Hugging Face Portal.
  2. Accept the terms of service on the weight model repositories: pyannote/segmentation-3.0 and pyannote/speaker-diarization-3.1.
  3. Navigate to Settings > Access Tokens or jump straight to huggingface.co/settings/tokens.
  4. Click New Token, set the permission level to Read, and paste the generated string into your .env file under HUGGINGFACE_TOKEN.

2. Telegram Bot Token (TELEGRAM_BOT_TOKEN)

To establish remote interface communication lanes over the Aiogram 3 system:

  1. Open your Telegram client app and search for the verified account @BotFather.
  2. Issue the initialization command: /newbot.
  3. Follow the guided interactive responses to assign a specific display Name and alphanumeric Username for your app.
  4. Copy the unique HTTP API token response hash provided by BotFather into your .env file under TELEGRAM_BOT_TOKEN.

πŸš€ Key Features

  • Complete Air-Gapped Isolation: Zero third-party web requests. All source assets, database indexes, and model weights reside strictly inside your local host context.
  • Decoupled Asynchronous Processing: Powered by a RabbitMQ transaction layer to separate UI event threads from long-running, compute-heavy deep learning workloads.
  • Unified Speech-to-Text Pipeline: Combines Faster-Whisper parsing with PyAnnote Audio 3.1 multi-speaker cluster analysis to isolate distinct speaker timelines.
  • Deeply Budgeted Inference Matrix: Configured for extreme constraint execution (≀ 6 GB RAM footprint across the entire container stack) running exclusively over CPU runtimes.
  • Dynamic Analytics Engines: Driven by an SQLite3 backend, empowering operators to map tailored user instructions directly into custom template outputs.

πŸ—οΈ System Architecture

The layout maps complex data processing into container-isolated, decoupled microservices. Instead of blocking interactive layers during operations, components pass lightweight JSON definitions securely across distinct RabbitMQ brokers.

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚   Desktop UI   β”‚      β”‚ Telegram Botβ”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
          β”‚                      β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚    RabbitMQ     β”‚ (Message Broker)
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β–Ό                β–Ό                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Media   β”‚  β”‚ Speech-to- β”‚   β”‚ Summarizer β”‚   β”‚ Local LLM  β”‚
β”‚ Processor β”‚  β”‚ Text (STT) β”‚   β”‚  Service   β”‚   β”‚  (Ollama)  β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
      β”‚              β”‚                β”‚                β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Shared Volume  β”‚ (SQLite3 DB & Media Storage)
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Microservice Queue Mapping

Queue Name Source Component Target Worker Component Responsibility
media_tasks Desktop UI / Telegram Bot Media Processor Extracts and resamples incoming streams to 16kHz mono PCM WAV via FFmpeg.
transcription_tasks Media Processor Speech-to-Text Evaluates automated speech recognition and speaker diarization timelines.
summarization_tasks Speech-to-Text Summarizer Composes structured text layouts alongside user instruction prompt vectors.
ui_results / bot_results Summarizer Desktop UI / Telegram Bot Returns completed structured analytical summaries back to the interface layers.

πŸ“‚ Repository Structure

.
β”œβ”€β”€ .env.example                 # Environment parameters deployment template
β”œβ”€β”€ docker-compose.yml           # Core multi-container system deployment layout
β”œβ”€β”€ storage/                     # Persistent local storage directories
β”‚   β”œβ”€β”€ db/                      # Houses central app_data.db (SQLite3)
β”‚   β”œβ”€β”€ model_cache/             # Shared local cache directory for model checkpoints
β”‚   └── raw_media/               # Shared transactional processing volume context
└── services/
    β”œβ”€β”€ desktop-ui/              # CustomTkinter client environment (app.py)
    β”œβ”€β”€ telegram-bot/            # Fully independent Aiogram 3 gateway application
    β”œβ”€β”€ media-processor/         # FFmpeg automated data ingest workflows
    β”œβ”€β”€ speech-to-text/          # Faster-Whisper and PyAnnote model layers
    β”œβ”€β”€ summarizer/              # LangChain execution logic and runtime routing
    └── shared/                  # Common internal library modules
        β”œβ”€β”€ db_manager.py        # Centralized database management interfaces
        β”œβ”€β”€ rabbitmq_utils.py    # Resilient message broker link wrappers
        └── schema.py            # Data contract structures (MediaTask, etc.)


πŸ› οΈ Tech Stack

  • Core Runtime environment: Python 3.10+
  • Message Orchestration: RabbitMQ (pika)
  • Neural Networks & Core ML: LangChain, Pydantic, Faster-Whisper, PyAnnote Audio, PyTorch
  • Data Storage Engines: SQLite3
  • Interface Systems: CustomTkinter (Desktop UI), Aiogram 3 (Telegram API Interface)
  • Infrastructure Layer: Docker, Docker Compose, FFmpeg

⚑ Quick Start

Prerequisites

  • Operating System: Linux (Ubuntu/Debian recommended) or Windows 10/11 running Docker Desktop.
  • Hardware Matrix: Minimum 4-Core CPU containing AVX2 instruction architectures, alongside 8 GB system RAM (with 6 GB allocated strictly for the container cluster environment).

Here is the updated Installation & Run sequence, including guidance on how to swap out the default Ollama model for a larger or different alternative.


Installation & Run

  1. Clone the repository:
git clone https://github.com/PhillMckinnon/LocalVoxScribe.git
cd LocalVoxScribe

2. **Establish your local environmental attributes configuration:**
   ```bash
   cp .env.example .env

Open your newly generated .env file and configure your credentials, including your Hugging Face token and Telegram Bot configurations.

  1. Initialize Ollama and pull the language model: Before spinning up the entire microservice cluster, bring up the local inference framework container individually to pull the required 4-bit quantized model layer:
# Spin up the Ollama background engine service
docker compose up -d ollama

# Stream the model download directly into your local container context
docker compose exec ollama ollama pull qwen2.5:1.5b-instruct-q4_K_M

4. **Deploy the remaining microservice cluster environment:**
   Once the model download completes successfully, bring up the rest of the application ecosystem (RabbitMQ, Speech-to-Text pipeline, UI/Bot gateways, and Media Processors):
   ```bash
   docker compose up --build -d

Note: On your initial execution loop, allow a few additional minutes for Docker to download baseline runtime layers and fetch the PyAnnote speaker diarization weights into your local model cache directory.


🧠 Swapping to a Different Language Model (Optional)

If your local hardware has more than 8 GB of RAM and you want to use a more powerful model (such as llama3.2:3b or a larger qwen2.5 variant), you can easily change the target model string in two places:

Step 1: Pull your preferred model variant Tell the Ollama container to fetch your target model from the Ollama registry:

docker compose exec ollama ollama pull <your-desired-model-string>

Step 2: Update the Summarizer runtime configuration Open the worker file located at services/summarizer/src/processor.py (or your local equivalent source path) and modify the model declaration parameter to match the exact string you just pulled:

# services/summarizer/src/processor.py

# Locate the Ollama initialization block and update the model parameter:
self.llm = Ollama(
    base_url=f"http://{OLLAMA_HOST}:11434",
    model="your-desired-model-string"  # Change this from qwen2.5:1.5b-instruct-q4_K_M
)

Once edited, restart the summarizer worker to apply your changes: docker compose up -d --build summarizer-service.

🌐 Handling Large Files via Local Telegram API Server (Optional)

By default, the cloud-hosted Telegram Bot API sets a hard limit of 20 MB for uploads and 50 MB for downloads. Because this system is designed to process multimedia recordings up to 150 MB, it is highly recommended to run a local Telegram API server engine container alongside the application cluster.

Once a self-hosted API server instance is running on your network (e.g., listening on port 8081), update your active .env configuration file to redirect the network worker layers:

TELEGRAM_API_URL=http://telegram-api-server:8081

This lifts the file size constraint to 2 GB, speeds up internal network transits, and ensures that raw audio/video files are kept strictly within your private localized network.


πŸ“– Operational Guidance

Custom Prompt Configurations

When deploying the client GUI applications, the underlying SQLite database initializes with core default summary configurations. You can expand the analytical templates folder instantly using the following structures:

  • Executive Summary: "Analyze the transcribed meeting dialogue text. Extract a high-level summary overview detailing core operational goals..."
  • Action Items Mapping: "Isolate all explicit action steps specified across the timeline. Group individual assignments by speaker profile..."
  • Minutes of Meeting (MoM): "Generate a structured, formal timeline tracking core talking points, key design blockers, and organizational decisions..."

βš™οΈ Low-Resource Optimization

This workspace operates under strict memory limits to run on standard home computers without an external GPU.

Zero-Key Offline Baseline Setup

Out of the box, building the system establishes a completely self-contained local operations suite:

Capability Local Tool / Mechanism Operational Parameter
Inference Framework Ollama Engine Deployment Constrained execution loop mapping strictly to CPU threads.
Language Base Weights qwen2.5:1.5b-instruct-q4_K_M 4-bit quantized model layer featuring low active RAM footprints.
Context Window Vector Native Context Optimization Managed token boundaries designed to run smoothly on standard CPUs.
Audio Processing Engine FFmpeg Extraction Utilities Converts files into lightweight mono streams before processing.

About

An asynchronous, microservice-based application for private, offline multimedia analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors