Quick Start Β· System Architecture Β· Low-Resource Optimization Β· Data Contracts
Turn your local terminal into an autonomous multimedia intelligence hub. Drop long-form video, audio recordings, or text-heavy transcripts into your workspaceβyour local pipeline handles demultiplexing, speech-to-text, diarization, structural aggregation, and abstract summary processing entirely on your host machine.
The Crucial Distinction: This is not a wrapper relying on external cloud APIs or high-end enterprise hardware. This system orchestrates open-source models within tight infrastructure boundaries, maintaining complete data isolation and minimal external telemetry.
LocalVoxScribe is built on a distributed network of open-source frameworks. It relies on the following core environments to handle localized pipelines.
| Component / Service | Logo | Resource Gateway | Setup & Credentials |
|---|---|---|---|
| Python 3.10+ | Python.org Documentation | Execution Runtime | |
| RabbitMQ | RabbitMQ Tutorials | Local Broker Container | |
| Ollama | Ollama Model Library | Pulls qwen2.5:1.5b |
|
| Docker | Docker Specification Engine | Containerization Runtime | |
| Hugging Face | Hugging Face Portal | Requires User Access Token | |
| Telegram Bot API | Telegram BotFather Core | Requires HTTP Bot Token |
To initialize the private offline workspace containers successfully, you must capture authorization tokens from the following endpoints and append them to your .env configuration file:
Because PyAnnote Audio 3.1 requires explicit acceptance of its open-source license agreements, you must link a Hugging Face user profile:
- Create an account or log in to the Hugging Face Portal.
- Accept the terms of service on the weight model repositories: pyannote/segmentation-3.0 and pyannote/speaker-diarization-3.1.
- Navigate to Settings > Access Tokens or jump straight to huggingface.co/settings/tokens.
- Click New Token, set the permission level to
Read, and paste the generated string into your.envfile underHUGGINGFACE_TOKEN.
To establish remote interface communication lanes over the Aiogram 3 system:
- Open your Telegram client app and search for the verified account @BotFather.
- Issue the initialization command:
/newbot. - Follow the guided interactive responses to assign a specific display Name and alphanumeric Username for your app.
- Copy the unique HTTP API token response hash provided by BotFather into your
.envfile underTELEGRAM_BOT_TOKEN.
- Complete Air-Gapped Isolation: Zero third-party web requests. All source assets, database indexes, and model weights reside strictly inside your local host context.
- Decoupled Asynchronous Processing: Powered by a RabbitMQ transaction layer to separate UI event threads from long-running, compute-heavy deep learning workloads.
- Unified Speech-to-Text Pipeline: Combines Faster-Whisper parsing with PyAnnote Audio 3.1 multi-speaker cluster analysis to isolate distinct speaker timelines.
- Deeply Budgeted Inference Matrix: Configured for extreme constraint execution (β€ 6 GB RAM footprint across the entire container stack) running exclusively over CPU runtimes.
- Dynamic Analytics Engines: Driven by an SQLite3 backend, empowering operators to map tailored user instructions directly into custom template outputs.
The layout maps complex data processing into container-isolated, decoupled microservices. Instead of blocking interactive layers during operations, components pass lightweight JSON definitions securely across distinct RabbitMQ brokers.
ββββββββββββββββββ βββββββββββββββ
β Desktop UI β β Telegram Botβ
βββββββββ¬βββββββββ ββββββββ¬βββββββ
β β
βββββββββββββ¬βββββββββββ
βΌ
βββββββββββββββββββ
β RabbitMQ β (Message Broker)
ββββββββββ¬βββββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ
β Media β β Speech-to- β β Summarizer β β Local LLM β
β Processor β β Text (STT) β β Service β β (Ollama) β
βββββββ¬ββββββ βββββββ¬βββββββ βββββββ¬βββββββ βββββββ¬βββββββ
β β β β
ββββββββββββββββ΄ββββββββ¬βββββββββ΄βββββββββββββββββ
βΌ
βββββββββββββββββββ
β Shared Volume β (SQLite3 DB & Media Storage)
βββββββββββββββββββ
| Queue Name | Source Component | Target Worker Component | Responsibility |
|---|---|---|---|
media_tasks |
Desktop UI / Telegram Bot | Media Processor | Extracts and resamples incoming streams to 16kHz mono PCM WAV via FFmpeg. |
transcription_tasks |
Media Processor | Speech-to-Text | Evaluates automated speech recognition and speaker diarization timelines. |
summarization_tasks |
Speech-to-Text | Summarizer | Composes structured text layouts alongside user instruction prompt vectors. |
ui_results / bot_results |
Summarizer | Desktop UI / Telegram Bot | Returns completed structured analytical summaries back to the interface layers. |
.
βββ .env.example # Environment parameters deployment template
βββ docker-compose.yml # Core multi-container system deployment layout
βββ storage/ # Persistent local storage directories
β βββ db/ # Houses central app_data.db (SQLite3)
β βββ model_cache/ # Shared local cache directory for model checkpoints
β βββ raw_media/ # Shared transactional processing volume context
βββ services/
βββ desktop-ui/ # CustomTkinter client environment (app.py)
βββ telegram-bot/ # Fully independent Aiogram 3 gateway application
βββ media-processor/ # FFmpeg automated data ingest workflows
βββ speech-to-text/ # Faster-Whisper and PyAnnote model layers
βββ summarizer/ # LangChain execution logic and runtime routing
βββ shared/ # Common internal library modules
βββ db_manager.py # Centralized database management interfaces
βββ rabbitmq_utils.py # Resilient message broker link wrappers
βββ schema.py # Data contract structures (MediaTask, etc.)
- Core Runtime environment: Python 3.10+
- Message Orchestration: RabbitMQ (pika)
- Neural Networks & Core ML: LangChain, Pydantic, Faster-Whisper, PyAnnote Audio, PyTorch
- Data Storage Engines: SQLite3
- Interface Systems: CustomTkinter (Desktop UI), Aiogram 3 (Telegram API Interface)
- Infrastructure Layer: Docker, Docker Compose, FFmpeg
- Operating System: Linux (Ubuntu/Debian recommended) or Windows 10/11 running Docker Desktop.
- Hardware Matrix: Minimum 4-Core CPU containing AVX2 instruction architectures, alongside 8 GB system RAM (with 6 GB allocated strictly for the container cluster environment).
Here is the updated Installation & Run sequence, including guidance on how to swap out the default Ollama model for a larger or different alternative.
- Clone the repository:
git clone https://github.com/PhillMckinnon/LocalVoxScribe.git
cd LocalVoxScribe
2. **Establish your local environmental attributes configuration:**
```bash
cp .env.example .env
Open your newly generated .env file and configure your credentials, including your Hugging Face token and Telegram Bot configurations.
- Initialize Ollama and pull the language model: Before spinning up the entire microservice cluster, bring up the local inference framework container individually to pull the required 4-bit quantized model layer:
# Spin up the Ollama background engine service
docker compose up -d ollama
# Stream the model download directly into your local container context
docker compose exec ollama ollama pull qwen2.5:1.5b-instruct-q4_K_M
4. **Deploy the remaining microservice cluster environment:**
Once the model download completes successfully, bring up the rest of the application ecosystem (RabbitMQ, Speech-to-Text pipeline, UI/Bot gateways, and Media Processors):
```bash
docker compose up --build -d
Note: On your initial execution loop, allow a few additional minutes for Docker to download baseline runtime layers and fetch the PyAnnote speaker diarization weights into your local model cache directory.
If your local hardware has more than 8 GB of RAM and you want to use a more powerful model (such as llama3.2:3b or a larger qwen2.5 variant), you can easily change the target model string in two places:
Step 1: Pull your preferred model variant Tell the Ollama container to fetch your target model from the Ollama registry:
docker compose exec ollama ollama pull <your-desired-model-string>
Step 2: Update the Summarizer runtime configuration
Open the worker file located at services/summarizer/src/processor.py (or your local equivalent source path) and modify the model declaration parameter to match the exact string you just pulled:
# services/summarizer/src/processor.py
# Locate the Ollama initialization block and update the model parameter:
self.llm = Ollama(
base_url=f"http://{OLLAMA_HOST}:11434",
model="your-desired-model-string" # Change this from qwen2.5:1.5b-instruct-q4_K_M
)Once edited, restart the summarizer worker to apply your changes: docker compose up -d --build summarizer-service.
By default, the cloud-hosted Telegram Bot API sets a hard limit of 20 MB for uploads and 50 MB for downloads. Because this system is designed to process multimedia recordings up to 150 MB, it is highly recommended to run a local Telegram API server engine container alongside the application cluster.
Once a self-hosted API server instance is running on your network (e.g., listening on port 8081), update your active .env configuration file to redirect the network worker layers:
TELEGRAM_API_URL=http://telegram-api-server:8081
This lifts the file size constraint to 2 GB, speeds up internal network transits, and ensures that raw audio/video files are kept strictly within your private localized network.
When deploying the client GUI applications, the underlying SQLite database initializes with core default summary configurations. You can expand the analytical templates folder instantly using the following structures:
- Executive Summary: "Analyze the transcribed meeting dialogue text. Extract a high-level summary overview detailing core operational goals..."
- Action Items Mapping: "Isolate all explicit action steps specified across the timeline. Group individual assignments by speaker profile..."
- Minutes of Meeting (MoM): "Generate a structured, formal timeline tracking core talking points, key design blockers, and organizational decisions..."
This workspace operates under strict memory limits to run on standard home computers without an external GPU.
Out of the box, building the system establishes a completely self-contained local operations suite:
| Capability | Local Tool / Mechanism | Operational Parameter |
|---|---|---|
| Inference Framework | Ollama Engine Deployment | Constrained execution loop mapping strictly to CPU threads. |
| Language Base Weights | qwen2.5:1.5b-instruct-q4_K_M |
4-bit quantized model layer featuring low active RAM footprints. |
| Context Window Vector | Native Context Optimization | Managed token boundaries designed to run smoothly on standard CPUs. |
| Audio Processing Engine | FFmpeg Extraction Utilities | Converts files into lightweight mono streams before processing. |