A production-ready Retrieval-Augmented Generation (RAG) application built with FastAPI, Ollama, PostgreSQL (pgvector), and Redis. Features comprehensive document processing, PDF table extraction, intelligent chunking, streaming responses, and accurate context-based answers. Supports documents up to 15 MB with tunable RAG parameters configurable at runtime from the Settings UI.
See the Architecture and Project Structure sections below for a full overview.
- Document Processing: PDF, DOCX, TXT, Markdown with advanced table extraction; supports files up to 15 MB
- RAG Pipeline: High-quality retrieval — 30-candidate hybrid search, 12-chunk reranking, 0.70 diversity filter
- Chat Interface: Real-time streaming responses with document context
- Enhanced Web Search: Optional live DuckDuckGo integration for up-to-date answers
- Persistent Memory: Conversation history stored in PostgreSQL
- Long-term Memory: Cross-session fact extraction; top-K memories injected into every prompt
- Workspaces: Isolated document + conversation namespaces per workspace
- GraphRAG: spaCy entity extraction + 1-hop graph expansion of BM25 query terms
- Document Connectors: Local folder, S3/MinIO/R2, SharePoint, OneDrive — daemon-synced
- Multi-model Agent Routing: Rule-based classifier routes queries to VISION/CODE/LARGE/FAST/BASE models
- Function Calling: Built-in tools (document search, calculator, datetime) + drop-in plugin system
- MCP Integration: Three MCP servers (local-docs, web-search, cloud-connectors) over JSON-RPC 2.0
- Vector Search: Lightning-fast similarity search using pgvector HNSW
- Table Extraction: Advanced PDF table detection and preservation
- Duplicate Prevention: Smart document fingerprinting
- Input Validation: Pydantic models with comprehensive sanitization
- Caching Layer: Redis/Memory cache for embeddings and queries
- Streaming Responses: Server-Sent Events for real-time feedback
- Security: Rate limiting, CORS support, JWT authentication, XSS-safe frontend
- GPU Acceleration: Automatic NVIDIA/AMD GPU detection; configurable multi-GPU layer offload via
OLLAMA_NUM_GPU - Observability: Prometheus metrics endpoint, request timing middleware, detailed health checks, admin dashboard
- Runtime RAG Tuning:
TOP_K_RESULTS,RERANK_TOP_K,DIVERSITY_THRESHOLD,SEMANTIC_WEIGHTadjustable live from Settings UI without restart
- Comprehensive Tests: Unit, integration, and comprehensive test suites
- Type Safety: Full type hints across codebase
- Modular Architecture: Clean separation of concerns
- CI/CD Ready: GitHub Actions configuration
- Error Handling: Professional exception system with context preservation
- XSS Prevention: DOM-based rendering;
escapeHtml()wraps all server-controlled values injected intoinnerHTML - Path Traversal Prevention:
sanitize_filename()+validate_path()belt-and-suspenders on every upload - AST-Safe Calculator:
eval()replaced with a recursive AST evaluator; only arithmetic is permitted - Rate Limiting: Configurable per-endpoint via slowapi
- CORS Support: Configurable allowed origins
- JWT Authentication: Token-based auth for admin endpoints
- Input Sanitization: Pydantic validation + server-side sanitization on all inputs
- Supply Chain: Pinned Docker image SHA256 digest;
litellm>=1.83.7,h11>=0.16.0 - Container Hardening: Non-root user (UID 1000),
allowPrivilegeEscalation: false,drop: ALLcapabilities in Helm charts - Secret Scanning: No credentials in source; placeholder examples only
- Hybrid Search: Combines semantic similarity with BM25 keyword matching
- Multi-level Caching:
- Embedding cache (5000 capacity)
- Query cache (1000 capacity)
- Configurable TTL
- Efficient Indexing: HNSW for fast approximate nearest neighbor search
- Smart Chunking: Context-aware with table preservation
- Reranking: Multi-signal fusion for improved relevance
- GPU Acceleration: Multi-GPU support via
OLLAMA_NUM_GPU; NVIDIA/AMD auto-detection - Request Timing:
X-Request-Durationheader + Prometheus histogram on every response - TTL-Cached Subprocess Calls:
nvidia-smi/rocm-smiresults cached 30 s; Ollama/api/pscached 5 s
- Quick Start
- Architecture
- Usage
- Project Structure
- Documentation
- Testing
- Configuration
- Monitoring & Observability
- CI/CD & Code Quality
- Development
- License
# 1. Clone repository
git clone https://github.com/jwvanderstam/LocalChat
cd LocalChat
# 2. Install dependencies
pip install -r requirements.txt
# 3. Configure environment
cp .env.example .env # edit with your DB / Ollama settings
# 4. Start backing services (PostgreSQL + Redis + Ollama)
docker compose up -d db redis ollama
# 5. Run application
python app.py
# 6. Open browser
# http://localhost:5000Once running, open your browser at http://localhost:5000.
- Chat tab — ask questions; toggle RAG Mode to ground answers in uploaded documents, Enhanced to additionally query the web via DuckDuckGo.
- Documents tab — upload PDF, DOCX, TXT, or Markdown files and test retrieval.
- Models tab — select the active Ollama model.
- API — all endpoints are documented in the interactive Swagger UI at
/api/docs/.
+---------------------------------------------------------------+
| LocalChat RAG System |
+---------------------------------------------------------------+
| |
| +------------+ +------------+ +------------+ |
| | Web UI |--->| FastAPI |--->| Services | |
| | (Browser) |--->| (Routes) |--->| Layer | |
| +------------+ +------------+ +------------+ |
| | | |
| | | |
| +----------------------------------------------------+ |
| | Application Core | |
| +----------------------------------------------------+ |
| | | |
| | +------------+ +------------+ +------------+ | |
| | | RAG Engine | | Cache | | Security | | |
| | | - Hybrid | | - Redis | | - Rate | | |
| | | Search | | - Memory | | Limit | | |
| | | - Rerank | | - TTL | | - CORS | | |
| | +------------+ +------------+ +------------+ | |
| | | |
| | +------------+ +------------+ +------------+ | |
| | | Document | | Ollama | | Monitoring | | |
| | | Processor | | Client | | - Metrics | | |
| | | - Extract | | - LLM | | - Health | | |
| | | - Chunk | | - Embed | | - Logs | | |
| | +------------+ +------------+ +------------+ | |
| | | |
| +----------------------------------------------------+ |
| | | |
| | | |
| +------------+ +------------+ +------------+ |
| | PostgreSQL | | Ollama | | Redis | |
| | + pgvector | | (LLM API) | | (Optional) | |
| | - Documents| | - Embeddings| | - Caching | |
| | - Chunks | | - Generation| | - Sessions | |
| | - Vectors | +------------+ +------------+ |
| +------------+ |
| |
+---------------------------------------------------------------+
Document Upload:
Upload -> Validate -> Extract Text -> Detect Tables ->
Smart Chunk -> Generate Embeddings -> Store in DB ->
Update Cache
RAG Query:
Query -> Cache Check -> Generate Query Embedding ->
Hybrid Search (Semantic + BM25) -> Retrieve Chunks ->
Rerank Results -> Format Context -> LLM Generation ->
Stream Response -> Cache Result
Cache Strategy:
- Embedding Cache: 7 days TTL, 5000 capacity
- Query Cache: 1 hour TTL, 1000 capacity
- LRU eviction for memory cache
- Redis fallback to memory cache
flowchart TD
Browser["Browser / API Client"]
subgraph FastAPI["FastAPI Application"]
Routes["Routes\n(APIRouters)"]
Auth["Security\n(JWT · Rate Limit · CORS)"]
Pydantic["Pydantic Validation\n+ Sanitization"]
RAG["RAG Pipeline\n(Retrieval · Reranking)"]
Tools["Tool Executor\n(Function Calling)"]
SSE["SSE Stream"]
end
subgraph Services["External Services"]
PG["PostgreSQL + pgvector\n(documents · chunks · vectors)"]
Ollama["Ollama\n(LLM · Embeddings)"]
Redis["Redis\n(Cache · Rate Limiting)"]
end
Browser -->|HTTP request| Routes
Routes --> Auth
Auth --> Pydantic
Pydantic --> RAG
RAG -->|vector search| PG
RAG -->|embed query| Ollama
Pydantic --> Tools
Tools -->|tool-call loop| Ollama
Tools --> RAG
Ollama -->|stream tokens| SSE
SSE -->|text/event-stream| Browser
Routes -.->|cache r/w| Redis
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | HTML, CSS, JavaScript | Web interface |
| Backend | FastAPI + Uvicorn | Web framework |
| Database | PostgreSQL 15+ | Document storage |
| Vector DB | pgvector | Similarity search |
| Cache | Redis / Memory | Performance optimization |
| LLM | Ollama | Local inference |
| Embeddings | nomic-embed-text | Vector generation |
| GPU | NVIDIA (nvidia-smi) / AMD (rocm-smi) | Hardware acceleration |
| Metrics | Prometheus text format v0.0.4 | Observability |
| Validation | Pydantic 2.12 | Input validation |
| Testing | pytest | Test framework |
All documentation lives in-code with comprehensive docstrings and type hints.
| Document | Purpose |
|---|---|
| docs/SCHEMA.md | Database schema, ER diagram, index rationale |
| docs/TROUBLESHOOTING.md | Common issues and fixes |
| docs/OPERATIONS.md | Backup, restore, and maintenance procedures |
| ROADMAP.md | Evolution roadmap and completion status |
app.py— Application entry point;create_uvicorn_app()for productionsrc/app_fastapi.py— FastAPI application factory with router registrationsrc/app_bootstrap.py— All startup I/O (DB, Ollama, connectors, plugins, reranker)src/monitoring.py— Prometheus metrics, request timing, health checks (/api/metrics,/api/health)src/ollama_client.py— Ollama LLM/embedding client with GPU detection and TTL cachingsrc/routes_fastapi/settings_routes.py— Settings UI + admin ops dashboard (/api/settings,/api/admin/stats)src/rag/web_search.py— DuckDuckGo web search provider (Enhanced mode)src/security_fastapi.py— Rate limiting, CORS, JWT authentication (FastAPI)src/config.py— All configuration (env vars, RAG tuning, GPU settings, cache settings).env.example— Environment variable template
- Interactive Swagger UI available at
/api/docs/when the app is running
LocalChat/
├── app.py # Entry point; create_uvicorn_app() for prod
├── requirements.txt
├── .env.example # Environment variable template
├── src/
│ ├── app_fastapi.py # FastAPI application factory
│ ├── app_bootstrap.py # All startup I/O (DB, Ollama, connectors)
│ ├── config.py # All configuration, loads .env
│ ├── models.py # Pydantic request/response models
│ ├── security_fastapi.py # JWT, rate limiting, CORS (FastAPI)
│ ├── monitoring.py # Prometheus metrics, health checks
│ ├── ollama_client.py # Ollama LLM/embedding client
│ ├── llm_client.py # LiteLLM cloud-fallback adapter
│ ├── mcp_client.py # MCP HTTP client + circuit breaker
│ ├── gpu_monitor.py # NVIDIA/AMD GPU detection, TTL-cached
│ ├── exceptions.py # Custom exception hierarchy
│ ├── agent/ # Multi-model agent dispatch
│ │ ├── aggregator.py # Parallel tool dispatch + retry
│ │ ├── models.py # ModelRegistry (env-driven model mapping)
│ │ ├── result.py # AgentResult, ToolCall dataclasses
│ │ ├── router.py # Rule-based model classifier (<1 ms)
│ │ └── tool_router.py # MCP + direct handler mapping
│ ├── cache/
│ │ ├── managers.py # Embedding + query cache managers
│ │ └── backends/
│ │ ├── base.py # CacheBackend ABC
│ │ ├── memory.py # In-memory LRU (default)
│ │ ├── redis_cache.py # Redis-backed distributed cache
│ │ └── database_cache.py # PostgreSQL-backed cache
│ ├── connectors/ # Document source connectors
│ │ ├── base.py # BaseConnector ABC, DocumentSource
│ │ ├── local_folder.py # Folder watcher (stat-based poll)
│ │ ├── s3_connector.py # S3/MinIO/R2 via boto3
│ │ ├── webhook.py # HTTP push connector
│ │ ├── sharepoint_connector.py # SharePoint Graph API delta
│ │ ├── onedrive_connector.py # OneDrive Graph API delta
│ │ ├── google_drive_connector.py # Google Drive API v3 changes feed
│ │ ├── confluence_connector.py # Confluence Cloud CQL polling
│ │ ├── microsoft_auth.py # Microsoft OAuth2 token refresh
│ │ ├── google_auth.py # Google OAuth2 token refresh
│ │ ├── registry.py # ConnectorRegistry singleton
│ │ └── worker.py # SyncWorker daemon thread
│ ├── db/ # PostgreSQL + pgvector layer
│ │ ├── connection.py # Connection pool, schema init, migrations
│ │ ├── conversations.py # Conversation + message CRUD
│ │ ├── documents.py # Document/chunk CRUD + vector search
│ │ ├── entities.py # GraphRAG entity/relation CRUD
│ │ ├── feedback.py # Answer feedback + chunk stats
│ │ ├── memories.py # Long-term memory CRUD + vector search
│ │ ├── annotations.py # Annotation CRUD
│ │ ├── oauth_tokens.py # Fernet-encrypted OAuth token storage
│ │ ├── tokens.py # JWT revocation deny-list
│ │ ├── users.py # User CRUD + PBKDF2 password hashing
│ │ ├── workspaces.py # Workspace CRUD
│ │ └── connectors.py # Connector config + sync log CRUD
│ ├── graph/
│ │ ├── store.py # GraphStore ABC + Postgres/Kuzu backends
│ │ ├── extractor.py # spaCy entity extraction
│ │ └── expander.py # 1-hop term expansion for BM25
│ ├── memory/
│ │ ├── extractor.py # Extract memorable facts from turns
│ │ └── retriever.py # Vector-search memories into LLM prompt
│ ├── performance/
│ │ └── batch_processor.py # Parallel batch embedding processor
│ ├── rag/
│ │ ├── processor.py # Ingest orchestrator
│ │ ├── retrieval.py # Hybrid search (semantic + BM25)
│ │ ├── chunking.py # Intelligent overlapping chunking
│ │ ├── loaders.py # PDF/DOCX/TXT/MD/Excel loaders
│ │ ├── active_learning.py # Knowledge-gap document suggestions
│ │ ├── scoring.py # BM25 implementation
│ │ ├── reranker.py # Cross-encoder reranking
│ │ ├── planner.py # QueryPlanner — intent decomposition
│ │ ├── doc_type.py # DocType enum, ChunkerRegistry
│ │ ├── feedback_pipeline.py # Fine-tune pipeline on feedback data
│ │ ├── cache.py # Embedding/query cache wrapper
│ │ └── web_search.py # DuckDuckGo web search provider
│ ├── routes_fastapi/ # FastAPI route handlers (production)
│ │ ├── api_routes.py # Chat SSE (/api/chat)
│ │ ├── document_routes.py # Document upload/delete/list
│ │ ├── memory_routes.py # Conversation CRUD + export
│ │ ├── model_routes.py # Ollama model management
│ │ ├── settings_routes.py # Settings UI + admin ops dashboard
│ │ ├── auth_routes.py # User management + password change
│ │ ├── connector_routes.py # Connector REST API + webhook receiver
│ │ ├── feedback_routes.py # Answer feedback submission + stats
│ │ ├── longterm_memory_routes.py # Long-term memory CRUD + trigger
│ │ ├── oauth_routes.py # Microsoft + Google OAuth2 flows
│ │ ├── workspace_routes.py # Workspace management
│ │ ├── annotation_routes.py # Annotation CRUD
│ │ ├── web_routes.py # Frontend SPA + static assets
│ │ └── _request_state.py # Per-request state helpers
│ ├── tools/
│ │ ├── registry.py # Tool registration + JSON schemas
│ │ ├── executor.py # Tool-call loop (multi-turn)
│ │ ├── builtin.py # Built-in tools (search, calc, datetime)
│ │ └── plugin_loader.py # Plugin discovery + dynamic loading
│ └── utils/
│ ├── logging_config.py # JSON structured logging
│ ├── sanitization.py # Input sanitization
│ ├── encryption.py # Fernet encrypt/decrypt for text columns
│ ├── export.py # Conversation export (DOCX + PDF)
│ ├── file_validation.py # Magic-byte + ZIP content validation for uploads
│ ├── workspace.py # get_workspace_id() — X-Workspace-ID header helper
│ └── request_id.py # X-Request-ID middleware
├── mcp_servers/
│ ├── base.py # JSON-RPC 2.0 dispatcher base
│ ├── local_docs/server.py # Local docs MCP (port 5001)
│ ├── web_search/server.py # Web search MCP (port 5002)
│ └── cloud_connectors/server.py # Cloud connectors MCP (port 5003)
├── plugins/ # Drop-in tool plugins (auto-loaded at startup)
│ └── example_plugin.py
├── tests/
│ ├── conftest.py # Shared fixtures
│ ├── unit/ # 70+ modules, ~2,000 tests
│ └── integration/ # 12 modules, requires running services
└── helm/localchat/ # Helm chart (app + PostgreSQL + Redis + MCP)
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific category
pytest tests/unit/
pytest tests/integration/
# Run specific test file
pytest tests/unit/test_rag.py
# Run with verbose output
pytest -v
# Run tests in parallel (if pytest-xdist installed)
pytest -n auto# Generate coverage report
pytest --cov=src --cov-report=html
# View report
open htmlcov/index.html
# Or view in terminal
pytest --cov=src --cov-report=term- Unit Tests:
tests/unit/— 70+ modules covering all core components (~2,000 tests) - Integration Tests:
tests/integration/— 12 modules; require a live PostgreSQL + Ollama instance - Quality Gate: SonarCloud enforces ≥ 80% coverage on new code, 0 unreviewed hotspots
Create a .env file in the root directory (copy from .env.example):
# Database Configuration
export PG_HOST=localhost
export PG_PORT=5432
export PG_USER=postgres
export PG_PASSWORD=your_password
export PG_DB=rag_db
# Ollama Configuration
export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_DEFAULT_MODEL=llama3.2
export OLLAMA_EMBEDDING_MODEL=nomic-embed-text:latest
# GPU layer offload: -1 = all layers on GPU (default), 0 = CPU only
export OLLAMA_NUM_GPU=-1
# Redis Configuration (Optional)
export REDIS_ENABLED=False # Set to True to enable Redis
export REDIS_HOST=localhost
export REDIS_PORT=6379
export REDIS_DB=0
export REDIS_PASSWORD= # Leave empty if no password
# Application Configuration
export SECRET_KEY=your_secret_key_here
export JWT_SECRET_KEY=your_jwt_secret_here
export ADMIN_PASSWORD=your_admin_password_here # Required for /api/auth/login
export APP_ENV=production
export DEBUG=False
# Security Configuration
export RATELIMIT_ENABLED=True
export RATELIMIT_CHAT=10 per minute
export RATELIMIT_UPLOAD=5 per hour
export CORS_ENABLED=False
export CORS_ORIGINS=http://localhost:3000
# Observability
# Leave METRICS_TOKEN empty to allow unauthenticated Prometheus scraping
# (acceptable on a private network). Set a strong token in production.
export METRICS_TOKEN=LocalChat supports two caching backends:
- Pros: No external dependencies, fast, simple setup
- Cons: Lost on restart, limited capacity, single-process only
- Best for: Development, testing, light loads
# Enable memory cache (default)
export REDIS_ENABLED=False- Pros: Persistent, distributed, large capacity
- Cons: Requires Redis server
- Best for: Production, high load, multi-process deployments
# Enable Redis cache
export REDIS_ENABLED=True
export REDIS_HOST=localhost
export REDIS_PORT=6379
export REDIS_PASSWORD=your_password # Optional
# Start Redis
redis-server
# Or with Docker
docker run -d -p 6379:6379 redis:alpineCore RAG parameters can be tuned at runtime in the Settings → RAG Parameters tab, or set via environment variables. Changes from the UI take effect immediately for all subsequent queries — no restart required.
| Parameter | Default | Env var | Range | Description |
|---|---|---|---|---|
TOP_K_RESULTS |
30 | TOP_K_RESULTS |
10–50 | Initial retrieval candidate pool |
RERANK_TOP_K |
12 | RERANK_TOP_K |
4–20 | Chunks passed to LLM after reranking |
DIVERSITY_THRESHOLD |
0.70 | (UI only) | 0.50–0.90 | Jaccard threshold for near-duplicate filtering |
SEMANTIC_WEIGHT |
0.70 | SEMANTIC_WEIGHT |
0.30–0.90 | Semantic vs. BM25 blend in hybrid search |
RERANKER_ENABLED |
true | RERANKER_ENABLED |
true/false | Neural cross-encoder re-ranking (see below) |
Parameters that require re-ingesting documents (chunk size, overlap) are set via environment variables only:
# Chunking — changing these requires re-uploading all documents
CHUNK_SIZE=1200 # Characters per chunk
CHUNK_OVERLAP=150 # Overlap between chunks (12.5%)
# Retrieval
TOP_K_RESULTS=30 # Initial candidates
RERANK_TOP_K=12 # Chunks sent to LLM
# Context window
OLLAMA_NUM_CTX=8192 # Token context window sent to Ollama
# MAX_CONTEXT_LENGTH defaults to OLLAMA_NUM_CTX × 3 chars
# Ingestion timeouts (supports files up to 15 MB)
OLLAMA_EMBED_TIMEOUT=600 # Seconds — worst-case 15 MB TXT ~280 s
UVICORN_TIMEOUT=600 # Must be >= OLLAMA_EMBED_TIMEOUT
# Cross-encoder reranker (enabled by default)
# RERANKER_ENABLED=false # Disable on very slow / embedded hardwareReranker: LocalChat ships with the neural cross-encoder reranker enabled by default (
RERANKER_ENABLED=true). It usescross-encoder/ms-marco-MiniLM-L-6-v2(downloaded automatically, ~80 MB) to re-score each retrieved chunk against the query, substantially improving answer precision. The overhead is negligible on modern hardware — typically < 200 ms per query. SetRERANKER_ENABLED=falseonly if you are running on very constrained CPU hardware where the extra inference latency is unacceptable.
LocalChat supports documents up to 15 MB on CPU-only hardware:
| Format | Chunks @ 15 MB | DB size | Ingest time |
|---|---|---|---|
| TXT | ~14,000 | ~160 MB | ~280 s |
| DOCX | ~8,000 | ~95 MB | ~160 s |
| ~3,500 | ~40 MB | ~70 s |
Each chunk stores a 768-dim float32 embedding vector (~3 KB). The HNSW index scales to millions of chunks with sub-second query latency.
# Connection Pool
DB_POOL_MIN_CONN = 2
DB_POOL_MAX_CONN = 10
# HNSW Index Parameters
# ef_search is computed dynamically as max(TOP_K_RESULTS * 2, 40)
DB_INDEX_TYPE = 'hnsw' # Use HNSW for fast ANN search# Parallel Processing
MAX_WORKERS = 8 # Concurrent threads
BATCH_SIZE = 512 # Embeddings batch size (512 chunks per call)
# Table Extraction
KEEP_TABLES_INTACT = True # Don't split tables across chunks
MIN_TABLE_ROWS = 3 # Minimum rows to detect as tableSee src/config.py for all configuration options.
The application exposes a Prometheus-compatible scrape endpoint:
GET /api/metrics — Prometheus text format v0.0.4
GET /api/metrics.json — JSON metrics snapshot (used by admin dashboard)
GET /api/health — Detailed component health check
Sample output from /api/metrics:
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="health_check",status="200"} 42
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_count 42
http_request_duration_seconds_sum 1.234
http_request_duration_seconds_bucket{le="0.1"} 38
http_request_duration_seconds_bucket{le="+Inf"} 42
# TYPE app_uptime_seconds gauge
app_uptime_seconds 3600.5
Every response also carries an X-Request-Duration header (e.g. 0.042s).
Set METRICS_TOKEN in .env to require a Bearer token:
export METRICS_TOKEN=your_strong_token_herePrometheus scrape config:
scrape_configs:
- job_name: localchat
static_configs:
- targets: ['localhost:5000']
bearer_token: your_strong_token_hereLeave METRICS_TOKEN empty for unauthenticated access (safe on a private network).
GET /api/health
Returns 200 healthy, 200 degraded (Ollama down), or 503 unhealthy (database down):
{
"status": "healthy",
"timestamp": "2026-03-19T10:00:00.000000",
"checks": {
"database": { "status": "up", "healthy": true },
"ollama": { "status": "up", "healthy": true },
"cache": { "status": "up", "healthy": true, "stats": { "hits": 120, "misses": 5 } }
}
}Navigate to /admin (JWT required in production; open in demo mode).
The dashboard surfaces:
- GPU Hardware — per-physical-GPU cards: VRAM usage bar, utilisation %, temperature (refreshed every 30 s)
- Loaded Models — per-model VRAM breakdown with GPU offload % (refreshed every 5 s)
- Cache Stats — embedding cache and query cache hit rates
- System Info — app version, active model, uptime, request count
LocalChat automatically detects available GPUs:
| Vendor | Tool | Detection |
|---|---|---|
| NVIDIA | nvidia-smi |
Auto-detected if on PATH |
| AMD | rocm-smi |
Auto-detected if on PATH |
Control GPU layer offload in .env:
# -1 = all transformer layers on GPU (recommended when VRAM is sufficient)
# 0 = CPU-only inference
# N = offload N layers to GPU
export OLLAMA_NUM_GPU=-1The value is forwarded in options.num_gpu on every /api/chat and /api/embed request,
so Ollama distributes work across all detected GPUs automatically when multiple GPUs are present.
# Install dependencies
pip install -r requirements.txt
# Lint (must be clean before every commit)
ruff check src/ tests/
# Run fast tests with coverage
pytest -m "not (slow or ollama or db)"
# Run all unit tests with coverage report
pytest tests/unit/ --cov=src --cov-report=term-missing- Test Coverage: ≥ 80% on new code (SonarCloud gate)
- Linting:
ruff checkmust pass (CI blocks on failure) - Static Analysis: SonarCloud Quality Gate must pass before merge
- Documentation: Update
README.mdandCLAUDE.mdKey Files table in the same PR
Two GitHub Actions workflows run on every push and pull request to main, plus automated dependency updates via Dependabot:
| Workflow / Config | File | Purpose |
|---|---|---|
| Tests | .github/workflows/tests.yml |
Runs all unit tests on Python 3.11 |
| SonarCloud | .github/workflows/sonarcloud.yml |
Runs unit tests with coverage, then uploads results to SonarCloud |
| CodeQL | .github/workflows/codeql.yml |
Python security-extended static analysis on push/PR to main + weekly Monday scan |
| Docker publish | .github/workflows/docker-publish.yml |
Builds and pushes image to ghcr.io/jwvanderstam/localchat on merge to main and version tags |
| Dependabot | .github/dependabot.yml |
Weekly PRs for pip and GitHub Actions version bumps; auto-assigned to jwvanderstam with labels dependencies / ci |
Static analysis and coverage tracking are handled by SonarCloud.
- Project key:
jwvanderstam_LocalChat - Organisation:
jwvanderstam - Configuration:
sonar-project.properties - Coverage source:
coverage.xmlproduced bypytest --cov=src --cov-report=xml
Vendored third-party assets (static/css/bootstrap*.css, static/js/bootstrap*.js, static/css/fonts/) are excluded from analysis so they don't skew metrics.
To run the same coverage report locally that the SonarCloud workflow uses:
pytest tests/unit/ -v --tb=short --cov=src --cov-report=xml --cov-report=term-missingThe coverage.xml file is produced in the project root and is picked up automatically by the sonarcloud-github-action.
-
Create feature branch
git checkout -b feature/your-feature
-
Write code and tests
# Add tests first (TDD) pytest tests/unit/test_your_feature.py -
Check code quality
ruff check src/ tests/ pytest -m "not (slow or ollama or db)" -
Commit and push
git add . git commit -m "feat: your feature" git push origin feature/your-feature
-
Create pull request
See GitHub Releases for version history.
Issue: RAG not retrieving documents
# Check if documents are uploaded
curl http://localhost:5000/api/documents/stats
# Test retrieval
curl -X POST http://localhost:5000/api/documents/test \
-H "Content-Type: application/json" \
-d '{"query": "test"}'Issue: Ollama connection failed
# Check Ollama is running
curl http://localhost:11434/api/tags
# Restart Ollama
ollama serveIssue: Database connection error
# Check PostgreSQL is running
pg_isready
# Check pgvector extension
psql rag_db -c "SELECT * FROM pg_extension WHERE extname='vector';"See src/config.py for database and connection pool settings.
This project is licensed under the MIT License - see the LICENSE file for details.
- Ollama for local LLM inference
- pgvector for vector similarity search
- FastAPI for web framework
- Pydantic for data validation
- pytest for testing framework
- Source Code:
src/ - Configuration:
src/config.py - Issues: GitHub Issues
- Docker deployment & Kubernetes configs
- Monitoring dashboard
- Advanced RAG techniques (query expansion, multi-hop)
- Multi-language support
- Plugin system
- Admin dashboard
If you find this project useful, please consider giving it a star!
Made with care by the LocalChat Team
Professional RAG application for document-based question answering