Skip to content

kayceejenz/ai-docs-processor

Repository files navigation

AI Document Processor

Upload a business document such as invoice, receipt, contract, purchase order and the system handles the rest. It classifies the document, extracts the key fields, validates the data, and routes it to one of three outcomes: auto-approved, flagged for human review, or escalated. Every decision is logged. Every human correction feeds back into an accuracy layer so you can measure and improve extraction quality over time.

Stack

Layer Technology
Web app Next.js 15, App Router
Auth NextAuth.js v4, Google OAuth, JWT
Database PostgreSQL, pgvector
Background worker Python 3.12, psycopg3
AI providers Anthropic Claude, Ollama, HuggingFace
Embeddings Ollama nomic-embed-text via pgvector
Schema validation Pydantic v2 (worker), Zod (web)

Getting Started

# 1. Install dependencies
git clone <repo> && cd ai-docs-processor
npm install
cd worker && pip install -e ".[dev]"

# 2. Configure environment
cp apps/web/.env.example apps/web/.env.local
cp worker/.env.example worker/.env 

# 3. Start Postgres
docker compose up -d postgres

# 4. Run the web app
npm run dev --workspace=apps/web

# 5. Run the worker
cd worker && python -m ai_docs_worker

# Optional: enable local embeddings
ollama pull nomic-embed-text
ollama pull llama3.2

System Architecture

System Architecture

The browser talks to a Next.js application that handles auth, serves pages, and exposes API routes. Every upload is written to PostgreSQL and queued for processing. A separate Python worker polls the queue, claims jobs atomically with SELECT ... FOR UPDATE SKIP LOCKED, and runs each document through the AI pipeline. Results, audit events, and embeddings all go back into Postgres. The AI provider (Claude, Ollama, or HuggingFace) is swappable via a single environment variable.

Document Processing Pipeline

Each document moves through five stages in the Python worker:

Stage What happens
Text extraction Detects file type and extracts text via PyMuPDF (PDF), python-docx (DOCX), or plain read
Classification LLM call returns document type (INVOICE, CONTRACT, RECEIPT…) + confidence score
Field extraction LLM call extracts structured fields; dynamic fields handled via admin-configurable schema
Validation Rule-based checks: required fields present, amounts numeric, dates logical, currency ISO 4217
Decision Deterministic routing — INVALID_VALUE → escalate, MISSING_FIELD → review, confidence thresholds → approve or review

The decision engine applies rules in strict priority order so data integrity problems are never auto-approved regardless of model confidence.

Database Schema

Database Schema

Thirteen tables organised into four domains. documents is the central entity — all processing tables cascade from it. audit_events is append-only so the full history of any document is always reconstructable. document_chunks stores vector embeddings for RAG search alongside a GIN index for keyword fallback. field_evaluation_results captures predicted vs corrected values per field so extraction accuracy is measurable.

API Reference

All routes require an authenticated session. Admin routes require role === "admin".

Method Route Access Description
GET /api/documents Any List documents (admin sees all, beta sees own)
POST /api/documents Any Upload a document
DELETE /api/documents?id Any Delete a document
GET /api/documents/:id Any Document detail and audit trail
POST /api/documents/:id/actions Any Submit review action
POST /api/documents/:id/rating Any Rate extraction quality (1–5)
POST /api/documents/:id/ingest Any Manually trigger RAG ingestion
POST /api/documents/reindex Admin Bulk re-index into vector store
GET /api/admin/users Admin List users with usage
POST /api/admin/users Admin Add beta user
PATCH /api/admin/users Admin Toggle active status or upload limit
DELETE /api/admin/users?id Admin Remove beta user
POST /api/chat Any Streaming RAG question answering

Key Design Decisions

PostgreSQL as the job queue. All document state lives in one database. Using FOR UPDATE SKIP LOCKED keeps the queue inside the same ACID transaction as the work, avoiding coordination across two systems. The tradeoff is it does not scale to thousands of concurrent workers: acceptable for now with a clear migration path.

Rules-based routing, not a second model. A model cannot reliably approve or reject its own output. Deterministic rules on typed outputs (confidence float, issues typed list) make routing auditable and configurable without retraining anything.

JSONB for extracted fields. Document types and their required fields are admin-configurable. JSONB lets the field schema be data-driven without a migration every time a new document type is added.

Three-tier RAG retrieval. Vector search → keyword search (GIN index) → recency. The chat widget stays functional even when pgvector or Ollama is unavailable. Quality degrades; the feature does not break.

Append-only audit log. audit_events is insert-only. Every state change creates a new row. Reconstructing the full history of any document is a simple query.

Project Structure

ai-docs-processor/
├── apps/web/src/
│   ├── app/
│   │   ├── page.tsx              # Document queue
│   │   ├── admin/                # User management
│   │   ├── evaluation/           # Accuracy metrics
│   │   ├── documents/[id]/       # Document detail and review
│   │   └── api/                  # All API routes
│   ├── lib/
│   │   ├── db.ts                 # Connection pool
│   │   ├── auth.ts               # NextAuth config
│   │   ├── documents/repository.ts
│   │   ├── users/repository.ts
│   │   └── rag/                  # chunker, embedder, store, ingest
│   └── components/
│       ├── layout/               # Shell, nav, profile
│       ├── documents/            # Table, upload modal, detail
│       └── chat/                 # Streaming chat widget
└── worker/src/ai_docs_worker/
    ├── ai/                       # BaseAIProvider, factory, providers
    └── pipeline/                 # runner, decision, validation, models, ocr

Evaluation Layer

Every human correction is recorded in field_evaluation_results alongside the model's original prediction. The evaluation dashboard surfaces overall accuracy, per-field accuracy breakdown, most frequently corrected fields, and 1–5 quality ratings submitted by reviewers. This makes prompt iteration data-driven: inspect what's failing, adjust the prompt, measure improvement on the next batch.