Upload a business document such as invoice, receipt, contract, purchase order and the system handles the rest. It classifies the document, extracts the key fields, validates the data, and routes it to one of three outcomes: auto-approved, flagged for human review, or escalated. Every decision is logged. Every human correction feeds back into an accuracy layer so you can measure and improve extraction quality over time.
| Layer | Technology |
|---|---|
| Web app | Next.js 15, App Router |
| Auth | NextAuth.js v4, Google OAuth, JWT |
| Database | PostgreSQL, pgvector |
| Background worker | Python 3.12, psycopg3 |
| AI providers | Anthropic Claude, Ollama, HuggingFace |
| Embeddings | Ollama nomic-embed-text via pgvector |
| Schema validation | Pydantic v2 (worker), Zod (web) |
# 1. Install dependencies
git clone <repo> && cd ai-docs-processor
npm install
cd worker && pip install -e ".[dev]"
# 2. Configure environment
cp apps/web/.env.example apps/web/.env.local
cp worker/.env.example worker/.env
# 3. Start Postgres
docker compose up -d postgres
# 4. Run the web app
npm run dev --workspace=apps/web
# 5. Run the worker
cd worker && python -m ai_docs_worker
# Optional: enable local embeddings
ollama pull nomic-embed-text
ollama pull llama3.2The browser talks to a Next.js application that handles auth, serves pages, and exposes API routes. Every upload is written to PostgreSQL and queued for processing. A separate Python worker polls the queue, claims jobs atomically with SELECT ... FOR UPDATE SKIP LOCKED, and runs each document through the AI pipeline. Results, audit events, and embeddings all go back into Postgres. The AI provider (Claude, Ollama, or HuggingFace) is swappable via a single environment variable.
Each document moves through five stages in the Python worker:
| Stage | What happens |
|---|---|
| Text extraction | Detects file type and extracts text via PyMuPDF (PDF), python-docx (DOCX), or plain read |
| Classification | LLM call returns document type (INVOICE, CONTRACT, RECEIPT…) + confidence score |
| Field extraction | LLM call extracts structured fields; dynamic fields handled via admin-configurable schema |
| Validation | Rule-based checks: required fields present, amounts numeric, dates logical, currency ISO 4217 |
| Decision | Deterministic routing — INVALID_VALUE → escalate, MISSING_FIELD → review, confidence thresholds → approve or review |
The decision engine applies rules in strict priority order so data integrity problems are never auto-approved regardless of model confidence.
Thirteen tables organised into four domains. documents is the central entity — all processing tables cascade from it. audit_events is append-only so the full history of any document is always reconstructable. document_chunks stores vector embeddings for RAG search alongside a GIN index for keyword fallback. field_evaluation_results captures predicted vs corrected values per field so extraction accuracy is measurable.
All routes require an authenticated session. Admin routes require role === "admin".
| Method | Route | Access | Description |
|---|---|---|---|
| GET | /api/documents |
Any | List documents (admin sees all, beta sees own) |
| POST | /api/documents |
Any | Upload a document |
| DELETE | /api/documents?id |
Any | Delete a document |
| GET | /api/documents/:id |
Any | Document detail and audit trail |
| POST | /api/documents/:id/actions |
Any | Submit review action |
| POST | /api/documents/:id/rating |
Any | Rate extraction quality (1–5) |
| POST | /api/documents/:id/ingest |
Any | Manually trigger RAG ingestion |
| POST | /api/documents/reindex |
Admin | Bulk re-index into vector store |
| GET | /api/admin/users |
Admin | List users with usage |
| POST | /api/admin/users |
Admin | Add beta user |
| PATCH | /api/admin/users |
Admin | Toggle active status or upload limit |
| DELETE | /api/admin/users?id |
Admin | Remove beta user |
| POST | /api/chat |
Any | Streaming RAG question answering |
PostgreSQL as the job queue. All document state lives in one database. Using FOR UPDATE SKIP LOCKED keeps the queue inside the same ACID transaction as the work, avoiding coordination across two systems. The tradeoff is it does not scale to thousands of concurrent workers: acceptable for now with a clear migration path.
Rules-based routing, not a second model. A model cannot reliably approve or reject its own output. Deterministic rules on typed outputs (confidence float, issues typed list) make routing auditable and configurable without retraining anything.
JSONB for extracted fields. Document types and their required fields are admin-configurable. JSONB lets the field schema be data-driven without a migration every time a new document type is added.
Three-tier RAG retrieval. Vector search → keyword search (GIN index) → recency. The chat widget stays functional even when pgvector or Ollama is unavailable. Quality degrades; the feature does not break.
Append-only audit log. audit_events is insert-only. Every state change creates a new row. Reconstructing the full history of any document is a simple query.
ai-docs-processor/
├── apps/web/src/
│ ├── app/
│ │ ├── page.tsx # Document queue
│ │ ├── admin/ # User management
│ │ ├── evaluation/ # Accuracy metrics
│ │ ├── documents/[id]/ # Document detail and review
│ │ └── api/ # All API routes
│ ├── lib/
│ │ ├── db.ts # Connection pool
│ │ ├── auth.ts # NextAuth config
│ │ ├── documents/repository.ts
│ │ ├── users/repository.ts
│ │ └── rag/ # chunker, embedder, store, ingest
│ └── components/
│ ├── layout/ # Shell, nav, profile
│ ├── documents/ # Table, upload modal, detail
│ └── chat/ # Streaming chat widget
└── worker/src/ai_docs_worker/
├── ai/ # BaseAIProvider, factory, providers
└── pipeline/ # runner, decision, validation, models, ocr
Every human correction is recorded in field_evaluation_results alongside the model's original prediction. The evaluation dashboard surfaces overall accuracy, per-field accuracy breakdown, most frequently corrected fields, and 1–5 quality ratings submitted by reviewers. This makes prompt iteration data-driven: inspect what's failing, adjust the prompt, measure improvement on the next batch.

