AI Document Processor

Upload a business document such as invoice, receipt, contract, purchase order and the system handles the rest. It classifies the document, extracts the key fields, validates the data, and routes it to one of three outcomes: auto-approved, flagged for human review, or escalated. Every decision is logged. Every human correction feeds back into an accuracy layer so you can measure and improve extraction quality over time.

Stack

Layer	Technology
Web app	Next.js 15, App Router
Auth	NextAuth.js v4, Google OAuth, JWT
Database	PostgreSQL, pgvector
Background worker	Python 3.12, psycopg3
AI providers	Anthropic Claude, Ollama, HuggingFace
Embeddings	Ollama nomic-embed-text via pgvector
Schema validation	Pydantic v2 (worker), Zod (web)

Getting Started

# 1. Install dependencies
git clone <repo> && cd ai-docs-processor
npm install
cd worker && pip install -e ".[dev]"

# 2. Configure environment
cp apps/web/.env.example apps/web/.env.local
cp worker/.env.example worker/.env 

# 3. Start Postgres
docker compose up -d postgres

# 4. Run the web app
npm run dev --workspace=apps/web

# 5. Run the worker
cd worker && python -m ai_docs_worker

# Optional: enable local embeddings
ollama pull nomic-embed-text
ollama pull llama3.2

System Architecture

The browser talks to a Next.js application that handles auth, serves pages, and exposes API routes. Every upload is written to PostgreSQL and queued for processing. A separate Python worker polls the queue, claims jobs atomically with SELECT ... FOR UPDATE SKIP LOCKED, and runs each document through the AI pipeline. Results, audit events, and embeddings all go back into Postgres. The AI provider (Claude, Ollama, or HuggingFace) is swappable via a single environment variable.

Document Processing Pipeline

Each document moves through five stages in the Python worker:

Stage	What happens
Text extraction	Detects file type and extracts text via PyMuPDF (PDF), python-docx (DOCX), or plain read
Classification	LLM call returns document type (INVOICE, CONTRACT, RECEIPT…) + confidence score
Field extraction	LLM call extracts structured fields; dynamic fields handled via admin-configurable schema
Validation	Rule-based checks: required fields present, amounts numeric, dates logical, currency ISO 4217
Decision	Deterministic routing — INVALID_VALUE → escalate, MISSING_FIELD → review, confidence thresholds → approve or review

The decision engine applies rules in strict priority order so data integrity problems are never auto-approved regardless of model confidence.

Database Schema

Thirteen tables organised into four domains. documents is the central entity — all processing tables cascade from it. audit_events is append-only so the full history of any document is always reconstructable. document_chunks stores vector embeddings for RAG search alongside a GIN index for keyword fallback. field_evaluation_results captures predicted vs corrected values per field so extraction accuracy is measurable.

API Reference

All routes require an authenticated session. Admin routes require role === "admin".

Method	Route	Access	Description
GET	`/api/documents`	Any	List documents (admin sees all, beta sees own)
POST	`/api/documents`	Any	Upload a document
DELETE	`/api/documents?id`	Any	Delete a document
GET	`/api/documents/:id`	Any	Document detail and audit trail
POST	`/api/documents/:id/actions`	Any	Submit review action
POST	`/api/documents/:id/rating`	Any	Rate extraction quality (1–5)
POST	`/api/documents/:id/ingest`	Any	Manually trigger RAG ingestion
POST	`/api/documents/reindex`	Admin	Bulk re-index into vector store
GET	`/api/admin/users`	Admin	List users with usage
POST	`/api/admin/users`	Admin	Add beta user
PATCH	`/api/admin/users`	Admin	Toggle active status or upload limit
DELETE	`/api/admin/users?id`	Admin	Remove beta user
POST	`/api/chat`	Any	Streaming RAG question answering

Key Design Decisions

PostgreSQL as the job queue. All document state lives in one database. Using FOR UPDATE SKIP LOCKED keeps the queue inside the same ACID transaction as the work, avoiding coordination across two systems. The tradeoff is it does not scale to thousands of concurrent workers: acceptable for now with a clear migration path.

Rules-based routing, not a second model. A model cannot reliably approve or reject its own output. Deterministic rules on typed outputs (confidence float, issues typed list) make routing auditable and configurable without retraining anything.

JSONB for extracted fields. Document types and their required fields are admin-configurable. JSONB lets the field schema be data-driven without a migration every time a new document type is added.

Three-tier RAG retrieval. Vector search → keyword search (GIN index) → recency. The chat widget stays functional even when pgvector or Ollama is unavailable. Quality degrades; the feature does not break.

Append-only audit log. audit_events is insert-only. Every state change creates a new row. Reconstructing the full history of any document is a simple query.

Project Structure

ai-docs-processor/
├── apps/web/src/
│   ├── app/
│   │   ├── page.tsx              # Document queue
│   │   ├── admin/                # User management
│   │   ├── evaluation/           # Accuracy metrics
│   │   ├── documents/[id]/       # Document detail and review
│   │   └── api/                  # All API routes
│   ├── lib/
│   │   ├── db.ts                 # Connection pool
│   │   ├── auth.ts               # NextAuth config
│   │   ├── documents/repository.ts
│   │   ├── users/repository.ts
│   │   └── rag/                  # chunker, embedder, store, ingest
│   └── components/
│       ├── layout/               # Shell, nav, profile
│       ├── documents/            # Table, upload modal, detail
│       └── chat/                 # Streaming chat widget
└── worker/src/ai_docs_worker/
    ├── ai/                       # BaseAIProvider, factory, providers
    └── pipeline/                 # runner, decision, validation, models, ocr

Evaluation Layer

Every human correction is recorded in field_evaluation_results alongside the model's original prediction. The evaluation dashboard surfaces overall accuracy, per-field accuracy breakdown, most frequently corrected fields, and 1–5 quality ratings submitted by reviewers. This makes prompt iteration data-driven: inspect what's failing, adjust the prompt, measure improvement on the next batch.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
apps/web		apps/web
docs		docs
packages		packages
storage/documents		storage/documents
worker		worker
.editorconfig		.editorconfig
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
learning-roadmap.html		learning-roadmap.html
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Document Processor

Stack

Getting Started

System Architecture

Document Processing Pipeline

Database Schema

API Reference

Key Design Decisions

Project Structure

Evaluation Layer

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Document Processor

Stack

Getting Started

System Architecture

Document Processing Pipeline

Database Schema

API Reference

Key Design Decisions

Project Structure

Evaluation Layer

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages