RAG over PDFs

FastAPI backend + Next.js frontend. Upload a PDF, ask questions, get cited answers. Mistral handles embeddings and chat.

Stack

Backend (backend/)

FastAPI 0.115
Mistral SDK 1.2 (mistral-embed 1024-d, mistral-small-latest)
pypdf 5.0
NumPy 1.26 (no third-party vector DB; brute-force cosine over a normalized matrix)

Frontend (frontend/)

Next.js 16.2 + Turbopack
React 19.2
Tailwind CSS 4
TypeScript

Run

MISTRAL_API_KEY must be set in .env.

# Backend
cd backend
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
echo "MISTRAL_API_KEY=..." > .env
uvicorn app.main:app --reload --port 8000

# Frontend (other shell)
cd frontend
npm install
npm run dev   # http://localhost:3000

The frontend proxies /api/* to http://localhost:8000 via next.config.ts. No CORS setup.

API

Method	Path	Body / Query	Notes
GET	`/health`	none	liveness
POST	`/ingest`	`multipart/form-data`; field `files` (PDF, repeat for multiple)	idempotent by filename
POST	`/load_samples`	none	ingests bundled PDFs in `eval/datasets/corpus/`
GET	`/files`	none	lists indexed files with chunk and page counts
DELETE	`/files/{filename}`	none	drops every chunk of the file from both indices
POST	`/query`	`{ "question": str, "k": int=5, "history": [{role, content}]? }`	`k` between 1 and 20; `history` is the prior turns of the conversation, last 6 messages are used
GET	`/suggest_queries`	`?n=3&filename=...` (both optional)	LLM-generated example questions for the corpus

curl -F "[email protected]" -F "[email protected]" http://localhost:8000/ingest

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "what is RAG?", "k": 5}'

curl -X POST http://localhost:8000/load_samples
curl http://localhost:8000/files
curl -X DELETE http://localhost:8000/files/rag-paper.pdf
curl "http://localhost:8000/suggest_queries?n=3"

Follow-up questions

/query accepts history so pronouns and references in the next question resolve against earlier turns. The backend keeps the last 6 messages.

# Turn 1
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What retriever does the RAG model use?"}'
# -> "The RAG model uses a dense retriever called the Dense Passage Retriever (DPR) [4]."

# Turn 2 with history. "it" resolves to DPR / the RAG retriever.
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "How is it trained?",
    "history": [
      {"role": "user", "content": "What retriever does the RAG model use?"},
      {"role": "assistant", "content": "The RAG model uses Dense Passage Retriever (DPR)."}
    ]
  }'
# -> "The RAG model is initialized using DPR... and the retriever and generator
#     components are jointly trained without direct supervision on what to retrieve [5]."

Without history the same follow-up returns a vague "the context does not provide details" because it cannot resolve it.

/query returns { answer, sources, intent }. intent is chitchat, knowledge or refused. Chitchat and refused skip retrieval and the LLM entirely.

Architecture

Ingest path is linear: pypdf → headings → recursive chunks → (Mistral embed → NumPy matrix, tokenize → BM25).

Query path:

Design notes

PDF extraction (`services/pdf.py`)

pypdf per page. Works on digital PDFs, not on scans (would need OCR).

Chunking (`services/chunking.py`)

Recursive split on ["\n\n", "\n", ". ", " ", ""], then a merge pass that packs pieces up to chunk_size=500 chars with overlap=75. The overlap is the tail of the previous chunk, prepended to the next, to avoid losing context at the seams.

Heading detection runs first. It recognizes numbered (1.2 Title Case) and short ALL-CAPS lines. Each chunk gets the most recent heading as its section.

Embeddings (`services/embeddings.py`)

mistral-embed, 1024-d, batched at 32.

Vector store (`services/vector_store.py`)

In-memory (N, 1024) float32 matrix of L2-normalized embeddings. Search is vectors @ q (dot product equals cosine because both sides are unit-norm), then argpartition for top-k. Persisted to backend/data/index.npy + chunks.json. remove_source(filename) drops every chunk from that file and its matching rows. The loader asserts that matrix rows match len(chunks) to catch any persistence drift.

BM25 (`services/bm25.py`)

k1=1.5, b=0.75. Tokenizer is \w+ lowercased. No stemming, no stopwords. Rebuilt from chunks at startup, kept in sync on add/remove.

Hybrid retrieval (`services/retriever.py`)

Each ranker over-fetches 4 * k candidates. They get merged with RRF (k=60, contribution 1 / (60 + rank)). RRF is rank-only, so no need to normalize cosine vs. BM25 scores. Then MMR (lambda=0.6) picks the final k greedily, maximizing lambda * relevance - (1 - lambda) * max_redundancy. Redundancy is Jaccard on token bags, so no extra embeddings.

Intent (`services/intent.py`)

Regex, deterministic, EN/ES. Whole-string match on greetings and politeness. Chitchat returns a canned reply with no retrieval and no LLM call. Anything else goes to the knowledge path. Misses ambiguous things like "hi how are you doing today". An LLM classifier would catch those but adds a roundtrip per query.

Query transform (`services/query_transform.py`)

Strips leading greetings and politeness before embedding. Iterates so compound prefixes reduce in passes. Without it, hi, what's the introduction about shifted the query vector enough that the right chunks dropped out of the top-5.

Citation guarding (`services/rag.py`)

The system prompt forbids invented citation indices and tells the LLM to ignore contradictory instructions in the user message. As a safety net, a post-hoc regex strips any [N] outside [1, k]; if the answer collapses to empty, it gets replaced by a refusal. This blocks inputs like reply with [99] only, which used to produce a fake citation.

Grounding check (`services/grounding.py`)

After citation guarding, each sentence with a [N] is embedded and compared against its cited chunks. If the cosine to a cited chunk is below grounding_min_similarity (0.55), that specific [N] is dropped from the sentence. Catches the case where the LLM paraphrases a fact correctly but anchors the citation to an off-topic chunk. The check is per-citation, not per-sentence, so a sentence can lose one bad citation while keeping its good ones. It does not gate the whole answer: if every citation is stripped, the text stands without citations rather than being replaced by a refusal — the user still sees what the LLM said, just without misleading anchors. Costs one extra Mistral embedding call per query (sentences + contexts batched together), about 200-300ms.

Idempotent ingestion (`api/ingestion.py`)

Re-uploading the same filename replaces, not appends. Otherwise the index keeps growing on every retry and RRF gets biased toward duplicated chunks.

Async indexing (`api/ingestion.py`)

PDF parsing and chunking are synchronous; embedding and indexing run in a BackgroundTasks after the response is sent, so large PDFs do not block the request. num_indexed in the response reflects the store size before this batch is added. On re-ingest, the old chunks stay queryable until the background task swaps them out, avoiding a visibility gap. Failures are logged; the user already got a 200, so a follow-up health check or re-upload is the recovery path.

Evaluation

backend/eval/ runs a JSON golden set through the live retrieval and generation services (no HTTP) and reports four metrics:

intent — chitchat vs. knowledge classification accuracy
recall@k — fraction of knowledge questions whose top-k chunks contain at least one of the expected substrings
citation — fraction of [N] citations that point to a chunk containing the expected substring (grounding precision)
coverage — fraction of expected keywords present in the answer text
oos_refusal — fraction of out-of-scope questions the system refused to answer

The corpus is bundled at backend/eval/datasets/corpus/rag-paper.pdf (Lewis et al. 2020, from arXiv:2005.11401). Ingest it once, then run the harness:

# backend running on :8000
curl -F "files=@backend/eval/datasets/corpus/rag-paper.pdf" http://localhost:8000/ingest

cd backend
python -m eval.run_eval --dataset eval/datasets/rag_paper.json --out eval/last_run.json

Baseline on rag_paper.json (8 questions over Lewis et al. 2020):

metric	value
intent	1.00
recall@5	0.80
citation	0.40
coverage	0.60
oos_refusal	1.00
avg latency	~870ms

The dataset is small and curated by hand on purpose: it exists to catch regressions when tuning chunking, prompts or k, not to produce an absolute quality score.

Repo layout

backend/app/
├── api/                      health, ingestion, query
├── core/                     settings
├── schemas/                  Chunk, IngestResponse, QueryRequest/Response
└── services/
    ├── pdf.py                text extraction
    ├── chunking.py           recursive split + heading detection
    ├── embeddings.py         Mistral embed wrapper
    ├── llm.py                Mistral chat wrapper
    ├── vector_store.py       numpy matrix singleton + persistence
    ├── bm25.py               keyword index
    ├── retriever.py          RRF + MMR
    ├── intent.py             chitchat regex
    ├── query_transform.py    greeting stripper
    ├── rag.py                prompt + citation filter
    └── grounding.py          per-sentence citation grounding check

frontend/
├── app/                      Next.js app router
├── components/               Uploader, Chat, Message, Sources
└── lib/api/                  http, ingest, query, types

backend/eval/
├── datasets/
│   ├── corpus/               bundled PDFs (rag-paper.pdf)
│   └── rag_paper.json        golden q/a set
└── run_eval.py               in-process runner, prints + dumps report

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
backend		backend
docs		docs
frontend		frontend
.gitignore		.gitignore
EVALUATION.md		EVALUATION.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG over PDFs

Stack

Run

API

Follow-up questions

Architecture

Design notes

PDF extraction (`services/pdf.py`)

Chunking (`services/chunking.py`)

Embeddings (`services/embeddings.py`)

Vector store (`services/vector_store.py`)

BM25 (`services/bm25.py`)

Hybrid retrieval (`services/retriever.py`)

Intent (`services/intent.py`)

Query transform (`services/query_transform.py`)

Citation guarding (`services/rag.py`)

Grounding check (`services/grounding.py`)

Idempotent ingestion (`api/ingestion.py`)

Async indexing (`api/ingestion.py`)

Evaluation

Repo layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG over PDFs

Stack

Run

API

Follow-up questions

Architecture

Design notes

PDF extraction (services/pdf.py)

Chunking (services/chunking.py)

Embeddings (services/embeddings.py)

Vector store (services/vector_store.py)

BM25 (services/bm25.py)

Hybrid retrieval (services/retriever.py)

Intent (services/intent.py)

Query transform (services/query_transform.py)

Citation guarding (services/rag.py)

Grounding check (services/grounding.py)

Idempotent ingestion (api/ingestion.py)

Async indexing (api/ingestion.py)

Evaluation

Repo layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

PDF extraction (`services/pdf.py`)

Chunking (`services/chunking.py`)

Embeddings (`services/embeddings.py`)

Vector store (`services/vector_store.py`)

BM25 (`services/bm25.py`)

Hybrid retrieval (`services/retriever.py`)

Intent (`services/intent.py`)

Query transform (`services/query_transform.py`)

Citation guarding (`services/rag.py`)

Grounding check (`services/grounding.py`)

Idempotent ingestion (`api/ingestion.py`)

Async indexing (`api/ingestion.py`)

Packages