FastAPI backend + Next.js frontend. Upload a PDF, ask questions, get cited answers. Mistral handles embeddings and chat.
Backend (backend/)
- FastAPI 0.115
- Mistral SDK 1.2 (
mistral-embed1024-d,mistral-small-latest) - pypdf 5.0
- NumPy 1.26 (no third-party vector DB; brute-force cosine over a normalized matrix)
Frontend (frontend/)
- Next.js 16.2 + Turbopack
- React 19.2
- Tailwind CSS 4
- TypeScript
MISTRAL_API_KEY must be set in .env.
# Backend
cd backend
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
echo "MISTRAL_API_KEY=..." > .env
uvicorn app.main:app --reload --port 8000
# Frontend (other shell)
cd frontend
npm install
npm run dev # http://localhost:3000The frontend proxies /api/* to http://localhost:8000 via next.config.ts. No CORS setup.
| Method | Path | Body / Query | Notes |
|---|---|---|---|
| GET | /health |
none | liveness |
| POST | /ingest |
multipart/form-data; field files (PDF, repeat for multiple) |
idempotent by filename |
| POST | /load_samples |
none | ingests bundled PDFs in eval/datasets/corpus/ |
| GET | /files |
none | lists indexed files with chunk and page counts |
| DELETE | /files/{filename} |
none | drops every chunk of the file from both indices |
| POST | /query |
{ "question": str, "k": int=5, "history": [{role, content}]? } |
k between 1 and 20; history is the prior turns of the conversation, last 6 messages are used |
| GET | /suggest_queries |
?n=3&filename=... (both optional) |
LLM-generated example questions for the corpus |
curl -F "[email protected]" -F "[email protected]" http://localhost:8000/ingest
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "what is RAG?", "k": 5}'
curl -X POST http://localhost:8000/load_samples
curl http://localhost:8000/files
curl -X DELETE http://localhost:8000/files/rag-paper.pdf
curl "http://localhost:8000/suggest_queries?n=3"/query accepts history so pronouns and references in the next question
resolve against earlier turns. The backend keeps the last 6 messages.
# Turn 1
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What retriever does the RAG model use?"}'
# -> "The RAG model uses a dense retriever called the Dense Passage Retriever (DPR) [4]."
# Turn 2 with history. "it" resolves to DPR / the RAG retriever.
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"question": "How is it trained?",
"history": [
{"role": "user", "content": "What retriever does the RAG model use?"},
{"role": "assistant", "content": "The RAG model uses Dense Passage Retriever (DPR)."}
]
}'
# -> "The RAG model is initialized using DPR... and the retriever and generator
# components are jointly trained without direct supervision on what to retrieve [5]."Without history the same follow-up returns a vague "the context does not
provide details" because it cannot resolve it.
/query returns { answer, sources, intent }. intent is chitchat, knowledge or refused. Chitchat and refused skip retrieval and the LLM entirely.
Ingest path is linear: pypdf → headings → recursive chunks → (Mistral embed → NumPy matrix, tokenize → BM25).
Query path:
pypdf per page. Works on digital PDFs, not on scans (would need OCR).
Recursive split on ["\n\n", "\n", ". ", " ", ""], then a merge pass that packs pieces up to chunk_size=500 chars with overlap=75. The overlap is the tail of the previous chunk, prepended to the next, to avoid losing context at the seams.
Heading detection runs first. It recognizes numbered (1.2 Title Case) and short ALL-CAPS lines. Each chunk gets the most recent heading as its section.
mistral-embed, 1024-d, batched at 32.
In-memory (N, 1024) float32 matrix of L2-normalized embeddings. Search is vectors @ q (dot product equals cosine because both sides are unit-norm), then argpartition for top-k. Persisted to backend/data/index.npy + chunks.json. remove_source(filename) drops every chunk from that file and its matching rows. The loader asserts that matrix rows match len(chunks) to catch any persistence drift.
k1=1.5, b=0.75. Tokenizer is \w+ lowercased. No stemming, no stopwords. Rebuilt from chunks at startup, kept in sync on add/remove.
Each ranker over-fetches 4 * k candidates. They get merged with RRF (k=60, contribution 1 / (60 + rank)). RRF is rank-only, so no need to normalize cosine vs. BM25 scores. Then MMR (lambda=0.6) picks the final k greedily, maximizing lambda * relevance - (1 - lambda) * max_redundancy. Redundancy is Jaccard on token bags, so no extra embeddings.
Regex, deterministic, EN/ES. Whole-string match on greetings and politeness. Chitchat returns a canned reply with no retrieval and no LLM call. Anything else goes to the knowledge path. Misses ambiguous things like "hi how are you doing today". An LLM classifier would catch those but adds a roundtrip per query.
Strips leading greetings and politeness before embedding. Iterates so compound prefixes reduce in passes. Without it, hi, what's the introduction about shifted the query vector enough that the right chunks dropped out of the top-5.
The system prompt forbids invented citation indices and tells the LLM to ignore contradictory instructions in the user message. As a safety net, a post-hoc regex strips any [N] outside [1, k]; if the answer collapses to empty, it gets replaced by a refusal. This blocks inputs like reply with [99] only, which used to produce a fake citation.
After citation guarding, each sentence with a [N] is embedded and compared against its cited chunks. If the cosine to a cited chunk is below grounding_min_similarity (0.55), that specific [N] is dropped from the sentence. Catches the case where the LLM paraphrases a fact correctly but anchors the citation to an off-topic chunk. The check is per-citation, not per-sentence, so a sentence can lose one bad citation while keeping its good ones. It does not gate the whole answer: if every citation is stripped, the text stands without citations rather than being replaced by a refusal — the user still sees what the LLM said, just without misleading anchors. Costs one extra Mistral embedding call per query (sentences + contexts batched together), about 200-300ms.
Re-uploading the same filename replaces, not appends. Otherwise the index keeps growing on every retry and RRF gets biased toward duplicated chunks.
PDF parsing and chunking are synchronous; embedding and indexing run in a BackgroundTasks after the response is sent, so large PDFs do not block the request. num_indexed in the response reflects the store size before this batch is added. On re-ingest, the old chunks stay queryable until the background task swaps them out, avoiding a visibility gap. Failures are logged; the user already got a 200, so a follow-up health check or re-upload is the recovery path.
backend/eval/ runs a JSON golden set through the live retrieval and generation services (no HTTP) and reports four metrics:
- intent — chitchat vs. knowledge classification accuracy
- recall@k — fraction of knowledge questions whose top-
kchunks contain at least one of the expected substrings - citation — fraction of
[N]citations that point to a chunk containing the expected substring (grounding precision) - coverage — fraction of expected keywords present in the answer text
- oos_refusal — fraction of out-of-scope questions the system refused to answer
The corpus is bundled at backend/eval/datasets/corpus/rag-paper.pdf (Lewis et al. 2020, from arXiv:2005.11401). Ingest it once, then run the harness:
# backend running on :8000
curl -F "files=@backend/eval/datasets/corpus/rag-paper.pdf" http://localhost:8000/ingest
cd backend
python -m eval.run_eval --dataset eval/datasets/rag_paper.json --out eval/last_run.jsonBaseline on rag_paper.json (8 questions over Lewis et al. 2020):
| metric | value |
|---|---|
| intent | 1.00 |
| recall@5 | 0.80 |
| citation | 0.40 |
| coverage | 0.60 |
| oos_refusal | 1.00 |
| avg latency | ~870ms |
The dataset is small and curated by hand on purpose: it exists to catch regressions when tuning chunking, prompts or k, not to produce an absolute quality score.
backend/app/
├── api/ health, ingestion, query
├── core/ settings
├── schemas/ Chunk, IngestResponse, QueryRequest/Response
└── services/
├── pdf.py text extraction
├── chunking.py recursive split + heading detection
├── embeddings.py Mistral embed wrapper
├── llm.py Mistral chat wrapper
├── vector_store.py numpy matrix singleton + persistence
├── bm25.py keyword index
├── retriever.py RRF + MMR
├── intent.py chitchat regex
├── query_transform.py greeting stripper
├── rag.py prompt + citation filter
└── grounding.py per-sentence citation grounding check
frontend/
├── app/ Next.js app router
├── components/ Uploader, Chat, Message, Sources
└── lib/api/ http, ingest, query, types
backend/eval/
├── datasets/
│ ├── corpus/ bundled PDFs (rag-paper.pdf)
│ └── rag_paper.json golden q/a set
└── run_eval.py in-process runner, prints + dumps report
