Text–Image Retrieval-Augmented Generation System
A production-oriented multimodal RAG pipeline enabling semantic search and question answering over documents and images by combining vector retrieval, distributed backend services, and multimodal LLMs.
Built with FastAPI, PostgreSQL + pgvector, Amazon S3, Dockerized services, and Angular + React micro-frontends.
-
Micro-frontend UI: Angular + React composed via
single-spa, routed through JWT-secured gateways -
Backend services: FastAPI REST APIs for ingestion, retrieval, and query orchestration
-
Vector retrieval:
- CLIP (HuggingFace) embeddings for text and images
- pgvector-backed similarity search with ANN indexes (IVF / HNSW)
- OCR-based enrichment for screenshots and diagrams
-
Object storage: Images stored in Amazon S3, referenced via metadata and embeddings
-
RAG orchestration: Retrieved image URLs, OCR text, and metadata composed into grounded LLM prompts
- Ingest – Upload documents/images → store binaries in S3, metadata in PostgreSQL
- Embed – Generate CLIP embeddings; optionally enrich with OCR/captions
- Index – Persist vectors in pgvector with ANN indexing
- Retrieve – Embed query → ANN search → hybrid retrieval over images + text
- Compose – Assemble multimodal or text-only context
- Generate – LLM responds with citations grounded in retrieved data
Frontend: Angular, React, micro-frontends (single-spa)
Backend: FastAPI, REST APIs, JWT auth
Data: PostgreSQL, pgvector, Amazon S3
Search: ANN (IVF / HNSW)
ML: CLIP embeddings, OCR pipelines, multimodal LLMs (GPT-4o / GPT-4o-mini)
Infra: Docker, docker-compose, Nginx
docker compose -f docker-compose.dev.yml up --build- Frontend root:
http://localhost:9000 - Backend API:
http://localhost:8000
cd python-backend
CREATE_DEMO_USER=false DATABASE_URL=... JWT_SECRET=... \
python -m app.scripts.create_admin_user --username <admin> --password <password>Passwords are hashed server-side and never logged.
- Scalable retrieval over mixed media
- Clean separation of storage, retrieval, and reasoning
- Extensible foundation for future RAG and multimodal systems