GnosisPages

title	GnosisPages
emoji	📚
colorFrom	blue
colorTo	green
sdk	docker
app_port	8501
pinned	false

GnosisPages

GnosisPages is a RAG + LLM chatbot for querying private document collections. Upload PDF files, build a semantic knowledge base, and ask questions in natural language — no keyword matching required.

▶ Try the live demo · Watch a walkthrough

Use Case: CV Discovery for Recruiters

Managing large volumes of CVs is difficult. Recruiters often don't know the exact technologies or skill names to search for, and keyword-based search misses semantically equivalent terms (e.g. "machine learning" vs "aprendizaje automático", or experience in a framework analogous to the one required).

GnosisPages solves this with semantic retrieval: a recruiter can ask "Who has experience with distributed systems and has worked in startups?" and the system finds relevant profiles even when the exact phrasing doesn't match.

Candidate data is sensitive (contact details, personal history). GnosisPages keeps it private by design: documents live in a local or private vector database, and the LLM never trains on them — it only reads the retrieved context at inference time.

The demo ships with a pre-loaded collection of synthetic CVs generated with Claude Sonnet 4.6 and vectorized with OpenAI's text-embedding-3-small.

Data flow

PDF documents
      │
      ▼
  Text extraction (PyMuPDF)
      │
      ▼
  Chunking (LangChain TextSplitter)
      │
      ▼
  Embedding (text-embedding-3-small · OpenAI)
      │
      ▼
  Vector storage (ChromaDB)
      │
      ▼
  User query (natural language)
      │
      ├─► Embed query ──► Semantic search (ChromaDB cosine similarity)
      │                         │
      │                   Top-k chunks
      │                         │
      └─────────────────► Prompt construction
                                │
                          GPT-4o-Mini (LangChain)
                                │
                          Answer → Streamlit UI

Architecture

Components

Layer	Technology	Role
UI	Streamlit 1.58	Web interface and file upload
Orchestration	LangChain 0.3	RAG chain, prompt management
Vector store	ChromaDB 1.5	Semantic storage and retrieval
Embeddings	`text-embedding-3-small` (OpenAI)	Document and query vectorization
LLM	GPT-4o-Mini (OpenAI)	Answer generation
PDF parsing	PyMuPDF 1.24	Text extraction from PDF files

text-embedding-3-small replaces ChromaDB's default (all-MiniLM-L6-v2) for better semantic quality, especially across mixed-language content.

Why GPT-4o-Mini

Fast response times for conversational QA over retrieved context
Lower cost per token than GPT-4o or GPT-4 Turbo
Native LangChain integration
Stable OpenAI API with no additional infrastructure

Why RAG

The knowledge base is private, dynamic, and cannot be baked into model weights. RAG provides on-demand access to documents the LLM was never trained on, without exposing them to external services beyond the query moment.

Features

Upload PDFs up to 200 MB (programmatically created or OCR-processed)
Semantic search across your document collection — finds relevant content even without exact keyword matches
Conversational interface — ask follow-up questions in the same session
Pre-loaded dataset — the demo includes synthetic CVs so you can try it immediately without uploading anything
Private by design — documents stay in your vector store; the LLM only sees retrieved chunks

Demo Usage

The live demo on HuggingFace requires only an OpenAI API Key.

Example questions to try with the pre-loaded CV dataset:

Who has experience with Python and machine learning?
Find candidates who have worked in startups or early-stage companies.
Who has the most experience in technical leadership roles?
Is there anyone with a background in both data engineering and backend development?
Which candidates mention experience with cloud infrastructure?

Local Setup

Requirements: Python 3.11, OpenAI API Key

# 1. Clone
git clone https://github.com/maclenn77/pdf-explainer.git
cd pdf-explainer

# 2. Create environment file
touch .env

Add your key to .env:

OPENAI_API_KEY=your_key_here

# 3. Install dependencies
pip install -r requirements.txt

# 4. Run
streamlit run GnosisPages.py

Deployment

The repo includes three GitHub Actions workflows that run on every PR and deploy automatically on merge to main:

Workflow	What it does
Check file size	Blocks merges with files above HuggingFace's size limit
Check lints	Runs `pylint` on the codebase
Deploy to HuggingFace	Pushes the latest `main` to the HuggingFace Space

To deploy your own fork, add these secrets in your repository settings:

HF_TOKEN — your HuggingFace access token
HF_USERNAME — your HuggingFace username

Project Structure

pdf-explainer/
├── GnosisPages.py          # App entry point
├── gnosis/
│   ├── chroma_client.py    # ChromaDB wrapper
│   ├── settings.py         # Collection bootstrap (loads pre-built DB)
│   ├── gui_messages.py     # UI copy
│   └── components/
│       ├── sidebar.py      # File upload and DB controls
│       └── main.py         # Chat interface and RAG chain
├── pages/                  # Additional Streamlit pages
├── requirements.txt
├── Dockerfile
└── .github/workflows/      # CI/CD

License

MIT — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
gnosis		gnosis
pages		pages
prompts		prompts
.example.env		.example.env
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
GnosisPages.py		GnosisPages.py
LICENSE		LICENSE
README.md		README.md
entrypoint.sh		entrypoint.sh
requirements.txt		requirements.txt
wk_flow_requirements.txt		wk_flow_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GnosisPages

Use Case: CV Discovery for Recruiters

Data flow

Architecture

Components

Why GPT-4o-Mini

Why RAG

Features

Demo Usage

Local Setup

Deployment

Project Structure

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GnosisPages

Use Case: CV Discovery for Recruiters

Data flow

Architecture

Components

Why GPT-4o-Mini

Why RAG

Features

Demo Usage

Local Setup

Deployment

Project Structure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages