Agentic Graph RAG vs Traditional RAG

A comprehensive demo comparing Graph-based RAG (using AtomicRAG) against Traditional vector-based RAG through a LangGraph-powered agentic system. The project demonstrates how graph-structured knowledge retrieval significantly outperforms flat vector search on multi-hop reasoning tasks.

Key Results

HotpotQA Multi-Hop Reasoning (500 questions)

Metric	Traditional RAG	Graph RAG	Delta
Answer Correctness	70.0%	80.9%	+10.9%
Wins	27	103	—
Ties	370	370	—

Graph RAG wins 3.8x more often than Traditional RAG on multi-hop questions that require connecting facts across multiple documents. The +10.9% improvement is consistent across 500 bridge-type questions.

Medical Dataset (200 questions, 6 metrics)

Metric	Traditional RAG	Graph RAG	Notes
Answer Correctness	54.2%	52.8%	Tied — single-hop queries
Ctx Summarize ACC	54.6%	59.3%	+4.7% Graph RAG
Creative Gen ACC	36.3%	40.4%	+4.0% Graph RAG

The medical dataset (mostly single-hop factual lookups) shows comparable performance, while Graph RAG edges ahead on complex summarization and creative tasks. Multi-hop datasets like HotpotQA are where graph traversal truly shines.

Architecture

┌─────────────────────────────────────────────────────────┐
│                   LangGraph Agent                         │
│  ┌──────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │  Query   │→ │  Traversal   │→ │    Retrieval     │  │
│  │ Analyzer │  │   Planner    │  │  (mode-based)    │  │
│  └──────────┘  └──────────────┘  └────────┬─────────┘  │
│                                            │             │
│                          ┌─────────────────┼──────────┐ │
│                          ▼                 ▼          │ │
│               ┌──────────────────┐ ┌──────────────┐  │ │
│               │  Traditional RAG │ │  Graph RAG   │  │ │
│               │  (Vector Store)  │ │  (AtomicRAG) │  │ │
│               └──────────────────┘ └──────────────┘  │ │
│                          │                 │          │ │
│                          ▼                 ▼          │ │
│               ┌──────────────────────────────────┐   │ │
│               │       Answer Generator           │   │ │
│               └──────────────────────────────────┘   │ │
└─────────────────────────────────────────────────────────┘

How Graph RAG Works (AtomicRAG Q-Iter Algorithm)

Entity Anchoring — Extract entities from query, find matching graph nodes
Iterative Expansion — Hop through Entity → Knowledge Unit → Entity connections
Query Updating — Subtract retrieved embeddings to find diverse information
Beam Search Pruning — Keep only top-M most relevant paths at each depth

The agent dynamically controls graph traversal parameters:

traversal_depth (1–3): How many hops to take
beam_size (5–15): Width of beam search
query_update_weight (0–1): Diversity vs relevance tradeoff

Project Structure

├── src/
│   ├── config.py                  # Central configuration
│   ├── agent/                     # LangGraph agentic system
│   │   ├── state.py              # Agent state (TypedDict)
│   │   ├── nodes.py             # Agent nodes (analyze, plan, retrieve, generate)
│   │   ├── graph.py             # LangGraph graph definition
│   │   └── prompts.py           # LLM prompts
│   ├── traditional_rag/          # Vector similarity retrieval
│   │   ├── chunker.py           # Document chunking
│   │   ├── vector_store.py      # Local vector store (NumPy cosine sim)
│   │   └── retriever.py         # Retrieval interface
│   ├── graph_rag/                # AtomicRAG-based retrieval
│   │   ├── loader.py            # Load pre-built knowledge graph
│   │   └── retriever.py         # Graph traversal retrieval
│   └── evaluation/               # Scoring framework
│       ├── runner.py             # Batch evaluation runner
│       └── scorer.py            # ACC metric implementation
├── scripts/
│   ├── setup_vector_store.py     # Build traditional RAG vector store
│   ├── run_evaluation.py         # Run agent on dataset questions
│   ├── run_scoring.py            # Score predictions (6 metrics)
│   ├── hotpotqa_eval_100.py      # HotpotQA 100-question benchmark
│   ├── hotpotqa_demo.py          # Quick 10-question demo
│   ├── manual_comparison.py      # Side-by-side comparison tool
│   └── test_quick.py            # Smoke test both pipelines
├── results/
│   ├── medical/                  # Medical dataset evaluation report
│   └── hotpotqa/                 # HotpotQA multi-hop results
├── .env.example                  # Environment template
├── pyproject.toml                # Python dependencies
└── docker-compose.yml            # Optional PGVector setup

Setup

Prerequisites

Python 3.11+
Google Gemini API key (get one here)
atomicrag library (pip install atomicrag)

Installation

git clone https://github.com/Rahuljangs/agentic-graph-rag.git
cd agentic-graph-rag

# Install dependencies
pip install -e .

# Configure environment
cp .env.example .env
# Edit .env and add your GOOGLE_API_KEY

Build the RAG Systems

# 1. Build traditional vector store (Medical dataset)
python -m scripts.setup_vector_store

# 2. The Graph RAG uses pre-built AtomicRAG knowledge graphs
#    (or build one for HotpotQA using scripts/hotpotqa_eval_100.py)

Usage

Quick Demo (10 questions)

python -m scripts.hotpotqa_demo

Runs 10 multi-hop questions through both RAG systems and prints a comparison table.

Full Evaluation (100 questions)

python -m scripts.hotpotqa_eval_100

Builds the knowledge graph, runs 100 HotpotQA bridge questions, and produces scored results.

Medical Dataset Evaluation

# Generate predictions
python -m scripts.run_evaluation

# Score with 6 metrics
python -m scripts.run_scoring

Test Both Pipelines

python -m scripts.test_quick

Evaluation Metrics

Metric	Description	LLM-based?
Answer Correctness (ACC)	0.75×Factuality_F1 + 0.25×SemanticSim	Yes
Faithfulness	Are answer claims supported by context?	Yes
Context Relevance	Is retrieved context relevant to question?	Yes
Context Recall	Does context cover gold answer facts?	Yes
Answer Relevance	Does answer address the question?	Yes
ROUGE-L	Lexical overlap (F1 of LCS)	No

Datasets

HotpotQA (Primary — Multi-Hop)

7,405 multi-hop questions requiring reasoning across 2+ Wikipedia paragraphs
Each question has 2 supporting + 8 distractor paragraphs
Source: HotpotQA via HuggingFace

Medical (GraphRAG-Bench)

NCCN clinical guidelines corpus
4 question types: Fact Retrieval, Complex Reasoning, Contextual Summarize, Creative Generation
Source: GraphRAG-Bench

Why Graph RAG Wins on Multi-Hop Questions

Traditional RAG retrieves chunks that are individually similar to the query. For multi-hop questions like:

"What government position was held by the woman who portrayed Corliss Archer in Kiss and Tell?"

Traditional RAG finds paragraphs about "Kiss and Tell" OR "government positions" but cannot connect them. Graph RAG:

Anchors on entities: "Corliss Archer", "Kiss and Tell"
Traverses: Kiss and Tell → Shirley Temple (actress) → Chief of Protocol (position)
Returns the complete evidence chain

Result: Traditional RAG scored 12% on this question. Graph RAG scored 90%.

Tech Stack

LLM: Google Gemini 2.5 Flash
Embeddings: Gemini Embedding 001
Agent Framework: LangGraph
Graph RAG: AtomicRAG (Q-Iter algorithm)
Vector Store: Local NumPy-based (cosine similarity)
Evaluation: LLM-as-judge (Gemini) + embedding similarity

References

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
results		results
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic Graph RAG vs Traditional RAG

Key Results

HotpotQA Multi-Hop Reasoning (500 questions)

Medical Dataset (200 questions, 6 metrics)

Architecture

How Graph RAG Works (AtomicRAG Q-Iter Algorithm)

Project Structure

Setup

Prerequisites

Installation

Build the RAG Systems

Usage

Quick Demo (10 questions)

Full Evaluation (100 questions)

Medical Dataset Evaluation

Test Both Pipelines

Evaluation Metrics

Datasets

HotpotQA (Primary — Multi-Hop)

Medical (GraphRAG-Bench)

Why Graph RAG Wins on Multi-Hop Questions

Tech Stack

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic Graph RAG vs Traditional RAG

Key Results

HotpotQA Multi-Hop Reasoning (500 questions)

Medical Dataset (200 questions, 6 metrics)

Architecture

How Graph RAG Works (AtomicRAG Q-Iter Algorithm)

Project Structure

Setup

Prerequisites

Installation

Build the RAG Systems

Usage

Quick Demo (10 questions)

Full Evaluation (100 questions)

Medical Dataset Evaluation

Test Both Pipelines

Evaluation Metrics

Datasets

HotpotQA (Primary — Multi-Hop)

Medical (GraphRAG-Bench)

Why Graph RAG Wins on Multi-Hop Questions

Tech Stack

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages