Cybersecurity RAG Agent

A local Retrieval-Augmented Generation (RAG) system for generating actionable cybersecurity guidelines for practitioners. The system combines:

a curated knowledge base from trusted cybersecurity sources,
a local open-source LLM for response generation,
an optional image-to-text / vision-assisted input path,
and a RAGAS evaluation layer for reference-free assessment.

Quick setup and run

Clone this repo.
Install ollama and run ollama run gemma3:12b-it-qat
Install npm and required dependencies.
In the notebooks folder you will find 3 notebooks which will setup all vector db and the scraped data.
You will need to open 3 terminals in parallel. Run these 3 commands in all 3 from the base directory of the repo.

ollama serve

python backend/app.py

cd frontend
npm run dev

To open the frontend go to http://localhost:5173/ You can use it like any other chatbot.

1. Repository Structure

.
├── backend/
│   ├── logs/
│   │   └── chat_logs.jsonl              # Local storage for raw conversational inputs/outputs
│   ├── app.py                           # Core backend API server (e.g., FastAPI or Flask)
│   ├── config.py                        # Environment variables, Qdrant URLs, and model settings
│   ├── generation.py                    # LLM orchestration and system prompting
│   ├── qdrant_manager.py                # Collection management, schema definitions, and vector upserts
│   ├── query_rewrite.py                 # Query expansion/de-contextualization logic
│   ├── requirements.txt                 # Backend Python dependencies
│   ├── retrieval.py                     # Vector similarity search and filtering logic
│   ├── test_embedding.py                # Unit test for checking vector embedding generation
│   ├── test_generate.py                 # Unit test for isolating LLM output quality
│   ├── test_qdrant.py                   # Integration test for Qdrant connection stability
│   ├── test_retrieve.py                 # Unit test verifying vector search retrieval quality
│   ├── utils.py                         # Shared helper functions (text sanitization, logging configurations)
│   └── vision.py                        # Logic for processing multimodal/image inputs (if applicable)
│
├── eval_outputs/                        # Output destination for evaluation frameworks (e.g., RAGAS)
│   ├── final_metrics.txt                # Aggregated summary scores (Faithfulness, Relevancy, etc.)
│   ├── rows.json                        # Raw JSON dump of individual test query results
│   └── scores.csv                       # Tabular matrix of evaluation runs for easy visualization
│
├── frontend/                            # React + Vite frontend application
│   ├── public/
│   │   ├── favicon.svg                  # Browser tab icon
│   │   └── icons.svg                    # Shared UI icons (SVG spritesheet)
│   ├── src/
│   │   ├── components/                  # Reusable UI elements
│   │   │   ├── ChatBox.jsx              # Main chat interface wrapper
│   │   │   ├── CollectionSelector.jsx   # UI component to toggle between vector datasets
│   │   │   └── Message.jsx              # Renders individual user/assistant chat bubbles
│   │   ├── api.js                       # Axios/Fetch wrappers to talk to the backend API
│   │   ├── App.jsx                      # Main application view layout
│   │   ├── index.css                    # Tailwind baseline configurations
│   │   └── main.jsx                     # React DOM entry point
│   ├── .gitignore                       # Frontend-specific git exclusions (node_modules, dist)
│   ├── eslint.config.js                 # Linter configuration for code quality
│   ├── index.html                       # Base HTML shell
│   ├── package.json                     # Frontend dependencies and run scripts
│   ├── postcss.config.js                # CSS post-processing configuration (Tailwind requirement)
│   ├── README.md                        # Frontend documentation
│   ├── repomix-output.xml               # Codebase packing output for LLM context ingestion
│   ├── tailwind.config.js               # Design system specifications (themes, colors, spacing)
│   └── vite.config.js                   # Vite bundler options and proxy server settings
│
├── notebooks/                           # Data engineering and corpus parsing phase
│   ├── convert_hf_cyberqaa_to_vector.ipynb   # Process HuggingFace CyberQA datasets into vectors
│   ├── convert_hf_secqa_to_vector.ipynb      # Process security QA datasets into vectors
│   └── convert_kaggle_cyberqa_to_vector.ipynb # Process Kaggle cyber infrastructure data into vectors
│
├── scripts/                             # Document ETL pipeline (Extraction, Transformation, Loading)
│   ├── chunk_cisa.py                    # Implements text splitting (e.g., recursive character or semantic chunking)
│   ├── discover_cisa_links.py           # Web crawler targeting specific cyber advisory URLs
│   ├── index_qdrant.py                  # Pushes finalized embeddings to production vector store
│   └── scrape_cisa.py                   # Extracts text content from identified CISA advisory pages
│
├── .gitignore                           # Root-level git rules (ignores virtual environments, sensitive keys)
├── genAI_logs.txt                       # Development logging or standard output tracking for model calls
├── ragas_eval_local_ollama.py           # Script using RAGAS and Ollama to evaluate the local pipeline offline
└── README.md                            # Main project layout, architecture blueprint, and setup guide

2. Assignment goal

build a real RAG pipeline and chat interface
use trustworthy cybersecurity sources
supports retrieval, generation, and evaluation
analyse failure cases and trustworthiness issues
keep everything runnable locally for a live demo
RAGAS evaluation and code for everything mentioned above

3. What the system does and justification of design choices

Given a practitioner-style query such as:

“How should we respond to a spear-phishing attempt in a mid-sized enterprise?”

the system can:

accept the user query in a GPT-style chat interface
optionally accept an image upload
optionally rewrite the query for better retrieval
retrieve the top relevant chunks from selected data sources
generate concise cybersecurity guidance using a local LLM
return structured citations for the retrieved context
display source cards so the user can inspect evidence

The interface also supports source selection so the user can choose between:

static pre-indexed collections of both datasets and a webscraped source(CISA)
or a combination of both.

Why Gemma 3 12B?

local deployment,
privacy-preserving,
avoids API dependency,
sufficient reasoning quality,
feasible on RTX 3090.

Why BGE-large?

strong retrieval benchmark performance,
optimized for semantic similarity,
widely used in RAG literature.

Why Qdrant?

lightweight local vector DB,
reproducibility,
hybrid-ready,
low operational overhead.

Why cybersecurity QA datasets?

domain grounding,
benchmarkability,
structured practitioner language.

Results

Evaluation on the hf_qaa_chunks collection (100 samples) produced the following results:

Metric	Score
Faithfulness	0.968
Answer Relevancy	0.812
Context Precision	0.980

Empirical Evaluation of the RAGAS Eval Output

1. Metric Distribution & Performance Overview

Metric	Score Range	Performance Profile	Architectural Implication
Context Precision	$0.83 - 1.00$	Excellent / Highly Stable	Top-tier Retriever performance; minimal noise in top $K$ chunks.
Answer Relevancy	$0.53 - 0.96$ (with $0$ anomalies)	Moderate / Variable	Generator addresses user intent but introduces formatting/verbosity variances.
Faithfulness	$0.66 - 1.00$	Sub-optimal / Stepped	Systematic generation drift; minor hallucinations or ungrounded synthesis.

2. Key Insights & Discussion Points for Publication

Insight A: The Generation-Retrieval Gap (Retriever Efficiency vs. Generator Drift)

The evaluation exposes a classic RAG bottleneck: near-perfect retrieval paired with ungrounded generation.

The Data: Context Precision consistently reaches $1.00$ or $0.83$, confirming that the embedding model and vector search successfully retrieve and rank the exact context required to answer the queries.
The Phenonmenon: Despite receiving pristine context, the Generator (LLM) fails to maintain absolute fidelity, with Faithfulness dropping to $0.66$ across multiple samples (e.g., WannaCry, Cryptographic Implementations, CTAS Design Phase queries).
Paper Takeaway: This quantitative gap proves that optimizing embedding strategies yields diminishing returns if the LLM system prompt fails to enforce strict information grounding. The bottleneck in this architecture is entirely localized within the generation phase.

Insight B: Quantized Faithfulness Scores ($0.66\overline{6}$) Signaling Structural Over-Synthesis

A striking pattern in the dataset is the recurrence of an identical Faithfulness score: $0.6666666667$.

Statistical Root Cause: RAGAS calculates faithfulness as the ratio of generated statements that can be mathematically inferred from the context over the total number of generated statements ($F = \frac{|Statements_{\text{grounded}}|}{|Statements_{\text{total}}|}$). A repeating score of exactly $\frac{2}{3}$ across diverse queries implies a systematic structural behavior.
Architectural Insight: The LLM prompt template likely encourages a structured three-point or three-sentence output structure by default. If the retrieved context only explicitly validates two of those points, while the third is derived from the LLM’s parametric "common sense" or linguistic smoothing, the model is penalized a strict $33.3%$ for over-synthesis.

Insight C: Syntactic and Encoding Artifacts Triggering False Negative Relevancy Drops

The dataset shows severe drops in Answer Relevancy down to $0.00$ for highly specific queries (e.g., Future Cryptography Research and Cyber Essentials Third-Party Devices), despite the generated response aligning closely with the reference text.

The Root Cause: Closer inspection of the logs reveals severe text encoding corruption (e.g., Cyber Essentialsâ\x80\x99, Ã¢â\x82¬â\x84¢s, Ã¢â\x82¬Ë\x9cappropriateÃ¢â\x82¬â\x84¢). RAGAS Answer Relevancy relies on semantic embeddings of the generated output compared to the target.
Paper Takeaway: String tokenization failures and encoding noise drastically skew automated evaluation metrics. In a production evaluation pipeline, a strict string-cleaning preprocessing layer (UTF-8 normalization) is mandatory to prevent evaluation artifacts from falsifying LLM performance metrics.

3. Drawbacks and Limitations

While our retrieval architecture achieves highly precise document ranking (Context Precision $\ge 0.83$), the generative component exhibits a distinct vulnerability to descriptive elaboration. The systematic convergence at a 0.66 Faithfulness threshold indicates that the LLM injects non-contextual smoothing tokens to fulfill stylistic constraints. This highlights a lack of deterministic constraint satisfaction within the current prompt engineering framework.

How to Run Evaluation

python ragas_eval_local_ollama.py \
--dataset /home/nam/projects/sid/RAG-Assignment3/data/hf_cyberqaa/processed/hf_cyberqaa_chunks.json \
--collection hf_qaa_chunks

Notes

Ensure the Flask backend is running before executing evaluation.
Ensure Qdrant contains the correct collection (hf_qaa_chunks).
Evaluation depends entirely on retrieval quality; if no contexts are returned, the sample is skipped.
All evaluation is performed locally without external API calls.

4. Related Work and Literature Review

Retrieval-Augmented Generation (RAG) systems combine information retrieval techniques with Large Language Models (LLMs) to generate responses grounded in external knowledge sources rather than relying solely on the model’s internal parameters. RAG was formally introduced by Patrick Lewis et al. (2020), where retrieved documents were incorporated into the generation process to improve factual accuracy and reduce hallucination. Since then, RAG architectures have become widely adopted in domains requiring trustworthy and explainable outputs, including healthcare, legal systems, and cybersecurity.

In cybersecurity, trustworthy and evidence-grounded responses are especially important because inaccurate or hallucinated recommendations can introduce operational risks. Several studies have shown that general-purpose LLMs may produce technically plausible but incorrect security advice when unsupported by reliable evidence. As a result, retrieval-based grounding has become an important strategy for improving reliability in AI-assisted cybersecurity systems.

Recent literature highlights three major components that influence RAG performance:

Retrieval Quality The retriever determines which documents are supplied to the LLM. Dense retrieval methods using embedding models such as BGE and Sentence Transformers have demonstrated strong semantic search performance compared to traditional keyword-based retrieval approaches. Vector databases such as Qdrant enable efficient similarity search over embedded cybersecurity documents and are commonly used in modern RAG pipelines.
Generation Faithfulness Research has shown that LLMs frequently hallucinate when retrieved context is weak or incomplete. Prompt engineering and citation-aware generation are therefore commonly used to constrain model behaviour and improve faithfulness. Local open-source models such as Google Gemma have become increasingly popular for privacy-preserving deployments because they avoid transmitting potentially sensitive organisational data to external APIs.
Evaluation of RAG Systems Traditional NLP evaluation methods rely on reference answers, which are often unavailable in real-world enterprise settings. The RAGAS framework introduced automated reference-free evaluation metrics specifically for RAG systems, including:
- Context Relevance — whether retrieved documents are relevant to the query,
- Answer Relevance — whether the generated response addresses the user’s request,
- Faithfulness — whether the answer is grounded in retrieved evidence rather than hallucinated content.

This project builds upon these ideas by designing a cybersecurity-focused RAG assistant that retrieves guidance from curated security resources and generates practitioner-oriented recommendations. Unlike generic chatbot systems, this implementation emphasises:

local deployment,
retrieval transparency through citations,
source-aware generation,
and evaluation using RAGAS metrics.

The system also explores multimodal retrieval by supporting optional image-based inputs, allowing screenshots or security-related images to be incorporated into the retrieval pipeline. This extends the traditional text-only RAG architecture toward more practical cybersecurity workflows.

Key References

Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
Es, S. et al. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation.
Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey.
Hugging Face documentation on embedding models and retrieval pipelines.
Qdrant documentation on vector similarity search and semantic retrieval.

5. Highlights of the Project

Retrieval: benchmarkable data source + data sources for variations and testing.
Source transparency: the user chooses what the model is allowed to search.
Image-assisted input: screenshots or images can be described and folded into the query.
Citation-first response design: the UI exposes retrieved sources and relevance scores.
Local deployment: the full system can run on your own machine without external hosting.

6. Current feature set

Frontend

GPT-style chat interface
sidebar for source selection
source counts and active-state indicators
image upload
citations panel with clickable retrieval previews
assistant metadata for rewritten queries and vision output
loading state and typing indicator

Backend

/collections to list available knowledge collections
/rewrite to rewrite a query
/chat to run the full pipeline
/vision to process an uploaded image separately

Pipeline

optional query rewrite
optional image understanding
retrieval over selected collections
generation with local LLM
structured citation return
logging of query / retrieval / answer traces

7. High-level architecture

User Query / Image
        ↓
React Chat UI
        ↓
Flask API
        ↓
[Optional query rewrite]
        ↓
[Optional vision / OCR step]
        ↓
Retriever over selected collections
        ↓
Top-k chunk selection
        ↓
Local LLM generation
        ↓
Structured answer + citations
        ↓
UI shows answer, sources, and previews

8. Tech stack

Frontend

React
Vite
Tailwind CSS
Axios

Backend

Flask
Flask-CORS
Python

RAG / Retrieval

local embeddings and vector retrieval
collection-based source filtering
optional query rewriting

Generation

local open-source LLM served on a 3090 setup

Optional vision path

local image understanding / OCR-style preprocessing
vision output merged into the query before retrieval

Evaluation

RAGAS-style metrics
context relevance
answer relevance
faithfulness
retrieval quality analysis

9. Evaluation plan

RAGAS Evaluation (Local RAG System)

We evaluate the performance of the retrieval-augmented generation (RAG) pipeline using RAGAS (Retrieval-Augmented Generation Assessment) on a static benchmark derived from the HuggingFace cybersecurity QA dataset.

Evaluation Setup

The evaluation is performed on a fixed subset of 100 randomly sampled QA pairs from the dataset stored in hf_qaa_chunks. Each entry contains:

A natural language question
A reference ground-truth answer
Associated topic metadata
Source evidence passages (used only for dataset construction, not evaluation leakage)

The evaluation is fully reproducible using a fixed random seed.

Evaluation Pipeline

The evaluation script (ragas_eval_local_ollama.py) follows this workflow:

Dataset Loading
- Loads the HuggingFace-formatted JSON dataset
- Extracts structured fields: question, answer, evidence, and metadata
Query Execution (RAG Pipeline)
- Each question is sent to the local Flask /chat endpoint
- The backend handles:
  - Query rewriting (optional)
  - Embedding-based retrieval via Qdrant
  - Context injection into the LLM (Gemma 3 via Ollama)
- The response includes:
  - Generated answer
  - Retrieved context chunks (citations from Qdrant)
Context Extraction
- Retrieved contexts are extracted from backend citations (preview fields)
- Only actual retrieved documents are used (no ground-truth leakage)
RAGAS Evaluation
- The following metrics are computed:
  - Faithfulness → measures whether the answer is supported by retrieved context
  - Answer Relevancy → measures semantic alignment between question and answer
  - Context Precision → measures how relevant retrieved chunks are to the query
- Evaluation uses:
  - Local Ollama LLM (Gemma 3 12B IT QAT) for scoring
  - Local HuggingFace embedding model (BAAI/bge-large-en-v1.5) for semantic similarity

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
backend		backend
eval_outputs		eval_outputs
frontend		frontend
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
genAI_logs.txt		genAI_logs.txt
ragas_eval_local_ollama.py		ragas_eval_local_ollama.py
repomix-output.xml		repomix-output.xml

Folders and files

Latest commit

History

Repository files navigation

Cybersecurity RAG Agent

Quick setup and run

1. Repository Structure

2. Assignment goal

3. What the system does and justification of design choices

Why Gemma 3 12B?

Why BGE-large?

Why Qdrant?

Why cybersecurity QA datasets?

Results

Empirical Evaluation of the RAGAS Eval Output

1. Metric Distribution & Performance Overview

2. Key Insights & Discussion Points for Publication

Insight A: The Generation-Retrieval Gap (Retriever Efficiency vs. Generator Drift)

Insight B: Quantized Faithfulness Scores ($0.66\overline{6}$) Signaling Structural Over-Synthesis

Insight C: Syntactic and Encoding Artifacts Triggering False Negative Relevancy Drops

3. Drawbacks and Limitations

How to Run Evaluation

Notes

4. Related Work and Literature Review

Key References

5. Highlights of the Project

6. Current feature set

Frontend

Backend

Pipeline

7. High-level architecture

8. Tech stack

Frontend

Backend

RAG / Retrieval

Generation

Optional vision path

Evaluation

9. Evaluation plan

RAGAS Evaluation (Local RAG System)

Evaluation Setup

Evaluation Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages