A local Retrieval-Augmented Generation (RAG) system for generating actionable cybersecurity guidelines for practitioners. The system combines:
- a curated knowledge base from trusted cybersecurity sources,
- a local open-source LLM for response generation,
- an optional image-to-text / vision-assisted input path,
- and a RAGAS evaluation layer for reference-free assessment.
- Clone this repo.
- Install ollama and run
ollama run gemma3:12b-it-qat - Install npm and required dependencies.
- In the
notebooksfolder you will find 3 notebooks which will setup all vector db and the scraped data. - You will need to open 3 terminals in parallel. Run these 3 commands in all 3 from the base directory of the repo.
ollama servepython backend/app.pycd frontend
npm run devTo open the frontend go to http://localhost:5173/
You can use it like any other chatbot.
.
├── backend/
│ ├── logs/
│ │ └── chat_logs.jsonl # Local storage for raw conversational inputs/outputs
│ ├── app.py # Core backend API server (e.g., FastAPI or Flask)
│ ├── config.py # Environment variables, Qdrant URLs, and model settings
│ ├── generation.py # LLM orchestration and system prompting
│ ├── qdrant_manager.py # Collection management, schema definitions, and vector upserts
│ ├── query_rewrite.py # Query expansion/de-contextualization logic
│ ├── requirements.txt # Backend Python dependencies
│ ├── retrieval.py # Vector similarity search and filtering logic
│ ├── test_embedding.py # Unit test for checking vector embedding generation
│ ├── test_generate.py # Unit test for isolating LLM output quality
│ ├── test_qdrant.py # Integration test for Qdrant connection stability
│ ├── test_retrieve.py # Unit test verifying vector search retrieval quality
│ ├── utils.py # Shared helper functions (text sanitization, logging configurations)
│ └── vision.py # Logic for processing multimodal/image inputs (if applicable)
│
├── eval_outputs/ # Output destination for evaluation frameworks (e.g., RAGAS)
│ ├── final_metrics.txt # Aggregated summary scores (Faithfulness, Relevancy, etc.)
│ ├── rows.json # Raw JSON dump of individual test query results
│ └── scores.csv # Tabular matrix of evaluation runs for easy visualization
│
├── frontend/ # React + Vite frontend application
│ ├── public/
│ │ ├── favicon.svg # Browser tab icon
│ │ └── icons.svg # Shared UI icons (SVG spritesheet)
│ ├── src/
│ │ ├── components/ # Reusable UI elements
│ │ │ ├── ChatBox.jsx # Main chat interface wrapper
│ │ │ ├── CollectionSelector.jsx # UI component to toggle between vector datasets
│ │ │ └── Message.jsx # Renders individual user/assistant chat bubbles
│ │ ├── api.js # Axios/Fetch wrappers to talk to the backend API
│ │ ├── App.jsx # Main application view layout
│ │ ├── index.css # Tailwind baseline configurations
│ │ └── main.jsx # React DOM entry point
│ ├── .gitignore # Frontend-specific git exclusions (node_modules, dist)
│ ├── eslint.config.js # Linter configuration for code quality
│ ├── index.html # Base HTML shell
│ ├── package.json # Frontend dependencies and run scripts
│ ├── postcss.config.js # CSS post-processing configuration (Tailwind requirement)
│ ├── README.md # Frontend documentation
│ ├── repomix-output.xml # Codebase packing output for LLM context ingestion
│ ├── tailwind.config.js # Design system specifications (themes, colors, spacing)
│ └── vite.config.js # Vite bundler options and proxy server settings
│
├── notebooks/ # Data engineering and corpus parsing phase
│ ├── convert_hf_cyberqaa_to_vector.ipynb # Process HuggingFace CyberQA datasets into vectors
│ ├── convert_hf_secqa_to_vector.ipynb # Process security QA datasets into vectors
│ └── convert_kaggle_cyberqa_to_vector.ipynb # Process Kaggle cyber infrastructure data into vectors
│
├── scripts/ # Document ETL pipeline (Extraction, Transformation, Loading)
│ ├── chunk_cisa.py # Implements text splitting (e.g., recursive character or semantic chunking)
│ ├── discover_cisa_links.py # Web crawler targeting specific cyber advisory URLs
│ ├── index_qdrant.py # Pushes finalized embeddings to production vector store
│ └── scrape_cisa.py # Extracts text content from identified CISA advisory pages
│
├── .gitignore # Root-level git rules (ignores virtual environments, sensitive keys)
├── genAI_logs.txt # Development logging or standard output tracking for model calls
├── ragas_eval_local_ollama.py # Script using RAGAS and Ollama to evaluate the local pipeline offline
└── README.md # Main project layout, architecture blueprint, and setup guide
- build a real RAG pipeline and chat interface
- use trustworthy cybersecurity sources
- supports retrieval, generation, and evaluation
- analyse failure cases and trustworthiness issues
- keep everything runnable locally for a live demo
- RAGAS evaluation and code for everything mentioned above
Given a practitioner-style query such as:
“How should we respond to a spear-phishing attempt in a mid-sized enterprise?”
the system can:
- accept the user query in a GPT-style chat interface
- optionally accept an image upload
- optionally rewrite the query for better retrieval
- retrieve the top relevant chunks from selected data sources
- generate concise cybersecurity guidance using a local LLM
- return structured citations for the retrieved context
- display source cards so the user can inspect evidence
The interface also supports source selection so the user can choose between:
- static pre-indexed collections of both datasets and a webscraped source(CISA)
- or a combination of both.
- local deployment,
- privacy-preserving,
- avoids API dependency,
- sufficient reasoning quality,
- feasible on RTX 3090.
- strong retrieval benchmark performance,
- optimized for semantic similarity,
- widely used in RAG literature.
- lightweight local vector DB,
- reproducibility,
- hybrid-ready,
- low operational overhead.
- domain grounding,
- benchmarkability,
- structured practitioner language.
Evaluation on the hf_qaa_chunks collection (100 samples) produced the following results:
| Metric | Score |
|---|---|
| Faithfulness | 0.968 |
| Answer Relevancy | 0.812 |
| Context Precision | 0.980 |
| Metric | Score Range | Performance Profile | Architectural Implication |
|---|---|---|---|
| Context Precision | Excellent / Highly Stable | Top-tier Retriever performance; minimal noise in top |
|
| Answer Relevancy |
|
Moderate / Variable | Generator addresses user intent but introduces formatting/verbosity variances. |
| Faithfulness | Sub-optimal / Stepped | Systematic generation drift; minor hallucinations or ungrounded synthesis. |
The evaluation exposes a classic RAG bottleneck: near-perfect retrieval paired with ungrounded generation.
-
The Data: Context Precision consistently reaches
$1.00$ or$0.83$ , confirming that the embedding model and vector search successfully retrieve and rank the exact context required to answer the queries. -
The Phenonmenon: Despite receiving pristine context, the Generator (LLM) fails to maintain absolute fidelity, with Faithfulness dropping to
$0.66$ across multiple samples (e.g., WannaCry, Cryptographic Implementations, CTAS Design Phase queries). - Paper Takeaway: This quantitative gap proves that optimizing embedding strategies yields diminishing returns if the LLM system prompt fails to enforce strict information grounding. The bottleneck in this architecture is entirely localized within the generation phase.
A striking pattern in the dataset is the recurrence of an identical Faithfulness score:
-
Statistical Root Cause: RAGAS calculates faithfulness as the ratio of generated statements that can be mathematically inferred from the context over the total number of generated statements (
$F = \frac{|Statements_{\text{grounded}}|}{|Statements_{\text{total}}|}$ ). A repeating score of exactly$\frac{2}{3}$ across diverse queries implies a systematic structural behavior. -
Architectural Insight: The LLM prompt template likely encourages a structured three-point or three-sentence output structure by default. If the retrieved context only explicitly validates two of those points, while the third is derived from the LLM’s parametric "common sense" or linguistic smoothing, the model is penalized a strict
$33.3%$ for over-synthesis.
The dataset shows severe drops in Answer Relevancy down to
- The Root Cause: Closer inspection of the logs reveals severe text encoding corruption (e.g.,
Cyber Essentialsâ\x80\x99,ââ\x82¬â\x84¢s,ââ\x82¬Ë\x9cappropriateââ\x82¬â\x84¢). RAGAS Answer Relevancy relies on semantic embeddings of the generated output compared to the target. - Paper Takeaway: String tokenization failures and encoding noise drastically skew automated evaluation metrics. In a production evaluation pipeline, a strict string-cleaning preprocessing layer (UTF-8 normalization) is mandatory to prevent evaluation artifacts from falsifying LLM performance metrics.
While our retrieval architecture achieves highly precise document ranking (Context Precision
python ragas_eval_local_ollama.py \
--dataset /home/nam/projects/sid/RAG-Assignment3/data/hf_cyberqaa/processed/hf_cyberqaa_chunks.json \
--collection hf_qaa_chunks- Ensure the Flask backend is running before executing evaluation.
- Ensure Qdrant contains the correct collection (
hf_qaa_chunks). - Evaluation depends entirely on retrieval quality; if no contexts are returned, the sample is skipped.
- All evaluation is performed locally without external API calls.
Retrieval-Augmented Generation (RAG) systems combine information retrieval techniques with Large Language Models (LLMs) to generate responses grounded in external knowledge sources rather than relying solely on the model’s internal parameters. RAG was formally introduced by Patrick Lewis et al. (2020), where retrieved documents were incorporated into the generation process to improve factual accuracy and reduce hallucination. Since then, RAG architectures have become widely adopted in domains requiring trustworthy and explainable outputs, including healthcare, legal systems, and cybersecurity.
In cybersecurity, trustworthy and evidence-grounded responses are especially important because inaccurate or hallucinated recommendations can introduce operational risks. Several studies have shown that general-purpose LLMs may produce technically plausible but incorrect security advice when unsupported by reliable evidence. As a result, retrieval-based grounding has become an important strategy for improving reliability in AI-assisted cybersecurity systems.
Recent literature highlights three major components that influence RAG performance:
-
Retrieval Quality The retriever determines which documents are supplied to the LLM. Dense retrieval methods using embedding models such as BGE and Sentence Transformers have demonstrated strong semantic search performance compared to traditional keyword-based retrieval approaches. Vector databases such as Qdrant enable efficient similarity search over embedded cybersecurity documents and are commonly used in modern RAG pipelines.
-
Generation Faithfulness Research has shown that LLMs frequently hallucinate when retrieved context is weak or incomplete. Prompt engineering and citation-aware generation are therefore commonly used to constrain model behaviour and improve faithfulness. Local open-source models such as Google Gemma have become increasingly popular for privacy-preserving deployments because they avoid transmitting potentially sensitive organisational data to external APIs.
-
Evaluation of RAG Systems Traditional NLP evaluation methods rely on reference answers, which are often unavailable in real-world enterprise settings. The RAGAS framework introduced automated reference-free evaluation metrics specifically for RAG systems, including:
- Context Relevance — whether retrieved documents are relevant to the query,
- Answer Relevance — whether the generated response addresses the user’s request,
- Faithfulness — whether the answer is grounded in retrieved evidence rather than hallucinated content.
This project builds upon these ideas by designing a cybersecurity-focused RAG assistant that retrieves guidance from curated security resources and generates practitioner-oriented recommendations. Unlike generic chatbot systems, this implementation emphasises:
- local deployment,
- retrieval transparency through citations,
- source-aware generation,
- and evaluation using RAGAS metrics.
The system also explores multimodal retrieval by supporting optional image-based inputs, allowing screenshots or security-related images to be incorporated into the retrieval pipeline. This extends the traditional text-only RAG architecture toward more practical cybersecurity workflows.
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
- Es, S. et al. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation.
- Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey.
- Hugging Face documentation on embedding models and retrieval pipelines.
- Qdrant documentation on vector similarity search and semantic retrieval.
- Retrieval: benchmarkable data source + data sources for variations and testing.
- Source transparency: the user chooses what the model is allowed to search.
- Image-assisted input: screenshots or images can be described and folded into the query.
- Citation-first response design: the UI exposes retrieved sources and relevance scores.
- Local deployment: the full system can run on your own machine without external hosting.
- GPT-style chat interface
- sidebar for source selection
- source counts and active-state indicators
- image upload
- citations panel with clickable retrieval previews
- assistant metadata for rewritten queries and vision output
- loading state and typing indicator
/collectionsto list available knowledge collections/rewriteto rewrite a query/chatto run the full pipeline/visionto process an uploaded image separately
- optional query rewrite
- optional image understanding
- retrieval over selected collections
- generation with local LLM
- structured citation return
- logging of query / retrieval / answer traces
User Query / Image
↓
React Chat UI
↓
Flask API
↓
[Optional query rewrite]
↓
[Optional vision / OCR step]
↓
Retriever over selected collections
↓
Top-k chunk selection
↓
Local LLM generation
↓
Structured answer + citations
↓
UI shows answer, sources, and previews
- React
- Vite
- Tailwind CSS
- Axios
- Flask
- Flask-CORS
- Python
- local embeddings and vector retrieval
- collection-based source filtering
- optional query rewriting
- local open-source LLM served on a 3090 setup
- local image understanding / OCR-style preprocessing
- vision output merged into the query before retrieval
- RAGAS-style metrics
- context relevance
- answer relevance
- faithfulness
- retrieval quality analysis
We evaluate the performance of the retrieval-augmented generation (RAG) pipeline using RAGAS (Retrieval-Augmented Generation Assessment) on a static benchmark derived from the HuggingFace cybersecurity QA dataset.
The evaluation is performed on a fixed subset of 100 randomly sampled QA pairs from the dataset stored in hf_qaa_chunks. Each entry contains:
- A natural language question
- A reference ground-truth answer
- Associated topic metadata
- Source evidence passages (used only for dataset construction, not evaluation leakage)
The evaluation is fully reproducible using a fixed random seed.
The evaluation script (ragas_eval_local_ollama.py) follows this workflow:
-
Dataset Loading
- Loads the HuggingFace-formatted JSON dataset
- Extracts structured fields: question, answer, evidence, and metadata
-
Query Execution (RAG Pipeline)
-
Each question is sent to the local Flask
/chatendpoint -
The backend handles:
- Query rewriting (optional)
- Embedding-based retrieval via Qdrant
- Context injection into the LLM (Gemma 3 via Ollama)
-
The response includes:
- Generated answer
- Retrieved context chunks (citations from Qdrant)
-
-
Context Extraction
- Retrieved contexts are extracted from backend citations (
previewfields) - Only actual retrieved documents are used (no ground-truth leakage)
- Retrieved contexts are extracted from backend citations (
-
RAGAS Evaluation
-
The following metrics are computed:
- Faithfulness → measures whether the answer is supported by retrieved context
- Answer Relevancy → measures semantic alignment between question and answer
- Context Precision → measures how relevant retrieved chunks are to the query
-
Evaluation uses:
- Local Ollama LLM (Gemma 3 12B IT QAT) for scoring
- Local HuggingFace embedding model (BAAI/bge-large-en-v1.5) for semantic similarity
-