Skip to content

khaled-ghanem/VQA-RAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Medical Visual Question Answering (VQA-RAD)

A vision-language pipeline that answers natural-language questions about radiology images, grounding answers in the image content. Built around a fine-tuned BLIP model and a multi-step reasoning agent.

⚠️ Research and educational project. Not a medical device.

Overview

Stage What Result
Baseline Pretrained BLIP on medical VQA Open F1 0.05, Closed acc 0.57
Fine-tuning BLIP fine-tuned on VQA-RAD Open F1 0.30 (6×), Closed acc 0.61
Agent Multi-step reasoning (plan→perceive→synthesize) coherent multi-step answers
Demo Streaming Gradio interface real-time responses

The Problem

General vision-language models fail on medical images. A pretrained BLIP scored just 0.05 token-F1 on open-ended VQA-RAD questions — it answered "blurry" for modality and "none" for a clear lesion. This project fine-tunes BLIP on medical data and wraps it in a reasoning agent.

Dataset

VQA-RAD — 315 radiology images (X-ray, CT, MRI) with ~3500 question/answer pairs across categories: presence, position, abnormality, modality, organ, size, and more. Questions are closed (yes/no) or open-ended.

Approach

  • Baseline evaluation: measured pretrained BLIP separately on closed (exact-match accuracy) and open (token-overlap F1) questions to expose where it fails.
  • Fine-tuning: trained BLIP's generative VQA head on the VQA-RAD train split (5 epochs, lr 2e-5), teaching it medical vocabulary.
  • Reasoning agent: an LLM (Gemini) decomposes a complex question into sub-questions, BLIP answers each (perception), and the LLM synthesizes a coherent final response — surfacing uncertainty when BLIP's evidence conflicts.
  • Streaming demo: a Gradio interface streams the agent's reasoning and answer token-by-token via Python generators.

Results

Metric Pretrained BLIP Fine-tuned BLIP
Open token-F1 0.047 0.295
Closed accuracy 0.568 0.614

Fine-tuning improved open-question performance . The absolute score (~0.30) reflects how hard open-ended medical VQA is — the gain over baseline is the key result, not a solved task.

Evaluation Notes (Honest)

  • Open-ended answers are scored with token-overlap F1 (partial credit), which is fairer than exact match but still imperfect for medical synonyms.
  • The reasoning agent improves answer structure and surfaces uncertainty; it does not increase raw accuracy beyond the underlying BLIP model.
  • The fine-tuned model can still produce confident wrong answers — not for clinical use.

Tech Stack

Python, PyTorch, HuggingFace Transformers (BLIP), Google Gemini, Gradio

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

VQA-RAD is not included — download it from the official source or Kaggle.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors