Medical Visual Question Answering (VQA-RAD)

A vision-language pipeline that answers natural-language questions about radiology images, grounding answers in the image content. Built around a fine-tuned BLIP model and a multi-step reasoning agent.

⚠️ Research and educational project. Not a medical device.

Overview

Stage	What	Result
Baseline	Pretrained BLIP on medical VQA	Open F1 0.05, Closed acc 0.57
Fine-tuning	BLIP fine-tuned on VQA-RAD	Open F1 0.30 (6×), Closed acc 0.61
Agent	Multi-step reasoning (plan→perceive→synthesize)	coherent multi-step answers
Demo	Streaming Gradio interface	real-time responses

The Problem

General vision-language models fail on medical images. A pretrained BLIP scored just 0.05 token-F1 on open-ended VQA-RAD questions — it answered "blurry" for modality and "none" for a clear lesion. This project fine-tunes BLIP on medical data and wraps it in a reasoning agent.

Dataset

VQA-RAD — 315 radiology images (X-ray, CT, MRI) with ~3500 question/answer pairs across categories: presence, position, abnormality, modality, organ, size, and more. Questions are closed (yes/no) or open-ended.

Approach

Baseline evaluation: measured pretrained BLIP separately on closed (exact-match accuracy) and open (token-overlap F1) questions to expose where it fails.
Fine-tuning: trained BLIP's generative VQA head on the VQA-RAD train split (5 epochs, lr 2e-5), teaching it medical vocabulary.
Reasoning agent: an LLM (Gemini) decomposes a complex question into sub-questions, BLIP answers each (perception), and the LLM synthesizes a coherent final response — surfacing uncertainty when BLIP's evidence conflicts.
Streaming demo: a Gradio interface streams the agent's reasoning and answer token-by-token via Python generators.

Results

Metric	Pretrained BLIP	Fine-tuned BLIP
Open token-F1	0.047	0.295
Closed accuracy	0.568	0.614

Fine-tuning improved open-question performance 6×. The absolute score (~0.30) reflects how hard open-ended medical VQA is — the gain over baseline is the key result, not a solved task.

Evaluation Notes (Honest)

Open-ended answers are scored with token-overlap F1 (partial credit), which is fairer than exact match but still imperfect for medical synonyms.
The reasoning agent improves answer structure and surfaces uncertainty; it does not increase raw accuracy beyond the underlying BLIP model.
The fine-tuned model can still produce confident wrong answers — not for clinical use.

Tech Stack

Python, PyTorch, HuggingFace Transformers (BLIP), Google Gemini, Gradio

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

VQA-RAD is not included — download it from the official source or Kaggle.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
medical-visual-qa.ipynb		medical-visual-qa.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Visual Question Answering (VQA-RAD)

Overview

The Problem

Dataset

Approach

Results

Evaluation Notes (Honest)

Tech Stack

Setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Medical Visual Question Answering (VQA-RAD)

Overview

The Problem

Dataset

Approach

Results

Evaluation Notes (Honest)

Tech Stack

Setup

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages