A vision-language pipeline that answers natural-language questions about radiology images, grounding answers in the image content. Built around a fine-tuned BLIP model and a multi-step reasoning agent.
⚠️ Research and educational project. Not a medical device.
| Stage | What | Result |
|---|---|---|
| Baseline | Pretrained BLIP on medical VQA | Open F1 0.05, Closed acc 0.57 |
| Fine-tuning | BLIP fine-tuned on VQA-RAD | Open F1 0.30 (6×), Closed acc 0.61 |
| Agent | Multi-step reasoning (plan→perceive→synthesize) | coherent multi-step answers |
| Demo | Streaming Gradio interface | real-time responses |
General vision-language models fail on medical images. A pretrained BLIP scored just 0.05 token-F1 on open-ended VQA-RAD questions — it answered "blurry" for modality and "none" for a clear lesion. This project fine-tunes BLIP on medical data and wraps it in a reasoning agent.
VQA-RAD — 315 radiology images (X-ray, CT, MRI) with ~3500 question/answer pairs across categories: presence, position, abnormality, modality, organ, size, and more. Questions are closed (yes/no) or open-ended.
- Baseline evaluation: measured pretrained BLIP separately on closed (exact-match accuracy) and open (token-overlap F1) questions to expose where it fails.
- Fine-tuning: trained BLIP's generative VQA head on the VQA-RAD train split (5 epochs, lr 2e-5), teaching it medical vocabulary.
- Reasoning agent: an LLM (Gemini) decomposes a complex question into sub-questions, BLIP answers each (perception), and the LLM synthesizes a coherent final response — surfacing uncertainty when BLIP's evidence conflicts.
- Streaming demo: a Gradio interface streams the agent's reasoning and answer token-by-token via Python generators.
| Metric | Pretrained BLIP | Fine-tuned BLIP |
|---|---|---|
| Open token-F1 | 0.047 | 0.295 |
| Closed accuracy | 0.568 | 0.614 |
Fine-tuning improved open-question performance 6×. The absolute score (~0.30) reflects how hard open-ended medical VQA is — the gain over baseline is the key result, not a solved task.
- Open-ended answers are scored with token-overlap F1 (partial credit), which is fairer than exact match but still imperfect for medical synonyms.
- The reasoning agent improves answer structure and surfaces uncertainty; it does not increase raw accuracy beyond the underlying BLIP model.
- The fine-tuned model can still produce confident wrong answers — not for clinical use.
Python, PyTorch, HuggingFace Transformers (BLIP), Google Gemini, Gradio
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtVQA-RAD is not included — download it from the official source or Kaggle.