Skip to content

ryanjosephkamp/image_token_study

Repository files navigation

Image Token Study: Vision-Language Model Robustness to Token-Reducing Image Transformations

A Systematic Cross-Provider Empirical Study of How Standard Image Transformations Affect VLM Description Accuracy and Input Token Cost


Python: 3.10+ License: MIT PyTorch: 2.0+ Pillow: 10.0+ BERTScore Sentence-Transformers Platform API Providers


Author: Ryan Kamp Affiliation: Department of Computer Science, University of Cincinnati Location: Cincinnati, OH, USA Email: [email protected] GitHub: https://github.com/ryanjosephkamp Date: April 2026


Table of Contents

  1. Project Overview
  2. Research Questions
  3. Models Under Study
  4. Image Transformations
  5. Datasets
  6. Evaluation Methodology
  7. Experimental Design
  8. Key Findings
  9. Selected Results
  10. Getting Started
  11. Usage
  12. Project Structure
  13. Reports and Documentation
  14. References
  15. License

Project Overview

Research Motivation

Vision-language models (VLMs) and large language models with vision capabilities process input images by converting them into sequences of tokens that occupy the model's context window. Each image token incurs both computational cost — latency and context window consumption — and direct financial cost through per-token API pricing. For researchers and practitioners who routinely submit images to VLM APIs, whether scientific figures for automated report generation, photographs for content analysis, or diagrams for accessibility annotation, the cumulative token cost of image inputs represents a significant and growing expense.

The number of image tokens consumed is determined by provider-specific tokenization mechanisms that depend primarily on image dimensions. For Anthropic Claude models, the token count follows an area-based formula:

$$N \approx \frac{W \times H}{750}$$

where $W$ and $H$ are the pixel dimensions after any provider-side resizing. OpenAI GPT models use a patch-based system with $32 \times 32$ pixel patches, while Google Gemini and xAI Grok employ their own dimension-dependent tokenization strategies.

A natural and practically important question arises: how far can the input image token footprint be reduced — through standard image transformations such as downscaling, compression, and color reduction — before model performance on image understanding tasks degrades to an unacceptable level?

Research Gap

Despite the ubiquity of this question, the literature lacks systematic cross-provider empirical data on the relationship between image transformations and VLM description accuracy. Prior work on token reduction has focused primarily on internal model mechanisms such as token merging, rather than on user-side input transformations that any practitioner can apply before submitting images to an API. No published study has compared the robustness of current flagship models from multiple providers under identical transformation conditions, nor has any study characterized whether scientific figures and general-purpose images exhibit meaningfully different degradation profiles.

Project Scope

This project is a formal, scientifically rigorous empirical study that systematically measures how the image description accuracy of six current flagship VLMs changes when token-reducing image transformations are applied to input images. The study spans two distinct image categories — scientific figures and general images — and four transformation families:

Family Transformation Mechanism
T1 Resolution Reduction Reduces pixel dimensions via interpolation, directly reducing token count
T2 Lossy Compression Reduces file size through JPEG/WebP encoding without changing pixel dimensions
T3 Color Depth Reduction Reduces chromatic information through grayscale conversion or color quantization
T4 Adaptive Multi-Stage Pipeline Combines moderate downscaling, format optimization (WebP), and quality-targeted compression

This design exploits the fact that these transformations affect different aspects of the image — spatial resolution, compression artifacts, and color information — and interact differently with provider-specific tokenization mechanisms. This enables the study to disentangle the effects of token count reduction from image quality degradation, since some transformations (T2, T3) alter image content without changing pixel dimensions and thus without affecting dimension-based token counts.

Study Scale

The study evaluates six flagship VLMs from four providers across 19 transformation levels, yielding 12,000 scored API observations. Description accuracy is assessed using four complementary metrics: BERTScore F1, Sentence-BERT cosine similarity, ROUGE-L, and an LLM-as-judge evaluation on a 300-observation subset. All experiments treat each model as a black-box API endpoint, interacting only with the public API surface.

Contributions

  1. Empirical characterization of transformation–accuracy tradeoffs across six flagship VLMs and 19 transformation levels, yielding the first systematic cross-provider dataset on this topic.
  2. Confirmation that lossy compression does not reduce token counts for any tested provider, establishing that image tokenization in current commercial VLM APIs is dimension-based rather than file-size-based.
  3. Identification of practical degradation thresholds: models maintain stable accuracy through $4\times$ resolution reduction ($\text{scale} = 0.25$) for general images and through $8\times$ reduction ($\text{scale} = 0.125$) for scientific figures on SBERT cosine similarity, before sharp performance drops.
  4. Cross-model robustness rankings revealing that GPT-5.4 achieves the highest overall description accuracy while Gemini 3.1 Pro Preview employs a fixed-token-budget strategy that renders input-side token reduction ineffective.
  5. Evidence that scientific figures and general images exhibit distinct degradation profiles, with implications for practitioners optimizing images across different content domains.
  6. Pareto-optimal transformation strategies for each model and image category at 50% and 75% token budget constraints.

Research Questions

The study is organized around seven research questions, each grounded in the research gaps identified in the literature review and addressed by one or more experiments.

RQ1: How does each model's description accuracy degrade as a function of downscaling factor, and at what resolution does degradation become practically unacceptable?

VLM image tokenizers compute token counts from pixel dimensions — via $32 \times 32$ patches for OpenAI, an area-based formula ($N \approx WH/750$) for Anthropic, and provider-specific mechanisms for Google and xAI. Downscaling is the most direct lever for reducing token cost. The critical unknown is the empirical shape of the degradation curve for each model and image category. (Addressed by Experiment 2.)

RQ2: Does lossy compression reduce input image token cost, or does it only degrade image quality without token savings?

Lossy JPEG and WebP compression reduce file size without changing pixel dimensions. If VLM tokenization is purely dimension-based, then compression should have zero effect on token count while still degrading image quality — making compression a "cost-free degradation" from the tokenization perspective. (Addressed by Experiment 3.)

RQ3: Do VLMs extract image understanding primarily from spatial structure or from chromatic detail, as revealed by performance under color depth reduction?

Color depth reduction (grayscale conversion, color quantization) removes chromatic information while preserving pixel dimensions and spatial structure. Scientific figures often rely on structural detail rather than color; general photographs may depend more on chromatic semantics. (Addressed by Experiment 4.)

RQ4: Can an adaptive multi-stage pipeline achieve a superior Pareto tradeoff (token cost vs. accuracy) compared to any single-axis transformation?

Distributing an information reduction budget across multiple independent axes (spatial resolution, encoding quality) should introduce less perceptual degradation than concentrating the same reduction along a single axis. The adaptive pipeline (T4) operationalizes this principle. (Addressed by Experiment 5.)

RQ5: Do scientific figures and general images exhibit meaningfully different degradation profiles under token-reducing transformations?

Scientific figures have distinctive properties — dense text, fine spatial detail, symbolic conventions, and structured layouts — that may make them differentially sensitive to certain transformations relative to general photographs. (Addressed by Experiment 6.)

RQ6: Which models are most robust to each transformation type, and is there a consistent robustness ranking across transformations?

The six models span four providers with different architectures, training data, and tokenization mechanisms. Whether robustness to image degradation is a general model property or a transformation-specific property has direct practical implications for model selection. (Addressed by Experiment 6.)

RQ7: Do the primary evaluation metrics (BERTScore, Sentence-BERT cosine similarity) agree on degradation patterns, or do they reveal different aspects of quality loss?

If the evaluation metrics agree strongly, conclusions are robust to metric choice. If they diverge, the study must characterize how and why — with implications for future evaluation methodology. (Addressed by Experiment 7.)


Models Under Study

Six flagship VLMs from four providers were evaluated as black-box API endpoints in inference-only mode. All models were accessed through their respective provider APIs with temperature set to 0.

# Provider Model API Identifier Tokenization Strategy
1 Anthropic Claude Opus 4.6 claude-opus-4-6-20260301 Area-based: $N \approx WH/750$
2 Anthropic Claude Sonnet 4.6 claude-sonnet-4-6-20260301 Area-based: $N \approx WH/750$
3 Anthropic Claude Opus 4.5 claude-opus-4-5-20250520 Area-based: $N \approx WH/750$
4 OpenAI GPT-5.4 gpt-5.4 Patch-based: $32 \times 32$ patches
5 Google Gemini 3.1 Pro Preview gemini-3.1-pro-preview Fixed-token-budget ($\approx 1{,}094$ tokens)
6 xAI Grok 4.20 (Reasoning) grok-4-20-reasoning Dimension-based (provider-specific)
Provider-Specific Tokenization Details
  • Anthropic (Claude models): All three Claude models share a deterministic tokenizer. Token count scales with pixel area ($N \approx WH/750$ after any provider-side resizing). At baseline, mean input tokens were 929 for the combined dataset (456 general, 1,401 scientific). Token counts decreased proportionally with downscaling.
  • OpenAI (GPT-5.4): Uses a $32 \times 32$ pixel patch system. Slightly fewer tokens than Anthropic at baseline (mean 855; 399 general, 1,311 scientific). Token counts decreased with downscaling.
  • Google (Gemini 3.1 Pro Preview): Employs a fixed-token-budget strategy: token counts remained near-constant at approximately 1,094 ($\sigma = 8.4$) across all images and all transformation levels, rendering dimension-based token reduction ineffective.
  • xAI (Grok 4.20): Dimension-based tokenization with proportional but less aggressive reduction than Anthropic/OpenAI. Mean baseline: 836 tokens (456 general, 1,217 scientific).
Experimental Configuration
Parameter Setting
Temperature 0 for all providers
Random seed 42 (OpenAI only; logged as N/A for other providers)
Prompt Fixed across all 12,000 primary API calls
Output format One-to-two sentence image description
Mode Inference only (no training or fine-tuning)

Experimental Design

Overview

The study comprises 7 experiments organized into 4 execution groups, testing 4 transformation families across 19 transformation levels, yielding 12,000 scored API observations plus 300 LLM-as-judge evaluations (12,300 total API calls).

Experiment Summary

# Experiment Transformation Primary RQ API Calls
1 Baseline Image Description None (original) 600
2 Resolution Reduction Robustness T1 RQ1 3,000
3 Lossy Compression Robustness T2 RQ2 3,600
4 Color Depth Reduction Robustness T3 RQ3 2,400
5 Adaptive Pipeline Evaluation T4 RQ4 2,400
6 Cross-Model Ranking & Category Divergence T1–T4 RQ5, RQ6 0 (analysis only)
7 Metric Agreement Analysis T1–T4 RQ7 300 (LLM-as-judge)
Total 12,300

Image Transformations

T1 — Resolution Reduction

Images downscaled using Lanczos interpolation at five scaling factors:

Level Scale Factor Effect on $1000 \times 1000$ Image Expected Token Impact
T1-L1 0.75 $750 \times 750$ Moderate reduction
T1-L2 0.50 $500 \times 500$ Substantial reduction
T1-L3 0.25 $250 \times 250$ Large reduction
T1-L4 0.125 $125 \times 125$ Very large reduction
T1-L5 0.0625 $63 \times 63$ Near-maximum reduction

Output format: PNG (lossless, to isolate resolution effects from compression artifacts).

T2 — Lossy Compression

Six compression levels applied without changing pixel dimensions:

Level Format Quality Artifact Severity
T2-L1 JPEG 85 Negligible
T2-L2 JPEG 50 Mild blocking
T2-L3 JPEG 20 Visible blocking and color banding
T2-L4 JPEG 5 Severe blocking, ringing, color loss
T2-L5 WebP 85 Negligible
T2-L6 WebP 50 Mild predictive artifacts
T3 — Color Depth Reduction

Four color conditions applied without changing pixel dimensions:

Level Condition Colors Method
T3-L1 Quantize 64 64 Median-cut algorithm
T3-L2 Quantize 16 16 Median-cut algorithm
T3-L3 Quantize 4 4 Median-cut algorithm
T3-L4 Grayscale 256 (luminance) ITU-R BT.601: $Y = 0.299R + 0.587G + 0.114B$

Grayscale images reconverted to 3-channel RGB for API compatibility. Output format: PNG (lossless).

T4 — Adaptive Multi-Stage Pipeline

Sequential pipeline: Lanczos downscale → WebP conversion → quality-targeted compression.

Level Config Scale Format Quality Token Reduction
T4-L1 Conservative 0.75 WebP 85 Moderate
T4-L2 Balanced 0.50 WebP 70 Substantial
T4-L3 Aggressive 0.25 WebP 50 Large
T4-L4 Maximum 0.125 WebP 30 Very large

Execution Groups

Experiments were organized into four execution groups to manage API costs, computational resources, and the dry-run/full validation workflow.

Group Scope Content Execution
A Dataset preparation Image curation, format standardization, ground-truth validation, transformation application (19 levels × 100 images) Local — no API calls
B Scientific images Experiments 1–5 on 50 scientific images × 6 models × all transformation levels 6,000 API calls
C General images Experiments 1–5 on 50 general images × 6 models × all transformation levels 6,000 API calls
D Evaluation & analysis BERTScore, SBERT, ROUGE-L on 12,000 observations; LLM-as-judge on 300-observation subset; Experiments 6–7 300 API calls + local compute
Dry-Run / Full Notebook Workflow

Each execution group followed a two-stage validation protocol:

  1. Dry-run notebook: A compact notebook exercising every major component at small scale (2–3 images, 1–2 models, 1–2 transformation levels). Run interactively from VS Code via the Google Colab extension on an A100/H100 GPU. Debugged iteratively until successful completion.
  2. Full notebook: The complete experiment run, initially planned for Colab H100 execution. Groups A and B executed on Colab; Groups C and D executed locally on macOS (Apple M3 Pro) after GPU vs. CPU analysis confirmed that API-bound workloads and MPS-accelerated BERTScore did not require cloud GPUs.

Both notebooks implemented:

  • Automatic checkpointing for pause-and-resume across sessions
  • Cell print-stream capture to log files
  • Result packaging into downloadable .zip archives
  • JSON checkpoint files for tracking completion state

Evaluation Methodology

Metric Type Model/Method Role
BERTScore F1 Token-level embedding microsoft/deberta-xlarge-mnli Primary — semantic alignment at token granularity
SBERT Cosine Similarity Sentence-level embedding all-MiniLM-L6-v2 Primary — holistic semantic equivalence
ROUGE-L F1 Lexical overlap N-gram matching Secondary — surface-level reference baseline
LLM-as-Judge LLM evaluation GPT-5.4, 1–5 rubric Supplementary — 300-observation subset

Standardized Prompt

All 12,000 primary API calls used the same fixed prompt:

Describe what this image shows in one to two concise sentences. Focus on the main content and any important details visible in the image.

Datasets

Category Size Source Ground-Truth Labels
Scientific 50 images (sci_001sci_050) Open-access publications, curated datasets One-to-two sentence figure descriptions
General 50 images (gen_001gen_050) MS COCO 2017 validation set First of five COCO captions per image

All images standardized to PNG format, RGB color space, minimum $640 \times 480$ resolution.


Key Findings

The study's 12,000 scored API observations yield seven principal findings, each tied to a research question.

1. Resolution Reduction Is the Only Effective Token Reduction Lever

Resolution reduction (T1) is the only single-axis transformation that consistently lowers input token counts across all providers with dimension-based tokenization. Models maintain stable accuracy through 4× downscaling (scale 0.25), which reduces token counts by 89–94% for Anthropic and OpenAI models. The degradation curve exhibits a plateau-then-cliff shape: accuracy is stable from 0.75× to 0.25× scale, then drops sharply at 0.125× and 0.0625×. At 16× downscaling (scale 0.0625), all models converge to a degraded floor with BERTScore F1 in the 0.15–0.19 range.

2. Lossy Compression Does Not Reduce Token Counts

JPEG and WebP compression did not reduce input token counts for any of the four providers tested, confirming that current commercial VLM tokenization is dimension-based rather than file-size-based. Despite having no token benefit, models demonstrated remarkable robustness to compression artifacts: accuracy remained effectively constant from JPEG $q = 85$ down to $q = 5$.

3. VLMs Rely Primarily on Spatial Structure, Not Color

Five of six models maintained accuracy within 5% of baseline under complete grayscale conversion, indicating that spatial structure — edges, text, object boundaries — carries far more information for VLM understanding than chromatic detail. Gemini 3.1 Pro Preview was the sole exception, showing a 36% relative BERTScore decline under grayscale conversion. Scientific figures were especially color-invariant, consistent with their reliance on layout and text rather than color semantics.

4. The Adaptive Pipeline Offers Practical Convenience

The multi-stage pipeline (T4) combining downscaling with WebP compression appeared on or near the Pareto frontier for most models, providing simultaneous token reduction and file size reduction. The Balanced configuration (T4-L2: 0.50× scale, WebP $q = 70$) achieves approximately 71% token savings with good accuracy preservation (SBERT $\approx 0.55$ for Anthropic models). However, the pipeline does not fundamentally dominate single-axis resolution reduction at equivalent scaling factors.

5. Scientific and General Images Degrade Differently

The two image categories exhibited an unexpected metric asymmetry: general images scored higher on BERTScore F1 (0.285 versus 0.141 at baseline) but lower on SBERT cosine similarity (0.548 versus 0.670). Under resolution reduction, general images degraded faster on SBERT (45% decline from baseline to 16× downscaling) than scientific images (28% decline), contradicting the initial hypothesis that scientific figures would be more sensitive due to fine text and line detail.

6. GPT-5.4 Is the Most Accurate and Robust Model

Across all 12,000 observations, GPT-5.4 achieved the highest mean BERTScore F1 (0.239) and SBERT cosine similarity (0.623). Grok 4.20 Reasoning ranked second, followed by the three Anthropic models. Gemini 3.1 Pro Preview employed a fixed-token-budget strategy ($\approx 1{,}094$ tokens regardless of image dimensions), rendering all input-side transformations ineffective for token reduction on that provider.

7. BERTScore and SBERT Capture Different Quality Dimensions

BERTScore F1 and SBERT cosine similarity showed near-zero correlation ($r = 0.188$ across the full dataset, $r = 0.085$ on the 300-observation judge subset). SBERT correlated most strongly with LLM-as-judge scores ($r = 0.503$), while BERTScore showed weak correlation ($r = 0.125$). This finding argues strongly for multi-metric evaluation as standard practice in VLM studies.


Selected Results

The following figures illustrate key experimental outcomes. All 31 figures and their descriptions are catalogued in the figure map.

Baseline Model Performance (Experiment 1)

Figure 1. Baseline accuracy by model. Bar chart comparing mean BERTScore F1 across all six models at baseline (original images), faceted by scientific and general categories. GPT-5.4 achieves the highest accuracy on both categories.

Figure 1. Mean BERTScore F1 at baseline (original images) for each model, faceted by scientific and general image categories.

Resolution Degradation Curves (Experiment 2)

Figure 3. BERTScore F1 versus image scaling factor for each model, faceted by scientific and general categories. The plateau-then-cliff degradation pattern is visible across all models, with sharp accuracy drops at 0.125× and 0.0625× scale.

Figure 3. BERTScore F1 versus scaling factor (0.0625–0.75×) per model and image category, showing the characteristic plateau-then-cliff degradation pattern.

Pareto Frontier — Token Cost vs. Accuracy (Experiment 5)

Figure 16. Scatter plot of mean input tokens versus mean BERTScore F1 across all transformations (T1–T4), with Pareto frontier highlighted. Resolution reduction and pipeline configurations dominate the efficient frontier.

Figure 16. Pareto frontier of input token cost versus BERTScore F1 across all transformation families. Points on the frontier represent configurations that are not dominated on both axes by any other configuration.

Cross-Model Robustness Ranking (Experiment 6)

Figure 20. Color-coded heatmap of Area Under the Degradation Curve (AUDC) BERTScore F1 for each model–transformation pair. Higher AUDC indicates greater robustness. GPT-5.4 shows the highest robustness across most transformation types.

Figure 20. AUDC BERTScore F1 heatmap across models and transformation families. Higher values indicate greater robustness to the corresponding transformation.

Metric Correlation Analysis (Experiment 7)

Figure 25. Pairwise Pearson correlation heatmap of all evaluation metrics: BERTScore F1, SBERT cosine similarity, ROUGE-L F1, and LLM judge score. SBERT and the LLM judge show the strongest agreement (r = 0.503), while BERTScore F1 and SBERT are nearly uncorrelated (r = 0.085).

Figure 25. Pairwise Pearson correlation matrix of evaluation metrics on the 300-observation LLM-as-judge subset. SBERT cosine similarity correlates most strongly with LLM judge scores.

Transformation Examples

Transformation examples panel showing the effect of resolution reduction, lossy compression, color depth reduction, and the adaptive pipeline on a sample image at multiple intensity levels.

Transformation examples. Visual illustration of the four transformation families applied to a sample image at increasing intensity levels.


Getting Started

Prerequisites

  • Python 3.10+ (tested on Python 3.12, macOS 14 with Apple M3 Pro)
  • pip (or any compatible package manager)
  • API keys for all four providers (see API Key Configuration below)

Installation

  1. Clone the repository:
git clone https://github.com/ryanjosephkamp/image_token_study.git
cd image_token_study
  1. Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt

The requirements.txt installs:

  • API client libraries: anthropic, openai, google-genai, requests
  • Image processing: Pillow, scikit-image
  • Scientific Python: numpy, pandas, scipy, matplotlib
  • Evaluation metrics: bert-score, sentence-transformers, rouge-score
  • Deep learning backend: torch (required by BERTScore and Sentence-Transformers)
  • Utilities: tqdm

API Key Configuration

The study requires API keys from all four providers. Keys are loaded from environment variables by api_client.py at runtime. Never commit API keys to the repository.

Provider Environment Variable Sign-Up
Anthropic ANTHROPIC_API_KEY console.anthropic.com
OpenAI OPENAI_API_KEY platform.openai.com
Google GOOGLE_API_KEY aistudio.google.dev
xAI XAI_API_KEY console.x.ai

Option A — .env file (recommended for local development):

Create a .env file in the project root (already in .gitignore):

# .env — do NOT commit this file
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=AI...
XAI_API_KEY=xai-...

Then source it before running:

source .env

Option B — Shell export:

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="AI..."
export XAI_API_KEY="xai-..."

Option C — Google Colab Secrets:

When running notebooks on Google Colab, store keys as Colab Secrets and access them via google.colab.userdata.get().


Usage

Main Orchestrator

The main entry point is image_token_study_scripts/main.py, which defines the model registry, shared constants, and group-level entry points. It can be invoked from notebooks or from the command line:

# Run a specific execution group in dry-run mode
python image_token_study_scripts/main.py --group a --mode dry_run

# Run a group in full mode
python image_token_study_scripts/main.py --group b --mode full

Execution Groups

Group Description Typical Runtime
a Dataset preparation and image transformations (no API calls) ~5 min
b Experiments 1–5 on 50 scientific images (6,000 API calls) ~2–4 hours
c Experiments 1–5 on 50 general images (6,000 API calls) ~2–4 hours
d BERTScore/SBERT/ROUGE-L evaluation + LLM-as-judge (300 API calls) ~1–2 hours

Dry-Run / Full Notebook Workflow

Each execution group was run through a two-stage validation protocol using Jupyter notebooks in image_token_study_scripts/notebooks/:

Stage 1 — Dry-Run Notebook

A compact notebook exercising every major component at small scale (2–3 images, 1–2 models, 1–2 transformation levels). Purpose: confirm that all API calls, transformations, CSV logging, and checkpointing work correctly before committing to the full run.

notebooks/group_a_dry_run.ipynb
notebooks/group_b_dry_run.ipynb
notebooks/group_c_dry_run.ipynb
notebooks/group_d_dry_run.ipynb

Run interactively in VS Code (local or via the Google Colab extension). Debug iteratively until all cells pass.

Stage 2 — Full Notebook

The complete experiment run for the group. Implements automatic checkpointing for pause-and-resume, cell print-stream capture to log files, and result packaging.

notebooks/group_a_full.ipynb
notebooks/group_b_full.ipynb   (+ group_b_rerun.ipynb for re-execution)
notebooks/group_c_full.ipynb
notebooks/group_d_full.ipynb

Groups A and B executed on Google Colab (A100/H100 GPU). Groups C and D executed locally on macOS (Apple M3 Pro) after analysis confirmed that API-bound workloads and MPS-accelerated BERTScore did not require cloud GPUs.

Verification Scripts

After each group completes, run the corresponding verification script to validate outputs:

python image_token_study_scripts/verification_scripts/verify_group_a.py
python image_token_study_scripts/verification_scripts/verify_group_b.py
python image_token_study_scripts/verification_scripts/verify_group_c.py
python image_token_study_scripts/verification_scripts/verify_group_d.py

Analysis and figure verification:

python image_token_study_scripts/verification_scripts/verify_analysis.py
python image_token_study_scripts/verification_scripts/verify_figures.py

Project Structure

image_token_study/
├── README.md
├── LICENSE                          # MIT License
├── requirements.txt                 # Python dependencies
│
├── image_token_study_scripts/       # All source code
│   ├── main.py                      # Main orchestrator — model registry, constants, group entry points
│   ├── implementation_scripts/      # Core implementation modules
│   │   ├── api_client.py            # Unified API client for all 4 providers
│   │   ├── image_transforms.py      # T1–T4 transformation pipeline
│   │   ├── dataset_utils.py         # Dataset preparation and curation
│   │   ├── data_utils.py            # Checkpointing, CSV logging, utilities
│   │   ├── evaluation.py            # BERTScore, SBERT, ROUGE-L scoring
│   │   ├── analysis.py              # Per-experiment statistical analysis
│   │   ├── cross_analysis.py        # Cross-experiment synthesis (Exp 6, 7)
│   │   └── figure_gen.py            # Matplotlib figure generation (32 figures)
│   ├── notebooks/                   # Jupyter notebooks for each execution group
│   │   ├── group_a_dry_run.ipynb
│   │   ├── group_a_full.ipynb
│   │   ├── group_b_dry_run.ipynb
│   │   ├── group_b_full.ipynb
│   │   ├── group_b_rerun.ipynb      # Re-execution notebook for Group B
│   │   ├── group_c_dry_run.ipynb
│   │   ├── group_c_full.ipynb
│   │   ├── group_d_dry_run.ipynb
│   │   └── group_d_full.ipynb
│   └── verification_scripts/        # Post-execution validation scripts
│       ├── verify_group_a.py
│       ├── verify_group_b.py
│       ├── verify_group_c.py
│       ├── verify_group_d.py
│       ├── verify_analysis.py
│       └── verify_figures.py
│
├── image_token_study_experiments/   # All experiment data and artifacts
│   ├── checkpoint.json              # Global checkpoint state
│   ├── group_b_rerun_checkpoint.json
│   ├── group_c_checkpoint.json
│   ├── group_d_checkpoint.json
│   ├── datasets/                    # Curated image datasets
│   │   ├── scientific/              # 50 scientific figures (sci_001–sci_050)
│   │   ├── general/                 # 50 general images (gen_001–gen_050)
│   │   ├── image_metadata.csv       # Image dimensions, file sizes, formats
│   │   ├── ground_truth_labels.csv  # Ground-truth descriptions per image
│   │   └── captions.json            # Caption mapping sidecar
│   ├── transformed_images/          # 19 transformation levels × 100 images
│   │   └── T1_resolution/ ...       # Subdirectories per transformation
│   ├── group_a/                     # Group A outputs (dry_run/, full/)
│   ├── group_b_scientific/          # Group B outputs (dry_run/, full/)
│   ├── group_b_rerun_working/       # Group B re-run intermediate CSVs
│   ├── group_c_general/             # Group C outputs (dry_run/, full/)
│   ├── group_c_working/             # Group C intermediate CSVs
│   ├── group_d_evaluation/          # Group D outputs (dry_run/, full/)
│   ├── group_d_working/             # Group D intermediate and merged CSVs
│   ├── analysis/                    # Final analysis artifacts
│   │   ├── consolidated_metrics.csv # 12,000-row master dataset
│   │   ├── pareto_frontiers.csv     # Pareto-optimal configurations
│   │   ├── practical_recommendations.csv
│   │   └── summary_statistics.json  # Aggregate statistics
│   └── group_*_results_*/           # Timestamped result archives
│
├── image_token_study_reports/       # Reports and figures
│   ├── figures/                     # 32 PNG figures + figure_map.csv
│   ├── markdowns/
│   │   └── image_token_study_comprehensive_report.md
│   ├── latex/
│   │   ├── final_report.tex         # IEEE-formatted LaTeX report
│   │   ├── final_report.pdf         # Compiled PDF
│   │   └── figures/                 # LaTeX figure copies
│   └── pdfs/
│
└── image_token_study_printouts/     # Console logs from each execution group
    ├── group_a_full_log.txt
    ├── group_b_rerun_log.txt
    ├── group_c_full_log.txt
    ├── group_d_full_log.txt
    └── analysis_log.txt

Reports and Documentation

Comprehensive Report

The full study report is available in two formats:

Format Location Description
Markdown image_token_study_reports/markdowns/image_token_study_comprehensive_report.md Full report with embedded figure references
IEEE LaTeX image_token_study_reports/latex/final_report.tex IEEE conference-formatted LaTeX source
Compiled PDF image_token_study_reports/latex/final_report.pdf Compiled PDF of the LaTeX report

Figures

All 32 analysis figures are in image_token_study_reports/figures/. The complete catalogue with descriptions is in image_token_study_reports/figures/figure_map.csv.

Figure Inventory by Experiment
Experiment Figures Count
Exp 1 — Baseline exp1_baseline_performance.png, exp1_token_distribution.png 2
Exp 2 — Resolution exp2_bertscore_vs_scale.png, exp2_sbert_vs_scale.png, exp2_tokens_vs_scale.png, exp2_degradation_heatmap.png 4
Exp 3 — Compression exp3_tokens_vs_compression.png, exp3_bertscore_vs_compression.png, exp3_jpeg_vs_webp.png, exp3_filesize_vs_accuracy.png, exp3_psnr_ssim_vs_accuracy.png 5
Exp 4 — Color Depth exp4_bertscore_by_color.png, exp4_grayscale_impact.png, exp4_quantization_curve.png, exp4_color_importance_distribution.png 4
Exp 5 — Pipeline exp5_pareto_frontier.png, exp5_pipeline_comparison.png, exp5_category_pareto.png, exp5_efficiency_ranking.png 4
Exp 6 — Cross-Model exp6_robustness_ranking_heatmap.png, exp6_category_divergence.png, exp6_rank_correlation.png, exp6_model_robustness_radar.png, exp6_scientific_vs_general_curves.png 5
Exp 7 — Metric Agreement exp7_metric_correlation_matrix.png, exp7_metric_agreement_scatter.png, exp7_disagreement_cases.png 3
Cross-Experiment cross_model_ranking_table.png, cross_pareto_summary.png, cross_practical_recommendations_v2.png, cross_category_summary.png 4
Visual transformation_examples_panel.png 1
Total 32

Analysis Artifacts

Key analysis outputs in image_token_study_experiments/analysis/:

File Description
consolidated_metrics.csv 12,000-row master dataset with all metrics per observation
pareto_frontiers.csv Pareto-optimal transformation configurations per model and category
practical_recommendations.csv Best configurations at 50% and 75% token budget constraints
summary_statistics.json Aggregate statistics across all experiments

Execution Logs

Console output from each execution group is preserved in image_token_study_printouts/ for reproducibility and debugging reference.


References

[1] D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, "Token Merging: Your ViT But Faster," in Proc. 11th Int. Conf. Learning Representations (ICLR), 2023.

[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention Is All You Need," in Advances in Neural Information Processing Systems, vol. 30, 2017.

[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," in Proc. 9th Int. Conf. Learning Representations (ICLR), 2021.

[4] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, "Learning Transferable Visual Models From Natural Language Supervision," in Proc. 38th Int. Conf. Machine Learning (ICML), pp. 8748–8763, 2021.

[5] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, "Flamingo: A Visual Language Model for Few-Shot Learning," in Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022.

[6] J. Li, D. Li, S. Savarese, and S. Hoi, "BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models," in Proc. 40th Int. Conf. Machine Learning (ICML), PMLR 202, 2023.

[7] H. Liu, C. Li, Q. Wu, and Y. J. Lee, "Visual Instruction Tuning," in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023.

[8] Anthropic, "Vision," API Documentation. [Online]. Available: https://platform.claude.com/docs/en/docs/build-with-claude/vision

[9] OpenAI, "Images and Vision," API Documentation. [Online]. Available: https://developers.openai.com/api/docs/guides/images-vision

[10] Google, "Gemini API: Vision." [Online]. Available: https://ai.google.dev/gemini-api/docs/vision

[11] R. Keys, "Cubic Convolution Interpolation for Digital Image Processing," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 29, no. 6, pp. 1153–1160, 1981.

[12] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 4th ed. Pearson, 2018.

[13] G. K. Wallace, "The JPEG Still Picture Compression Standard," IEEE Trans. Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv, 1992.

[14] Google Developers, "WebP: An Image Format for the Web." [Online]. Available: https://developers.google.com/speed/webp

[15] P. Heckbert, "Color Image Quantization for Frame Buffer Display," ACM SIGGRAPH Computer Graphics, vol. 16, no. 3, pp. 297–307, 1982.

[16] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, "BERTScore: Evaluating Text Generation with BERT," in Proc. 8th Int. Conf. Learning Representations (ICLR), 2020.

[17] N. Reimers and I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," in Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), pp. 3982–3992, 2019.

[18] C.-Y. Lin, "ROUGE: A Package for Automatic Evaluation of Summaries," in Text Summarization Branches Out, pp. 74–81, 2004.

[19] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023.

[20] A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning," in Findings of the Association for Computational Linguistics: ACL 2022, pp. 2263–2279, 2022.

[21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, "Microsoft COCO: Common Objects in Context," in Proc. European Conf. Computer Vision (ECCV), pp. 740–755, 2014.


License

This project is licensed under the MIT License. See the LICENSE file for details.

Copyright (c) 2026 Ryan Kamp

About

Systematic empirical study measuring how token-reducing image transformations (resolution reduction, lossy compression, color depth reduction, & adaptive pipelines) affect description accuracy & input token cost across 6 flagship VLMs from Anthropic, OpenAI, Google, & xAI; 12k scored API observations across 100 images & 19 transformation levels.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors