Image Token Study: Vision-Language Model Robustness to Token-Reducing Image Transformations

A Systematic Cross-Provider Empirical Study of How Standard Image Transformations Affect VLM Description Accuracy and Input Token Cost

Author: Ryan Kamp Affiliation: Department of Computer Science, University of Cincinnati Location: Cincinnati, OH, USA Email: [email protected] GitHub: https://github.com/ryanjosephkamp Date: April 2026

Project Overview

Research Motivation

Vision-language models (VLMs) and large language models with vision capabilities process input images by converting them into sequences of tokens that occupy the model's context window. Each image token incurs both computational cost — latency and context window consumption — and direct financial cost through per-token API pricing. For researchers and practitioners who routinely submit images to VLM APIs, whether scientific figures for automated report generation, photographs for content analysis, or diagrams for accessibility annotation, the cumulative token cost of image inputs represents a significant and growing expense.

The number of image tokens consumed is determined by provider-specific tokenization mechanisms that depend primarily on image dimensions. For Anthropic Claude models, the token count follows an area-based formula:

$$N \approx \frac{W \times H}{750}$$

where $W$ and $H$ are the pixel dimensions after any provider-side resizing. OpenAI GPT models use a patch-based system with $32 \times 32$ pixel patches, while Google Gemini and xAI Grok employ their own dimension-dependent tokenization strategies.

A natural and practically important question arises: how far can the input image token footprint be reduced — through standard image transformations such as downscaling, compression, and color reduction — before model performance on image understanding tasks degrades to an unacceptable level?

Research Gap

Despite the ubiquity of this question, the literature lacks systematic cross-provider empirical data on the relationship between image transformations and VLM description accuracy. Prior work on token reduction has focused primarily on internal model mechanisms such as token merging, rather than on user-side input transformations that any practitioner can apply before submitting images to an API. No published study has compared the robustness of current flagship models from multiple providers under identical transformation conditions, nor has any study characterized whether scientific figures and general-purpose images exhibit meaningfully different degradation profiles.

Project Scope

This project is a formal, scientifically rigorous empirical study that systematically measures how the image description accuracy of six current flagship VLMs changes when token-reducing image transformations are applied to input images. The study spans two distinct image categories — scientific figures and general images — and four transformation families:

Family	Transformation	Mechanism
T1	Resolution Reduction	Reduces pixel dimensions via interpolation, directly reducing token count
T2	Lossy Compression	Reduces file size through JPEG/WebP encoding without changing pixel dimensions
T3	Color Depth Reduction	Reduces chromatic information through grayscale conversion or color quantization
T4	Adaptive Multi-Stage Pipeline	Combines moderate downscaling, format optimization (WebP), and quality-targeted compression

This design exploits the fact that these transformations affect different aspects of the image — spatial resolution, compression artifacts, and color information — and interact differently with provider-specific tokenization mechanisms. This enables the study to disentangle the effects of token count reduction from image quality degradation, since some transformations (T2, T3) alter image content without changing pixel dimensions and thus without affecting dimension-based token counts.

Study Scale

The study evaluates six flagship VLMs from four providers across 19 transformation levels, yielding 12,000 scored API observations. Description accuracy is assessed using four complementary metrics: BERTScore F1, Sentence-BERT cosine similarity, ROUGE-L, and an LLM-as-judge evaluation on a 300-observation subset. All experiments treat each model as a black-box API endpoint, interacting only with the public API surface.

Contributions

Empirical characterization of transformation–accuracy tradeoffs across six flagship VLMs and 19 transformation levels, yielding the first systematic cross-provider dataset on this topic.
Confirmation that lossy compression does not reduce token counts for any tested provider, establishing that image tokenization in current commercial VLM APIs is dimension-based rather than file-size-based.
Identification of practical degradation thresholds: models maintain stable accuracy through $4\times$ resolution reduction ($\text{scale} = 0.25$) for general images and through $8\times$ reduction ($\text{scale} = 0.125$) for scientific figures on SBERT cosine similarity, before sharp performance drops.
Cross-model robustness rankings revealing that GPT-5.4 achieves the highest overall description accuracy while Gemini 3.1 Pro Preview employs a fixed-token-budget strategy that renders input-side token reduction ineffective.
Evidence that scientific figures and general images exhibit distinct degradation profiles, with implications for practitioners optimizing images across different content domains.
Pareto-optimal transformation strategies for each model and image category at 50% and 75% token budget constraints.

Research Questions

The study is organized around seven research questions, each grounded in the research gaps identified in the literature review and addressed by one or more experiments.

RQ1: How does each model's description accuracy degrade as a function of downscaling factor, and at what resolution does degradation become practically unacceptable?

VLM image tokenizers compute token counts from pixel dimensions — via $32 \times 32$ patches for OpenAI, an area-based formula ($N \approx WH/750$) for Anthropic, and provider-specific mechanisms for Google and xAI. Downscaling is the most direct lever for reducing token cost. The critical unknown is the empirical shape of the degradation curve for each model and image category. (Addressed by Experiment 2.)

RQ2: Does lossy compression reduce input image token cost, or does it only degrade image quality without token savings?

Lossy JPEG and WebP compression reduce file size without changing pixel dimensions. If VLM tokenization is purely dimension-based, then compression should have zero effect on token count while still degrading image quality — making compression a "cost-free degradation" from the tokenization perspective. (Addressed by Experiment 3.)

RQ3: Do VLMs extract image understanding primarily from spatial structure or from chromatic detail, as revealed by performance under color depth reduction?

Color depth reduction (grayscale conversion, color quantization) removes chromatic information while preserving pixel dimensions and spatial structure. Scientific figures often rely on structural detail rather than color; general photographs may depend more on chromatic semantics. (Addressed by Experiment 4.)

RQ4: Can an adaptive multi-stage pipeline achieve a superior Pareto tradeoff (token cost vs. accuracy) compared to any single-axis transformation?

Distributing an information reduction budget across multiple independent axes (spatial resolution, encoding quality) should introduce less perceptual degradation than concentrating the same reduction along a single axis. The adaptive pipeline (T4) operationalizes this principle. (Addressed by Experiment 5.)

RQ5: Do scientific figures and general images exhibit meaningfully different degradation profiles under token-reducing transformations?

Scientific figures have distinctive properties — dense text, fine spatial detail, symbolic conventions, and structured layouts — that may make them differentially sensitive to certain transformations relative to general photographs. (Addressed by Experiment 6.)

RQ6: Which models are most robust to each transformation type, and is there a consistent robustness ranking across transformations?

The six models span four providers with different architectures, training data, and tokenization mechanisms. Whether robustness to image degradation is a general model property or a transformation-specific property has direct practical implications for model selection. (Addressed by Experiment 6.)

RQ7: Do the primary evaluation metrics (BERTScore, Sentence-BERT cosine similarity) agree on degradation patterns, or do they reveal different aspects of quality loss?

If the evaluation metrics agree strongly, conclusions are robust to metric choice. If they diverge, the study must characterize how and why — with implications for future evaluation methodology. (Addressed by Experiment 7.)

Models Under Study

Six flagship VLMs from four providers were evaluated as black-box API endpoints in inference-only mode. All models were accessed through their respective provider APIs with temperature set to 0.

#	Provider	Model	API Identifier	Tokenization Strategy
1	Anthropic	Claude Opus 4.6	`claude-opus-4-6-20260301`	Area-based: $N \approx WH/750$
2	Anthropic	Claude Sonnet 4.6	`claude-sonnet-4-6-20260301`	Area-based: $N \approx WH/750$
3	Anthropic	Claude Opus 4.5	`claude-opus-4-5-20250520`	Area-based: $N \approx WH/750$
4	OpenAI	GPT-5.4	`gpt-5.4`	Patch-based: $32 \times 32$ patches
5	Google	Gemini 3.1 Pro Preview	`gemini-3.1-pro-preview`	Fixed-token-budget ($\approx 1{,}094$ tokens)
6	xAI	Grok 4.20 (Reasoning)	`grok-4-20-reasoning`	Dimension-based (provider-specific)

Provider-Specific Tokenization Details

Anthropic (Claude models): All three Claude models share a deterministic tokenizer. Token count scales with pixel area ($N \approx WH/750$ after any provider-side resizing). At baseline, mean input tokens were 929 for the combined dataset (456 general, 1,401 scientific). Token counts decreased proportionally with downscaling.
OpenAI (GPT-5.4): Uses a $32 \times 32$ pixel patch system. Slightly fewer tokens than Anthropic at baseline (mean 855; 399 general, 1,311 scientific). Token counts decreased with downscaling.
Google (Gemini 3.1 Pro Preview): Employs a fixed-token-budget strategy: token counts remained near-constant at approximately 1,094 ($\sigma = 8.4$) across all images and all transformation levels, rendering dimension-based token reduction ineffective.
xAI (Grok 4.20): Dimension-based tokenization with proportional but less aggressive reduction than Anthropic/OpenAI. Mean baseline: 836 tokens (456 general, 1,217 scientific).

Experimental Configuration

Parameter	Setting
Temperature	0 for all providers
Random seed	42 (OpenAI only; logged as N/A for other providers)
Prompt	Fixed across all 12,000 primary API calls
Output format	One-to-two sentence image description
Mode	Inference only (no training or fine-tuning)

Experimental Design

Overview

The study comprises 7 experiments organized into 4 execution groups, testing 4 transformation families across 19 transformation levels, yielding 12,000 scored API observations plus 300 LLM-as-judge evaluations (12,300 total API calls).

Experiment Summary

#	Experiment	Transformation	Primary RQ	API Calls
1	Baseline Image Description	None (original)	—	600
2	Resolution Reduction Robustness	T1	RQ1	3,000
3	Lossy Compression Robustness	T2	RQ2	3,600
4	Color Depth Reduction Robustness	T3	RQ3	2,400
5	Adaptive Pipeline Evaluation	T4	RQ4	2,400
6	Cross-Model Ranking & Category Divergence	T1–T4	RQ5, RQ6	0 (analysis only)
7	Metric Agreement Analysis	T1–T4	RQ7	300 (LLM-as-judge)
	Total			12,300

Image Transformations

T1 — Resolution Reduction

Images downscaled using Lanczos interpolation at five scaling factors:

Level	Scale Factor	Effect on $1000 \times 1000$ Image	Expected Token Impact
T1-L1	0.75	$750 \times 750$	Moderate reduction
T1-L2	0.50	$500 \times 500$	Substantial reduction
T1-L3	0.25	$250 \times 250$	Large reduction
T1-L4	0.125	$125 \times 125$	Very large reduction
T1-L5	0.0625	$63 \times 63$	Near-maximum reduction

Output format: PNG (lossless, to isolate resolution effects from compression artifacts).

T2 — Lossy Compression

Six compression levels applied without changing pixel dimensions:

Level	Format	Quality	Artifact Severity
T2-L1	JPEG	85	Negligible
T2-L2	JPEG	50	Mild blocking
T2-L3	JPEG	20	Visible blocking and color banding
T2-L4	JPEG	5	Severe blocking, ringing, color loss
T2-L5	WebP	85	Negligible
T2-L6	WebP	50	Mild predictive artifacts

T3 — Color Depth Reduction

Four color conditions applied without changing pixel dimensions:

Level	Condition	Colors	Method
T3-L1	Quantize 64	64	Median-cut algorithm
T3-L2	Quantize 16	16	Median-cut algorithm
T3-L3	Quantize 4	4	Median-cut algorithm
T3-L4	Grayscale	256 (luminance)	ITU-R BT.601: $Y = 0.299R + 0.587G + 0.114B$

Grayscale images reconverted to 3-channel RGB for API compatibility. Output format: PNG (lossless).

T4 — Adaptive Multi-Stage Pipeline

Sequential pipeline: Lanczos downscale → WebP conversion → quality-targeted compression.

Level	Config	Scale	Format	Quality	Token Reduction
T4-L1	Conservative	0.75	WebP	85	Moderate
T4-L2	Balanced	0.50	WebP	70	Substantial
T4-L3	Aggressive	0.25	WebP	50	Large
T4-L4	Maximum	0.125	WebP	30	Very large

Execution Groups

Experiments were organized into four execution groups to manage API costs, computational resources, and the dry-run/full validation workflow.

Group	Scope	Content	Execution
A	Dataset preparation	Image curation, format standardization, ground-truth validation, transformation application (19 levels × 100 images)	Local — no API calls
B	Scientific images	Experiments 1–5 on 50 scientific images × 6 models × all transformation levels	6,000 API calls
C	General images	Experiments 1–5 on 50 general images × 6 models × all transformation levels	6,000 API calls
D	Evaluation & analysis	BERTScore, SBERT, ROUGE-L on 12,000 observations; LLM-as-judge on 300-observation subset; Experiments 6–7	300 API calls + local compute

Dry-Run / Full Notebook Workflow

Each execution group followed a two-stage validation protocol:

Dry-run notebook: A compact notebook exercising every major component at small scale (2–3 images, 1–2 models, 1–2 transformation levels). Run interactively from VS Code via the Google Colab extension on an A100/H100 GPU. Debugged iteratively until successful completion.
Full notebook: The complete experiment run, initially planned for Colab H100 execution. Groups A and B executed on Colab; Groups C and D executed locally on macOS (Apple M3 Pro) after GPU vs. CPU analysis confirmed that API-bound workloads and MPS-accelerated BERTScore did not require cloud GPUs.

Both notebooks implemented:

Automatic checkpointing for pause-and-resume across sessions
Cell print-stream capture to log files
Result packaging into downloadable .zip archives
JSON checkpoint files for tracking completion state

Evaluation Methodology

Metric	Type	Model/Method	Role
BERTScore F1	Token-level embedding	`microsoft/deberta-xlarge-mnli`	Primary — semantic alignment at token granularity
SBERT Cosine Similarity	Sentence-level embedding	`all-MiniLM-L6-v2`	Primary — holistic semantic equivalence
ROUGE-L F1	Lexical overlap	N-gram matching	Secondary — surface-level reference baseline
LLM-as-Judge	LLM evaluation	GPT-5.4, 1–5 rubric	Supplementary — 300-observation subset

Standardized Prompt

All 12,000 primary API calls used the same fixed prompt:

Describe what this image shows in one to two concise sentences. Focus on the main content and any important details visible in the image.

Datasets

Category	Size	Source	Ground-Truth Labels
Scientific	50 images (`sci_001`–`sci_050`)	Open-access publications, curated datasets	One-to-two sentence figure descriptions
General	50 images (`gen_001`–`gen_050`)	MS COCO 2017 validation set	First of five COCO captions per image

All images standardized to PNG format, RGB color space, minimum $640 \times 480$ resolution.

Key Findings

The study's 12,000 scored API observations yield seven principal findings, each tied to a research question.

1. Resolution Reduction Is the Only Effective Token Reduction Lever

Resolution reduction (T1) is the only single-axis transformation that consistently lowers input token counts across all providers with dimension-based tokenization. Models maintain stable accuracy through 4× downscaling (scale 0.25), which reduces token counts by 89–94% for Anthropic and OpenAI models. The degradation curve exhibits a plateau-then-cliff shape: accuracy is stable from 0.75× to 0.25× scale, then drops sharply at 0.125× and 0.0625×. At 16× downscaling (scale 0.0625), all models converge to a degraded floor with BERTScore F1 in the 0.15–0.19 range.

2. Lossy Compression Does Not Reduce Token Counts

JPEG and WebP compression did not reduce input token counts for any of the four providers tested, confirming that current commercial VLM tokenization is dimension-based rather than file-size-based. Despite having no token benefit, models demonstrated remarkable robustness to compression artifacts: accuracy remained effectively constant from JPEG $q = 85$ down to $q = 5$.

3. VLMs Rely Primarily on Spatial Structure, Not Color

Five of six models maintained accuracy within 5% of baseline under complete grayscale conversion, indicating that spatial structure — edges, text, object boundaries — carries far more information for VLM understanding than chromatic detail. Gemini 3.1 Pro Preview was the sole exception, showing a 36% relative BERTScore decline under grayscale conversion. Scientific figures were especially color-invariant, consistent with their reliance on layout and text rather than color semantics.

4. The Adaptive Pipeline Offers Practical Convenience

The multi-stage pipeline (T4) combining downscaling with WebP compression appeared on or near the Pareto frontier for most models, providing simultaneous token reduction and file size reduction. The Balanced configuration (T4-L2: 0.50× scale, WebP $q = 70$) achieves approximately 71% token savings with good accuracy preservation (SBERT $\approx 0.55$ for Anthropic models). However, the pipeline does not fundamentally dominate single-axis resolution reduction at equivalent scaling factors.

5. Scientific and General Images Degrade Differently

The two image categories exhibited an unexpected metric asymmetry: general images scored higher on BERTScore F1 (0.285 versus 0.141 at baseline) but lower on SBERT cosine similarity (0.548 versus 0.670). Under resolution reduction, general images degraded faster on SBERT (45% decline from baseline to 16× downscaling) than scientific images (28% decline), contradicting the initial hypothesis that scientific figures would be more sensitive due to fine text and line detail.

6. GPT-5.4 Is the Most Accurate and Robust Model

Across all 12,000 observations, GPT-5.4 achieved the highest mean BERTScore F1 (0.239) and SBERT cosine similarity (0.623). Grok 4.20 Reasoning ranked second, followed by the three Anthropic models. Gemini 3.1 Pro Preview employed a fixed-token-budget strategy ($\approx 1{,}094$ tokens regardless of image dimensions), rendering all input-side transformations ineffective for token reduction on that provider.

7. BERTScore and SBERT Capture Different Quality Dimensions

BERTScore F1 and SBERT cosine similarity showed near-zero correlation ($r = 0.188$ across the full dataset, $r = 0.085$ on the 300-observation judge subset). SBERT correlated most strongly with LLM-as-judge scores ($r = 0.503$), while BERTScore showed weak correlation ($r = 0.125$). This finding argues strongly for multi-metric evaluation as standard practice in VLM studies.

Selected Results

The following figures illustrate key experimental outcomes. All 31 figures and their descriptions are catalogued in the figure map.

Baseline Model Performance (Experiment 1)

Figure 1. Mean BERTScore F1 at baseline (original images) for each model, faceted by scientific and general image categories.

Resolution Degradation Curves (Experiment 2)

Figure 3. BERTScore F1 versus scaling factor (0.0625–0.75×) per model and image category, showing the characteristic plateau-then-cliff degradation pattern.

Pareto Frontier — Token Cost vs. Accuracy (Experiment 5)

Figure 16. Pareto frontier of input token cost versus BERTScore F1 across all transformation families. Points on the frontier represent configurations that are not dominated on both axes by any other configuration.

Cross-Model Robustness Ranking (Experiment 6)

Figure 20. AUDC BERTScore F1 heatmap across models and transformation families. Higher values indicate greater robustness to the corresponding transformation.

Metric Correlation Analysis (Experiment 7)

Figure 25. Pairwise Pearson correlation matrix of evaluation metrics on the 300-observation LLM-as-judge subset. SBERT cosine similarity correlates most strongly with LLM judge scores.

Transformation Examples

Transformation examples. Visual illustration of the four transformation families applied to a sample image at increasing intensity levels.

Getting Started

Prerequisites

Python 3.10+ (tested on Python 3.12, macOS 14 with Apple M3 Pro)
pip (or any compatible package manager)
API keys for all four providers (see API Key Configuration below)

Installation

Clone the repository:

git clone https://github.com/ryanjosephkamp/image_token_study.git
cd image_token_study

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

The requirements.txt installs:

API client libraries: anthropic, openai, google-genai, requests
Image processing: Pillow, scikit-image
Scientific Python: numpy, pandas, scipy, matplotlib
Evaluation metrics: bert-score, sentence-transformers, rouge-score
Deep learning backend: torch (required by BERTScore and Sentence-Transformers)
Utilities: tqdm

API Key Configuration

The study requires API keys from all four providers. Keys are loaded from environment variables by api_client.py at runtime. Never commit API keys to the repository.

Provider	Environment Variable	Sign-Up
Anthropic	`ANTHROPIC_API_KEY`	console.anthropic.com
OpenAI	`OPENAI_API_KEY`	platform.openai.com
Google	`GOOGLE_API_KEY`	aistudio.google.dev
xAI	`XAI_API_KEY`	console.x.ai

Option A — .env file (recommended for local development):

Create a .env file in the project root (already in .gitignore):

# .env — do NOT commit this file
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=AI...
XAI_API_KEY=xai-...

Then source it before running:

source .env

Option B — Shell export:

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="AI..."
export XAI_API_KEY="xai-..."

Option C — Google Colab Secrets:

When running notebooks on Google Colab, store keys as Colab Secrets and access them via google.colab.userdata.get().

Usage

Main Orchestrator

The main entry point is image_token_study_scripts/main.py, which defines the model registry, shared constants, and group-level entry points. It can be invoked from notebooks or from the command line:

# Run a specific execution group in dry-run mode
python image_token_study_scripts/main.py --group a --mode dry_run

# Run a group in full mode
python image_token_study_scripts/main.py --group b --mode full

Execution Groups

Group	Description	Typical Runtime
`a`	Dataset preparation and image transformations (no API calls)	~5 min
`b`	Experiments 1–5 on 50 scientific images (6,000 API calls)	~2–4 hours
`c`	Experiments 1–5 on 50 general images (6,000 API calls)	~2–4 hours
`d`	BERTScore/SBERT/ROUGE-L evaluation + LLM-as-judge (300 API calls)	~1–2 hours

Dry-Run / Full Notebook Workflow

Each execution group was run through a two-stage validation protocol using Jupyter notebooks in image_token_study_scripts/notebooks/:

Stage 1 — Dry-Run Notebook

A compact notebook exercising every major component at small scale (2–3 images, 1–2 models, 1–2 transformation levels). Purpose: confirm that all API calls, transformations, CSV logging, and checkpointing work correctly before committing to the full run.

notebooks/group_a_dry_run.ipynb
notebooks/group_b_dry_run.ipynb
notebooks/group_c_dry_run.ipynb
notebooks/group_d_dry_run.ipynb

Run interactively in VS Code (local or via the Google Colab extension). Debug iteratively until all cells pass.

Stage 2 — Full Notebook

The complete experiment run for the group. Implements automatic checkpointing for pause-and-resume, cell print-stream capture to log files, and result packaging.

notebooks/group_a_full.ipynb
notebooks/group_b_full.ipynb   (+ group_b_rerun.ipynb for re-execution)
notebooks/group_c_full.ipynb
notebooks/group_d_full.ipynb

Groups A and B executed on Google Colab (A100/H100 GPU). Groups C and D executed locally on macOS (Apple M3 Pro) after analysis confirmed that API-bound workloads and MPS-accelerated BERTScore did not require cloud GPUs.

Verification Scripts

After each group completes, run the corresponding verification script to validate outputs:

python image_token_study_scripts/verification_scripts/verify_group_a.py
python image_token_study_scripts/verification_scripts/verify_group_b.py
python image_token_study_scripts/verification_scripts/verify_group_c.py
python image_token_study_scripts/verification_scripts/verify_group_d.py

Analysis and figure verification:

python image_token_study_scripts/verification_scripts/verify_analysis.py
python image_token_study_scripts/verification_scripts/verify_figures.py

Project Structure

image_token_study/
├── README.md
├── LICENSE                          # MIT License
├── requirements.txt                 # Python dependencies
│
├── image_token_study_scripts/       # All source code
│   ├── main.py                      # Main orchestrator — model registry, constants, group entry points
│   ├── implementation_scripts/      # Core implementation modules
│   │   ├── api_client.py            # Unified API client for all 4 providers
│   │   ├── image_transforms.py      # T1–T4 transformation pipeline
│   │   ├── dataset_utils.py         # Dataset preparation and curation
│   │   ├── data_utils.py            # Checkpointing, CSV logging, utilities
│   │   ├── evaluation.py            # BERTScore, SBERT, ROUGE-L scoring
│   │   ├── analysis.py              # Per-experiment statistical analysis
│   │   ├── cross_analysis.py        # Cross-experiment synthesis (Exp 6, 7)
│   │   └── figure_gen.py            # Matplotlib figure generation (32 figures)
│   ├── notebooks/                   # Jupyter notebooks for each execution group
│   │   ├── group_a_dry_run.ipynb
│   │   ├── group_a_full.ipynb
│   │   ├── group_b_dry_run.ipynb
│   │   ├── group_b_full.ipynb
│   │   ├── group_b_rerun.ipynb      # Re-execution notebook for Group B
│   │   ├── group_c_dry_run.ipynb
│   │   ├── group_c_full.ipynb
│   │   ├── group_d_dry_run.ipynb
│   │   └── group_d_full.ipynb
│   └── verification_scripts/        # Post-execution validation scripts
│       ├── verify_group_a.py
│       ├── verify_group_b.py
│       ├── verify_group_c.py
│       ├── verify_group_d.py
│       ├── verify_analysis.py
│       └── verify_figures.py
│
├── image_token_study_experiments/   # All experiment data and artifacts
│   ├── checkpoint.json              # Global checkpoint state
│   ├── group_b_rerun_checkpoint.json
│   ├── group_c_checkpoint.json
│   ├── group_d_checkpoint.json
│   ├── datasets/                    # Curated image datasets
│   │   ├── scientific/              # 50 scientific figures (sci_001–sci_050)
│   │   ├── general/                 # 50 general images (gen_001–gen_050)
│   │   ├── image_metadata.csv       # Image dimensions, file sizes, formats
│   │   ├── ground_truth_labels.csv  # Ground-truth descriptions per image
│   │   └── captions.json            # Caption mapping sidecar
│   ├── transformed_images/          # 19 transformation levels × 100 images
│   │   └── T1_resolution/ ...       # Subdirectories per transformation
│   ├── group_a/                     # Group A outputs (dry_run/, full/)
│   ├── group_b_scientific/          # Group B outputs (dry_run/, full/)
│   ├── group_b_rerun_working/       # Group B re-run intermediate CSVs
│   ├── group_c_general/             # Group C outputs (dry_run/, full/)
│   ├── group_c_working/             # Group C intermediate CSVs
│   ├── group_d_evaluation/          # Group D outputs (dry_run/, full/)
│   ├── group_d_working/             # Group D intermediate and merged CSVs
│   ├── analysis/                    # Final analysis artifacts
│   │   ├── consolidated_metrics.csv # 12,000-row master dataset
│   │   ├── pareto_frontiers.csv     # Pareto-optimal configurations
│   │   ├── practical_recommendations.csv
│   │   └── summary_statistics.json  # Aggregate statistics
│   └── group_*_results_*/           # Timestamped result archives
│
├── image_token_study_reports/       # Reports and figures
│   ├── figures/                     # 32 PNG figures + figure_map.csv
│   ├── markdowns/
│   │   └── image_token_study_comprehensive_report.md
│   ├── latex/
│   │   ├── final_report.tex         # IEEE-formatted LaTeX report
│   │   ├── final_report.pdf         # Compiled PDF
│   │   └── figures/                 # LaTeX figure copies
│   └── pdfs/
│
└── image_token_study_printouts/     # Console logs from each execution group
    ├── group_a_full_log.txt
    ├── group_b_rerun_log.txt
    ├── group_c_full_log.txt
    ├── group_d_full_log.txt
    └── analysis_log.txt

Reports and Documentation

Comprehensive Report

The full study report is available in two formats:

Format	Location	Description
Markdown	`image_token_study_reports/markdowns/image_token_study_comprehensive_report.md`	Full report with embedded figure references
IEEE LaTeX	`image_token_study_reports/latex/final_report.tex`	IEEE conference-formatted LaTeX source
Compiled PDF	`image_token_study_reports/latex/final_report.pdf`	Compiled PDF of the LaTeX report

Figures

All 32 analysis figures are in image_token_study_reports/figures/. The complete catalogue with descriptions is in image_token_study_reports/figures/figure_map.csv.

Figure Inventory by Experiment

Experiment	Figures	Count
Exp 1 — Baseline	`exp1_baseline_performance.png`, `exp1_token_distribution.png`	2
Exp 2 — Resolution	`exp2_bertscore_vs_scale.png`, `exp2_sbert_vs_scale.png`, `exp2_tokens_vs_scale.png`, `exp2_degradation_heatmap.png`	4
Exp 3 — Compression	`exp3_tokens_vs_compression.png`, `exp3_bertscore_vs_compression.png`, `exp3_jpeg_vs_webp.png`, `exp3_filesize_vs_accuracy.png`, `exp3_psnr_ssim_vs_accuracy.png`	5
Exp 4 — Color Depth	`exp4_bertscore_by_color.png`, `exp4_grayscale_impact.png`, `exp4_quantization_curve.png`, `exp4_color_importance_distribution.png`	4
Exp 5 — Pipeline	`exp5_pareto_frontier.png`, `exp5_pipeline_comparison.png`, `exp5_category_pareto.png`, `exp5_efficiency_ranking.png`	4
Exp 6 — Cross-Model	`exp6_robustness_ranking_heatmap.png`, `exp6_category_divergence.png`, `exp6_rank_correlation.png`, `exp6_model_robustness_radar.png`, `exp6_scientific_vs_general_curves.png`	5
Exp 7 — Metric Agreement	`exp7_metric_correlation_matrix.png`, `exp7_metric_agreement_scatter.png`, `exp7_disagreement_cases.png`	3
Cross-Experiment	`cross_model_ranking_table.png`, `cross_pareto_summary.png`, `cross_practical_recommendations_v2.png`, `cross_category_summary.png`	4
Visual	`transformation_examples_panel.png`	1
Total		32

Analysis Artifacts

Key analysis outputs in image_token_study_experiments/analysis/:

File	Description
`consolidated_metrics.csv`	12,000-row master dataset with all metrics per observation
`pareto_frontiers.csv`	Pareto-optimal transformation configurations per model and category
`practical_recommendations.csv`	Best configurations at 50% and 75% token budget constraints
`summary_statistics.json`	Aggregate statistics across all experiments

Execution Logs

Console output from each execution group is preserved in image_token_study_printouts/ for reproducibility and debugging reference.

References

[1] D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, "Token Merging: Your ViT But Faster," in Proc. 11th Int. Conf. Learning Representations (ICLR), 2023.

[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention Is All You Need," in Advances in Neural Information Processing Systems, vol. 30, 2017.

[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," in Proc. 9th Int. Conf. Learning Representations (ICLR), 2021.

[4] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, "Learning Transferable Visual Models From Natural Language Supervision," in Proc. 38th Int. Conf. Machine Learning (ICML), pp. 8748–8763, 2021.

[5] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, "Flamingo: A Visual Language Model for Few-Shot Learning," in Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022.

[6] J. Li, D. Li, S. Savarese, and S. Hoi, "BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models," in Proc. 40th Int. Conf. Machine Learning (ICML), PMLR 202, 2023.

[7] H. Liu, C. Li, Q. Wu, and Y. J. Lee, "Visual Instruction Tuning," in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023.

[8] Anthropic, "Vision," API Documentation. [Online]. Available: https://platform.claude.com/docs/en/docs/build-with-claude/vision

[9] OpenAI, "Images and Vision," API Documentation. [Online]. Available: https://developers.openai.com/api/docs/guides/images-vision

[10] Google, "Gemini API: Vision." [Online]. Available: https://ai.google.dev/gemini-api/docs/vision

[11] R. Keys, "Cubic Convolution Interpolation for Digital Image Processing," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 29, no. 6, pp. 1153–1160, 1981.

[12] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 4th ed. Pearson, 2018.

[13] G. K. Wallace, "The JPEG Still Picture Compression Standard," IEEE Trans. Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv, 1992.

[14] Google Developers, "WebP: An Image Format for the Web." [Online]. Available: https://developers.google.com/speed/webp

[15] P. Heckbert, "Color Image Quantization for Frame Buffer Display," ACM SIGGRAPH Computer Graphics, vol. 16, no. 3, pp. 297–307, 1982.

[16] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, "BERTScore: Evaluating Text Generation with BERT," in Proc. 8th Int. Conf. Learning Representations (ICLR), 2020.

[17] N. Reimers and I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," in Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), pp. 3982–3992, 2019.

[18] C.-Y. Lin, "ROUGE: A Package for Automatic Evaluation of Summaries," in Text Summarization Branches Out, pp. 74–81, 2004.

[19] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023.

[20] A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning," in Findings of the Association for Computational Linguistics: ACL 2022, pp. 2263–2279, 2022.

[21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, "Microsoft COCO: Common Objects in Context," in Proc. European Conf. Computer Vision (ECCV), pp. 740–755, 2014.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
image_token_study_experiments		image_token_study_experiments
image_token_study_printouts		image_token_study_printouts
image_token_study_reports		image_token_study_reports
image_token_study_scripts		image_token_study_scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Image Token Study: Vision-Language Model Robustness to Token-Reducing Image Transformations

A Systematic Cross-Provider Empirical Study of How Standard Image Transformations Affect VLM Description Accuracy and Input Token Cost

Table of Contents

Project Overview

Research Motivation

Research Gap

Project Scope

Study Scale

Contributions

Research Questions

Models Under Study

Experimental Design

Overview

Experiment Summary

Image Transformations

Execution Groups

Evaluation Methodology

Standardized Prompt

Datasets

Key Findings

1. Resolution Reduction Is the Only Effective Token Reduction Lever

2. Lossy Compression Does Not Reduce Token Counts

3. VLMs Rely Primarily on Spatial Structure, Not Color

4. The Adaptive Pipeline Offers Practical Convenience

5. Scientific and General Images Degrade Differently

6. GPT-5.4 Is the Most Accurate and Robust Model

7. BERTScore and SBERT Capture Different Quality Dimensions

Selected Results

Baseline Model Performance (Experiment 1)

Resolution Degradation Curves (Experiment 2)

Pareto Frontier — Token Cost vs. Accuracy (Experiment 5)

Cross-Model Robustness Ranking (Experiment 6)

Metric Correlation Analysis (Experiment 7)

Transformation Examples

Getting Started

Prerequisites

Installation

API Key Configuration

Usage

Main Orchestrator

Execution Groups

Dry-Run / Full Notebook Workflow

Verification Scripts

Project Structure

Reports and Documentation

Comprehensive Report

Figures

Analysis Artifacts

Execution Logs

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages