A Systematic Cross-Provider Empirical Study of How Standard Image Transformations Affect VLM Description Accuracy and Input Token Cost
Author: Ryan Kamp Affiliation: Department of Computer Science, University of Cincinnati Location: Cincinnati, OH, USA Email: [email protected] GitHub: https://github.com/ryanjosephkamp Date: April 2026
- Project Overview
- Research Questions
- Models Under Study
- Image Transformations
- Datasets
- Evaluation Methodology
- Experimental Design
- Key Findings
- Selected Results
- Getting Started
- Usage
- Project Structure
- Reports and Documentation
- References
- License
Vision-language models (VLMs) and large language models with vision capabilities process input images by converting them into sequences of tokens that occupy the model's context window. Each image token incurs both computational cost — latency and context window consumption — and direct financial cost through per-token API pricing. For researchers and practitioners who routinely submit images to VLM APIs, whether scientific figures for automated report generation, photographs for content analysis, or diagrams for accessibility annotation, the cumulative token cost of image inputs represents a significant and growing expense.
The number of image tokens consumed is determined by provider-specific tokenization mechanisms that depend primarily on image dimensions. For Anthropic Claude models, the token count follows an area-based formula:
where
A natural and practically important question arises: how far can the input image token footprint be reduced — through standard image transformations such as downscaling, compression, and color reduction — before model performance on image understanding tasks degrades to an unacceptable level?
Despite the ubiquity of this question, the literature lacks systematic cross-provider empirical data on the relationship between image transformations and VLM description accuracy. Prior work on token reduction has focused primarily on internal model mechanisms such as token merging, rather than on user-side input transformations that any practitioner can apply before submitting images to an API. No published study has compared the robustness of current flagship models from multiple providers under identical transformation conditions, nor has any study characterized whether scientific figures and general-purpose images exhibit meaningfully different degradation profiles.
This project is a formal, scientifically rigorous empirical study that systematically measures how the image description accuracy of six current flagship VLMs changes when token-reducing image transformations are applied to input images. The study spans two distinct image categories — scientific figures and general images — and four transformation families:
| Family | Transformation | Mechanism |
|---|---|---|
| T1 | Resolution Reduction | Reduces pixel dimensions via interpolation, directly reducing token count |
| T2 | Lossy Compression | Reduces file size through JPEG/WebP encoding without changing pixel dimensions |
| T3 | Color Depth Reduction | Reduces chromatic information through grayscale conversion or color quantization |
| T4 | Adaptive Multi-Stage Pipeline | Combines moderate downscaling, format optimization (WebP), and quality-targeted compression |
This design exploits the fact that these transformations affect different aspects of the image — spatial resolution, compression artifacts, and color information — and interact differently with provider-specific tokenization mechanisms. This enables the study to disentangle the effects of token count reduction from image quality degradation, since some transformations (T2, T3) alter image content without changing pixel dimensions and thus without affecting dimension-based token counts.
The study evaluates six flagship VLMs from four providers across 19 transformation levels, yielding 12,000 scored API observations. Description accuracy is assessed using four complementary metrics: BERTScore F1, Sentence-BERT cosine similarity, ROUGE-L, and an LLM-as-judge evaluation on a 300-observation subset. All experiments treat each model as a black-box API endpoint, interacting only with the public API surface.
- Empirical characterization of transformation–accuracy tradeoffs across six flagship VLMs and 19 transformation levels, yielding the first systematic cross-provider dataset on this topic.
- Confirmation that lossy compression does not reduce token counts for any tested provider, establishing that image tokenization in current commercial VLM APIs is dimension-based rather than file-size-based.
-
Identification of practical degradation thresholds: models maintain stable accuracy through
$4\times$ resolution reduction ($\text{scale} = 0.25$ ) for general images and through$8\times$ reduction ($\text{scale} = 0.125$ ) for scientific figures on SBERT cosine similarity, before sharp performance drops. - Cross-model robustness rankings revealing that GPT-5.4 achieves the highest overall description accuracy while Gemini 3.1 Pro Preview employs a fixed-token-budget strategy that renders input-side token reduction ineffective.
- Evidence that scientific figures and general images exhibit distinct degradation profiles, with implications for practitioners optimizing images across different content domains.
- Pareto-optimal transformation strategies for each model and image category at 50% and 75% token budget constraints.
The study is organized around seven research questions, each grounded in the research gaps identified in the literature review and addressed by one or more experiments.
RQ1: How does each model's description accuracy degrade as a function of downscaling factor, and at what resolution does degradation become practically unacceptable?
VLM image tokenizers compute token counts from pixel dimensions — via
RQ2: Does lossy compression reduce input image token cost, or does it only degrade image quality without token savings?
Lossy JPEG and WebP compression reduce file size without changing pixel dimensions. If VLM tokenization is purely dimension-based, then compression should have zero effect on token count while still degrading image quality — making compression a "cost-free degradation" from the tokenization perspective. (Addressed by Experiment 3.)
RQ3: Do VLMs extract image understanding primarily from spatial structure or from chromatic detail, as revealed by performance under color depth reduction?
Color depth reduction (grayscale conversion, color quantization) removes chromatic information while preserving pixel dimensions and spatial structure. Scientific figures often rely on structural detail rather than color; general photographs may depend more on chromatic semantics. (Addressed by Experiment 4.)
RQ4: Can an adaptive multi-stage pipeline achieve a superior Pareto tradeoff (token cost vs. accuracy) compared to any single-axis transformation?
Distributing an information reduction budget across multiple independent axes (spatial resolution, encoding quality) should introduce less perceptual degradation than concentrating the same reduction along a single axis. The adaptive pipeline (T4) operationalizes this principle. (Addressed by Experiment 5.)
RQ5: Do scientific figures and general images exhibit meaningfully different degradation profiles under token-reducing transformations?
Scientific figures have distinctive properties — dense text, fine spatial detail, symbolic conventions, and structured layouts — that may make them differentially sensitive to certain transformations relative to general photographs. (Addressed by Experiment 6.)
RQ6: Which models are most robust to each transformation type, and is there a consistent robustness ranking across transformations?
The six models span four providers with different architectures, training data, and tokenization mechanisms. Whether robustness to image degradation is a general model property or a transformation-specific property has direct practical implications for model selection. (Addressed by Experiment 6.)
RQ7: Do the primary evaluation metrics (BERTScore, Sentence-BERT cosine similarity) agree on degradation patterns, or do they reveal different aspects of quality loss?
If the evaluation metrics agree strongly, conclusions are robust to metric choice. If they diverge, the study must characterize how and why — with implications for future evaluation methodology. (Addressed by Experiment 7.)
Six flagship VLMs from four providers were evaluated as black-box API endpoints in inference-only mode. All models were accessed through their respective provider APIs with temperature set to 0.
| # | Provider | Model | API Identifier | Tokenization Strategy |
|---|---|---|---|---|
| 1 | Anthropic | Claude Opus 4.6 | claude-opus-4-6-20260301 |
Area-based: |
| 2 | Anthropic | Claude Sonnet 4.6 | claude-sonnet-4-6-20260301 |
Area-based: |
| 3 | Anthropic | Claude Opus 4.5 | claude-opus-4-5-20250520 |
Area-based: |
| 4 | OpenAI | GPT-5.4 | gpt-5.4 |
Patch-based: |
| 5 | Gemini 3.1 Pro Preview | gemini-3.1-pro-preview |
Fixed-token-budget ( |
|
| 6 | xAI | Grok 4.20 (Reasoning) | grok-4-20-reasoning |
Dimension-based (provider-specific) |
Provider-Specific Tokenization Details
-
Anthropic (Claude models): All three Claude models share a deterministic tokenizer. Token count scales with pixel area (
$N \approx WH/750$ after any provider-side resizing). At baseline, mean input tokens were 929 for the combined dataset (456 general, 1,401 scientific). Token counts decreased proportionally with downscaling. -
OpenAI (GPT-5.4): Uses a
$32 \times 32$ pixel patch system. Slightly fewer tokens than Anthropic at baseline (mean 855; 399 general, 1,311 scientific). Token counts decreased with downscaling. -
Google (Gemini 3.1 Pro Preview): Employs a fixed-token-budget strategy: token counts remained near-constant at approximately 1,094 (
$\sigma = 8.4$ ) across all images and all transformation levels, rendering dimension-based token reduction ineffective. - xAI (Grok 4.20): Dimension-based tokenization with proportional but less aggressive reduction than Anthropic/OpenAI. Mean baseline: 836 tokens (456 general, 1,217 scientific).
Experimental Configuration
| Parameter | Setting |
|---|---|
| Temperature | 0 for all providers |
| Random seed | 42 (OpenAI only; logged as N/A for other providers) |
| Prompt | Fixed across all 12,000 primary API calls |
| Output format | One-to-two sentence image description |
| Mode | Inference only (no training or fine-tuning) |
The study comprises 7 experiments organized into 4 execution groups, testing 4 transformation families across 19 transformation levels, yielding 12,000 scored API observations plus 300 LLM-as-judge evaluations (12,300 total API calls).
| # | Experiment | Transformation | Primary RQ | API Calls |
|---|---|---|---|---|
| 1 | Baseline Image Description | None (original) | — | 600 |
| 2 | Resolution Reduction Robustness | T1 | RQ1 | 3,000 |
| 3 | Lossy Compression Robustness | T2 | RQ2 | 3,600 |
| 4 | Color Depth Reduction Robustness | T3 | RQ3 | 2,400 |
| 5 | Adaptive Pipeline Evaluation | T4 | RQ4 | 2,400 |
| 6 | Cross-Model Ranking & Category Divergence | T1–T4 | RQ5, RQ6 | 0 (analysis only) |
| 7 | Metric Agreement Analysis | T1–T4 | RQ7 | 300 (LLM-as-judge) |
| Total | 12,300 |
T1 — Resolution Reduction
Images downscaled using Lanczos interpolation at five scaling factors:
| Level | Scale Factor | Effect on |
Expected Token Impact |
|---|---|---|---|
| T1-L1 | 0.75 | Moderate reduction | |
| T1-L2 | 0.50 | Substantial reduction | |
| T1-L3 | 0.25 | Large reduction | |
| T1-L4 | 0.125 | Very large reduction | |
| T1-L5 | 0.0625 | Near-maximum reduction |
Output format: PNG (lossless, to isolate resolution effects from compression artifacts).
T2 — Lossy Compression
Six compression levels applied without changing pixel dimensions:
| Level | Format | Quality | Artifact Severity |
|---|---|---|---|
| T2-L1 | JPEG | 85 | Negligible |
| T2-L2 | JPEG | 50 | Mild blocking |
| T2-L3 | JPEG | 20 | Visible blocking and color banding |
| T2-L4 | JPEG | 5 | Severe blocking, ringing, color loss |
| T2-L5 | WebP | 85 | Negligible |
| T2-L6 | WebP | 50 | Mild predictive artifacts |
T3 — Color Depth Reduction
Four color conditions applied without changing pixel dimensions:
| Level | Condition | Colors | Method |
|---|---|---|---|
| T3-L1 | Quantize 64 | 64 | Median-cut algorithm |
| T3-L2 | Quantize 16 | 16 | Median-cut algorithm |
| T3-L3 | Quantize 4 | 4 | Median-cut algorithm |
| T3-L4 | Grayscale | 256 (luminance) | ITU-R BT.601: |
Grayscale images reconverted to 3-channel RGB for API compatibility. Output format: PNG (lossless).
T4 — Adaptive Multi-Stage Pipeline
Sequential pipeline: Lanczos downscale → WebP conversion → quality-targeted compression.
| Level | Config | Scale | Format | Quality | Token Reduction |
|---|---|---|---|---|---|
| T4-L1 | Conservative | 0.75 | WebP | 85 | Moderate |
| T4-L2 | Balanced | 0.50 | WebP | 70 | Substantial |
| T4-L3 | Aggressive | 0.25 | WebP | 50 | Large |
| T4-L4 | Maximum | 0.125 | WebP | 30 | Very large |
Experiments were organized into four execution groups to manage API costs, computational resources, and the dry-run/full validation workflow.
| Group | Scope | Content | Execution |
|---|---|---|---|
| A | Dataset preparation | Image curation, format standardization, ground-truth validation, transformation application (19 levels × 100 images) | Local — no API calls |
| B | Scientific images | Experiments 1–5 on 50 scientific images × 6 models × all transformation levels | 6,000 API calls |
| C | General images | Experiments 1–5 on 50 general images × 6 models × all transformation levels | 6,000 API calls |
| D | Evaluation & analysis | BERTScore, SBERT, ROUGE-L on 12,000 observations; LLM-as-judge on 300-observation subset; Experiments 6–7 | 300 API calls + local compute |
Dry-Run / Full Notebook Workflow
Each execution group followed a two-stage validation protocol:
- Dry-run notebook: A compact notebook exercising every major component at small scale (2–3 images, 1–2 models, 1–2 transformation levels). Run interactively from VS Code via the Google Colab extension on an A100/H100 GPU. Debugged iteratively until successful completion.
- Full notebook: The complete experiment run, initially planned for Colab H100 execution. Groups A and B executed on Colab; Groups C and D executed locally on macOS (Apple M3 Pro) after GPU vs. CPU analysis confirmed that API-bound workloads and MPS-accelerated BERTScore did not require cloud GPUs.
Both notebooks implemented:
- Automatic checkpointing for pause-and-resume across sessions
- Cell print-stream capture to log files
- Result packaging into downloadable
.ziparchives - JSON checkpoint files for tracking completion state
| Metric | Type | Model/Method | Role |
|---|---|---|---|
| BERTScore F1 | Token-level embedding | microsoft/deberta-xlarge-mnli |
Primary — semantic alignment at token granularity |
| SBERT Cosine Similarity | Sentence-level embedding | all-MiniLM-L6-v2 |
Primary — holistic semantic equivalence |
| ROUGE-L F1 | Lexical overlap | N-gram matching | Secondary — surface-level reference baseline |
| LLM-as-Judge | LLM evaluation | GPT-5.4, 1–5 rubric | Supplementary — 300-observation subset |
All 12,000 primary API calls used the same fixed prompt:
Describe what this image shows in one to two concise sentences. Focus on the main content and any important details visible in the image.
| Category | Size | Source | Ground-Truth Labels |
|---|---|---|---|
| Scientific | 50 images (sci_001–sci_050) |
Open-access publications, curated datasets | One-to-two sentence figure descriptions |
| General | 50 images (gen_001–gen_050) |
MS COCO 2017 validation set | First of five COCO captions per image |
All images standardized to PNG format, RGB color space, minimum
The study's 12,000 scored API observations yield seven principal findings, each tied to a research question.
Resolution reduction (T1) is the only single-axis transformation that consistently lowers input token counts across all providers with dimension-based tokenization. Models maintain stable accuracy through 4× downscaling (scale 0.25), which reduces token counts by 89–94% for Anthropic and OpenAI models. The degradation curve exhibits a plateau-then-cliff shape: accuracy is stable from 0.75× to 0.25× scale, then drops sharply at 0.125× and 0.0625×. At 16× downscaling (scale 0.0625), all models converge to a degraded floor with BERTScore F1 in the 0.15–0.19 range.
JPEG and WebP compression did not reduce input token counts for any of the four providers tested, confirming that current commercial VLM tokenization is dimension-based rather than file-size-based. Despite having no token benefit, models demonstrated remarkable robustness to compression artifacts: accuracy remained effectively constant from JPEG
Five of six models maintained accuracy within 5% of baseline under complete grayscale conversion, indicating that spatial structure — edges, text, object boundaries — carries far more information for VLM understanding than chromatic detail. Gemini 3.1 Pro Preview was the sole exception, showing a 36% relative BERTScore decline under grayscale conversion. Scientific figures were especially color-invariant, consistent with their reliance on layout and text rather than color semantics.
The multi-stage pipeline (T4) combining downscaling with WebP compression appeared on or near the Pareto frontier for most models, providing simultaneous token reduction and file size reduction. The Balanced configuration (T4-L2: 0.50× scale, WebP
The two image categories exhibited an unexpected metric asymmetry: general images scored higher on BERTScore F1 (0.285 versus 0.141 at baseline) but lower on SBERT cosine similarity (0.548 versus 0.670). Under resolution reduction, general images degraded faster on SBERT (45% decline from baseline to 16× downscaling) than scientific images (28% decline), contradicting the initial hypothesis that scientific figures would be more sensitive due to fine text and line detail.
Across all 12,000 observations, GPT-5.4 achieved the highest mean BERTScore F1 (0.239) and SBERT cosine similarity (0.623). Grok 4.20 Reasoning ranked second, followed by the three Anthropic models. Gemini 3.1 Pro Preview employed a fixed-token-budget strategy (
BERTScore F1 and SBERT cosine similarity showed near-zero correlation (
The following figures illustrate key experimental outcomes. All 31 figures and their descriptions are catalogued in the figure map.
Figure 1. Mean BERTScore F1 at baseline (original images) for each model, faceted by scientific and general image categories.
Figure 3. BERTScore F1 versus scaling factor (0.0625–0.75×) per model and image category, showing the characteristic plateau-then-cliff degradation pattern.
Figure 16. Pareto frontier of input token cost versus BERTScore F1 across all transformation families. Points on the frontier represent configurations that are not dominated on both axes by any other configuration.
Figure 20. AUDC BERTScore F1 heatmap across models and transformation families. Higher values indicate greater robustness to the corresponding transformation.
Figure 25. Pairwise Pearson correlation matrix of evaluation metrics on the 300-observation LLM-as-judge subset. SBERT cosine similarity correlates most strongly with LLM judge scores.
Transformation examples. Visual illustration of the four transformation families applied to a sample image at increasing intensity levels.
- Python 3.10+ (tested on Python 3.12, macOS 14 with Apple M3 Pro)
- pip (or any compatible package manager)
- API keys for all four providers (see API Key Configuration below)
- Clone the repository:
git clone https://github.com/ryanjosephkamp/image_token_study.git
cd image_token_study- Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txtThe requirements.txt installs:
- API client libraries:
anthropic,openai,google-genai,requests - Image processing:
Pillow,scikit-image - Scientific Python:
numpy,pandas,scipy,matplotlib - Evaluation metrics:
bert-score,sentence-transformers,rouge-score - Deep learning backend:
torch(required by BERTScore and Sentence-Transformers) - Utilities:
tqdm
The study requires API keys from all four providers. Keys are loaded from environment variables by api_client.py at runtime. Never commit API keys to the repository.
| Provider | Environment Variable | Sign-Up |
|---|---|---|
| Anthropic | ANTHROPIC_API_KEY |
console.anthropic.com |
| OpenAI | OPENAI_API_KEY |
platform.openai.com |
GOOGLE_API_KEY |
aistudio.google.dev | |
| xAI | XAI_API_KEY |
console.x.ai |
Option A — .env file (recommended for local development):
Create a .env file in the project root (already in .gitignore):
# .env — do NOT commit this file
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=AI...
XAI_API_KEY=xai-...Then source it before running:
source .envOption B — Shell export:
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="AI..."
export XAI_API_KEY="xai-..."Option C — Google Colab Secrets:
When running notebooks on Google Colab, store keys as Colab Secrets and access them via google.colab.userdata.get().
The main entry point is image_token_study_scripts/main.py, which defines the model registry, shared constants, and group-level entry points. It can be invoked from notebooks or from the command line:
# Run a specific execution group in dry-run mode
python image_token_study_scripts/main.py --group a --mode dry_run
# Run a group in full mode
python image_token_study_scripts/main.py --group b --mode full| Group | Description | Typical Runtime |
|---|---|---|
a |
Dataset preparation and image transformations (no API calls) | ~5 min |
b |
Experiments 1–5 on 50 scientific images (6,000 API calls) | ~2–4 hours |
c |
Experiments 1–5 on 50 general images (6,000 API calls) | ~2–4 hours |
d |
BERTScore/SBERT/ROUGE-L evaluation + LLM-as-judge (300 API calls) | ~1–2 hours |
Each execution group was run through a two-stage validation protocol using Jupyter notebooks in image_token_study_scripts/notebooks/:
Stage 1 — Dry-Run Notebook
A compact notebook exercising every major component at small scale (2–3 images, 1–2 models, 1–2 transformation levels). Purpose: confirm that all API calls, transformations, CSV logging, and checkpointing work correctly before committing to the full run.
notebooks/group_a_dry_run.ipynb
notebooks/group_b_dry_run.ipynb
notebooks/group_c_dry_run.ipynb
notebooks/group_d_dry_run.ipynb
Run interactively in VS Code (local or via the Google Colab extension). Debug iteratively until all cells pass.
Stage 2 — Full Notebook
The complete experiment run for the group. Implements automatic checkpointing for pause-and-resume, cell print-stream capture to log files, and result packaging.
notebooks/group_a_full.ipynb
notebooks/group_b_full.ipynb (+ group_b_rerun.ipynb for re-execution)
notebooks/group_c_full.ipynb
notebooks/group_d_full.ipynb
Groups A and B executed on Google Colab (A100/H100 GPU). Groups C and D executed locally on macOS (Apple M3 Pro) after analysis confirmed that API-bound workloads and MPS-accelerated BERTScore did not require cloud GPUs.
After each group completes, run the corresponding verification script to validate outputs:
python image_token_study_scripts/verification_scripts/verify_group_a.py
python image_token_study_scripts/verification_scripts/verify_group_b.py
python image_token_study_scripts/verification_scripts/verify_group_c.py
python image_token_study_scripts/verification_scripts/verify_group_d.pyAnalysis and figure verification:
python image_token_study_scripts/verification_scripts/verify_analysis.py
python image_token_study_scripts/verification_scripts/verify_figures.pyimage_token_study/
├── README.md
├── LICENSE # MIT License
├── requirements.txt # Python dependencies
│
├── image_token_study_scripts/ # All source code
│ ├── main.py # Main orchestrator — model registry, constants, group entry points
│ ├── implementation_scripts/ # Core implementation modules
│ │ ├── api_client.py # Unified API client for all 4 providers
│ │ ├── image_transforms.py # T1–T4 transformation pipeline
│ │ ├── dataset_utils.py # Dataset preparation and curation
│ │ ├── data_utils.py # Checkpointing, CSV logging, utilities
│ │ ├── evaluation.py # BERTScore, SBERT, ROUGE-L scoring
│ │ ├── analysis.py # Per-experiment statistical analysis
│ │ ├── cross_analysis.py # Cross-experiment synthesis (Exp 6, 7)
│ │ └── figure_gen.py # Matplotlib figure generation (32 figures)
│ ├── notebooks/ # Jupyter notebooks for each execution group
│ │ ├── group_a_dry_run.ipynb
│ │ ├── group_a_full.ipynb
│ │ ├── group_b_dry_run.ipynb
│ │ ├── group_b_full.ipynb
│ │ ├── group_b_rerun.ipynb # Re-execution notebook for Group B
│ │ ├── group_c_dry_run.ipynb
│ │ ├── group_c_full.ipynb
│ │ ├── group_d_dry_run.ipynb
│ │ └── group_d_full.ipynb
│ └── verification_scripts/ # Post-execution validation scripts
│ ├── verify_group_a.py
│ ├── verify_group_b.py
│ ├── verify_group_c.py
│ ├── verify_group_d.py
│ ├── verify_analysis.py
│ └── verify_figures.py
│
├── image_token_study_experiments/ # All experiment data and artifacts
│ ├── checkpoint.json # Global checkpoint state
│ ├── group_b_rerun_checkpoint.json
│ ├── group_c_checkpoint.json
│ ├── group_d_checkpoint.json
│ ├── datasets/ # Curated image datasets
│ │ ├── scientific/ # 50 scientific figures (sci_001–sci_050)
│ │ ├── general/ # 50 general images (gen_001–gen_050)
│ │ ├── image_metadata.csv # Image dimensions, file sizes, formats
│ │ ├── ground_truth_labels.csv # Ground-truth descriptions per image
│ │ └── captions.json # Caption mapping sidecar
│ ├── transformed_images/ # 19 transformation levels × 100 images
│ │ └── T1_resolution/ ... # Subdirectories per transformation
│ ├── group_a/ # Group A outputs (dry_run/, full/)
│ ├── group_b_scientific/ # Group B outputs (dry_run/, full/)
│ ├── group_b_rerun_working/ # Group B re-run intermediate CSVs
│ ├── group_c_general/ # Group C outputs (dry_run/, full/)
│ ├── group_c_working/ # Group C intermediate CSVs
│ ├── group_d_evaluation/ # Group D outputs (dry_run/, full/)
│ ├── group_d_working/ # Group D intermediate and merged CSVs
│ ├── analysis/ # Final analysis artifacts
│ │ ├── consolidated_metrics.csv # 12,000-row master dataset
│ │ ├── pareto_frontiers.csv # Pareto-optimal configurations
│ │ ├── practical_recommendations.csv
│ │ └── summary_statistics.json # Aggregate statistics
│ └── group_*_results_*/ # Timestamped result archives
│
├── image_token_study_reports/ # Reports and figures
│ ├── figures/ # 32 PNG figures + figure_map.csv
│ ├── markdowns/
│ │ └── image_token_study_comprehensive_report.md
│ ├── latex/
│ │ ├── final_report.tex # IEEE-formatted LaTeX report
│ │ ├── final_report.pdf # Compiled PDF
│ │ └── figures/ # LaTeX figure copies
│ └── pdfs/
│
└── image_token_study_printouts/ # Console logs from each execution group
├── group_a_full_log.txt
├── group_b_rerun_log.txt
├── group_c_full_log.txt
├── group_d_full_log.txt
└── analysis_log.txt
The full study report is available in two formats:
| Format | Location | Description |
|---|---|---|
| Markdown | image_token_study_reports/markdowns/image_token_study_comprehensive_report.md |
Full report with embedded figure references |
| IEEE LaTeX | image_token_study_reports/latex/final_report.tex |
IEEE conference-formatted LaTeX source |
| Compiled PDF | image_token_study_reports/latex/final_report.pdf |
Compiled PDF of the LaTeX report |
All 32 analysis figures are in image_token_study_reports/figures/. The complete catalogue with descriptions is in image_token_study_reports/figures/figure_map.csv.
Figure Inventory by Experiment
| Experiment | Figures | Count |
|---|---|---|
| Exp 1 — Baseline | exp1_baseline_performance.png, exp1_token_distribution.png |
2 |
| Exp 2 — Resolution | exp2_bertscore_vs_scale.png, exp2_sbert_vs_scale.png, exp2_tokens_vs_scale.png, exp2_degradation_heatmap.png |
4 |
| Exp 3 — Compression | exp3_tokens_vs_compression.png, exp3_bertscore_vs_compression.png, exp3_jpeg_vs_webp.png, exp3_filesize_vs_accuracy.png, exp3_psnr_ssim_vs_accuracy.png |
5 |
| Exp 4 — Color Depth | exp4_bertscore_by_color.png, exp4_grayscale_impact.png, exp4_quantization_curve.png, exp4_color_importance_distribution.png |
4 |
| Exp 5 — Pipeline | exp5_pareto_frontier.png, exp5_pipeline_comparison.png, exp5_category_pareto.png, exp5_efficiency_ranking.png |
4 |
| Exp 6 — Cross-Model | exp6_robustness_ranking_heatmap.png, exp6_category_divergence.png, exp6_rank_correlation.png, exp6_model_robustness_radar.png, exp6_scientific_vs_general_curves.png |
5 |
| Exp 7 — Metric Agreement | exp7_metric_correlation_matrix.png, exp7_metric_agreement_scatter.png, exp7_disagreement_cases.png |
3 |
| Cross-Experiment | cross_model_ranking_table.png, cross_pareto_summary.png, cross_practical_recommendations_v2.png, cross_category_summary.png |
4 |
| Visual | transformation_examples_panel.png |
1 |
| Total | 32 |
Key analysis outputs in image_token_study_experiments/analysis/:
| File | Description |
|---|---|
consolidated_metrics.csv |
12,000-row master dataset with all metrics per observation |
pareto_frontiers.csv |
Pareto-optimal transformation configurations per model and category |
practical_recommendations.csv |
Best configurations at 50% and 75% token budget constraints |
summary_statistics.json |
Aggregate statistics across all experiments |
Console output from each execution group is preserved in image_token_study_printouts/ for reproducibility and debugging reference.
[1] D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, "Token Merging: Your ViT But Faster," in Proc. 11th Int. Conf. Learning Representations (ICLR), 2023.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention Is All You Need," in Advances in Neural Information Processing Systems, vol. 30, 2017.
[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," in Proc. 9th Int. Conf. Learning Representations (ICLR), 2021.
[4] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, "Learning Transferable Visual Models From Natural Language Supervision," in Proc. 38th Int. Conf. Machine Learning (ICML), pp. 8748–8763, 2021.
[5] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, "Flamingo: A Visual Language Model for Few-Shot Learning," in Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022.
[6] J. Li, D. Li, S. Savarese, and S. Hoi, "BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models," in Proc. 40th Int. Conf. Machine Learning (ICML), PMLR 202, 2023.
[7] H. Liu, C. Li, Q. Wu, and Y. J. Lee, "Visual Instruction Tuning," in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023.
[8] Anthropic, "Vision," API Documentation. [Online]. Available: https://platform.claude.com/docs/en/docs/build-with-claude/vision
[9] OpenAI, "Images and Vision," API Documentation. [Online]. Available: https://developers.openai.com/api/docs/guides/images-vision
[10] Google, "Gemini API: Vision." [Online]. Available: https://ai.google.dev/gemini-api/docs/vision
[11] R. Keys, "Cubic Convolution Interpolation for Digital Image Processing," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 29, no. 6, pp. 1153–1160, 1981.
[12] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 4th ed. Pearson, 2018.
[13] G. K. Wallace, "The JPEG Still Picture Compression Standard," IEEE Trans. Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv, 1992.
[14] Google Developers, "WebP: An Image Format for the Web." [Online]. Available: https://developers.google.com/speed/webp
[15] P. Heckbert, "Color Image Quantization for Frame Buffer Display," ACM SIGGRAPH Computer Graphics, vol. 16, no. 3, pp. 297–307, 1982.
[16] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, "BERTScore: Evaluating Text Generation with BERT," in Proc. 8th Int. Conf. Learning Representations (ICLR), 2020.
[17] N. Reimers and I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," in Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), pp. 3982–3992, 2019.
[18] C.-Y. Lin, "ROUGE: A Package for Automatic Evaluation of Summaries," in Text Summarization Branches Out, pp. 74–81, 2004.
[19] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023.
[20] A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning," in Findings of the Association for Computational Linguistics: ACL 2022, pp. 2263–2279, 2022.
[21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, "Microsoft COCO: Common Objects in Context," in Proc. European Conf. Computer Vision (ECCV), pp. 740–755, 2014.
This project is licensed under the MIT License. See the LICENSE file for details.
Copyright (c) 2026 Ryan Kamp





