Skip to content

VlmConvertModel drops generation settings when building VlmEngineInput #3321

@geoHeil

Description

@geoHeil

Summary

docling/models/stages/vlm_convert/vlm_convert_model.py currently rebuilds VlmEngineInput with hardcoded or dropped generation settings in both __call__() and process_images().

On current main (commit c7615123e6b9d8b5e772e54496db24d8adc64d92), the stage forwards the prompt and image, but it does not forward the rest of the generation config carried by the VLM preset/model spec:

  • max_new_tokens from VlmModelSpec is ignored and replaced with 4096
  • stop_strings from VlmModelSpec are dropped
  • temperature is hardcoded to 0.0
  • extra_generation_config is dropped entirely

The same stage also does synchronous page.image fetch/rasterization and PIL Lanczos resize work before predict_batch() without any preprocessing timing in the debug logs. On large page rasters that can look like a hang: CPU busy, GPU mostly idle, no batch handed to the engine yet.

Why this is a problem

Runtime configuration is partially inert

Inference engines already consume these fields from VlmEngineInput, for example:

  • transformers_engine.py reads first_input.max_new_tokens, first_input.stop_strings, and first_input.extra_generation_config
  • vllm_engine.py reads first_input.temperature, first_input.max_new_tokens, first_input.stop_strings, and first_input.extra_generation_config
  • api_openai_compatible_engine.py reads input_data.temperature, input_data.max_new_tokens, input_data.stop_strings, and input_data.extra_generation_config

But VlmConvertModel strips most of that information at the stage boundary, so tuning the preset/model spec does not reliably reach the engine.

CPU-side preprocessing is hard to distinguish from a stall

page.image access may rasterize the page, then the stage may do one or two PIL resize passes before the first predict_batch() call. That work is synchronous and invisible in the stage logs today, so a long-running conversion can look stuck even when it is still preparing pages.

Minimal repro

A stubbed engine is enough to show the config drop:

from PIL import Image

from docling.datamodel.pipeline_options import VlmConvertOptions
from docling.datamodel.pipeline_options_vlm_model import ResponseFormat
from docling.datamodel.stage_model_specs import VlmModelSpec
from docling.datamodel.vlm_engine_options import AutoInlineVlmEngineOptions
from docling.models.stages.vlm_convert.vlm_convert_model import VlmConvertModel

class StubEngine:
    def __init__(self):
        self.batch = None

    def predict_batch(self, batch):
        self.batch = batch
        return []

    def cleanup(self):
        return None

model_spec = VlmModelSpec(
    name="Test Model",
    default_repo_id="org/model",
    prompt="Convert this page to docling.",
    response_format=ResponseFormat.DOCTAGS,
    max_new_tokens=128,
    stop_strings=["</doctag>"],
)

model = VlmConvertModel.__new__(VlmConvertModel)
model.enabled = True
model.engine = StubEngine()
model.options = VlmConvertOptions(
    model_spec=model_spec,
    engine_options=AutoInlineVlmEngineOptions(),
)

list(model.process_images([Image.new("RGB", (8, 8), "white")], "custom prompt"))
engine_input = model.engine.batch[0]

assert engine_input.max_new_tokens == 128          # actual: 4096
assert engine_input.stop_strings == ["</doctag>"] # actual: []

Expected behavior

VlmConvertModel should construct VlmEngineInput from the configured VLM generation settings instead of hardcoding or dropping them.

At minimum:

  • forward max_new_tokens
  • forward stop_strings

And ideally:

  • expose and forward temperature
  • expose and forward extra_generation_config
  • emit debug timing for rasterization/resizing and batch handoff to predict_batch()

Notes

I have a small draft PR prepared from a fork that wires these fields through VlmConvertModel, adds regression tests for both __call__() and process_images(), and adds debug preprocessing/batch timing in the stage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions