Summary
docling/models/stages/vlm_convert/vlm_convert_model.py currently rebuilds VlmEngineInput with hardcoded or dropped generation settings in both __call__() and process_images().
On current main (commit c7615123e6b9d8b5e772e54496db24d8adc64d92), the stage forwards the prompt and image, but it does not forward the rest of the generation config carried by the VLM preset/model spec:
max_new_tokens from VlmModelSpec is ignored and replaced with 4096
stop_strings from VlmModelSpec are dropped
temperature is hardcoded to 0.0
extra_generation_config is dropped entirely
The same stage also does synchronous page.image fetch/rasterization and PIL Lanczos resize work before predict_batch() without any preprocessing timing in the debug logs. On large page rasters that can look like a hang: CPU busy, GPU mostly idle, no batch handed to the engine yet.
Why this is a problem
Runtime configuration is partially inert
Inference engines already consume these fields from VlmEngineInput, for example:
transformers_engine.py reads first_input.max_new_tokens, first_input.stop_strings, and first_input.extra_generation_config
vllm_engine.py reads first_input.temperature, first_input.max_new_tokens, first_input.stop_strings, and first_input.extra_generation_config
api_openai_compatible_engine.py reads input_data.temperature, input_data.max_new_tokens, input_data.stop_strings, and input_data.extra_generation_config
But VlmConvertModel strips most of that information at the stage boundary, so tuning the preset/model spec does not reliably reach the engine.
CPU-side preprocessing is hard to distinguish from a stall
page.image access may rasterize the page, then the stage may do one or two PIL resize passes before the first predict_batch() call. That work is synchronous and invisible in the stage logs today, so a long-running conversion can look stuck even when it is still preparing pages.
Minimal repro
A stubbed engine is enough to show the config drop:
from PIL import Image
from docling.datamodel.pipeline_options import VlmConvertOptions
from docling.datamodel.pipeline_options_vlm_model import ResponseFormat
from docling.datamodel.stage_model_specs import VlmModelSpec
from docling.datamodel.vlm_engine_options import AutoInlineVlmEngineOptions
from docling.models.stages.vlm_convert.vlm_convert_model import VlmConvertModel
class StubEngine:
def __init__(self):
self.batch = None
def predict_batch(self, batch):
self.batch = batch
return []
def cleanup(self):
return None
model_spec = VlmModelSpec(
name="Test Model",
default_repo_id="org/model",
prompt="Convert this page to docling.",
response_format=ResponseFormat.DOCTAGS,
max_new_tokens=128,
stop_strings=["</doctag>"],
)
model = VlmConvertModel.__new__(VlmConvertModel)
model.enabled = True
model.engine = StubEngine()
model.options = VlmConvertOptions(
model_spec=model_spec,
engine_options=AutoInlineVlmEngineOptions(),
)
list(model.process_images([Image.new("RGB", (8, 8), "white")], "custom prompt"))
engine_input = model.engine.batch[0]
assert engine_input.max_new_tokens == 128 # actual: 4096
assert engine_input.stop_strings == ["</doctag>"] # actual: []
Expected behavior
VlmConvertModel should construct VlmEngineInput from the configured VLM generation settings instead of hardcoding or dropping them.
At minimum:
- forward
max_new_tokens
- forward
stop_strings
And ideally:
- expose and forward
temperature
- expose and forward
extra_generation_config
- emit debug timing for rasterization/resizing and batch handoff to
predict_batch()
Notes
I have a small draft PR prepared from a fork that wires these fields through VlmConvertModel, adds regression tests for both __call__() and process_images(), and adds debug preprocessing/batch timing in the stage.
Summary
docling/models/stages/vlm_convert/vlm_convert_model.pycurrently rebuildsVlmEngineInputwith hardcoded or dropped generation settings in both__call__()andprocess_images().On current
main(commitc7615123e6b9d8b5e772e54496db24d8adc64d92), the stage forwards the prompt and image, but it does not forward the rest of the generation config carried by the VLM preset/model spec:max_new_tokensfromVlmModelSpecis ignored and replaced with4096stop_stringsfromVlmModelSpecare droppedtemperatureis hardcoded to0.0extra_generation_configis dropped entirelyThe same stage also does synchronous
page.imagefetch/rasterization and PIL Lanczos resize work beforepredict_batch()without any preprocessing timing in the debug logs. On large page rasters that can look like a hang: CPU busy, GPU mostly idle, no batch handed to the engine yet.Why this is a problem
Runtime configuration is partially inert
Inference engines already consume these fields from
VlmEngineInput, for example:transformers_engine.pyreadsfirst_input.max_new_tokens,first_input.stop_strings, andfirst_input.extra_generation_configvllm_engine.pyreadsfirst_input.temperature,first_input.max_new_tokens,first_input.stop_strings, andfirst_input.extra_generation_configapi_openai_compatible_engine.pyreadsinput_data.temperature,input_data.max_new_tokens,input_data.stop_strings, andinput_data.extra_generation_configBut
VlmConvertModelstrips most of that information at the stage boundary, so tuning the preset/model spec does not reliably reach the engine.CPU-side preprocessing is hard to distinguish from a stall
page.imageaccess may rasterize the page, then the stage may do one or two PIL resize passes before the firstpredict_batch()call. That work is synchronous and invisible in the stage logs today, so a long-running conversion can look stuck even when it is still preparing pages.Minimal repro
A stubbed engine is enough to show the config drop:
Expected behavior
VlmConvertModelshould constructVlmEngineInputfrom the configured VLM generation settings instead of hardcoding or dropping them.At minimum:
max_new_tokensstop_stringsAnd ideally:
temperatureextra_generation_configpredict_batch()Notes
I have a small draft PR prepared from a fork that wires these fields through
VlmConvertModel, adds regression tests for both__call__()andprocess_images(), and adds debug preprocessing/batch timing in the stage.