VlmConvertModel drops generation settings when building VlmEngineInput

## Summary

`docling/models/stages/vlm_convert/vlm_convert_model.py` currently rebuilds `VlmEngineInput` with hardcoded or dropped generation settings in both `__call__()` and `process_images()`.

On current `main` (commit `c7615123e6b9d8b5e772e54496db24d8adc64d92`), the stage forwards the prompt and image, but it does not forward the rest of the generation config carried by the VLM preset/model spec:

- `max_new_tokens` from `VlmModelSpec` is ignored and replaced with `4096`
- `stop_strings` from `VlmModelSpec` are dropped
- `temperature` is hardcoded to `0.0`
- `extra_generation_config` is dropped entirely

The same stage also does synchronous `page.image` fetch/rasterization and PIL Lanczos resize work before `predict_batch()` without any preprocessing timing in the debug logs. On large page rasters that can look like a hang: CPU busy, GPU mostly idle, no batch handed to the engine yet.

## Why this is a problem

### Runtime configuration is partially inert

Inference engines already consume these fields from `VlmEngineInput`, for example:

- `transformers_engine.py` reads `first_input.max_new_tokens`, `first_input.stop_strings`, and `first_input.extra_generation_config`
- `vllm_engine.py` reads `first_input.temperature`, `first_input.max_new_tokens`, `first_input.stop_strings`, and `first_input.extra_generation_config`
- `api_openai_compatible_engine.py` reads `input_data.temperature`, `input_data.max_new_tokens`, `input_data.stop_strings`, and `input_data.extra_generation_config`

But `VlmConvertModel` strips most of that information at the stage boundary, so tuning the preset/model spec does not reliably reach the engine.

### CPU-side preprocessing is hard to distinguish from a stall

`page.image` access may rasterize the page, then the stage may do one or two PIL resize passes before the first `predict_batch()` call. That work is synchronous and invisible in the stage logs today, so a long-running conversion can look stuck even when it is still preparing pages.

## Minimal repro

A stubbed engine is enough to show the config drop:

```python
from PIL import Image

from docling.datamodel.pipeline_options import VlmConvertOptions
from docling.datamodel.pipeline_options_vlm_model import ResponseFormat
from docling.datamodel.stage_model_specs import VlmModelSpec
from docling.datamodel.vlm_engine_options import AutoInlineVlmEngineOptions
from docling.models.stages.vlm_convert.vlm_convert_model import VlmConvertModel

class StubEngine:
    def __init__(self):
        self.batch = None

    def predict_batch(self, batch):
        self.batch = batch
        return []

    def cleanup(self):
        return None

model_spec = VlmModelSpec(
    name="Test Model",
    default_repo_id="org/model",
    prompt="Convert this page to docling.",
    response_format=ResponseFormat.DOCTAGS,
    max_new_tokens=128,
    stop_strings=["</doctag>"],
)

model = VlmConvertModel.__new__(VlmConvertModel)
model.enabled = True
model.engine = StubEngine()
model.options = VlmConvertOptions(
    model_spec=model_spec,
    engine_options=AutoInlineVlmEngineOptions(),
)

list(model.process_images([Image.new("RGB", (8, 8), "white")], "custom prompt"))
engine_input = model.engine.batch[0]

assert engine_input.max_new_tokens == 128          # actual: 4096
assert engine_input.stop_strings == ["</doctag>"] # actual: []
```

## Expected behavior

`VlmConvertModel` should construct `VlmEngineInput` from the configured VLM generation settings instead of hardcoding or dropping them.

At minimum:

- forward `max_new_tokens`
- forward `stop_strings`

And ideally:

- expose and forward `temperature`
- expose and forward `extra_generation_config`
- emit debug timing for rasterization/resizing and batch handoff to `predict_batch()`

## Notes

I have a small draft PR prepared from a fork that wires these fields through `VlmConvertModel`, adds regression tests for both `__call__()` and `process_images()`, and adds debug preprocessing/batch timing in the stage.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VlmConvertModel drops generation settings when building VlmEngineInput #3321

Summary

Why this is a problem

Runtime configuration is partially inert

CPU-side preprocessing is hard to distinguish from a stall

Minimal repro

Expected behavior

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VlmConvertModel drops generation settings when building VlmEngineInput #3321

Description

Summary

Why this is a problem

Runtime configuration is partially inert

CPU-side preprocessing is hard to distinguish from a stall

Minimal repro

Expected behavior

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions