Skip to content

Falcon-OCR on Linux/Transformers needs multiple compatibility fixes, not just eager attention #3278

@geoHeil

Description

@geoHeil

Bug

On the Linux TransformersVlmEngine path, the falcon_ocr preset currently needs a small bundle of Falcon-specific compatibility fixes. Forcing eager attention is necessary, but it is not sufficient on its own.

This appears to be separate from:

Inference is local. The failures happen while loading and running the local model through Transformers with trust_remote_code=True.

Steps to reproduce

On Linux, with a Docling setup that resolves falcon_ocr to the Transformers engine:

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmConvertOptions, VlmPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline

vlm_options = VlmConvertOptions.from_preset("falcon_ocr")
pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_options,
    allow_external_plugins=True,
    enable_remote_services=True,
)
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        )
    }
)
converter.convert("sample.pdf")

Actual behavior

On the Linux/Transformers path we hit a sequence of Falcon-specific failures:

  • attention backend dispatch: Falcon-OCR does not support SDPA yet, so the model must be loaded with public attn_implementation="eager"
  • config initialization: the eager setting must be present on the actual Falcon config object used for model init, not just inferred later
  • generation config loading: the Falcon repo does not ship generation_config.json, so GenerationConfig.from_model_config(...) fallback is needed
  • prompt formatting: Falcon does not ship a usable chat template for the generic Transformers VLM path
  • inference path: Falcon's remote-code model needs to use its native OCR generation entrypoints instead of Docling's generic chat-template processor flow

Relevant stack shape:

  • docling.models.inference_engines.vlm.factory.create_vlm_engine
  • AutoInlineVlmEngine
  • TransformersVlmEngine
  • model_cls.from_pretrained(...)
  • FalconOCRForCausalLM

The underlying module is loaded from the HF cache, e.g.:

.../.cache/huggingface/modules/transformers_modules/.../modeling_falcon_ocr.py

Expected behavior

falcon_ocr should initialize and run successfully on Linux when Docling routes it through the Transformers engine.

Suggested fix direction

Docling should treat Falcon-OCR as a small Transformers compatibility special case on the Linux path:

  • honor explicit public attn_implementation overrides
  • default Falcon-OCR to eager attention on the Transformers preset
  • preload the Falcon config with eager attention before model construction
  • fall back when generation_config.json is missing
  • bypass the generic chat-template prompt path and use Falcon's native OCR generation flow

Status

Tracked in #3279.

Docling version

2.86.0

Python version

3.13.13

Additional environment details

  • transformers==5.5.3
  • observed on Linux CUDA path
  • Apple Silicon MLX path is not affected because it avoids the Transformers engine for this preset

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions