Skip to content

feat: support text-image-pairs (VLM) hosted inference#461

Open
joaomarcoscrs wants to merge 1 commit intomainfrom
feat/vlm-text-image-pairs
Open

feat: support text-image-pairs (VLM) hosted inference#461
joaomarcoscrs wants to merge 1 commit intomainfrom
feat/vlm-text-image-pairs

Conversation

@joaomarcoscrs
Copy link
Copy Markdown
Contributor

@joaomarcoscrs joaomarcoscrs commented Apr 22, 2026

Summary

  • Adds VLMModel for text-image-pairs (PaliGemma-style) projects, wrapping the serverless endpoint that already supports these models. Returns the raw response dict unchanged — the payload is free-form (captions, VQA, OCR, token-boxes depending on the model) so coercing into a detection schema would lose information.
  • Wires the new type into Version.model and roboflow infer so the CLI stops rejecting text-image-pairs with Unsupported project type.
  • Adds TYPE_TEXT_IMAGE_PAIRS constant.

Why

serverless.roboflow.com already serves these models, but the SDK hardcoded a 5-type whitelist in both roboflow/core/version.py and roboflow/cli/handlers/infer.py, blocking hosted inference for text-image-pairs. This was pure plumbing — backend is ready. The MCP models_infer tool inherits the same enum 1:1, so landing this unblocks that path once its schema is updated.

Design notes

  • VLMModel.predict(url | path, **kwargs) returns the raw serverless JSON. No parsing of box<loc_y1><loc_x1><loc_y2><loc_x2> tokens — different text-image-pairs models produce different shapes (captions vs. VQA vs. token-boxes), so structured post-processing belongs to callers / model-specific helpers.
  • Extra kwargs forward as query params. Serverless currently ignores prompt for PaliGemma, but this keeps us forward-compatible without baking in an assumption.
  • CLI handler branches on dict return to pass through verbatim; --confidence/--overlap are skipped for VLM.

Test plan

  • python -m unittest — 467 tests pass (added 8 new: 6 in tests/models/test_vlm.py, 2 in tests/cli/test_infer_handler.py).
  • ruff format + ruff check clean on changed files.
  • Live call against amazontrial/1 (PaliGemma) + parcel image URL via direct VLMModel — response matches raw curl.
  • Live roboflow infer -m amazontrial/1 -t text-image-pairs <url> --json — previously errored, now returns raw server JSON.

Follow-ups (not in this PR)

  • MCP models_infer schema: add text-image-pairs to the project_type enum.
  • Optional parser helpers (e.g. parse_paligemma_boxes) as opt-in, not in the inference path.

Note

Medium Risk
Adds a new inference model type with custom HTTP request/response handling and changes CLI output behavior based on return type, which could affect inference consumers if assumptions about prediction objects change.

Overview
Adds support for text-image-pairs (VLM) hosted inference end-to-end. This introduces a new VLMModel that calls the serverless endpoint and returns the raw JSON response dict, wires the new TYPE_TEXT_IMAGE_PAIRS into Version.model selection, and updates roboflow infer to accept this project type and passthrough dict responses (skipping confidence/overlap and pretty-printing when not using --json).

Includes unit tests covering the new CLI passthrough behavior and VLMModel URL/local-file request paths, query param forwarding, and error handling.

Reviewed by Cursor Bugbot for commit b5758c9. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds a VLMModel that wraps the serverless endpoint for text-image-pairs
projects (PaliGemma-style). Returns the raw response dict since the
payload is free-form (captions, VQA, OCR, token-boxes) and shouldn't be
coerced into a detection schema.

Wires the new type into Version.model and the CLI infer handler so
`roboflow infer` no longer errors with "Unsupported project type" for
text-image-pairs models.
@joaomarcoscrs joaomarcoscrs self-assigned this Apr 22, 2026
@joaomarcoscrs
Copy link
Copy Markdown
Contributor Author

bugbot run

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit b5758c9. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant