feat: support text-image-pairs (VLM) hosted inference by joaomarcoscrs · Pull Request #461 · roboflow/roboflow-python

joaomarcoscrs · 2026-04-22T21:20:13Z

Summary

Adds VLMModel for text-image-pairs (PaliGemma-style) projects, wrapping the serverless endpoint that already supports these models. Returns the raw response dict unchanged — the payload is free-form (captions, VQA, OCR, token-boxes depending on the model) so coercing into a detection schema would lose information.
Wires the new type into Version.model and roboflow infer so the CLI stops rejecting text-image-pairs with Unsupported project type.
Adds TYPE_TEXT_IMAGE_PAIRS constant.

Why

serverless.roboflow.com already serves these models, but the SDK hardcoded a 5-type whitelist in both roboflow/core/version.py and roboflow/cli/handlers/infer.py, blocking hosted inference for text-image-pairs. This was pure plumbing — backend is ready. The MCP models_infer tool inherits the same enum 1:1, so landing this unblocks that path once its schema is updated.

Design notes

VLMModel.predict(url | path, **kwargs) returns the raw serverless JSON. No parsing of box<loc_y1><loc_x1><loc_y2><loc_x2> tokens — different text-image-pairs models produce different shapes (captions vs. VQA vs. token-boxes), so structured post-processing belongs to callers / model-specific helpers.
Extra kwargs forward as query params. Serverless currently ignores prompt for PaliGemma, but this keeps us forward-compatible without baking in an assumption.
CLI handler branches on dict return to pass through verbatim; --confidence/--overlap are skipped for VLM.

Test plan

python -m unittest — 467 tests pass (added 8 new: 6 in tests/models/test_vlm.py, 2 in tests/cli/test_infer_handler.py).
ruff format + ruff check clean on changed files.
Live call against amazontrial/1 (PaliGemma) + parcel image URL via direct VLMModel — response matches raw curl.
Live roboflow infer -m amazontrial/1 -t text-image-pairs <url> --json — previously errored, now returns raw server JSON.

Follow-ups (not in this PR)

MCP models_infer schema: add text-image-pairs to the project_type enum.
Optional parser helpers (e.g. parse_paligemma_boxes) as opt-in, not in the inference path.

Note

Medium Risk
Adds a new inference model type with custom HTTP request/response handling and changes CLI output behavior based on return type, which could affect inference consumers if assumptions about prediction objects change.

Overview
Adds support for text-image-pairs (VLM) hosted inference end-to-end. This introduces a new VLMModel that calls the serverless endpoint and returns the raw JSON response dict, wires the new TYPE_TEXT_IMAGE_PAIRS into Version.model selection, and updates roboflow infer to accept this project type and passthrough dict responses (skipping confidence/overlap and pretty-printing when not using --json).

Includes unit tests covering the new CLI passthrough behavior and VLMModel URL/local-file request paths, query param forwarding, and error handling.

^{Reviewed by Cursor Bugbot for commit b5758c9. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds a VLMModel that wraps the serverless endpoint for text-image-pairs projects (PaliGemma-style). Returns the raw response dict since the payload is free-form (captions, VQA, OCR, token-boxes) and shouldn't be coerced into a detection schema. Wires the new type into Version.model and the CLI infer handler so `roboflow infer` no longer errors with "Unsupported project type" for text-image-pairs models.

joaomarcoscrs · 2026-04-22T21:31:57Z

bugbot run

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit b5758c9. Configure here.}

joaomarcoscrs self-assigned this Apr 22, 2026

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support text-image-pairs (VLM) hosted inference#461

feat: support text-image-pairs (VLM) hosted inference#461
joaomarcoscrs wants to merge 1 commit intomainfrom
feat/vlm-text-image-pairs

joaomarcoscrs commented Apr 22, 2026 •

edited by cursor Bot

Loading

Uh oh!

joaomarcoscrs commented Apr 22, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joaomarcoscrs commented Apr 22, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Design notes

Test plan

Follow-ups (not in this PR)

Uh oh!

joaomarcoscrs commented Apr 22, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joaomarcoscrs commented Apr 22, 2026 •

edited by cursor Bot

Loading