feat: support text-image-pairs (VLM) hosted inference#461
Open
joaomarcoscrs wants to merge 1 commit intomainfrom
Open
feat: support text-image-pairs (VLM) hosted inference#461joaomarcoscrs wants to merge 1 commit intomainfrom
joaomarcoscrs wants to merge 1 commit intomainfrom
Conversation
Adds a VLMModel that wraps the serverless endpoint for text-image-pairs projects (PaliGemma-style). Returns the raw response dict since the payload is free-form (captions, VQA, OCR, token-boxes) and shouldn't be coerced into a detection schema. Wires the new type into Version.model and the CLI infer handler so `roboflow infer` no longer errors with "Unsupported project type" for text-image-pairs models.
Contributor
Author
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit b5758c9. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
VLMModelfortext-image-pairs(PaliGemma-style) projects, wrapping the serverless endpoint that already supports these models. Returns the raw response dict unchanged — the payload is free-form (captions, VQA, OCR, token-boxes depending on the model) so coercing into a detection schema would lose information.Version.modelandroboflow inferso the CLI stops rejectingtext-image-pairswith Unsupported project type.TYPE_TEXT_IMAGE_PAIRSconstant.Why
serverless.roboflow.comalready serves these models, but the SDK hardcoded a 5-type whitelist in both roboflow/core/version.py and roboflow/cli/handlers/infer.py, blocking hosted inference fortext-image-pairs. This was pure plumbing — backend is ready. The MCPmodels_infertool inherits the same enum 1:1, so landing this unblocks that path once its schema is updated.Design notes
VLMModel.predict(url | path, **kwargs)returns the raw serverless JSON. No parsing ofbox<loc_y1><loc_x1><loc_y2><loc_x2>tokens — differenttext-image-pairsmodels produce different shapes (captions vs. VQA vs. token-boxes), so structured post-processing belongs to callers / model-specific helpers.promptfor PaliGemma, but this keeps us forward-compatible without baking in an assumption.dictreturn to pass through verbatim;--confidence/--overlapare skipped for VLM.Test plan
python -m unittest— 467 tests pass (added 8 new: 6 intests/models/test_vlm.py, 2 intests/cli/test_infer_handler.py).ruff format+ruff checkclean on changed files.amazontrial/1(PaliGemma) + parcel image URL via directVLMModel— response matches raw curl.roboflow infer -m amazontrial/1 -t text-image-pairs <url> --json— previously errored, now returns raw server JSON.Follow-ups (not in this PR)
models_inferschema: addtext-image-pairsto theproject_typeenum.parse_paligemma_boxes) as opt-in, not in the inference path.Note
Medium Risk
Adds a new inference model type with custom HTTP request/response handling and changes CLI output behavior based on return type, which could affect inference consumers if assumptions about prediction objects change.
Overview
Adds support for
text-image-pairs(VLM) hosted inference end-to-end. This introduces a newVLMModelthat calls the serverless endpoint and returns the raw JSON response dict, wires the newTYPE_TEXT_IMAGE_PAIRSintoVersion.modelselection, and updatesroboflow inferto accept this project type and passthrough dict responses (skippingconfidence/overlapand pretty-printing when not using--json).Includes unit tests covering the new CLI passthrough behavior and
VLMModelURL/local-file request paths, query param forwarding, and error handling.Reviewed by Cursor Bugbot for commit b5758c9. Bugbot is set up for automated code reviews on this repo. Configure here.