GenAI servables - refactor input processing#4318
Draft
mzegla wants to merge 1 commit into
Draft
Conversation
7ba1037 to
f8f5106
Compare
f8f5106 to
6fe4461
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors GenAI servable input handling to use a unified InputRequest + InputProcessor chain, moving generation-config extraction into the API handler and deferring multimodal image decoding (and related validation) out of the OpenAI request parsers.
Changes:
- Introduces
InputRequestand anInputProcessorpipeline (raw prompt extraction, chat template application, tokenization, deferred image decoding, text-content normalization). - Updates LM/VLM servables and executors to consume
executionContext->inputRequest(and removes legacyprepareInputsoverrides in VLM servables). - Updates OpenAI handlers/tests to preserve multimodal
contentarrays inChatHistory, removesprocessedJson/imageHistory, and adjusts tools parsing assertions.
Reviewed changes
Copilot reviewed 36 out of 36 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/test/llm/llmnode_test.cpp | Updates tests to use executionContext.inputRequest.inputIds. |
| src/test/llm/input_processing/raw_prompt_extractor_test.cpp | Adds unit tests for RawPromptExtractor. |
| src/test/llm/input_processing/image_decoding_processor_test.cpp | Adds unit tests for ImageDecodingProcessor behavior without actual decoding. |
| src/test/http_openai_handler_test.cpp | Updates parsing tests to assert ChatHistory preservation and tool map contents (no processedJson/imageHistory). |
| src/llm/visual_language_model/legacy/servable.hpp | Removes VLM legacy inputText/inputImages fields and sets isVLM in processor context. |
| src/llm/visual_language_model/legacy/servable.cpp | Switches to extractInputRequest(); removes legacy VLM prepareInputs implementation. |
| src/llm/visual_language_model/legacy/legacy_executor.cpp | Uses inputRequest.promptText/inputImages/generationConfig for generation. |
| src/llm/visual_language_model/continuous_batching/servable.hpp | Removes VLM CB inputText/inputImages fields and sets isVLM in processor context. |
| src/llm/visual_language_model/continuous_batching/servable.cpp | Uses inputRequest.* when adding requests; removes legacy VLM prepareInputs. |
| src/llm/servable.hpp | Replaces inputIds/GenerationConfigBuilder in execution context with InputRequest; adds InputProcessorContext. |
| src/llm/servable.cpp | Refactors base parseRequest/prepareInputs to build and process InputRequest. |
| src/llm/servable_initializer.cpp | Populates InputProcessorContext (tokenizer + optional Python template processor). |
| src/llm/language_model/legacy/servable.cpp | Uses inputRequest for generation config and NPU input-length validation. |
| src/llm/language_model/legacy/legacy_executor.cpp | Uses inputRequest.inputIds/generationConfig for generation. |
| src/llm/language_model/continuous_batching/servable.cpp | Uses inputRequest for scheduler limits and pipeline add_request. |
| src/llm/io_processing/input_request.hpp | Adds InputRequest and InputPayload variant. |
| src/llm/io_processing/input_processors/tokenization_processor.hpp | Adds tokenization processor definition. |
| src/llm/io_processing/input_processors/tokenization_processor.cpp | Implements tokenization into req.inputIds. |
| src/llm/io_processing/input_processors/text_content_normalization_processor.hpp | Adds text-only content-array normalizer (LM paths). |
| src/llm/io_processing/input_processors/text_content_normalization_processor.cpp | Implements content-array flattening to string with \\n joins. |
| src/llm/io_processing/input_processors/raw_prompt_extractor.hpp | Adds raw prompt extractor (COMPLETIONS path). |
| src/llm/io_processing/input_processors/image_decoding_processor.hpp | Adds deferred image decoding processor (VLM paths). |
| src/llm/io_processing/input_processors/image_decoding_processor.cpp | Implements image decoding + <ov_genai_image_N> injection into message content. |
| src/llm/io_processing/input_processors/chat_template_processor.hpp | Adds chat template processor (Python and native paths). |
| src/llm/io_processing/input_processors/chat_template_processor.cpp | Implements prompt building from ChatHistory. |
| src/llm/io_processing/input_processor.hpp | Adds orchestrator selecting processors based on config + payload variant. |
| src/llm/io_processing/input_processor.cpp | Builds and executes the processor chain. |
| src/llm/io_processing/input_processor_context.hpp | Adds per-deployment resources for input processing. |
| src/llm/io_processing/input_processing_config.hpp | Adds deployment-level processing config (isVLM). |
| src/llm/io_processing/base_input_processor.hpp | Adds base interface for processing steps. |
| src/llm/BUILD | Adds Bazel targets/deps for new IO processing components. |
| src/llm/apis/openai_responses.cpp | Preserves content arrays in ChatHistory and removes Python processedJson path + eager image decoding. |
| src/llm/apis/openai_request.hpp | Removes processedJson and imageHistory from OpenAIRequest. |
| src/llm/apis/openai_completions.cpp | Preserves multimodal content arrays in ChatHistory and removes eager image decoding + processedJson rebuild. |
| src/llm/apis/openai_api_handler.hpp | Removes getProcessedJson/getImageHistory; adds extractInputRequest(). |
| src/llm/apis/openai_api_handler.cpp | Implements extractInputRequest() and removes processedJson mutations from tools parsing. |
Comment on lines
+19
to
+21
| #include <string> | ||
| #include <unordered_map> | ||
| #include <utility> |
Comment on lines
+39
to
+44
| for (size_t i = 0; i < chatHistory.size(); i++) { | ||
| const auto content = chatHistory[i]["content"]; | ||
| if (content.as_string().value_or("").find("<ov_genai_image_") != std::string::npos) { | ||
| return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag"); | ||
| } | ||
| } |
Comment on lines
+69
to
+71
| } else if (type == "text") { | ||
| textContent += part["text"].as_string().value_or(""); | ||
| } |
Comment on lines
207
to
+210
| if (getProperties()->maxModelLength.has_value()) { | ||
| if (executionContext->inputIds.get_size() > getProperties()->maxModelLength.value()) { | ||
| if (req.inputIds.get_size() > getProperties()->maxModelLength.value()) { | ||
| std::stringstream ss; | ||
| ss << "Number of prompt tokens: " << executionContext->inputIds.get_size() << " exceeds model max length: " << getProperties()->maxModelLength.value(); | ||
| ss << "Number of prompt tokens: " << req.inputIds.get_size() |
Comment on lines
+499
to
+507
| InputRequest req; | ||
| req.generationConfig = configBuilder.getConfig(); | ||
| if (endpoint == Endpoint::COMPLETIONS) { | ||
| req.input = request.prompt.value_or(""); | ||
| } else { | ||
| // CHAT_COMPLETIONS and RESPONSES both use ChatHistory. | ||
| // Copied (not moved) so the handler retains its own copy for response serialization. | ||
| req.input = request.chatHistory; | ||
| } |
Comment on lines
+49
to
+65
| if (isChatPath) { | ||
| #if (PYTHON_DISABLE == 0) | ||
| processors.emplace_back(std::make_unique<ChatTemplateProcessor>( | ||
| context.tokenizer, | ||
| *context.templateProcessor, | ||
| context.modelsPath)); | ||
| #else | ||
| processors.emplace_back(std::make_unique<ChatTemplateProcessor>(context.tokenizer)); | ||
| #endif | ||
| } else { | ||
| processors.emplace_back(std::make_unique<RawPromptExtractor>()); | ||
| } | ||
|
|
||
| if (!context.config.isVLM) { | ||
| processors.emplace_back(std::make_unique<TokenizationProcessor>( | ||
| context.tokenizer, addSpecialTokens)); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.