GenAI servables - refactor input processing#4318
Conversation
f8f5106 to
6fe4461
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors GenAI servable input handling to use a unified InputRequest + InputProcessor chain, moving generation-config extraction into the API handler and deferring multimodal image decoding (and related validation) out of the OpenAI request parsers.
Changes:
- Introduces
InputRequestand anInputProcessorpipeline (raw prompt extraction, chat template application, tokenization, deferred image decoding, text-content normalization). - Updates LM/VLM servables and executors to consume
executionContext->inputRequest(and removes legacyprepareInputsoverrides in VLM servables). - Updates OpenAI handlers/tests to preserve multimodal
contentarrays inChatHistory, removesprocessedJson/imageHistory, and adjusts tools parsing assertions.
Reviewed changes
Copilot reviewed 36 out of 36 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/test/llm/llmnode_test.cpp | Updates tests to use executionContext.inputRequest.inputIds. |
| src/test/llm/input_processing/raw_prompt_extractor_test.cpp | Adds unit tests for RawPromptExtractor. |
| src/test/llm/input_processing/image_decoding_processor_test.cpp | Adds unit tests for ImageDecodingProcessor behavior without actual decoding. |
| src/test/http_openai_handler_test.cpp | Updates parsing tests to assert ChatHistory preservation and tool map contents (no processedJson/imageHistory). |
| src/llm/visual_language_model/legacy/servable.hpp | Removes VLM legacy inputText/inputImages fields and sets isVLM in processor context. |
| src/llm/visual_language_model/legacy/servable.cpp | Switches to extractInputRequest(); removes legacy VLM prepareInputs implementation. |
| src/llm/visual_language_model/legacy/legacy_executor.cpp | Uses inputRequest.promptText/inputImages/generationConfig for generation. |
| src/llm/visual_language_model/continuous_batching/servable.hpp | Removes VLM CB inputText/inputImages fields and sets isVLM in processor context. |
| src/llm/visual_language_model/continuous_batching/servable.cpp | Uses inputRequest.* when adding requests; removes legacy VLM prepareInputs. |
| src/llm/servable.hpp | Replaces inputIds/GenerationConfigBuilder in execution context with InputRequest; adds InputProcessorContext. |
| src/llm/servable.cpp | Refactors base parseRequest/prepareInputs to build and process InputRequest. |
| src/llm/servable_initializer.cpp | Populates InputProcessorContext (tokenizer + optional Python template processor). |
| src/llm/language_model/legacy/servable.cpp | Uses inputRequest for generation config and NPU input-length validation. |
| src/llm/language_model/legacy/legacy_executor.cpp | Uses inputRequest.inputIds/generationConfig for generation. |
| src/llm/language_model/continuous_batching/servable.cpp | Uses inputRequest for scheduler limits and pipeline add_request. |
| src/llm/io_processing/input_request.hpp | Adds InputRequest and InputPayload variant. |
| src/llm/io_processing/input_processors/tokenization_processor.hpp | Adds tokenization processor definition. |
| src/llm/io_processing/input_processors/tokenization_processor.cpp | Implements tokenization into req.inputIds. |
| src/llm/io_processing/input_processors/text_content_normalization_processor.hpp | Adds text-only content-array normalizer (LM paths). |
| src/llm/io_processing/input_processors/text_content_normalization_processor.cpp | Implements content-array flattening to string with \\n joins. |
| src/llm/io_processing/input_processors/raw_prompt_extractor.hpp | Adds raw prompt extractor (COMPLETIONS path). |
| src/llm/io_processing/input_processors/image_decoding_processor.hpp | Adds deferred image decoding processor (VLM paths). |
| src/llm/io_processing/input_processors/image_decoding_processor.cpp | Implements image decoding + <ov_genai_image_N> injection into message content. |
| src/llm/io_processing/input_processors/chat_template_processor.hpp | Adds chat template processor (Python and native paths). |
| src/llm/io_processing/input_processors/chat_template_processor.cpp | Implements prompt building from ChatHistory. |
| src/llm/io_processing/input_processor.hpp | Adds orchestrator selecting processors based on config + payload variant. |
| src/llm/io_processing/input_processor.cpp | Builds and executes the processor chain. |
| src/llm/io_processing/input_processor_context.hpp | Adds per-deployment resources for input processing. |
| src/llm/io_processing/input_processing_config.hpp | Adds deployment-level processing config (isVLM). |
| src/llm/io_processing/base_input_processor.hpp | Adds base interface for processing steps. |
| src/llm/BUILD | Adds Bazel targets/deps for new IO processing components. |
| src/llm/apis/openai_responses.cpp | Preserves content arrays in ChatHistory and removes Python processedJson path + eager image decoding. |
| src/llm/apis/openai_request.hpp | Removes processedJson and imageHistory from OpenAIRequest. |
| src/llm/apis/openai_completions.cpp | Preserves multimodal content arrays in ChatHistory and removes eager image decoding + processedJson rebuild. |
| src/llm/apis/openai_api_handler.hpp | Removes getProcessedJson/getImageHistory; adds extractInputRequest(). |
| src/llm/apis/openai_api_handler.cpp | Implements extractInputRequest() and removes processedJson mutations from tools parsing. |
6fe4461 to
f5b6037
Compare
| req.input = request.chatHistory; | ||
| // Populate tools and chat_template_kwargs on the copied ChatHistory so | ||
| // ChatTemplateProcessor can access them via get_tools()/get_extra_context(). | ||
| auto& chatHistory = std::get<ov::genai::ChatHistory>(req.input); |
There was a problem hiding this comment.
it can throw an exception if req.input is not of type ChatHistory. can we try catch that and return error if it happens?
| absl::Status process(InputRequest& req) override; | ||
|
|
||
| private: | ||
| ov::genai::Tokenizer* tokenizer; // non-owning; lifetime tied to InputProcessorContext |
There was a problem hiding this comment.
nit pick - we can do that since all constructors take tokenizer ref as argument
| ov::genai::Tokenizer* tokenizer; // non-owning; lifetime tied to InputProcessorContext | |
| ov::genai::Tokenizer& tokenizer; // non-owning; lifetime tied to InputProcessorContext |
| allowedMediaDomains(std::move(allowedMediaDomains)) {} | ||
|
|
||
| absl::Status ImageDecodingProcessor::process(InputRequest& req) { | ||
| ov::genai::ChatHistory& chatHistory = std::get<ov::genai::ChatHistory>(req.input); |
There was a problem hiding this comment.
also, do we need try catch and return error?
| absl::Status process(InputRequest& req) override; | ||
|
|
||
| private: | ||
| ov::genai::Tokenizer* tokenizer; // non-owning; lifetime tied to InputProcessorContext |
There was a problem hiding this comment.
| ov::genai::Tokenizer* tokenizer; // non-owning; lifetime tied to InputProcessorContext | |
| ov::genai::Tokenizer& tokenizer; // non-owning; lifetime tied to InputProcessorContext |
| // True when the GenAI built-in tokenizer.apply_chat_template() should be used | ||
| // even on Python-enabled builds (i.e. ChatTemplateMode::MINJA). | ||
| // False (default) uses PyJinjaTemplateProcessor when PYTHON_DISABLE==0. | ||
| bool useMinja = false; |
There was a problem hiding this comment.
do I put auto deduced caps/workarounds here?
| const auto& chatTemplateKwargs = chatTemplateKwargsParsingResult.value(); | ||
| if (llm_calculator_logger->should_log(spdlog::level::trace)) { | ||
| SPDLOG_LOGGER_TRACE(llm_calculator_logger, "VLM chatHistory messages: {}", chatHistory.get_messages().to_json_string()); | ||
| SPDLOG_LOGGER_TRACE(llm_calculator_logger, "VLM chatHistory.get_tools(): {}", chatHistory.get_tools().to_json_string()); |
There was a problem hiding this comment.
do we still have such logs somewhere? I think they were useful
| if (!getProperties()->inputProcessorContext.config.isVLM && | ||
| std::holds_alternative<ov::genai::ChatHistory>(executionContext->inputRequest.input)) { | ||
| const auto& ch = std::get<ov::genai::ChatHistory>(executionContext->inputRequest.input); | ||
| for (size_t i = 0; i < ch.size(); i++) { |
There was a problem hiding this comment.
could be separate util function
| "/wd6240", | ||
| "/wd6326", | ||
| "/wd6385", | ||
| "/wd6386", |
| TEST_P(HttpOpenAIHandlerChatAndResponsesParsingTest, ParsingImageStringWithNoMimePrefixFails) { | ||
| // Without a "data:..." prefix the URL falls through to the local-filesystem loader, | ||
| // which is disabled by default. | ||
| TEST_P(HttpOpenAIHandlerChatAndResponsesParsingTest, ParsingImageStringWithNoMimePrefixPreservedInChatHistory) { |
There was a problem hiding this comment.
if we change the assert here, do we have separate test checking if these requests are rejected during other phase?
| TEST_F(HttpOpenAIHandlerParsingTest, ParsingMessagesWithNoContentFieldAddsEmptyStringToChatHistory) { | ||
| // Assistant turns that carry only tool_calls legitimately omit the "content" field. | ||
| // parseMessages injects an empty-string content entry so chat templates always see | ||
| // the field. Verifies the JsonContainer proxy write-through: lastMessage["content"] = "" |
There was a problem hiding this comment.
this is one of "workarounds" llama.cpp and minja applies. I see we already have it - for all models without autodetection. Is it new behavior or you added this in this PR? If so, which processor handles that?
No description provided.