Summary
On a _python_on Windows build of OVMS, /v3/chat/completions fails for a text-decoder IR extracted from a tri-modal (text+image+audio+video) qwen3_5 model:
Mediapipe execution failed. MP status - INVALID_ARGUMENT: CalculatorGraph::Run() failed:
Calculator::Process() for node "LLMExecutor" failed:
Error: Chat template not loaded correctly, so it cannot be applied
/v3/completions (raw prompt) on the same served model works perfectly, so the model, tokenizer, and inference are fine — only the chat-template application path fails. The model is loaded as an LLM (continuous-batching) pipeline, which defaults to the Python-Jinja2 template processor.
Environment
- OVMS
2026.2.1 (ovms_windows_2026.2.1_python_on), GenAI backend 2026.2.1.0-3123. Also reproduced on 2026.2.0.
- Windows, Intel Arc GPU (
targetDevice: GPU).
- Model: text decoder extracted from a
qwen3_5 omni model, exported to INT4 OpenVINO IR via optimum-cli. Standard Qwen ChatML template; <|im_start|>/<|im_end|> present in vocab; bos=None, eos=<|im_end|> (identical to a working Qwen3-14B).
Reproduction
- Serve the omni-derived text-decoder IR as an LLM continuous-batching pipeline (default
graph.pbtxt).
POST /v3/completions with a raw ChatML prompt → works, correct output.
POST /v3/chat/completions with messages → fails with the error above.
Root cause (traced in source)
/chat uses the embedded Python Jinja2 processor, and the template object ends up null:
src/llm/py_jinja_template_processor.cpp (~L39-40):
if (templateProcessor.chatTemplate == nullptr) {
output = "Error: Chat template not loaded correctly, so it cannot be applied";
return false;
}
src/llm/servable_initializer.cpp → loadPyTemplateProcessor (~L147+): reads
tokenizer.get_original_chat_template() then compiles it in an
ImmutableSandboxedEnvironment. For this tokenizer the load/compile does not
produce a usable template, so chatTemplate stays null and /chat fails at apply time.
Importantly, GenAI's own Tokenizer.apply_chat_template() succeeds on the exact same tokenizer (verified standalone with openvino_genai 2026.2.1.0, which renders correct ChatML). So the failure is specific to OVMS's Python-Jinja serving path, not GenAI's template engine. Upgrading GenAI alone does not fix /chat.
The mechanism to fix it already exists in main — but not in the release
main has LLMCalculatorOptions.chat_template_mode (src/llm/llm_calculator.proto):
MINJA = 0 — use GenAI apply_chat_template (the path that works here). "default for VLM pipelines."
JINJA = 1 — Python Jinja2. "default for LLM pipelines" — i.e. the failing path for this model.
There is even an in-code TODO(dkalinow) to make MINJA the default for VLM. Setting chat_template_mode: MINJA in the graph would route through the working engine — but the 2026.2.1 release binary rejects the field:
libprotobuf ERROR ... text_format.cc: Message type "mediapipe.LLMCalculatorOptions"
has no field named "chat_template_mode".
So the option is main-only and the graph fails to load when it's added on 2026.2.1.
Requests
- Release the
chat_template_mode option in a 2026.2.x/2026.3 build so users can opt VL-derived/omni LLM pipelines into MINJA.
- (Robustness) In
loadChatTemplate/loadPyTemplateProcessor, when the Python-Jinja processor leaves chatTemplate == nullptr, auto-fall-back to MINJA (GenAI's engine) instead of failing /chat outright — GenAI already handles these templates correctly.
- Consider making
MINJA the default (or auto-selected) for LLM pipelines whose tokenizer originates from a VL/omni model (aligns with the existing VLM TODO).
Current workaround
A thin reverse proxy that applies ChatML itself and forwards to /v3/completions restores /chat/completions fully (verified, correct outputs). Happy to share if useful.
Summary
On a
_python_onWindows build of OVMS,/v3/chat/completionsfails for a text-decoder IR extracted from a tri-modal (text+image+audio+video) qwen3_5 model:/v3/completions(raw prompt) on the same served model works perfectly, so the model, tokenizer, and inference are fine — only the chat-template application path fails. The model is loaded as an LLM (continuous-batching) pipeline, which defaults to the Python-Jinja2 template processor.Environment
2026.2.1(ovms_windows_2026.2.1_python_on), GenAI backend2026.2.1.0-3123. Also reproduced on2026.2.0.targetDevice: GPU).qwen3_5omni model, exported to INT4 OpenVINO IR viaoptimum-cli. Standard Qwen ChatML template;<|im_start|>/<|im_end|>present in vocab;bos=None, eos=<|im_end|>(identical to a working Qwen3-14B).Reproduction
graph.pbtxt).POST /v3/completionswith a raw ChatML prompt → works, correct output.POST /v3/chat/completionswithmessages→ fails with the error above.Root cause (traced in source)
/chatuses the embedded Python Jinja2 processor, and the template object ends up null:src/llm/py_jinja_template_processor.cpp(~L39-40):src/llm/servable_initializer.cpp→loadPyTemplateProcessor(~L147+): readstokenizer.get_original_chat_template()then compiles it in anImmutableSandboxedEnvironment. For this tokenizer the load/compile does notproduce a usable template, so
chatTemplatestays null and/chatfails at apply time.Importantly, GenAI's own
Tokenizer.apply_chat_template()succeeds on the exact same tokenizer (verified standalone withopenvino_genai 2026.2.1.0, which renders correct ChatML). So the failure is specific to OVMS's Python-Jinja serving path, not GenAI's template engine. Upgrading GenAI alone does not fix/chat.The mechanism to fix it already exists in
main— but not in the releasemainhasLLMCalculatorOptions.chat_template_mode(src/llm/llm_calculator.proto):MINJA = 0— use GenAIapply_chat_template(the path that works here). "default for VLM pipelines."JINJA = 1— Python Jinja2. "default for LLM pipelines" — i.e. the failing path for this model.There is even an in-code
TODO(dkalinow)to make MINJA the default for VLM. Settingchat_template_mode: MINJAin the graph would route through the working engine — but the2026.2.1release binary rejects the field:So the option is
main-only and the graph fails to load when it's added on2026.2.1.Requests
chat_template_modeoption in a2026.2.x/2026.3build so users can opt VL-derived/omni LLM pipelines intoMINJA.loadChatTemplate/loadPyTemplateProcessor, when the Python-Jinja processor leaveschatTemplate == nullptr, auto-fall-back toMINJA(GenAI's engine) instead of failing/chatoutright — GenAI already handles these templates correctly.MINJAthe default (or auto-selected) for LLM pipelines whose tokenizer originates from a VL/omni model (aligns with the existing VLM TODO).Current workaround
A thin reverse proxy that applies ChatML itself and forwards to
/v3/completionsrestores/chat/completionsfully (verified, correct outputs). Happy to share if useful.