Disable thinking mode in vLLM instruct chat by dmndxld · Pull Request #457 · IINemo/lm-polygraph

dmndxld · 2026-04-22T09:34:35Z

Problem

Qwen3 enables thinking mode by default in its chat template, generating <think> blocks before the actual response.
This interferes with uncertainty estimation, as the generated text contains reasoning traces instead of direct
answers.

When using WhiteboxModelvLLM with instruct=True, the model.chat() call invokes apply_chat_template internally,
which enables thinking mode by default for Qwen3.

Fix

Pass chat_template_kwargs={"enable_thinking": False} to model.chat(), which propagates to apply_chat_template
and disables thinking mode:

output = self.model.chat(
    *args, chats, sampling_params,
    chat_template_kwargs={"enable_thinking": False},
)

This uses vLLM's native chat_template_kwargs parameter rather than manually calling apply_chat_template +
model.generate(), keeping the implementation consistent with vLLM's API.

Affected models

Models that enable thinking mode by default in their chat template, such as Qwen3.

Disable thinking mode in vLLM instruct chat

4d4335a

smirnovlad mentioned this pull request May 2, 2026

feat: single-pass native HS capture with prefix-cache fill and multi-step accumulator #453

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable thinking mode in vLLM instruct chat#457

Disable thinking mode in vLLM instruct chat#457
dmndxld wants to merge 1 commit into
IINemo:mainfrom
dmndxld:fix/vllm-disable-thinking-mode

dmndxld commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dmndxld commented Apr 22, 2026

Problem

Fix

Affected models

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant