Skip to content

Disable thinking mode in vLLM instruct chat#457

Open
dmndxld wants to merge 1 commit into
IINemo:mainfrom
dmndxld:fix/vllm-disable-thinking-mode
Open

Disable thinking mode in vLLM instruct chat#457
dmndxld wants to merge 1 commit into
IINemo:mainfrom
dmndxld:fix/vllm-disable-thinking-mode

Conversation

@dmndxld

@dmndxld dmndxld commented Apr 22, 2026

Copy link
Copy Markdown

Problem

Qwen3 enables thinking mode by default in its chat template, generating <think> blocks before the actual response.
This interferes with uncertainty estimation, as the generated text contains reasoning traces instead of direct
answers.

When using WhiteboxModelvLLM with instruct=True, the model.chat() call invokes apply_chat_template internally,
which enables thinking mode by default for Qwen3.

Fix

Pass chat_template_kwargs={"enable_thinking": False} to model.chat(), which propagates to apply_chat_template
and disables thinking mode:

output = self.model.chat(
    *args, chats, sampling_params,
    chat_template_kwargs={"enable_thinking": False},
)

This uses vLLM's native chat_template_kwargs parameter rather than manually calling apply_chat_template +
model.generate(), keeping the implementation consistent with vLLM's API.

Affected models

Models that enable thinking mode by default in their chat template, such as Qwen3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant