Skip to content

fix(frontend): auto-detect force_reasoning when chat template appends <think>#8240

Open
navmarri14 wants to merge 2 commits intoai-dynamo:mainfrom
navmarri14:main
Open

fix(frontend): auto-detect force_reasoning when chat template appends <think>#8240
navmarri14 wants to merge 2 commits intoai-dynamo:mainfrom
navmarri14:main

Conversation

@navmarri14
Copy link
Copy Markdown

@navmarri14 navmarri14 commented Apr 15, 2026

Overview:

auto-detect force_reasoning when chat template appends <think>

Details:

When a chat template's generation prompt ends with <think>, the
reasoning parser must start in reasoning mode so it correctly separates
thinking content from normal output. Detect this by inspecting the tail
of the tokenized prompt and pass force_reasoning=True to the
ReasoningParser. Also pass return_dict=False to apply_chat_template
for consistent tokenizer output.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Automatic detection of reasoning-enabled prompts to ensure proper parsing.
    • Enhanced reasoning parser configuration for improved handling of edge cases and complex scenarios.
  • Improvements

    • Optimized tokenizer template handling for more efficient prompt processing.
    • Refined multi-process reasoning support to maintain consistency across request processing.

@navmarri14 navmarri14 requested review from a team as code owners April 15, 2026 18:34
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added fix frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` labels Apr 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi navmarri14! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions Bot added the external-contribution Pull request is from an external contributor label Apr 15, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

Walkthrough

A new detect_force_reasoning helper function checks if prompts end with "<think>" tokens. The function is applied in preprocessing to detect force reasoning intent, and the result is passed through parser creation methods. Tokenizer template application was also updated to disable dictionary output.

Changes

Cohort / File(s) Summary
Force Reasoning Detection
components/src/dynamo/frontend/sglang_prepost.py
Added detect_force_reasoning() helper function that decodes prompt tail to detect "<think>" endings. Updated create_parsers() signature with force_reasoning parameter and modified preprocess_chat_request() to compute and pass force reasoning status to parser creation. Changed tokenizer template application to include return_dict: False.
Multi-Process Integration
components/src/dynamo/frontend/sglang_processor.py
Imported detect_force_reasoning and integrated force reasoning computation in _generator_inner_pool(). Updated create_parsers() call in the multi-process generation path to forward the computed force_reasoning flag.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: auto-detecting force_reasoning when a chat template appends , which aligns directly with the primary objective and file changes.
Description check ✅ Passed The description covers the overview, details of changes, and implementation approach, though it lacks specific file recommendations and contains a placeholder for related issues.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
components/src/dynamo/frontend/sglang_prepost.py (1)

61-73: Consider adding a constant and type hint for clarity.

The function is well-implemented, but a couple of minor improvements would enhance readability:

  1. The magic number 10 could be documented or made a constant
  2. The tokenizer parameter lacks a type hint (though this may be intentional due to multiple tokenizer types)
♻️ Optional: Add constant and brief inline comment
+# Lookback tokens to decode for <think> detection; covers typical tokenizations
+_THINK_DETECT_LOOKBACK = 10
+
+
 def detect_force_reasoning(tokenizer, prompt_token_ids: list[int]) -> bool:
     """Check if the chat template's generation prompt ends with ``<think>``.
 
     When the template appends ``<think>`` to the prompt, the model output
     starts inside a reasoning block without an explicit opening tag.
     The reasoning parser must be told to begin in reasoning mode
     (``force_reasoning=True``) so that it correctly separates reasoning
     content from normal content.
     """
     if not prompt_token_ids:
         return False
-    tail = tokenizer.decode(prompt_token_ids[-10:], skip_special_tokens=False)
+    tail = tokenizer.decode(
+        prompt_token_ids[-_THINK_DETECT_LOOKBACK:], skip_special_tokens=False
+    )
     return tail.rstrip().endswith("<think>")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/frontend/sglang_prepost.py` around lines 61 - 73, The
detect_force_reasoning function uses a magic number 10 and lacks a tokenizer
type hint; introduce a module-level constant (e.g., PROMPT_TAIL_TOKEN_WINDOW =
10) and replace the literal 10 in detect_force_reasoning with that constant, add
a brief comment above the constant explaining it controls how many tail tokens
to inspect, and add a permissive type hint for tokenizer (e.g., TokenizerLike or
Any) on the detect_force_reasoning signature to document expected type while
preserving compatibility with multiple tokenizer implementations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@components/src/dynamo/frontend/sglang_prepost.py`:
- Around line 61-73: The detect_force_reasoning function uses a magic number 10
and lacks a tokenizer type hint; introduce a module-level constant (e.g.,
PROMPT_TAIL_TOKEN_WINDOW = 10) and replace the literal 10 in
detect_force_reasoning with that constant, add a brief comment above the
constant explaining it controls how many tail tokens to inspect, and add a
permissive type hint for tokenizer (e.g., TokenizerLike or Any) on the
detect_force_reasoning signature to document expected type while preserving
compatibility with multiple tokenizer implementations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 427fe291-98f3-44c4-91b9-e5fdfd957a26

📥 Commits

Reviewing files that changed from the base of the PR and between 2ac22df and 8b06e79.

📒 Files selected for processing (2)
  • components/src/dynamo/frontend/sglang_prepost.py
  • components/src/dynamo/frontend/sglang_processor.py

@navmarri14 navmarri14 marked this pull request as draft April 16, 2026 05:38
@navmarri14 navmarri14 marked this pull request as ready for review April 21, 2026 17:34
@ishandhanani
Copy link
Copy Markdown
Contributor

/ok to test 52248ee

"""
if not prompt_token_ids:
return False
tail = tokenizer.decode(prompt_token_ids[-10:], skip_special_tokens=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this magic number?

@ishandhanani
Copy link
Copy Markdown
Contributor

Can you please attach a brief before and after in the PR description to help us debug in the future?

Comment on lines +79 to +91
def detect_force_reasoning(tokenizer, prompt_token_ids: list[int]) -> bool:
"""Check if the chat template's generation prompt ends with ``<think>``.

When the template appends ``<think>`` to the prompt, the model output
starts inside a reasoning block without an explicit opening tag.
The reasoning parser must be told to begin in reasoning mode
(``force_reasoning=True``) so that it correctly separates reasoning
content from normal content.
"""
if not prompt_token_ids:
return False
tail = tokenizer.decode(prompt_token_ids[-10:], skip_special_tokens=False)
return tail.rstrip().endswith("<think>")
Copy link
Copy Markdown
Contributor

@KrishnanPrash KrishnanPrash Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, add unit testing for detect_content_reasoning that uses a lightweight or mock tokenizer. Some possible test cases:

  • Empty prompt → False
  • Prompt ending with <think>True
  • Prompt ending with <think>\n (whitespace) → True
  • Prompt NOT ending with <think>False

Comment on lines +135 to +136
if force_reasoning:
kwargs["force_reasoning"] = True
Copy link
Copy Markdown
Contributor

@KrishnanPrash KrishnanPrash Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could remove the double force_reasoning derivation and just do something like:

if reasoning_parser_name and tokenizer and prompt_token_ids:
     kwargs["force_reasoning"] = detect_force_reasoning(tokenizer, prompt_token_ids)

Would require adding tokenizer and prompt_token_ids to arg list.

@navmarri14
Copy link
Copy Markdown
Author

@ishandhanani
payload

curl -s http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
  "model": "/tmp-nvme/models/glm5-nvfp4",
  "messages": [
    {"role": "user", "content": "Compute 1+1!"}
  ],
  "stream": false,
  "max_tokens": 200,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "calculate_sum",
        "description": "Calculates the sum of two numbers.",
        "parameters": {
          "type": "object",
          "properties": {
            "a": {"type": "number", "description": "The first number to add."},
            "b": {"type": "number", "description": "The second number to add."}
          },
          "required": ["a", "b"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}'

before

{"id":"chatcmpl-98a27d7a-7b7a-4056-88cd-66c3a999853f","choices":[{"index":0,"message":{"content":"The user is asking me to compute 1+1. This is a simple addition problem. I have a function called \"calculate_sum\" that takes two numbers and returns their sum. Since this is exactly what the user is asking for, I should use this function.\n\nThe parameters needed are:\n- a: 1 (the first number)\n- b: 1 (the second number)\n\nBoth parameters are provided implicitly in the request \"1+1\".</think>I'll help you compute 1+1 using the available calculation function.<tool_call>calculate_sum<arg_key>a</arg_key><arg_value>1</arg_value><arg_key>b</arg_key><arg_value>1</arg_value></tool_call>","role":"assistant","reasoning_content":null},"finish_reason":"stop"}],"created":1776202733,"model":"/mnt/models","object":"chat.completion","usage":{"prompt_tokens":173,"completion_tokens":125,"total_tokens":298},"nvext":{"worker_id":{"prefill_worker_id":782000407404662,"prefill_dp_rank":0,"decode_worker_id":782000407404662,"decode_dp_rank":0},"timing":{"request_received_ms":1776202733323,"prefill_wait_time_ms":0.8677819999999999,"prefill_time_ms":171.25239,"ttft_ms":172.120172,"total_time_ms":1700.60159,"kv_hit_rate":0.0,"router_queue_depth":0}}}

notice reasoning_content is null

after

{"id":"969fe798d036b549","choices":[{"index":0,"message":{"content":"I'll calculate 1+1 for you using the sum function.","tool_calls":[{"id":"call_a387f52be1b5be7e","type":"function","function":{"name":"calculate_sum","arguments":"{\"a\": 1, \"b\": 1}"}}],"role":"assistant","reasoning_content":"The user wants me to compute 1+1. I have a function called calculate_sum that can calculate the sum of two numbers. I should use this function to compute 1+1.\n\nLooking at the function parameters:\n- a: number (required) - I'll use 1\n- b: number (required) - I'll use 1\n\nBoth required parameters are provided, so I can make the function call."},"finish_reason":"tool_calls"}],"created":1776276679,"model":"/tmp-nvme/models/glm5-nvfp4","object":"chat.completion","usage":{"prompt_tokens":181,"completion_tokens":119,"total_tokens":300}}

notice reasoning content is populated.

@richardhuo-nv
Copy link
Copy Markdown
Contributor

richardhuo-nv commented Apr 23, 2026

Does sglang also do this? I don't think so TBH.

I think we should not diverge from SGLang's chat implementation. Because a lot of other models are relying on this processor as well.

Is there a specific case that sglang can do but dynamo cannot? we need to find out where the real divergence happened.

@richardhuo-nv
Copy link
Copy Markdown
Contributor

What's the sglang's output of the request?

If this is a bug in sglang, we should fix the upstream first.

@navmarri14
Copy link
Copy Markdown
Author

Does sglang also do this? I don't think so TBH.

I think we should not diverge from SGLang's chat implementation. Because a lot of other models are relying on this processor as well.

Is there a specific case that sglang can do but dynamo cannot? we need to find out where the real divergence happened.

sglang handles this as well although slightly differently:

  • Template-level static flag populated once on TemplateManager.force_reasoning (ref) used in serving_chat.py.
  • Per-request override in _get_reasoning_from_request, which is just a parser lookup table .
  • Both are then OR'd together and gated by request.separate_reasoning (ref).
    I can re-align the implementation to be close to sglang.

@richardhuo-nv
Copy link
Copy Markdown
Contributor

Does sglang also do this? I don't think so TBH.
I think we should not diverge from SGLang's chat implementation. Because a lot of other models are relying on this processor as well.
Is there a specific case that sglang can do but dynamo cannot? we need to find out where the real divergence happened.

sglang handles this as well although slightly differently:

  • Template-level static flag populated once on TemplateManager.force_reasoning (ref) used in serving_chat.py.
  • Per-request override in _get_reasoning_from_request, which is just a parser lookup table .
  • Both are then OR'd together and gated by request.separate_reasoning (ref).
    I can re-align the implementation to be close to sglang.

Thanks! Yeah, let's align with sglang. Because someone could simply use vibe coding and say "align the dynamo sglang chat processor with sglang" in the prompt and the coding agent could get your change reverted, and no one will notice it. We'd better align closely.

@pull-request-size pull-request-size Bot added size/L and removed size/M labels Apr 23, 2026
Copy link
Copy Markdown
Contributor

@richardhuo-nv richardhuo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the sglang's output of the request?

If this is a bug in sglang, we should fix the upstream first.

@richardhuo-nv
Copy link
Copy Markdown
Contributor

@navmarri14 LGTM! could you sign your commits?

In your local branch, run: git rebase HEAD~4 --signoff
Force push your changes to overwrite the branch: git push --force-with-lease origin main

also could you resolve the issue in the pre-commit hooks: https://github.com/ai-dynamo/dynamo/actions/runs/24859235642/job/72780270461?pr=8240

Thanks for your contribution!

@richardhuo-nv
Copy link
Copy Markdown
Contributor

/ok to test e8f0006

@richardhuo-nv
Copy link
Copy Markdown
Contributor

Sorry, I think there is merge confilicts now, could you rebase and rerun?

@richardhuo-nv
Copy link
Copy Markdown
Contributor

/ok to test 6693101

@richardhuo-nv
Copy link
Copy Markdown
Contributor

richardhuo-nv commented Apr 30, 2026

@navmarri14
looks like these three tool calling tests are failed pretty consistent after your change:

FAILED tests/frontend/test_tool_calling_sglang.py::TestToolCallingProtocol::test_tool_choice_required_forces_a_tool_call[chat_processor_frontend]
FAILED tests/frontend/test_tool_calling_sglang.py::TestToolCallingProtocol::test_named_tool_choice_forces_specific_function[chat_processor_frontend]
FAILED tests/frontend/test_tool_calling_sglang.py::TestToolCallingProtocol::test_array_argument_schema_valid[chat_processor_frontend]

can you test locally with these three tests and see if it's solveable? Thanks!

navmarri14 and others added 2 commits May 6, 2026 04:30
… <think>

Signed-off-by: Naveen Marri <[email protected]>
Signed-off-by: Liangjun Feng <[email protected]>
@liangjuf
Copy link
Copy Markdown

liangjuf commented May 6, 2026

@richardhuo-nv (navmarri14 is on leave so taking over the work) I've fixed the failed test cases in the new commit. It turned out that the test model is Qwen3, with this change we will enable reasoning by default and model will have the output in the reasoning part instead of the normal response which will be parsed by tool call parser. So fixing the issue by disabling the reasoning parser if toolcall is set to required or a specific function (which aligns with sglang behavior). Also add more unit test for the new logic. PTAL and run the tests, thank you.

@richardhuo-nv
Copy link
Copy Markdown
Contributor

/ok to test 4daad7b

@richardhuo-nv
Copy link
Copy Markdown
Contributor

@richardhuo-nv (navmarri14 is on leave so taking over the work) I've fixed the failed test cases in the new commit. It turned out that the test model is Qwen3, with this change we will enable reasoning by default and model will have the output in the reasoning part instead of the normal response which will be parsed by tool call parser. So fixing the issue by disabling the reasoning parser if toolcall is set to required or a specific function (which aligns with sglang behavior). Also add more unit test for the new logic. PTAL and run the tests, thank you.

thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor fix frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants