Skip to content

[None][fix] Add streaming support to no </think> for nemotron model#12176

Merged
tijyojwad merged 5 commits into
NVIDIA:mainfrom
tijyojwad:jdaw/add-streaming-support-think-fix
Mar 14, 2026
Merged

[None][fix] Add streaming support to no </think> for nemotron model#12176
tijyojwad merged 5 commits into
NVIDIA:mainfrom
tijyojwad:jdaw/add-streaming-support-think-fix

Conversation

@tijyojwad
Copy link
Copy Markdown
Collaborator

@tijyojwad tijyojwad commented Mar 13, 2026

Summary by CodeRabbit

  • New Features

    • Enhanced streaming reasoning completion to properly finalize reasoning content across multiple parser types.
  • Bug Fixes

    • Fixed handling of incomplete reasoning segments when streaming ends without closing tags; reasoning content is now correctly emitted and marked as complete.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Override finish() in NemotronV3ReasoningParser so that when
force_nonempty_content is not set and no </think> closing tag is found,
reasoning stays in reasoning_content and content remains empty.
When force_nonempty_content is set, the parent behavior is preserved
(reasoning moved to content). Also clears accumulated reasoning on
closing tag detection to reduce memory.

Signed-off-by: Joyjit Daw <[email protected]>
Made-with: Cursor
@tijyojwad tijyojwad requested a review from a team as a code owner March 13, 2026 02:10
@tijyojwad tijyojwad requested a review from syuoni March 13, 2026 02:10
@tijyojwad tijyojwad requested review from 2ez4bz and Wanli-Jiang March 13, 2026 02:11
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

The PR introduces streaming finish handling to reasoning parsers by adding finish() methods and reasoning_completed properties to parser classes. State tracking for incomplete reasoning segments is added, and the finished flag is propagated through post-processing and response utilities to finalize reasoning content when streaming ends.

Changes

Cohort / File(s) Summary
Reasoning Parser Core
tensorrt_llm/llmapi/reasoning_parser.py
Added finish() method and reasoning_completed property to BaseReasoningParser, DeepSeekR1Parser, and NemotronV3ReasoningParser. Introduced state fields _accumulated_reasoning and _found_closing_tag to track incomplete reasoning segments. Modified parse() and parse_delta() to accumulate partial content and properly handle closing tags.
Post-processing Integration
tensorrt_llm/serve/postprocess_handlers.py, tensorrt_llm/serve/responses_utils.py
Extended apply_reasoning_parser and _apply_reasoning_parser signatures with finished parameter. Added logic to invoke finish() on reasoning parsers when streaming completes, merging results back into content and reasoning_content. Imported ReasoningParserResult for result composition.
Test Coverage
tests/unittest/llmapi/test_reasoning_parser.py
Expanded test suite with finish() scenario tests for multiple parser types (deepseek-r1, qwen3, nano-v3). Added verification of finishing state (content, reasoning_content) and reasoning_completed flag across various delta sequences and optional chat template configurations.

Sequence Diagram(s)

sequenceDiagram
    participant Streaming as Streaming Handler
    participant PostProc as PostProcess Handler
    participant Parser as Reasoning Parser
    participant State as Parser State

    Streaming->>PostProc: apply_reasoning_parser(text, streaming=true, finished=false)
    PostProc->>Parser: parse_delta(text)
    Parser->>State: Accumulate reasoning segment
    Parser-->>PostProc: (content, reasoning_content)
    PostProc-->>Streaming: (processed_content, reasoning)

    Note over Streaming: Stream continues...

    Streaming->>PostProc: apply_reasoning_parser(text, streaming=true, finished=true)
    PostProc->>Parser: parse_delta(final_text)
    PostProc->>Parser: finish()
    Parser->>State: Flush accumulated_reasoning
    Parser->>State: Check closing tag status
    Parser-->>PostProc: ReasoningParserResult
    PostProc->>PostProc: Merge finish result
    PostProc-->>Streaming: (final_content, final_reasoning)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning Pull request description is almost entirely blank template content with only a checkbox marked; substantive content sections (Description and Test Coverage) are empty. Author should fill in the Description section explaining the issue, solution, and reasoning; and the Test Coverage section listing relevant tests. Commit messages indicate this adds streaming support for missing tag handling in reasoning parsers.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding streaming support for handling the missing closing tag in the Nemotron reasoning parser model.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/unittest/llmapi/test_reasoning_parser.py (1)

208-225: Flake8 indentation warnings.

Static analysis flagged indentation issues at lines 51, 106, and 225. These are minor formatting issues but should be addressed for consistency.

Suggested fix for line 225
 `@pytest.mark.parametrize`(("delta_texts", "finish_content", "finish_reasoning",
                           "reasoning_completed", "chat_template_kwargs"), [
-                              (["a", "b"], "", "", False, None),
+                             (["a", "b"], "", "", False, None),

Apply similar fixes to lines 51 and 106 to align with the hanging indent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/llmapi/test_reasoning_parser.py` around lines 208 - 225, The
parametrize block for the test (pytest.mark.parametrize with parameter names
"delta_texts", "finish_content", "finish_reasoning", "reasoning_completed",
"chat_template_kwargs") has inconsistent hanging indent causing Flake8 warnings;
reformat the multi-line tuple and the list of parameter tuples so continuation
lines align under the first opening parenthesis (or use a consistent hanging
indent style) and apply the same alignment fix to the other similar blocks
referenced around the symbols R1_END and the parameter sets to eliminate the
indentation warnings at the three reported locations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/llmapi/reasoning_parser.py`:
- Around line 234-244: The finish() implementation in NemotronV3 clears
_accumulated_reasoning when the parser is in_reasoning and no closing tag was
found, but only returns the remaining _buffer as reasoning_content, losing
accumulated content; modify finish() so that when the early-return branch runs
(not _force_nonempty_content and in_reasoning and not _found_closing_tag) it
combines _accumulated_reasoning with _buffer (e.g., concatenate non-empty pieces
or prefer accumulated if present) into the returned
ReasoningParserResult(reasoning_content=...) before clearing state, ensuring
behavior matches DeepSeekR1Parser.finish(); keep uses of _buffer,
_accumulated_reasoning, in_reasoning and _found_closing_tag and still
call/return _maybe_swap_content(super().finish()) in the other path.

---

Nitpick comments:
In `@tests/unittest/llmapi/test_reasoning_parser.py`:
- Around line 208-225: The parametrize block for the test
(pytest.mark.parametrize with parameter names "delta_texts", "finish_content",
"finish_reasoning", "reasoning_completed", "chat_template_kwargs") has
inconsistent hanging indent causing Flake8 warnings; reformat the multi-line
tuple and the list of parameter tuples so continuation lines align under the
first opening parenthesis (or use a consistent hanging indent style) and apply
the same alignment fix to the other similar blocks referenced around the symbols
R1_END and the parameter sets to eliminate the indentation warnings at the three
reported locations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8c0a48ce-7a10-49e5-a90d-b2c12692695a

📥 Commits

Reviewing files that changed from the base of the PR and between 0507609 and 18718a6.

📒 Files selected for processing (4)
  • tensorrt_llm/llmapi/reasoning_parser.py
  • tensorrt_llm/serve/postprocess_handlers.py
  • tensorrt_llm/serve/responses_utils.py
  • tests/unittest/llmapi/test_reasoning_parser.py

Comment thread tensorrt_llm/llmapi/reasoning_parser.py Outdated
Comment thread tensorrt_llm/llmapi/reasoning_parser.py Outdated
Comment thread tests/unittest/llmapi/test_reasoning_parser.py Outdated
Comment thread tensorrt_llm/llmapi/reasoning_parser.py Outdated
Comment thread tensorrt_llm/llmapi/reasoning_parser.py Outdated
@tijyojwad
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38887 [ run ] triggered by Bot. Commit: 66dab4b Link to invocation

@tijyojwad tijyojwad requested a review from 2ez4bz March 13, 2026 18:36
Comment thread tensorrt_llm/llmapi/reasoning_parser.py Outdated
@tijyojwad tijyojwad force-pushed the jdaw/add-streaming-support-think-fix branch from 1d6fc47 to 78d36d5 Compare March 13, 2026 19:29
@tijyojwad
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38895 [ run ] triggered by Bot. Commit: 78d36d5 Link to invocation

@tijyojwad
Copy link
Copy Markdown
Collaborator Author

/bot help

@github-actions
Copy link
Copy Markdown

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@tijyojwad
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38926 [ run ] triggered by Bot. Commit: 78d36d5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38926 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38895 [ run ] completed with state FAILURE. Commit: 78d36d5
/LLM/main/L0_MergeRequest_PR pipeline #30203 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tijyojwad
Copy link
Copy Markdown
Collaborator Author

/bot run --reuse-test 30203

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38928 [ run ] triggered by Bot. Commit: 78d36d5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38928 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

@tijyojwad
Copy link
Copy Markdown
Collaborator Author

/bot run --reuse-test 30203

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38952 [ run ] triggered by Bot. Commit: 78d36d5 Link to invocation

@tijyojwad tijyojwad enabled auto-merge (squash) March 14, 2026 20:25
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38952 [ run ] completed with state SUCCESS. Commit: 78d36d5
/LLM/main/L0_MergeRequest_PR pipeline #30234 completed with status: 'SUCCESS'

CI Report

Link to invocation

@tijyojwad tijyojwad merged commit 8cdcce9 into NVIDIA:main Mar 14, 2026
5 checks passed
limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026
longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants