[None][fix] Add streaming support to no </think> for nemotron model by tijyojwad · Pull Request #12176 · NVIDIA/TensorRT-LLM

tijyojwad · 2026-03-13T02:10:47Z

Summary by CodeRabbit

New Features
- Enhanced streaming reasoning completion to properly finalize reasoning content across multiple parser types.
Bug Fixes
- Fixed handling of incomplete reasoning segments when streaming ends without closing tags; reasoning content is now correctly emitted and marked as complete.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Joyjit Daw <[email protected]> Made-with: Cursor

Override finish() in NemotronV3ReasoningParser so that when force_nonempty_content is not set and no </think> closing tag is found, reasoning stays in reasoning_content and content remains empty. When force_nonempty_content is set, the parent behavior is preserved (reasoning moved to content). Also clears accumulated reasoning on closing tag detection to reduce memory. Signed-off-by: Joyjit Daw <[email protected]> Made-with: Cursor

coderabbitai · 2026-03-13T02:15:46Z

📝 Walkthrough

Walkthrough

The PR introduces streaming finish handling to reasoning parsers by adding finish() methods and reasoning_completed properties to parser classes. State tracking for incomplete reasoning segments is added, and the finished flag is propagated through post-processing and response utilities to finalize reasoning content when streaming ends.

Changes

Cohort / File(s)	Summary
Reasoning Parser Core `tensorrt_llm/llmapi/reasoning_parser.py`	Added `finish()` method and `reasoning_completed` property to `BaseReasoningParser`, `DeepSeekR1Parser`, and `NemotronV3ReasoningParser`. Introduced state fields `_accumulated_reasoning` and `_found_closing_tag` to track incomplete reasoning segments. Modified `parse()` and `parse_delta()` to accumulate partial content and properly handle closing tags.
Post-processing Integration `tensorrt_llm/serve/postprocess_handlers.py`, `tensorrt_llm/serve/responses_utils.py`	Extended `apply_reasoning_parser` and `_apply_reasoning_parser` signatures with `finished` parameter. Added logic to invoke `finish()` on reasoning parsers when streaming completes, merging results back into content and reasoning_content. Imported `ReasoningParserResult` for result composition.
Test Coverage `tests/unittest/llmapi/test_reasoning_parser.py`	Expanded test suite with `finish()` scenario tests for multiple parser types (deepseek-r1, qwen3, nano-v3). Added verification of finishing state (content, reasoning_content) and `reasoning_completed` flag across various delta sequences and optional chat template configurations.

Sequence Diagram(s)

sequenceDiagram
    participant Streaming as Streaming Handler
    participant PostProc as PostProcess Handler
    participant Parser as Reasoning Parser
    participant State as Parser State

    Streaming->>PostProc: apply_reasoning_parser(text, streaming=true, finished=false)
    PostProc->>Parser: parse_delta(text)
    Parser->>State: Accumulate reasoning segment
    Parser-->>PostProc: (content, reasoning_content)
    PostProc-->>Streaming: (processed_content, reasoning)

    Note over Streaming: Stream continues...

    Streaming->>PostProc: apply_reasoning_parser(text, streaming=true, finished=true)
    PostProc->>Parser: parse_delta(final_text)
    PostProc->>Parser: finish()
    Parser->>State: Flush accumulated_reasoning
    Parser->>State: Check closing tag status
    Parser-->>PostProc: ReasoningParserResult
    PostProc->>PostProc: Merge finish result
    PostProc-->>Streaming: (final_content, final_reasoning)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	Pull request description is almost entirely blank template content with only a checkbox marked; substantive content sections (Description and Test Coverage) are empty.	Author should fill in the Description section explaining the issue, solution, and reasoning; and the Test Coverage section listing relevant tests. Commit messages indicate this adds streaming support for missing tag handling in reasoning parsers.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: adding streaming support for handling the missing closing tag in the Nemotron reasoning parser model.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/unittest/llmapi/test_reasoning_parser.py (1)

208-225: Flake8 indentation warnings.

Static analysis flagged indentation issues at lines 51, 106, and 225. These are minor formatting issues but should be addressed for consistency.

Suggested fix for line 225

 `@pytest.mark.parametrize`(("delta_texts", "finish_content", "finish_reasoning",
                           "reasoning_completed", "chat_template_kwargs"), [
-                              (["a", "b"], "", "", False, None),
+                             (["a", "b"], "", "", False, None),

Apply similar fixes to lines 51 and 106 to align with the hanging indent.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/llmapi/test_reasoning_parser.py` around lines 208 - 225, The
parametrize block for the test (pytest.mark.parametrize with parameter names
"delta_texts", "finish_content", "finish_reasoning", "reasoning_completed",
"chat_template_kwargs") has inconsistent hanging indent causing Flake8 warnings;
reformat the multi-line tuple and the list of parameter tuples so continuation
lines align under the first opening parenthesis (or use a consistent hanging
indent style) and apply the same alignment fix to the other similar blocks
referenced around the symbols R1_END and the parameter sets to eliminate the
indentation warnings at the three reported locations.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/llmapi/reasoning_parser.py`:
- Around line 234-244: The finish() implementation in NemotronV3 clears
_accumulated_reasoning when the parser is in_reasoning and no closing tag was
found, but only returns the remaining _buffer as reasoning_content, losing
accumulated content; modify finish() so that when the early-return branch runs
(not _force_nonempty_content and in_reasoning and not _found_closing_tag) it
combines _accumulated_reasoning with _buffer (e.g., concatenate non-empty pieces
or prefer accumulated if present) into the returned
ReasoningParserResult(reasoning_content=...) before clearing state, ensuring
behavior matches DeepSeekR1Parser.finish(); keep uses of _buffer,
_accumulated_reasoning, in_reasoning and _found_closing_tag and still
call/return _maybe_swap_content(super().finish()) in the other path.

---

Nitpick comments:
In `@tests/unittest/llmapi/test_reasoning_parser.py`:
- Around line 208-225: The parametrize block for the test
(pytest.mark.parametrize with parameter names "delta_texts", "finish_content",
"finish_reasoning", "reasoning_completed", "chat_template_kwargs") has
inconsistent hanging indent causing Flake8 warnings; reformat the multi-line
tuple and the list of parameter tuples so continuation lines align under the
first opening parenthesis (or use a consistent hanging indent style) and apply
the same alignment fix to the other similar blocks referenced around the symbols
R1_END and the parameter sets to eliminate the indentation warnings at the three
reported locations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8c0a48ce-7a10-49e5-a90d-b2c12692695a

📥 Commits

Reviewing files that changed from the base of the PR and between 0507609 and 18718a6.

📒 Files selected for processing (4)

tensorrt_llm/llmapi/reasoning_parser.py
tensorrt_llm/serve/postprocess_handlers.py
tensorrt_llm/serve/responses_utils.py
tests/unittest/llmapi/test_reasoning_parser.py

Signed-off-by: Joyjit Daw <[email protected]>

tijyojwad · 2026-03-13T17:57:58Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-13T18:05:19Z

PR_Github #38887 [ run ] triggered by Bot. Commit: 66dab4b Link to invocation

Signed-off-by: Joyjit Daw <[email protected]>

tijyojwad · 2026-03-13T19:30:38Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-13T19:37:04Z

PR_Github #38895 [ run ] triggered by Bot. Commit: 78d36d5 Link to invocation

tijyojwad · 2026-03-14T01:40:13Z

/bot help

github-actions · 2026-03-14T01:40:21Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

tijyojwad · 2026-03-14T02:03:38Z

/bot run

tensorrt-cicd · 2026-03-14T02:12:39Z

PR_Github #38926 [ run ] triggered by Bot. Commit: 78d36d5 Link to invocation

tensorrt-cicd · 2026-03-14T02:12:40Z

PR_Github #38926 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

tensorrt-cicd · 2026-03-14T02:18:46Z

PR_Github #38895 [ run ] completed with state FAILURE. Commit: 78d36d5
/LLM/main/L0_MergeRequest_PR pipeline #30203 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

tijyojwad · 2026-03-14T02:25:18Z

/bot run --reuse-test 30203

tensorrt-cicd · 2026-03-14T02:31:42Z

PR_Github #38928 [ run ] triggered by Bot. Commit: 78d36d5 Link to invocation

tensorrt-cicd · 2026-03-14T02:31:42Z

PR_Github #38928 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

tijyojwad · 2026-03-14T19:38:08Z

/bot run --reuse-test 30203

tensorrt-cicd · 2026-03-14T19:44:25Z

PR_Github #38952 [ run ] triggered by Bot. Commit: 78d36d5 Link to invocation

tensorrt-cicd · 2026-03-14T21:03:33Z

PR_Github #38952 [ run ] completed with state SUCCESS. Commit: 78d36d5
/LLM/main/L0_MergeRequest_PR pipeline #30234 completed with status: 'SUCCESS'

CI Report

Link to invocation

…VIDIA#12176) Signed-off-by: Joyjit Daw <[email protected]>

tijyojwad added 2 commits March 12, 2026 18:55

add streaming support for missing think token fix in reasoning parser

51d5b61

Signed-off-by: Joyjit Daw <[email protected]> Made-with: Cursor

tijyojwad requested a review from a team as a code owner March 13, 2026 02:10

tijyojwad requested a review from syuoni March 13, 2026 02:10

github-actions Bot assigned tijyojwad Mar 13, 2026

tijyojwad requested review from 2ez4bz and Wanli-Jiang March 13, 2026 02:11

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes