[https://nvbugs/6038228][test] Add WA for trtllm-bench hang issue and improve its robustness by yufeiwu-nv · Pull Request #12655 · NVIDIA/TensorRT-LLM

yufeiwu-nv · 2026-04-01T09:39:33Z

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed report generation failures when parent output directories don't exist.
New Features
- Implemented timeout and stall detection for long-running operations with automatic process termination.
- Improved error visibility by capturing and displaying partial output from interrupted processes.
Chores
- Reorganized performance benchmark test distributions across different GPU configurations.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: yufeiwu-nv <[email protected]>

coderabbitai · 2026-04-01T09:52:39Z

📝 Walkthrough

Walkthrough

This change adds automatic directory creation before writing benchmark reports, implements threaded subprocess execution with stall detection and partial output capture in performance test utilities, and reorganizes test case assignments across GPU-specific test lists.

Changes

Cohort / File(s)	Summary
Report Directory Initialization `tensorrt_llm/bench/benchmark/__init__.py`	Ensures parent directory exists before writing benchmark report JSON files using `mkdir(parents=True, exist_ok=True)`.
Subprocess Execution with Stall Detection `tests/integration/defs/perf/utils.py`	Introduces threaded `_run_command_with_captured_output()` function that captures combined stdout/stderr incrementally, detects stalls via configurable timeouts, matches fatal error patterns, and enables process termination. Replaces direct `subprocess.check_output()` calls in engine build, MPI benchmarking, and serve-client commands. Updates error handling to capture and preserve partial output on failures.
Performance Test List Reorganization `tests/integration/test_lists/qa/llm_perf_core.yml`, `tests/integration/test_lists/qa/llm_perf_sanity.yml`	Reassigns GPU-specific test cases across condition blocks: moves Mistral-Small and chunked prefill tests from general GPU lists to H100/H20/H200-and-newer-GPU lists; shifts `nemotron_nano_12b_v2` test from initial condition to later GPU condition.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Performance Test
    participant Runner as _run_command_<br/>with_captured_output
    participant Thread as Output Capture<br/>Thread
    participant Proc as Subprocess
    participant Monitor as Stall Detection<br/>Monitor

    Test->>Runner: Execute command with env
    activate Runner
    Runner->>Proc: Start subprocess
    activate Proc
    
    Runner->>Thread: Launch capture thread
    activate Thread
    Runner->>Monitor: Initialize stall timers
    activate Monitor
    
    Thread->>Proc: Read stdout/stderr
    Proc-->>Thread: Output data
    Thread->>Runner: Accumulate output
    
    Monitor->>Proc: Check for stall<br/>(idle timeout)
    alt Stall Detected
        Monitor->>Monitor: Check for fatal patterns
        alt Fatal Pattern Match
            Monitor->>Monitor: Shorten timeout
        end
        Monitor->>Proc: kill()
    end
    
    Proc-->>Runner: Process exits
    deactivate Proc
    Thread-->>Runner: Finalize output
    deactivate Thread
    Monitor-->>Runner: Timeout state
    deactivate Monitor
    
    alt Non-zero Return
        Runner-->>Test: CalledProcessError<br/>(partial output in stdout)
    else Success
        Runner-->>Test: Captured output
    end
    deactivate Runner

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is blank except for the template placeholder comments; required sections like 'Description' and 'Test Coverage' are missing with only comments indicating where content should be.	Fill in the 'Description' section explaining the issue and solution, and the 'Test Coverage' section listing relevant tests that validate the changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly references the NVBugs ticket and indicates the PR fixes an issue while improving trtllm-bench robustness, matching the file changes related to test configurations, subprocess handling, and directory creation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/integration/defs/perf/utils.py (1)
145-145: Type the new helper signature.

Please annotate cmd, env, and the str return type on this new helper so the new subprocess path stays type-checkable.
✍️ Suggested refactor
-def _run_command_with_captured_output(cmd, env=None):
+def _run_command_with_captured_output(
+        cmd: list[str], env: dict[str, str] | None = None) -> str:
As per coding guidelines, "Always annotate functions with return types" and "Prefer using the built-in Python types list, dict, and tuple to the legacy typing.List, typing.Dict, and typing.Tuple."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/utils.py` at line 145, Annotate the new helper
_run_command_with_captured_output by adding type hints: set cmd: list[str], env:
dict[str, str] | None (or None by default), and the return type to -> str;
update the function signature to def _run_command_with_captured_output(cmd:
list[str], env: dict[str, str] | None = None) -> str so the subprocess path
remains type-checkable and uses built-in collection types.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/perf/utils.py`:
- Around line 622-624: The CalledProcessError handling calls e.stdout.decode()
(and elsewhere e.stderr.decode()) without defensive decoding; change decoding to
use e.stdout.decode("utf-8", errors="replace") and likewise
e.stderr.decode("utf-8", errors="replace") so UnicodeDecodeError won't mask the
real failure, ensuring you still pass the decoded string into clean_myelin_time
and any stderr handling in the CalledProcessError branch.
- Around line 162-165: The subprocess launcher in utils.py spawns MPI worker
processes that survive proc.kill(); modify the subprocess.Popen call that
creates proc (the one building benchmark_cmd / mpi_cmd) to include
start_new_session=True and add import signal to the module imports, then replace
all proc.kill() calls that attempt to terminate the stalled benchmark (the ones
referencing proc at the stall/cleanup paths) with os.killpg(proc.pid,
signal.SIGKILL) so the entire process group (MPI launcher + worker ranks) is
killed; ensure you import signal (and confirm os is already imported) and update
the three places where proc is terminated accordingly.
- Around line 226-240: The except block currently catches BaseException and
converts control-flow signals into a CalledProcessError; change it to first
catch Exception to perform the cleanup, conversion and raise CalledProcessError
(referencing proc, thread, output_lines, lock, cmd,
subprocess.CalledProcessError), and then add a second except BaseException:
handler that does the same cleanup (kill proc, wait, join thread, grab partial
under lock) but does NOT convert or wrap the exception—simply re-raise the
original exc to preserve interrupts and test-harness signals.

---

Nitpick comments:
In `@tests/integration/defs/perf/utils.py`:
- Line 145: Annotate the new helper _run_command_with_captured_output by adding
type hints: set cmd: list[str], env: dict[str, str] | None (or None by default),
and the return type to -> str; update the function signature to def
_run_command_with_captured_output(cmd: list[str], env: dict[str, str] | None =
None) -> str so the subprocess path remains type-checkable and uses built-in
collection types.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5730d1b8-2575-4b1f-b63c-756d30cc925d

📥 Commits

Reviewing files that changed from the base of the PR and between 1e51f1c and 4e41cba.

📒 Files selected for processing (4)

tensorrt_llm/bench/benchmark/__init__.py
tests/integration/defs/perf/utils.py
tests/integration/test_lists/qa/llm_perf_core.yml
tests/integration/test_lists/qa/llm_perf_sanity.yml

yufeiwu-nv · 2026-04-01T10:18:19Z

/bot run

Signed-off-by: yufeiwu-nv <[email protected]>

tensorrt-cicd · 2026-04-01T10:23:54Z

PR_Github #41185 [ run ] triggered by Bot. Commit: f8532a0 Link to invocation

tensorrt-cicd · 2026-04-01T17:35:50Z

PR_Github #41185 [ run ] completed with state SUCCESS. Commit: f8532a0
/LLM/main/L0_MergeRequest_PR pipeline #32148 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yufeiwu-nv · 2026-04-02T02:56:10Z

/bot skip --comment "only test list modify"

yufeiwu-nv · 2026-04-02T03:00:13Z

/bot skip --comment "only test list modify"

tensorrt-cicd · 2026-04-02T03:10:20Z

PR_Github #41317 [ skip ] triggered by Bot. Commit: 8bb12be Link to invocation

tensorrt-cicd · 2026-04-02T03:23:32Z

PR_Github #41317 [ skip ] completed with state SUCCESS. Commit: 8bb12be
Skipping testing for commit 8bb12be

Link to invocation

… improve its robustness (NVIDIA#12655) Signed-off-by: yufeiwu-nv <[email protected]>

yufeiwu-nv added 3 commits March 31, 2026 08:47

Refactor some mistral_small model out of L40S because memory limit

be9f6b5

Signed-off-by: yufeiwu-nv <[email protected]>

refact another

4e768a1

Signed-off-by: yufeiwu-nv <[email protected]>

improve robustness

4e41cba

Signed-off-by: yufeiwu-nv <[email protected]>

yufeiwu-nv requested review from a team as code owners April 1, 2026 09:39

yufeiwu-nv requested a review from FrankD412 April 1, 2026 09:39

github-actions Bot assigned yufeiwu-nv Apr 1, 2026

yufeiwu-nv changed the title ~~[https://nvbugs/6038228 ][test] Fix L40S test issue and improve trtllm-bench robustness~~ [https://nvbugs/6038228 ][test] Add WA for trtllm-bench hang issue and improve its robustness Apr 1, 2026

yufeiwu-nv changed the title ~~[https://nvbugs/6038228 ][test] Add WA for trtllm-bench hang issue and improve its robustness~~ [https://nvbugs/6038228][test] Add WA for trtllm-bench hang issue and improve its robustness Apr 1, 2026

coderabbitai Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread tests/integration/defs/perf/utils.py Outdated

Comment thread tests/integration/defs/perf/utils.py Outdated

Comment thread tests/integration/defs/perf/utils.py Outdated

yufeiwu-nv marked this pull request as draft April 1, 2026 10:00

yufeiwu-nv marked this pull request as ready for review April 1, 2026 10:17

yufeiwu-nv added 2 commits April 1, 2026 10:18

fix according to AI

274926b

Signed-off-by: yufeiwu-nv <[email protected]>

Merge branch 'main' into fix_L40S

f8532a0

ruodil approved these changes Apr 2, 2026

View reviewed changes

Merge branch 'main' into fix_L40S

8bb12be

FrankD412 approved these changes Apr 2, 2026

View reviewed changes

yufeiwu-nv merged commit 1079b1a into NVIDIA:main Apr 2, 2026
5 checks passed

karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026

[https://nvbugs/6038228][test] Add WA for trtllm-bench hang issue and…

caa4aff

… improve its robustness (NVIDIA#12655) Signed-off-by: yufeiwu-nv <[email protected]>

Conversation

yufeiwu-nv commented Apr 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 1, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yufeiwu-nv commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

yufeiwu-nv commented Apr 2, 2026

Uh oh!

yufeiwu-nv commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yufeiwu-nv commented Apr 1, 2026 •

edited by coderabbitai Bot

Loading