Skip to content

[https://nvbugs/6038228][test] Add WA for trtllm-bench hang issue and improve its robustness#12655

Merged
yufeiwu-nv merged 6 commits into
NVIDIA:mainfrom
yufeiwu-nv:fix_L40S
Apr 2, 2026
Merged

[https://nvbugs/6038228][test] Add WA for trtllm-bench hang issue and improve its robustness#12655
yufeiwu-nv merged 6 commits into
NVIDIA:mainfrom
yufeiwu-nv:fix_L40S

Conversation

@yufeiwu-nv
Copy link
Copy Markdown
Collaborator

@yufeiwu-nv yufeiwu-nv commented Apr 1, 2026

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed report generation failures when parent output directories don't exist.
  • New Features

    • Implemented timeout and stall detection for long-running operations with automatic process termination.
    • Improved error visibility by capturing and displaying partial output from interrupted processes.
  • Chores

    • Reorganized performance benchmark test distributions across different GPU configurations.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@yufeiwu-nv yufeiwu-nv requested review from a team as code owners April 1, 2026 09:39
@yufeiwu-nv yufeiwu-nv requested a review from FrankD412 April 1, 2026 09:39
@yufeiwu-nv yufeiwu-nv changed the title [https://nvbugs/6038228 ][test] Fix L40S test issue and improve trtllm-bench robustness [https://nvbugs/6038228 ][test] Add WA for trtllm-bench hang issue and improve its robustness Apr 1, 2026
@yufeiwu-nv yufeiwu-nv changed the title [https://nvbugs/6038228 ][test] Add WA for trtllm-bench hang issue and improve its robustness [https://nvbugs/6038228][test] Add WA for trtllm-bench hang issue and improve its robustness Apr 1, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

This change adds automatic directory creation before writing benchmark reports, implements threaded subprocess execution with stall detection and partial output capture in performance test utilities, and reorganizes test case assignments across GPU-specific test lists.

Changes

Cohort / File(s) Summary
Report Directory Initialization
tensorrt_llm/bench/benchmark/__init__.py
Ensures parent directory exists before writing benchmark report JSON files using mkdir(parents=True, exist_ok=True).
Subprocess Execution with Stall Detection
tests/integration/defs/perf/utils.py
Introduces threaded _run_command_with_captured_output() function that captures combined stdout/stderr incrementally, detects stalls via configurable timeouts, matches fatal error patterns, and enables process termination. Replaces direct subprocess.check_output() calls in engine build, MPI benchmarking, and serve-client commands. Updates error handling to capture and preserve partial output on failures.
Performance Test List Reorganization
tests/integration/test_lists/qa/llm_perf_core.yml, tests/integration/test_lists/qa/llm_perf_sanity.yml
Reassigns GPU-specific test cases across condition blocks: moves Mistral-Small and chunked prefill tests from general GPU lists to H100/H20/H200-and-newer-GPU lists; shifts nemotron_nano_12b_v2 test from initial condition to later GPU condition.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Performance Test
    participant Runner as _run_command_<br/>with_captured_output
    participant Thread as Output Capture<br/>Thread
    participant Proc as Subprocess
    participant Monitor as Stall Detection<br/>Monitor

    Test->>Runner: Execute command with env
    activate Runner
    Runner->>Proc: Start subprocess
    activate Proc
    
    Runner->>Thread: Launch capture thread
    activate Thread
    Runner->>Monitor: Initialize stall timers
    activate Monitor
    
    Thread->>Proc: Read stdout/stderr
    Proc-->>Thread: Output data
    Thread->>Runner: Accumulate output
    
    Monitor->>Proc: Check for stall<br/>(idle timeout)
    alt Stall Detected
        Monitor->>Monitor: Check for fatal patterns
        alt Fatal Pattern Match
            Monitor->>Monitor: Shorten timeout
        end
        Monitor->>Proc: kill()
    end
    
    Proc-->>Runner: Process exits
    deactivate Proc
    Thread-->>Runner: Finalize output
    deactivate Thread
    Monitor-->>Runner: Timeout state
    deactivate Monitor
    
    alt Non-zero Return
        Runner-->>Test: CalledProcessError<br/>(partial output in stdout)
    else Success
        Runner-->>Test: Captured output
    end
    deactivate Runner
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is blank except for the template placeholder comments; required sections like 'Description' and 'Test Coverage' are missing with only comments indicating where content should be. Fill in the 'Description' section explaining the issue and solution, and the 'Test Coverage' section listing relevant tests that validate the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly references the NVBugs ticket and indicates the PR fixes an issue while improving trtllm-bench robustness, matching the file changes related to test configurations, subprocess handling, and directory creation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/integration/defs/perf/utils.py (1)

145-145: Type the new helper signature.

Please annotate cmd, env, and the str return type on this new helper so the new subprocess path stays type-checkable.

✍️ Suggested refactor
-def _run_command_with_captured_output(cmd, env=None):
+def _run_command_with_captured_output(
+        cmd: list[str], env: dict[str, str] | None = None) -> str:

As per coding guidelines, "Always annotate functions with return types" and "Prefer using the built-in Python types list, dict, and tuple to the legacy typing.List, typing.Dict, and typing.Tuple."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/utils.py` at line 145, Annotate the new helper
_run_command_with_captured_output by adding type hints: set cmd: list[str], env:
dict[str, str] | None (or None by default), and the return type to -> str;
update the function signature to def _run_command_with_captured_output(cmd:
list[str], env: dict[str, str] | None = None) -> str so the subprocess path
remains type-checkable and uses built-in collection types.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/perf/utils.py`:
- Around line 622-624: The CalledProcessError handling calls e.stdout.decode()
(and elsewhere e.stderr.decode()) without defensive decoding; change decoding to
use e.stdout.decode("utf-8", errors="replace") and likewise
e.stderr.decode("utf-8", errors="replace") so UnicodeDecodeError won't mask the
real failure, ensuring you still pass the decoded string into clean_myelin_time
and any stderr handling in the CalledProcessError branch.
- Around line 162-165: The subprocess launcher in utils.py spawns MPI worker
processes that survive proc.kill(); modify the subprocess.Popen call that
creates proc (the one building benchmark_cmd / mpi_cmd) to include
start_new_session=True and add import signal to the module imports, then replace
all proc.kill() calls that attempt to terminate the stalled benchmark (the ones
referencing proc at the stall/cleanup paths) with os.killpg(proc.pid,
signal.SIGKILL) so the entire process group (MPI launcher + worker ranks) is
killed; ensure you import signal (and confirm os is already imported) and update
the three places where proc is terminated accordingly.
- Around line 226-240: The except block currently catches BaseException and
converts control-flow signals into a CalledProcessError; change it to first
catch Exception to perform the cleanup, conversion and raise CalledProcessError
(referencing proc, thread, output_lines, lock, cmd,
subprocess.CalledProcessError), and then add a second except BaseException:
handler that does the same cleanup (kill proc, wait, join thread, grab partial
under lock) but does NOT convert or wrap the exception—simply re-raise the
original exc to preserve interrupts and test-harness signals.

---

Nitpick comments:
In `@tests/integration/defs/perf/utils.py`:
- Line 145: Annotate the new helper _run_command_with_captured_output by adding
type hints: set cmd: list[str], env: dict[str, str] | None (or None by default),
and the return type to -> str; update the function signature to def
_run_command_with_captured_output(cmd: list[str], env: dict[str, str] | None =
None) -> str so the subprocess path remains type-checkable and uses built-in
collection types.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5730d1b8-2575-4b1f-b63c-756d30cc925d

📥 Commits

Reviewing files that changed from the base of the PR and between 1e51f1c and 4e41cba.

📒 Files selected for processing (4)
  • tensorrt_llm/bench/benchmark/__init__.py
  • tests/integration/defs/perf/utils.py
  • tests/integration/test_lists/qa/llm_perf_core.yml
  • tests/integration/test_lists/qa/llm_perf_sanity.yml

Comment thread tests/integration/defs/perf/utils.py Outdated
Comment thread tests/integration/defs/perf/utils.py Outdated
Comment thread tests/integration/defs/perf/utils.py Outdated
@yufeiwu-nv yufeiwu-nv marked this pull request as draft April 1, 2026 10:00
@yufeiwu-nv yufeiwu-nv marked this pull request as ready for review April 1, 2026 10:17
@yufeiwu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41185 [ run ] triggered by Bot. Commit: f8532a0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41185 [ run ] completed with state SUCCESS. Commit: f8532a0
/LLM/main/L0_MergeRequest_PR pipeline #32148 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yufeiwu-nv
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "only test list modify"

1 similar comment
@yufeiwu-nv
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "only test list modify"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41317 [ skip ] triggered by Bot. Commit: 8bb12be Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41317 [ skip ] completed with state SUCCESS. Commit: 8bb12be
Skipping testing for commit 8bb12be

Link to invocation

@yufeiwu-nv yufeiwu-nv merged commit 1079b1a into NVIDIA:main Apr 2, 2026
5 checks passed
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants