[https://nvbugs/6038228][test] Add WA for trtllm-bench hang issue and improve its robustness#12655
Conversation
Signed-off-by: yufeiwu-nv <[email protected]>
Signed-off-by: yufeiwu-nv <[email protected]>
Signed-off-by: yufeiwu-nv <[email protected]>
📝 WalkthroughWalkthroughThis change adds automatic directory creation before writing benchmark reports, implements threaded subprocess execution with stall detection and partial output capture in performance test utilities, and reorganizes test case assignments across GPU-specific test lists. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Performance Test
participant Runner as _run_command_<br/>with_captured_output
participant Thread as Output Capture<br/>Thread
participant Proc as Subprocess
participant Monitor as Stall Detection<br/>Monitor
Test->>Runner: Execute command with env
activate Runner
Runner->>Proc: Start subprocess
activate Proc
Runner->>Thread: Launch capture thread
activate Thread
Runner->>Monitor: Initialize stall timers
activate Monitor
Thread->>Proc: Read stdout/stderr
Proc-->>Thread: Output data
Thread->>Runner: Accumulate output
Monitor->>Proc: Check for stall<br/>(idle timeout)
alt Stall Detected
Monitor->>Monitor: Check for fatal patterns
alt Fatal Pattern Match
Monitor->>Monitor: Shorten timeout
end
Monitor->>Proc: kill()
end
Proc-->>Runner: Process exits
deactivate Proc
Thread-->>Runner: Finalize output
deactivate Thread
Monitor-->>Runner: Timeout state
deactivate Monitor
alt Non-zero Return
Runner-->>Test: CalledProcessError<br/>(partial output in stdout)
else Success
Runner-->>Test: Captured output
end
deactivate Runner
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/integration/defs/perf/utils.py (1)
145-145: Type the new helper signature.Please annotate
cmd,env, and thestrreturn type on this new helper so the new subprocess path stays type-checkable.✍️ Suggested refactor
-def _run_command_with_captured_output(cmd, env=None): +def _run_command_with_captured_output( + cmd: list[str], env: dict[str, str] | None = None) -> str:As per coding guidelines, "Always annotate functions with return types" and "Prefer using the built-in Python types
list,dict, andtupleto the legacytyping.List,typing.Dict, andtyping.Tuple."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/perf/utils.py` at line 145, Annotate the new helper _run_command_with_captured_output by adding type hints: set cmd: list[str], env: dict[str, str] | None (or None by default), and the return type to -> str; update the function signature to def _run_command_with_captured_output(cmd: list[str], env: dict[str, str] | None = None) -> str so the subprocess path remains type-checkable and uses built-in collection types.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/integration/defs/perf/utils.py`:
- Around line 622-624: The CalledProcessError handling calls e.stdout.decode()
(and elsewhere e.stderr.decode()) without defensive decoding; change decoding to
use e.stdout.decode("utf-8", errors="replace") and likewise
e.stderr.decode("utf-8", errors="replace") so UnicodeDecodeError won't mask the
real failure, ensuring you still pass the decoded string into clean_myelin_time
and any stderr handling in the CalledProcessError branch.
- Around line 162-165: The subprocess launcher in utils.py spawns MPI worker
processes that survive proc.kill(); modify the subprocess.Popen call that
creates proc (the one building benchmark_cmd / mpi_cmd) to include
start_new_session=True and add import signal to the module imports, then replace
all proc.kill() calls that attempt to terminate the stalled benchmark (the ones
referencing proc at the stall/cleanup paths) with os.killpg(proc.pid,
signal.SIGKILL) so the entire process group (MPI launcher + worker ranks) is
killed; ensure you import signal (and confirm os is already imported) and update
the three places where proc is terminated accordingly.
- Around line 226-240: The except block currently catches BaseException and
converts control-flow signals into a CalledProcessError; change it to first
catch Exception to perform the cleanup, conversion and raise CalledProcessError
(referencing proc, thread, output_lines, lock, cmd,
subprocess.CalledProcessError), and then add a second except BaseException:
handler that does the same cleanup (kill proc, wait, join thread, grab partial
under lock) but does NOT convert or wrap the exception—simply re-raise the
original exc to preserve interrupts and test-harness signals.
---
Nitpick comments:
In `@tests/integration/defs/perf/utils.py`:
- Line 145: Annotate the new helper _run_command_with_captured_output by adding
type hints: set cmd: list[str], env: dict[str, str] | None (or None by default),
and the return type to -> str; update the function signature to def
_run_command_with_captured_output(cmd: list[str], env: dict[str, str] | None =
None) -> str so the subprocess path remains type-checkable and uses built-in
collection types.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5730d1b8-2575-4b1f-b63c-756d30cc925d
📒 Files selected for processing (4)
tensorrt_llm/bench/benchmark/__init__.pytests/integration/defs/perf/utils.pytests/integration/test_lists/qa/llm_perf_core.ymltests/integration/test_lists/qa/llm_perf_sanity.yml
|
/bot run |
Signed-off-by: yufeiwu-nv <[email protected]>
|
PR_Github #41185 [ run ] triggered by Bot. Commit: |
|
PR_Github #41185 [ run ] completed with state
|
|
/bot skip --comment "only test list modify" |
1 similar comment
|
/bot skip --comment "only test list modify" |
|
PR_Github #41317 [ skip ] triggered by Bot. Commit: |
|
PR_Github #41317 [ skip ] completed with state |
… improve its robustness (NVIDIA#12655) Signed-off-by: yufeiwu-nv <[email protected]>
Summary by CodeRabbit
Release Notes
Bug Fixes
New Features
Chores
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.