Skip to content

[None][perf] Skip request broadcast when world_size is 1#13412

Merged
yechank-nvidia merged 1 commit into
NVIDIA:mainfrom
yechank-nvidia:remove_broadcast
Apr 24, 2026
Merged

[None][perf] Skip request broadcast when world_size is 1#13412
yechank-nvidia merged 1 commit into
NVIDIA:mainfrom
yechank-nvidia:remove_broadcast

Conversation

@yechank-nvidia
Copy link
Copy Markdown
Collaborator

@yechank-nvidia yechank-nvidia commented Apr 24, 2026

Skip the MPI broadcast in RequestBroadcaster._broadcast_requests when world_size == 1. The broadcast call still incurs pickle serialization overhead even when there is only one rank, which is wasteful for single-GPU (especially for multimodal) runs.

The check is placed before the has_pp branch so it covers every parallelism configuration (TP / PP / CP / EP / DP).

Summary by CodeRabbit

  • Bug Fixes
    • Optimized request distribution in single-process deployments to eliminate unnecessary communication overhead and improve performance.

@yechank-nvidia yechank-nvidia self-assigned this Apr 24, 2026
@yechank-nvidia yechank-nvidia requested a review from a team as a code owner April 24, 2026 07:06
@yechank-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

Adds an early return optimization in _broadcast_requests that skips distributed communication when running in a single-process environment, preventing unnecessary broadcast logic and PP topology dependencies.

Changes

Cohort / File(s) Summary
Single-process optimization
tensorrt_llm/_torch/pyexecutor/request_utils.py
Adds fast path early return in _broadcast_requests when world_size is 1, bypassing PP/TP/recv/send control flow for single-process scenarios.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description is concise and explains the optimization, but lacks structured sections from the template (Test Coverage, PR Checklist). Add Test Coverage section explaining which tests validate the single-GPU scenario, and confirm PR Checklist items are addressed.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: skipping request broadcast optimization when world_size is 1, matching the code's fast-path addition.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45364 [ run ] triggered by Bot. Commit: 05a2264 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45364 [ run ] completed with state SUCCESS. Commit: 05a2264
/LLM/main/L0_MergeRequest_PR pipeline #35608 completed with status: 'SUCCESS'

CI Report

Link to invocation

@yechank-nvidia yechank-nvidia merged commit 8e2bdfc into NVIDIA:main Apr 24, 2026
10 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants