[None][chore] Alltoall benchmark script refine (second time). by bobboli · Pull Request #12192 · NVIDIA/TensorRT-LLM

bobboli · 2026-03-13T07:31:58Z

Further refine AlltoAll benchmark script.

Correctly handle envvar in both spawn mode and MPI mode.
Use the range between first kernel start and last kernel end to estimate OP duration, instead of CUDA event which may have overhead.
Refactor the relationship between profile and other args.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

New Features
- Added CLI flags for explicit benchmark configuration (--hidden_size, --top_k, --num_experts, --quant) to override or supplement profile-based settings.
Improvements
- Enhanced timing measurements with per-iteration kernel tracking for improved accuracy in dispatch and combine operations.
- Updated CUPTI fallback messaging to clarify timing method when detailed kernel breakdown is unavailable.
Changes
- Modified profile configuration return format and validation behavior.
- Improved environment variable propagation for distributed benchmark execution.

bobboli · 2026-03-13T07:32:32Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-13T07:38:18Z

PR_Github #38847 [ run ] triggered by Bot. Commit: c9fccd5 Link to invocation

coderabbitai · 2026-03-13T07:42:49Z

📝 Walkthrough

Walkthrough

The changes refactor the benchmarking script's profile resolution to return four values instead of five, add explicit CLI overrides for profile parameters, integrate per-iteration kernel timing spans from CUPTI analysis, relax CUPTI initialization gating, and propagate worker environment variables across MPI launch modes.

Changes

Cohort / File(s)	Summary
Profile and Configuration Handling `tests/microbenchmarks/bench_moe_comm.py`	Modified `_resolve_profile_args()` signature to return `(hidden_size, top_k, num_experts_total, quant_algo)` instead of including `local_num_tokens`. Removed early validity check in `_iter_local_batch_sizes()`. Replaced `NO_QUANT` default with explicit `quant_algo` selection. Profile usage now enforces explicit CLI flag supply if no profile provided.
Timing Measurement and CUPTI Integration `tests/microbenchmarks/bench_moe_comm.py`	Updated `_build_cuda_graph_kernel_stats_cupti()` signature from `(cupti_kernels, cupti_events, warmup, iters)` to `(cupti_kernels, cupti_events, iters)`. Integrated per-iteration kernel-span timing (dispatch_iter_span, combine_iter_span) to compute refined `dispatch_times_us` and `combine_times_us`. Modified `_record_external()` to use per-iteration kernel-span times with fallback to event timings. Updated CUDA-graph timing documentation to reflect L2 cache flushing behavior within iteration loops.
CLI Interface Expansion `tests/microbenchmarks/bench_moe_comm.py`	Added new CLI flags `--hidden_size`, `--top_k`, `--num_experts`, and `--quant` in `parse_args()`. Updated `--profile` help text to document override behavior. Flags enable explicit parameter supply when no profile is provided.
Worker Environment and MPI Integration `tests/microbenchmarks/bench_moe_comm.py`	Introduced shared `_WORKER_ENV` dictionary containing `TRTLLM_ENABLE_PDL`. Relaxed CUPTI initialization gating in `_run_benchmark_worker_under_current_mpi()` to initialize CUPTI whenever CUDA graph is enabled (not just on kernel_breakdown). Ensured `_WORKER_ENV` propagation to both spawn-mode and external MPI launch paths. Updated user-facing messaging for CUPTI unavailability to specify fallback to CUDA event elapsed_time.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description covers the three main objectives (envvar handling, kernel timing, profile refactoring) but lacks details on testing, impact assessment, and checklist completion.	Complete the description template sections: add Test Coverage details, mark PR Checklist items as reviewed, and expand Description section with rationale for changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main changes: refining the AlltoAll benchmark script with specific mention of it being the second iteration.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/microbenchmarks/bench_moe_comm.py`:
- Around line 1005-1008: Replace the boolean-or fallbacks with explicit None
checks and validation: for hidden_size, top_k, and num_experts_total use
int(args.hidden_size) if args.hidden_size is not None else int(prof.hidden_size)
(and similarly for top_k and num_experts_total), then immediately validate each
is a positive integer and raise a clear ValueError if <= 0; for quant_algo keep
the explicit None check (quant_algo = args.quant if args.quant is not None else
prof.quant_algo) rather than using "or" so empty/zero values aren't silently
discarded. Ensure you reference the existing symbols hidden_size, top_k,
num_experts_total, quant_algo, args, and prof when making the change.
- Around line 871-875: The --quant argparse entry is failing because type=lambda
s: QuantAlgo[str(s).upper()] returns a QuantAlgo member while choices=[q.name
for q in QuantAlgo] contains uppercase names, causing mismatch; fix by making
both type and choices use the enum values (lowercase) consistently: replace the
type lambda with something like lambda s: QuantAlgo(s.lower()) (or QuantAlgo(s)
if inputs are already canonical) and set choices to [q.value for q in QuantAlgo]
so argparse validates against the same string form; adjust the "--quant" option
where QuantAlgo and the current lambda are referenced.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6ee64098-1feb-496c-9391-d1ccd730ff04

📥 Commits

Reviewing files that changed from the base of the PR and between 2727092 and c9fccd5.

📒 Files selected for processing (1)

tests/microbenchmarks/bench_moe_comm.py

tensorrt-cicd · 2026-03-13T11:19:16Z

PR_Github #38847 [ run ] completed with state SUCCESS. Commit: c9fccd5
/LLM/main/L0_MergeRequest_PR pipeline #30157 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bobboli · 2026-03-13T12:10:50Z

/bot run --disable-fail-fast --reuse-test

tensorrt-cicd · 2026-03-13T12:16:48Z

PR_Github #38869 [ run ] triggered by Bot. Commit: c9fccd5 Link to invocation

tensorrt-cicd · 2026-03-13T13:13:15Z

PR_Github #38869 [ run ] completed with state SUCCESS. Commit: c9fccd5
/LLM/main/L0_MergeRequest_PR pipeline #30179 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Bo Li <[email protected]>

…ate OP duration, instead of CUDA event which may have overhead. Signed-off-by: Bo Li <[email protected]>

Signed-off-by: Bo Li <[email protected]>

bobboli · 2026-03-13T16:52:56Z

/bot run --disable-fail-fast

bobboli · 2026-03-15T16:28:48Z

/bot run --disable-fail-fast

Signed-off-by: Bo Li <[email protected]>

bobboli · 2026-03-15T16:32:08Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-15T16:35:51Z

PR_Github #38993 [ run ] triggered by Bot. Commit: c5eb487 Link to invocation

tensorrt-cicd · 2026-03-15T16:37:49Z

PR_Github #38994 [ run ] triggered by Bot. Commit: c5eb487 Link to invocation

tensorrt-cicd · 2026-03-15T20:17:45Z

PR_Github #38994 [ run ] completed with state SUCCESS. Commit: c5eb487
/LLM/main/L0_MergeRequest_PR pipeline #30274 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bobboli · 2026-03-16T03:48:57Z

/bot run --disable-fail-fast --reuse-test

tensorrt-cicd · 2026-03-16T03:55:20Z

PR_Github #39025 [ run ] triggered by Bot. Commit: c5eb487 Link to invocation

tensorrt-cicd · 2026-03-16T05:05:21Z

PR_Github #39025 [ run ] completed with state SUCCESS. Commit: c5eb487
/LLM/main/L0_MergeRequest_PR pipeline #30301 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bobboli · 2026-03-16T07:13:44Z

/bot run --disable-fail-fast --reuse-test

tensorrt-cicd · 2026-03-16T07:19:45Z

PR_Github #39046 [ run ] triggered by Bot. Commit: c5eb487 Link to invocation

tensorrt-cicd · 2026-03-16T07:27:07Z

PR_Github #39046 [ run ] completed with state FAILURE. Commit: c5eb487
/LLM/main/L0_MergeRequest_PR pipeline #30319 completed with status: 'ABORTED'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bobboli · 2026-03-16T08:22:11Z

/bot run --disable-fail-fast --reuse-test

tensorrt-cicd · 2026-03-16T08:27:56Z

PR_Github #39061 [ run ] triggered by Bot. Commit: c5eb487 Link to invocation

tensorrt-cicd · 2026-03-16T09:07:04Z

PR_Github #39061 [ run ] completed with state SUCCESS. Commit: c5eb487
/LLM/main/L0_MergeRequest_PR pipeline #30329 completed with status: 'SUCCESS'

CI Report

Link to invocation

…#12192) Signed-off-by: Bo Li <[email protected]>

bobboli requested a review from xxi-nv March 13, 2026 07:32

github-actions Bot assigned bobboli Mar 13, 2026

bobboli enabled auto-merge (squash) March 13, 2026 07:32

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread tests/microbenchmarks/bench_moe_comm.py

Comment thread tests/microbenchmarks/bench_moe_comm.py

bobboli added 3 commits March 13, 2026 09:51

Refactor profile.

8772625

Signed-off-by: Bo Li <[email protected]>

Use the range between first kernel start and last kernel end to estim…

eecee38

…ate OP duration, instead of CUDA event which may have overhead. Signed-off-by: Bo Li <[email protected]>

Correctly handle envvar in both spawn mode and MPI mode.

2bafd5d

Signed-off-by: Bo Li <[email protected]>

bobboli force-pushed the alltoall_bench_refine_2 branch from c9fccd5 to 2bafd5d Compare March 13, 2026 16:52

Fix calling _create_forced_method with use_flashinfer=False.

c5eb487

Signed-off-by: Bo Li <[email protected]>

xxi-nv approved these changes Mar 16, 2026

View reviewed changes

bobboli merged commit 87658f2 into NVIDIA:main Mar 16, 2026
4 of 5 checks passed

reasonsolo pushed a commit to reasonsolo/TensorRT-LLM that referenced this pull request Mar 17, 2026

[None][chore] Alltoall benchmark script refine (second time). (NVIDIA…

3f299d7

…#12192) Signed-off-by: Bo Li <[email protected]>

limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026

[None][chore] Alltoall benchmark script refine (second time). (NVIDIA…

9e48e44

…#12192) Signed-off-by: Bo Li <[email protected]>

longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026

[None][chore] Alltoall benchmark script refine (second time). (NVIDIA…

aa2c5ec

…#12192) Signed-off-by: Bo Li <[email protected]>

Conversation

bobboli commented Mar 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Uh oh!

bobboli commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

coderabbitai Bot commented Mar 13, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

bobboli commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

bobboli commented Mar 13, 2026

Uh oh!

bobboli commented Mar 15, 2026

Uh oh!

bobboli commented Mar 15, 2026

Uh oh!

tensorrt-cicd commented Mar 15, 2026

Uh oh!

tensorrt-cicd commented Mar 15, 2026

Uh oh!

tensorrt-cicd commented Mar 15, 2026

Uh oh!

bobboli commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

bobboli commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

bobboli commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bobboli commented Mar 13, 2026 •

edited by coderabbitai Bot

Loading