Skip to content

[None][chore] Alltoall benchmark script refine (second time).#12192

Merged
bobboli merged 4 commits into
NVIDIA:mainfrom
bobboli:alltoall_bench_refine_2
Mar 16, 2026
Merged

[None][chore] Alltoall benchmark script refine (second time).#12192
bobboli merged 4 commits into
NVIDIA:mainfrom
bobboli:alltoall_bench_refine_2

Conversation

@bobboli
Copy link
Copy Markdown
Collaborator

@bobboli bobboli commented Mar 13, 2026

Further refine AlltoAll benchmark script.

  • Correctly handle envvar in both spawn mode and MPI mode.
  • Use the range between first kernel start and last kernel end to estimate OP duration, instead of CUDA event which may have overhead.
  • Refactor the relationship between profile and other args.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

  • New Features

    • Added CLI flags for explicit benchmark configuration (--hidden_size, --top_k, --num_experts, --quant) to override or supplement profile-based settings.
  • Improvements

    • Enhanced timing measurements with per-iteration kernel tracking for improved accuracy in dispatch and combine operations.
    • Updated CUPTI fallback messaging to clarify timing method when detailed kernel breakdown is unavailable.
  • Changes

    • Modified profile configuration return format and validation behavior.
    • Improved environment variable propagation for distributed benchmark execution.

@bobboli bobboli requested a review from xxi-nv March 13, 2026 07:32
@bobboli
Copy link
Copy Markdown
Collaborator Author

bobboli commented Mar 13, 2026

/bot run --disable-fail-fast

@bobboli bobboli enabled auto-merge (squash) March 13, 2026 07:32
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38847 [ run ] triggered by Bot. Commit: c9fccd5 Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

The changes refactor the benchmarking script's profile resolution to return four values instead of five, add explicit CLI overrides for profile parameters, integrate per-iteration kernel timing spans from CUPTI analysis, relax CUPTI initialization gating, and propagate worker environment variables across MPI launch modes.

Changes

Cohort / File(s) Summary
Profile and Configuration Handling
tests/microbenchmarks/bench_moe_comm.py
Modified _resolve_profile_args() signature to return (hidden_size, top_k, num_experts_total, quant_algo) instead of including local_num_tokens. Removed early validity check in _iter_local_batch_sizes(). Replaced NO_QUANT default with explicit quant_algo selection. Profile usage now enforces explicit CLI flag supply if no profile provided.
Timing Measurement and CUPTI Integration
tests/microbenchmarks/bench_moe_comm.py
Updated _build_cuda_graph_kernel_stats_cupti() signature from (cupti_kernels, cupti_events, warmup, iters) to (cupti_kernels, cupti_events, iters). Integrated per-iteration kernel-span timing (dispatch_iter_span, combine_iter_span) to compute refined dispatch_times_us and combine_times_us. Modified _record_external() to use per-iteration kernel-span times with fallback to event timings. Updated CUDA-graph timing documentation to reflect L2 cache flushing behavior within iteration loops.
CLI Interface Expansion
tests/microbenchmarks/bench_moe_comm.py
Added new CLI flags --hidden_size, --top_k, --num_experts, and --quant in parse_args(). Updated --profile help text to document override behavior. Flags enable explicit parameter supply when no profile is provided.
Worker Environment and MPI Integration
tests/microbenchmarks/bench_moe_comm.py
Introduced shared _WORKER_ENV dictionary containing TRTLLM_ENABLE_PDL. Relaxed CUPTI initialization gating in _run_benchmark_worker_under_current_mpi() to initialize CUPTI whenever CUDA graph is enabled (not just on kernel_breakdown). Ensured _WORKER_ENV propagation to both spawn-mode and external MPI launch paths. Updated user-facing messaging for CUPTI unavailability to specify fallback to CUDA event elapsed_time.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description covers the three main objectives (envvar handling, kernel timing, profile refactoring) but lacks details on testing, impact assessment, and checklist completion. Complete the description template sections: add Test Coverage details, mark PR Checklist items as reviewed, and expand Description section with rationale for changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main changes: refining the AlltoAll benchmark script with specific mention of it being the second iteration.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/microbenchmarks/bench_moe_comm.py`:
- Around line 1005-1008: Replace the boolean-or fallbacks with explicit None
checks and validation: for hidden_size, top_k, and num_experts_total use
int(args.hidden_size) if args.hidden_size is not None else int(prof.hidden_size)
(and similarly for top_k and num_experts_total), then immediately validate each
is a positive integer and raise a clear ValueError if <= 0; for quant_algo keep
the explicit None check (quant_algo = args.quant if args.quant is not None else
prof.quant_algo) rather than using "or" so empty/zero values aren't silently
discarded. Ensure you reference the existing symbols hidden_size, top_k,
num_experts_total, quant_algo, args, and prof when making the change.
- Around line 871-875: The --quant argparse entry is failing because type=lambda
s: QuantAlgo[str(s).upper()] returns a QuantAlgo member while choices=[q.name
for q in QuantAlgo] contains uppercase names, causing mismatch; fix by making
both type and choices use the enum values (lowercase) consistently: replace the
type lambda with something like lambda s: QuantAlgo(s.lower()) (or QuantAlgo(s)
if inputs are already canonical) and set choices to [q.value for q in QuantAlgo]
so argparse validates against the same string form; adjust the "--quant" option
where QuantAlgo and the current lambda are referenced.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6ee64098-1feb-496c-9391-d1ccd730ff04

📥 Commits

Reviewing files that changed from the base of the PR and between 2727092 and c9fccd5.

📒 Files selected for processing (1)
  • tests/microbenchmarks/bench_moe_comm.py

Comment thread tests/microbenchmarks/bench_moe_comm.py
Comment thread tests/microbenchmarks/bench_moe_comm.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38847 [ run ] completed with state SUCCESS. Commit: c9fccd5
/LLM/main/L0_MergeRequest_PR pipeline #30157 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bobboli
Copy link
Copy Markdown
Collaborator Author

bobboli commented Mar 13, 2026

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38869 [ run ] triggered by Bot. Commit: c9fccd5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38869 [ run ] completed with state SUCCESS. Commit: c9fccd5
/LLM/main/L0_MergeRequest_PR pipeline #30179 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bobboli added 3 commits March 13, 2026 09:51
Signed-off-by: Bo Li <[email protected]>
…ate OP duration, instead of CUDA event which may have overhead.

Signed-off-by: Bo Li <[email protected]>
@bobboli bobboli force-pushed the alltoall_bench_refine_2 branch from c9fccd5 to 2bafd5d Compare March 13, 2026 16:52
@bobboli
Copy link
Copy Markdown
Collaborator Author

bobboli commented Mar 13, 2026

/bot run --disable-fail-fast

1 similar comment
@bobboli
Copy link
Copy Markdown
Collaborator Author

bobboli commented Mar 15, 2026

/bot run --disable-fail-fast

@bobboli
Copy link
Copy Markdown
Collaborator Author

bobboli commented Mar 15, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38993 [ run ] triggered by Bot. Commit: c5eb487 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38994 [ run ] triggered by Bot. Commit: c5eb487 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38994 [ run ] completed with state SUCCESS. Commit: c5eb487
/LLM/main/L0_MergeRequest_PR pipeline #30274 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bobboli
Copy link
Copy Markdown
Collaborator Author

bobboli commented Mar 16, 2026

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39025 [ run ] triggered by Bot. Commit: c5eb487 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39025 [ run ] completed with state SUCCESS. Commit: c5eb487
/LLM/main/L0_MergeRequest_PR pipeline #30301 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bobboli
Copy link
Copy Markdown
Collaborator Author

bobboli commented Mar 16, 2026

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39046 [ run ] triggered by Bot. Commit: c5eb487 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39046 [ run ] completed with state FAILURE. Commit: c5eb487
/LLM/main/L0_MergeRequest_PR pipeline #30319 completed with status: 'ABORTED'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bobboli
Copy link
Copy Markdown
Collaborator Author

bobboli commented Mar 16, 2026

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39061 [ run ] triggered by Bot. Commit: c5eb487 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39061 [ run ] completed with state SUCCESS. Commit: c5eb487
/LLM/main/L0_MergeRequest_PR pipeline #30329 completed with status: 'SUCCESS'

CI Report

Link to invocation

@bobboli bobboli merged commit 87658f2 into NVIDIA:main Mar 16, 2026
4 of 5 checks passed
reasonsolo pushed a commit to reasonsolo/TensorRT-LLM that referenced this pull request Mar 17, 2026
limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026
longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants