[None][feat] CuteDSL MOE: Add raster along M/N support for blockscaled contiguous backbone kernel by liyuhannnnn · Pull Request #12079 · NVIDIA/TensorRT-LLM

liyuhannnnn · 2026-03-10T09:39:26Z

Add tile scheduler hooks and raster_along_m parameter to Sm100BlockScaledContiguousGroupedGemmKernel with early-exit optimization for raster along N (default) mode
Add --raster_along_m CLI arg to run_blockscaled_contiguous_grouped_gemm.py
Fix bug in run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py where raster_along_m was passed as positional arg to use_blkred
Add --use_blkred CLI arg to finalize fusion test script

Summary by CodeRabbit

New Features
- Added raster_along_m parameter to configure block-scaled GEMM kernel scheduling options.
- Added use_blkred parameter to configure block-scaled GEMM finalization behavior.
- Extended benchmark test scripts with new command-line flags (--raster_along_m and --use_blkred) to enable kernel performance tuning and experimentation.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

liyuhannnnn · 2026-03-10T09:47:01Z

/bot run

coderabbitai · 2026-03-10T09:50:30Z

📝 Walkthrough

Walkthrough

The changes introduce a new raster_along_m scheduling parameter to the Blackwell block-scaled GEMM kernel, enabling an alternative fast-divmod-based tile scheduling path. Conditional scheduler logic and grid computation are added to support this mode. Two test scripts are updated to expose this flag and a separate use_blkred flag via CLI arguments.

Changes

Cohort / File(s)	Summary
GEMM Kernel Scheduling `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py`	Adds `raster_along_m` parameter to kernel initialization and grid computation. Introduces conditional scheduling paths: fast-divmod-based workflow when `raster_along_m=True`, with updated tile acquisition, barrier synchronization, and producer/consumer sequencing. Updates swizzle/layout handling and extends `PersistentTileSchedulerParams` to validate and configure the new mode.
Test Script Parameter Threading `tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm.py`, `tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py`	Adds `raster_along_m` and `use_blkred` boolean flags to test harnesses. Both flags are threaded through function signatures, kernel constructor calls, CLI argument parsing, and logging output. No algorithmic changes; parameter propagation only.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	PR description is incomplete. It lacks the required Description and Test Coverage sections from the template.	Add a Description section explaining the issue and solution, and a Test Coverage section listing relevant tests that safeguard the changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding raster along M/N support for a blockscaled contiguous backbone kernel in the CuteDSL MOE system.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py (1)

1074-1097: Use keyword arguments for the trailing runtime flags.

Lines 1074-1097 still rely on a long positional tail. This PR already fixed one boolean misbinding here; keeping permuted_m, topK, seq_len, raster_along_m, use_blkred, and use_cupti keyworded will keep the next signature edit from reintroducing it.

♻️ Suggested call-site hardening

-    exec_time = run(
-        nkl,
-        group_m_list,
-        args.ab_dtype,
-        args.out_dtype,
-        args.sf_dtype,
-        args.sf_vec_size,
-        args.final_scale_dtype,
-        args.a_major,
-        args.b_major,
-        args.out_major,
-        args.mma_tiler_mn,
-        args.cluster_shape_mn,
-        args.tolerance,
-        args.warmup_iterations,
-        args.iterations,
-        args.skip_ref_check,
-        args.use_cold_l2,
-        args.permuted_m,
-        args.topk,
-        args.seq_len,
-        args.raster_along_m,
-        args.use_blkred,
-        args.use_cupti,
-    )
+    exec_time = run(
+        nkl,
+        group_m_list,
+        args.ab_dtype,
+        args.out_dtype,
+        args.sf_dtype,
+        args.sf_vec_size,
+        args.final_scale_dtype,
+        args.a_major,
+        args.b_major,
+        args.out_major,
+        args.mma_tiler_mn,
+        args.cluster_shape_mn,
+        args.tolerance,
+        args.warmup_iterations,
+        args.iterations,
+        args.skip_ref_check,
+        args.use_cold_l2,
+        permuted_m=args.permuted_m,
+        topK=args.topk,
+        seq_len=args.seq_len,
+        raster_along_m=args.raster_along_m,
+        use_blkred=args.use_blkred,
+        use_cupti=args.use_cupti,
+    )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py`
around lines 1074 - 1097, The call to run(...) uses a long positional tail for
runtime flags which risks misbinding; change the final arguments to keyword
arguments for clarity and safety by passing permuted_m=permuted_m, topk=topk,
seq_len=seq_len, raster_along_m=raster_along_m, use_blkred=use_blkred, and
use_cupti=use_cupti (and any other trailing booleans) so the invocation of the
run function uses explicit parameter names and won't break if the signature
changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py`:
- Around line 87-90: Initialize problem_shape_ncluster_mnl before the swizzle
branch by computing it with cute.ceil_div(...) (using the existing problem_shape
and cluster sizes) and assign that value to a local variable
problem_shape_ncluster_mnl; then in the existing swizzle branch (where
swizzle_size > 1) call cute.round_up(...) against that precomputed
problem_shape_ncluster_mnl (instead of referencing
self.problem_layout_ncluster_mnl.shape which is not yet set), and finally assign
self.problem_layout_ncluster_mnl from the finalized shape after swizzling so
later code can safely use self.problem_layout_ncluster_mnl; update references to
raster_along_m and swizzle_size as needed but keep the ceil_div-based
initialization outside the if-block.
- Around line 63-68: The hooked_PersistentTileSchedulerParams_init currently
reads self.problem_layout_ncluster_mnl.shape before that attribute is guaranteed
to be assigned (causing AttributeError when swizzle_size > 1) and also applies
process-wide CUTLASS global mutations duplicated across three kernel modules; to
fix, ensure problem_layout_ncluster_mnl is initialized before any access in
hooked_PersistentTileSchedulerParams_init (assign a default/layout
unconditionally or move the shape access after the conditional assignments so
both swizzle and non-swizzle paths set self.problem_layout_ncluster_mnl), and
consolidate the global scheduler monkey-patch (the mutations currently
duplicated in blockscaled_contiguous_grouped_gemm.py,
blockscaled_contiguous_grouped_gemm_finalize_fusion.py, and
blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py) into a single
module-level patch point to avoid import-order dependent behavior (remove the
duplicate overrides and reference that single centralized patch).

---

Nitpick comments:
In
`@tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py`:
- Around line 1074-1097: The call to run(...) uses a long positional tail for
runtime flags which risks misbinding; change the final arguments to keyword
arguments for clarity and safety by passing permuted_m=permuted_m, topk=topk,
seq_len=seq_len, raster_along_m=raster_along_m, use_blkred=use_blkred, and
use_cupti=use_cupti (and any other trailing booleans) so the invocation of the
run function uses explicit parameter names and won't break if the signature
changes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b966a6c1-ff50-4007-9ff5-2187fc6f2d76

📥 Commits

Reviewing files that changed from the base of the PR and between 1fef88e and e51dd22.

📒 Files selected for processing (3)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm.py
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py

tensorrt-cicd · 2026-03-10T09:52:37Z

PR_Github #38425 [ run ] triggered by Bot. Commit: a6a01d5 Link to invocation

tensorrt-cicd · 2026-03-10T14:24:43Z

PR_Github #38425 [ run ] completed with state SUCCESS. Commit: a6a01d5
/LLM/main/L0_MergeRequest_PR pipeline #29786 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…grouped GEMM - Add tile scheduler hooks and raster_along_m parameter to Sm100BlockScaledContiguousGroupedGemmKernel with early-exit optimization for raster along N (default) mode - Add --raster_along_m CLI arg to run_blockscaled_contiguous_grouped_gemm.py - Fix bug in run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py where raster_along_m was passed as positional arg to use_blkred - Add --use_blkred CLI arg to finalize fusion test script Signed-off-by: Yuhan Li <[email protected]>

…grouped GEMM - Add tile scheduler hooks and raster_along_m parameter to Sm100BlockScaledContiguousGroupedGemmKernel with early-exit optimization for raster along N (default) mode - Add --raster_along_m CLI arg to run_blockscaled_contiguous_grouped_gemm.py - Fix bug in run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py where raster_along_m was incorrectly passed as positional arg to use_blkred - Add --use_blkred CLI arg to finalize fusion test script Signed-off-by: Yuhan Li <[email protected]>

liyuhannnnn · 2026-03-18T02:05:13Z

/bot run

tensorrt-cicd · 2026-03-18T02:11:53Z

PR_Github #39352 [ run ] triggered by Bot. Commit: 75b1471 Link to invocation

tensorrt-cicd · 2026-03-18T05:38:07Z

PR_Github #39352 [ run ] completed with state SUCCESS. Commit: 75b1471
/LLM/main/L0_MergeRequest_PR pipeline #30596 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

liyuhannnnn · 2026-03-19T08:54:11Z

/bot run

tensorrt-cicd · 2026-03-19T08:59:36Z

PR_Github #39579 [ run ] triggered by Bot. Commit: 75b1471 Link to invocation

tensorrt-cicd · 2026-03-19T09:55:01Z

PR_Github #39579 [ run ] completed with state SUCCESS. Commit: 75b1471
/LLM/main/L0_MergeRequest_PR pipeline #30792 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

liyuhannnnn · 2026-03-23T01:41:05Z

/bot run

tensorrt-cicd · 2026-03-23T01:47:10Z

PR_Github #39853 [ run ] triggered by Bot. Commit: a0c42e3 Link to invocation

tensorrt-cicd · 2026-03-23T04:45:41Z

PR_Github #39853 [ run ] completed with state SUCCESS. Commit: a0c42e3
/LLM/main/L0_MergeRequest_PR pipeline #31027 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

liyuhannnnn · 2026-03-23T04:47:53Z

/bot run

tensorrt-cicd · 2026-03-23T04:54:15Z

PR_Github #39878 [ run ] triggered by Bot. Commit: a0c42e3 Link to invocation

tensorrt-cicd · 2026-03-23T09:20:43Z

PR_Github #39878 [ run ] completed with state SUCCESS. Commit: a0c42e3
/LLM/main/L0_MergeRequest_PR pipeline #31048 completed with status: 'SUCCESS'

CI Report

Link to invocation

…d contiguous backbone kernel (NVIDIA#12079) Signed-off-by: Yuhan Li <[email protected]>

liyuhannnnn requested a review from a team as a code owner March 10, 2026 09:39

liyuhannnnn requested review from Naveassaf and yilin-void March 10, 2026 09:39

github-actions Bot assigned liyuhannnnn Mar 10, 2026

coderabbitai Bot reviewed Mar 10, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py

Comment thread tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py

liyuhannnnn added 2 commits March 17, 2026 00:28

liyuhannnnn force-pushed the yuhan/raster_along_m_support branch from a6a01d5 to 75b1471 Compare March 17, 2026 07:28

liyuhannnnn requested review from kaiyux, sherry-1001, syuoni and zongfeijing March 17, 2026 08:17

syuoni approved these changes Mar 17, 2026

View reviewed changes

sherry-1001 approved these changes Mar 17, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py

Comment thread tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py

Merge branch 'main' into yuhan/raster_along_m_support

a0c42e3

syuoni merged commit bdd8558 into NVIDIA:main Mar 23, 2026
5 checks passed

longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026

[None][feat] CuteDSL MOE: Add raster along M/N support for blockscale…

a060385

…d contiguous backbone kernel (NVIDIA#12079) Signed-off-by: Yuhan Li <[email protected]>

Conversation

liyuhannnnn commented Mar 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

liyuhannnnn commented Mar 10, 2026

Uh oh!

coderabbitai Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

liyuhannnnn commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

liyuhannnnn commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

liyuhannnnn commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

liyuhannnnn commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

liyuhannnnn commented Mar 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 10, 2026 •

edited

Loading