Skip to content

[None][feat] CuteDSL MOE: Add raster along M/N support for blockscaled contiguous backbone kernel#12079

Merged
syuoni merged 3 commits into
NVIDIA:mainfrom
liyuhannnnn:yuhan/raster_along_m_support
Mar 23, 2026
Merged

[None][feat] CuteDSL MOE: Add raster along M/N support for blockscaled contiguous backbone kernel#12079
syuoni merged 3 commits into
NVIDIA:mainfrom
liyuhannnnn:yuhan/raster_along_m_support

Conversation

@liyuhannnnn
Copy link
Copy Markdown
Collaborator

@liyuhannnnn liyuhannnnn commented Mar 10, 2026

  • Add tile scheduler hooks and raster_along_m parameter to Sm100BlockScaledContiguousGroupedGemmKernel with early-exit optimization for raster along N (default) mode
  • Add --raster_along_m CLI arg to run_blockscaled_contiguous_grouped_gemm.py
  • Fix bug in run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py where raster_along_m was passed as positional arg to use_blkred
  • Add --use_blkred CLI arg to finalize fusion test script

Summary by CodeRabbit

  • New Features
    • Added raster_along_m parameter to configure block-scaled GEMM kernel scheduling options.
    • Added use_blkred parameter to configure block-scaled GEMM finalization behavior.
    • Extended benchmark test scripts with new command-line flags (--raster_along_m and --use_blkred) to enable kernel performance tuning and experimentation.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@liyuhannnnn liyuhannnnn requested a review from a team as a code owner March 10, 2026 09:39
@liyuhannnnn
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

The changes introduce a new raster_along_m scheduling parameter to the Blackwell block-scaled GEMM kernel, enabling an alternative fast-divmod-based tile scheduling path. Conditional scheduler logic and grid computation are added to support this mode. Two test scripts are updated to expose this flag and a separate use_blkred flag via CLI arguments.

Changes

Cohort / File(s) Summary
GEMM Kernel Scheduling
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py
Adds raster_along_m parameter to kernel initialization and grid computation. Introduces conditional scheduling paths: fast-divmod-based workflow when raster_along_m=True, with updated tile acquisition, barrier synchronization, and producer/consumer sequencing. Updates swizzle/layout handling and extends PersistentTileSchedulerParams to validate and configure the new mode.
Test Script Parameter Threading
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm.py, tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py
Adds raster_along_m and use_blkred boolean flags to test harnesses. Both flags are threaded through function signatures, kernel constructor calls, CLI argument parsing, and logging output. No algorithmic changes; parameter propagation only.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description is incomplete. It lacks the required Description and Test Coverage sections from the template. Add a Description section explaining the issue and solution, and a Test Coverage section listing relevant tests that safeguard the changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding raster along M/N support for a blockscaled contiguous backbone kernel in the CuteDSL MOE system.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py (1)

1074-1097: Use keyword arguments for the trailing runtime flags.

Lines 1074-1097 still rely on a long positional tail. This PR already fixed one boolean misbinding here; keeping permuted_m, topK, seq_len, raster_along_m, use_blkred, and use_cupti keyworded will keep the next signature edit from reintroducing it.

♻️ Suggested call-site hardening
-    exec_time = run(
-        nkl,
-        group_m_list,
-        args.ab_dtype,
-        args.out_dtype,
-        args.sf_dtype,
-        args.sf_vec_size,
-        args.final_scale_dtype,
-        args.a_major,
-        args.b_major,
-        args.out_major,
-        args.mma_tiler_mn,
-        args.cluster_shape_mn,
-        args.tolerance,
-        args.warmup_iterations,
-        args.iterations,
-        args.skip_ref_check,
-        args.use_cold_l2,
-        args.permuted_m,
-        args.topk,
-        args.seq_len,
-        args.raster_along_m,
-        args.use_blkred,
-        args.use_cupti,
-    )
+    exec_time = run(
+        nkl,
+        group_m_list,
+        args.ab_dtype,
+        args.out_dtype,
+        args.sf_dtype,
+        args.sf_vec_size,
+        args.final_scale_dtype,
+        args.a_major,
+        args.b_major,
+        args.out_major,
+        args.mma_tiler_mn,
+        args.cluster_shape_mn,
+        args.tolerance,
+        args.warmup_iterations,
+        args.iterations,
+        args.skip_ref_check,
+        args.use_cold_l2,
+        permuted_m=args.permuted_m,
+        topK=args.topk,
+        seq_len=args.seq_len,
+        raster_along_m=args.raster_along_m,
+        use_blkred=args.use_blkred,
+        use_cupti=args.use_cupti,
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py`
around lines 1074 - 1097, The call to run(...) uses a long positional tail for
runtime flags which risks misbinding; change the final arguments to keyword
arguments for clarity and safety by passing permuted_m=permuted_m, topk=topk,
seq_len=seq_len, raster_along_m=raster_along_m, use_blkred=use_blkred, and
use_cupti=use_cupti (and any other trailing booleans) so the invocation of the
run function uses explicit parameter names and won't break if the signature
changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py`:
- Around line 87-90: Initialize problem_shape_ncluster_mnl before the swizzle
branch by computing it with cute.ceil_div(...) (using the existing problem_shape
and cluster sizes) and assign that value to a local variable
problem_shape_ncluster_mnl; then in the existing swizzle branch (where
swizzle_size > 1) call cute.round_up(...) against that precomputed
problem_shape_ncluster_mnl (instead of referencing
self.problem_layout_ncluster_mnl.shape which is not yet set), and finally assign
self.problem_layout_ncluster_mnl from the finalized shape after swizzling so
later code can safely use self.problem_layout_ncluster_mnl; update references to
raster_along_m and swizzle_size as needed but keep the ceil_div-based
initialization outside the if-block.
- Around line 63-68: The hooked_PersistentTileSchedulerParams_init currently
reads self.problem_layout_ncluster_mnl.shape before that attribute is guaranteed
to be assigned (causing AttributeError when swizzle_size > 1) and also applies
process-wide CUTLASS global mutations duplicated across three kernel modules; to
fix, ensure problem_layout_ncluster_mnl is initialized before any access in
hooked_PersistentTileSchedulerParams_init (assign a default/layout
unconditionally or move the shape access after the conditional assignments so
both swizzle and non-swizzle paths set self.problem_layout_ncluster_mnl), and
consolidate the global scheduler monkey-patch (the mutations currently
duplicated in blockscaled_contiguous_grouped_gemm.py,
blockscaled_contiguous_grouped_gemm_finalize_fusion.py, and
blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py) into a single
module-level patch point to avoid import-order dependent behavior (remove the
duplicate overrides and reference that single centralized patch).

---

Nitpick comments:
In
`@tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py`:
- Around line 1074-1097: The call to run(...) uses a long positional tail for
runtime flags which risks misbinding; change the final arguments to keyword
arguments for clarity and safety by passing permuted_m=permuted_m, topk=topk,
seq_len=seq_len, raster_along_m=raster_along_m, use_blkred=use_blkred, and
use_cupti=use_cupti (and any other trailing booleans) so the invocation of the
run function uses explicit parameter names and won't break if the signature
changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b966a6c1-ff50-4007-9ff5-2187fc6f2d76

📥 Commits

Reviewing files that changed from the base of the PR and between 1fef88e and e51dd22.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py
  • tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm.py
  • tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38425 [ run ] triggered by Bot. Commit: a6a01d5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38425 [ run ] completed with state SUCCESS. Commit: a6a01d5
/LLM/main/L0_MergeRequest_PR pipeline #29786 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…grouped GEMM

- Add tile scheduler hooks and raster_along_m parameter to
  Sm100BlockScaledContiguousGroupedGemmKernel with early-exit
  optimization for raster along N (default) mode
- Add --raster_along_m CLI arg to run_blockscaled_contiguous_grouped_gemm.py
- Fix bug in run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py
  where raster_along_m was passed as positional arg to use_blkred
- Add --use_blkred CLI arg to finalize fusion test script

Signed-off-by: Yuhan Li <[email protected]>
…grouped GEMM

- Add tile scheduler hooks and raster_along_m parameter to
  Sm100BlockScaledContiguousGroupedGemmKernel with early-exit
  optimization for raster along N (default) mode
- Add --raster_along_m CLI arg to run_blockscaled_contiguous_grouped_gemm.py
- Fix bug in run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py
  where raster_along_m was incorrectly passed as positional arg to use_blkred
- Add --use_blkred CLI arg to finalize fusion test script

Signed-off-by: Yuhan Li <[email protected]>
@liyuhannnnn
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39352 [ run ] triggered by Bot. Commit: 75b1471 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39352 [ run ] completed with state SUCCESS. Commit: 75b1471
/LLM/main/L0_MergeRequest_PR pipeline #30596 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@liyuhannnnn
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39579 [ run ] triggered by Bot. Commit: 75b1471 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39579 [ run ] completed with state SUCCESS. Commit: 75b1471
/LLM/main/L0_MergeRequest_PR pipeline #30792 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@liyuhannnnn
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39853 [ run ] triggered by Bot. Commit: a0c42e3 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39853 [ run ] completed with state SUCCESS. Commit: a0c42e3
/LLM/main/L0_MergeRequest_PR pipeline #31027 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@liyuhannnnn
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39878 [ run ] triggered by Bot. Commit: a0c42e3 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39878 [ run ] completed with state SUCCESS. Commit: a0c42e3
/LLM/main/L0_MergeRequest_PR pipeline #31048 completed with status: 'SUCCESS'

CI Report

Link to invocation

@syuoni syuoni merged commit bdd8558 into NVIDIA:main Mar 23, 2026
5 checks passed
longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants