[None][feat] CuteDSL MOE: Add raster along M/N support for blockscaled contiguous backbone kernel#12079
Conversation
|
/bot run |
📝 WalkthroughWalkthroughThe changes introduce a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py (1)
1074-1097: Use keyword arguments for the trailing runtime flags.Lines 1074-1097 still rely on a long positional tail. This PR already fixed one boolean misbinding here; keeping
permuted_m,topK,seq_len,raster_along_m,use_blkred, anduse_cuptikeyworded will keep the next signature edit from reintroducing it.♻️ Suggested call-site hardening
- exec_time = run( - nkl, - group_m_list, - args.ab_dtype, - args.out_dtype, - args.sf_dtype, - args.sf_vec_size, - args.final_scale_dtype, - args.a_major, - args.b_major, - args.out_major, - args.mma_tiler_mn, - args.cluster_shape_mn, - args.tolerance, - args.warmup_iterations, - args.iterations, - args.skip_ref_check, - args.use_cold_l2, - args.permuted_m, - args.topk, - args.seq_len, - args.raster_along_m, - args.use_blkred, - args.use_cupti, - ) + exec_time = run( + nkl, + group_m_list, + args.ab_dtype, + args.out_dtype, + args.sf_dtype, + args.sf_vec_size, + args.final_scale_dtype, + args.a_major, + args.b_major, + args.out_major, + args.mma_tiler_mn, + args.cluster_shape_mn, + args.tolerance, + args.warmup_iterations, + args.iterations, + args.skip_ref_check, + args.use_cold_l2, + permuted_m=args.permuted_m, + topK=args.topk, + seq_len=args.seq_len, + raster_along_m=args.raster_along_m, + use_blkred=args.use_blkred, + use_cupti=args.use_cupti, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py` around lines 1074 - 1097, The call to run(...) uses a long positional tail for runtime flags which risks misbinding; change the final arguments to keyword arguments for clarity and safety by passing permuted_m=permuted_m, topk=topk, seq_len=seq_len, raster_along_m=raster_along_m, use_blkred=use_blkred, and use_cupti=use_cupti (and any other trailing booleans) so the invocation of the run function uses explicit parameter names and won't break if the signature changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.py`:
- Around line 87-90: Initialize problem_shape_ncluster_mnl before the swizzle
branch by computing it with cute.ceil_div(...) (using the existing problem_shape
and cluster sizes) and assign that value to a local variable
problem_shape_ncluster_mnl; then in the existing swizzle branch (where
swizzle_size > 1) call cute.round_up(...) against that precomputed
problem_shape_ncluster_mnl (instead of referencing
self.problem_layout_ncluster_mnl.shape which is not yet set), and finally assign
self.problem_layout_ncluster_mnl from the finalized shape after swizzling so
later code can safely use self.problem_layout_ncluster_mnl; update references to
raster_along_m and swizzle_size as needed but keep the ceil_div-based
initialization outside the if-block.
- Around line 63-68: The hooked_PersistentTileSchedulerParams_init currently
reads self.problem_layout_ncluster_mnl.shape before that attribute is guaranteed
to be assigned (causing AttributeError when swizzle_size > 1) and also applies
process-wide CUTLASS global mutations duplicated across three kernel modules; to
fix, ensure problem_layout_ncluster_mnl is initialized before any access in
hooked_PersistentTileSchedulerParams_init (assign a default/layout
unconditionally or move the shape access after the conditional assignments so
both swizzle and non-swizzle paths set self.problem_layout_ncluster_mnl), and
consolidate the global scheduler monkey-patch (the mutations currently
duplicated in blockscaled_contiguous_grouped_gemm.py,
blockscaled_contiguous_grouped_gemm_finalize_fusion.py, and
blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py) into a single
module-level patch point to avoid import-order dependent behavior (remove the
duplicate overrides and reference that single centralized patch).
---
Nitpick comments:
In
`@tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py`:
- Around line 1074-1097: The call to run(...) uses a long positional tail for
runtime flags which risks misbinding; change the final arguments to keyword
arguments for clarity and safety by passing permuted_m=permuted_m, topk=topk,
seq_len=seq_len, raster_along_m=raster_along_m, use_blkred=use_blkred, and
use_cupti=use_cupti (and any other trailing booleans) so the invocation of the
run function uses explicit parameter names and won't break if the signature
changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b966a6c1-ff50-4007-9ff5-2187fc6f2d76
📒 Files selected for processing (3)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm.pytests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm.pytests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py
|
PR_Github #38425 [ run ] triggered by Bot. Commit: |
|
PR_Github #38425 [ run ] completed with state
|
…grouped GEMM - Add tile scheduler hooks and raster_along_m parameter to Sm100BlockScaledContiguousGroupedGemmKernel with early-exit optimization for raster along N (default) mode - Add --raster_along_m CLI arg to run_blockscaled_contiguous_grouped_gemm.py - Fix bug in run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py where raster_along_m was passed as positional arg to use_blkred - Add --use_blkred CLI arg to finalize fusion test script Signed-off-by: Yuhan Li <[email protected]>
…grouped GEMM - Add tile scheduler hooks and raster_along_m parameter to Sm100BlockScaledContiguousGroupedGemmKernel with early-exit optimization for raster along N (default) mode - Add --raster_along_m CLI arg to run_blockscaled_contiguous_grouped_gemm.py - Fix bug in run_blockscaled_contiguous_grouped_gemm_finalize_fusion.py where raster_along_m was incorrectly passed as positional arg to use_blkred - Add --use_blkred CLI arg to finalize fusion test script Signed-off-by: Yuhan Li <[email protected]>
a6a01d5 to
75b1471
Compare
|
/bot run |
|
PR_Github #39352 [ run ] triggered by Bot. Commit: |
|
PR_Github #39352 [ run ] completed with state
|
|
/bot run |
|
PR_Github #39579 [ run ] triggered by Bot. Commit: |
|
PR_Github #39579 [ run ] completed with state
|
|
/bot run |
|
PR_Github #39853 [ run ] triggered by Bot. Commit: |
|
PR_Github #39853 [ run ] completed with state
|
|
/bot run |
|
PR_Github #39878 [ run ] triggered by Bot. Commit: |
|
PR_Github #39878 [ run ] completed with state |
…d contiguous backbone kernel (NVIDIA#12079) Signed-off-by: Yuhan Li <[email protected]>
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.