Skip to content

[None][feat] Add MegaMoEDeepGemmFusedMoE backend wrapping DeepGEMM fp8_fp4_mega_moe#13384

Merged
Barry-Delaney merged 6 commits into
NVIDIA:mainfrom
Barry-Delaney:user/jinshik/dg_mega_moe
May 8, 2026
Merged

[None][feat] Add MegaMoEDeepGemmFusedMoE backend wrapping DeepGEMM fp8_fp4_mega_moe#13384
Barry-Delaney merged 6 commits into
NVIDIA:mainfrom
Barry-Delaney:user/jinshik/dg_mega_moe

Conversation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator

@Barry-Delaney Barry-Delaney commented Apr 23, 2026

This PR enables the mega-MoE-kernel from DeepGEMM and added related backend into ConfigurableMoE.

@xxi-nv xxi-nv self-requested a review April 27, 2026 07:17
@Barry-Delaney Barry-Delaney force-pushed the user/jinshik/dg_mega_moe branch from 124d4f0 to b652f19 Compare April 27, 2026 07:19
@xxi-nv
Copy link
Copy Markdown
Collaborator

xxi-nv commented Apr 27, 2026

Discussed with @Barry-Delaney. Considering that Barry has tested the functionality locally and we are in a hurry for performance, it is suggested to rename the newly - added backend to MEGAMOE_DEEPGEMM. This is because we are developing our own MEGA kernels.
This PR mainly serves DeepSeek V4 and has no side - effects on other paths. Therefore, we can merge this PR first, and I will refactor it to make it more suitable for our architecture.
@longlee0622 for viz.

@Barry-Delaney Barry-Delaney force-pushed the user/jinshik/dg_mega_moe branch from b652f19 to 5112801 Compare April 27, 2026 09:09
@Barry-Delaney Barry-Delaney marked this pull request as ready for review April 27, 2026 09:28
@Barry-Delaney Barry-Delaney requested review from a team as code owners April 27, 2026 09:28
@Barry-Delaney Barry-Delaney force-pushed the user/jinshik/dg_mega_moe branch from 70c4946 to 4ce2c3c Compare April 27, 2026 09:29
@Barry-Delaney Barry-Delaney changed the title [None][feat] Add MegaMoEFusedMoE backend wrapping DeepGEMM fp8_fp4_mega_moe [None][feat] Add MegaMoEDeepGemmFusedMoE backend wrapping DeepGEMM fp8_fp4_mega_moe Apr 27, 2026
@xxi-nv
Copy link
Copy Markdown
Collaborator

xxi-nv commented Apr 27, 2026

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

This PR introduces a new MegaMoE backend for fused MoE operations powered by DeepGEMM, updating the deepgemm dependency to a newer commit and integrating the backend into the configurable MoE framework.

Changes

Cohort / File(s) Summary
Dependency Updates
3rdparty/fetch_content.json, scripts/attribution/data/dependency_metadata.yml, scripts/attribution/data/files_to_dependency.yml
Updates deepgemm git revision from 4ff3f54... to c491439..., reflecting a new commit with modified dependency hashes.
Build Configuration
cpp/tensorrt_llm/deep_gemm/CMakeLists.txt
Adds Python file processing to rewrite DeepGEMM imports from from .. import _C to import tensorrt_llm.deep_gemm_cpp_tllm, aligning subpackage namespace bindings.
MegaMoE Backend Implementation
tensorrt_llm/_torch/modules/fused_moe/mega_moe/__init__.py, tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py
Introduces new MegaMoEDeepGemmFusedMoE backend with SM100 support, BF16/FP8 quantization, MXFP4 weight storage, and DeepGEMM kernel dispatch.
MoE Framework Integration
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py, tensorrt_llm/_torch/modules/fused_moe/create_moe.py
Integrates MegaMoE backend into configurable MoE: adds _forward_chunk_mega_impl execution path with per-rank token slicing and quantization; extends create_moe.py to instantiate and validate MegaMoE backend with required quant mode (W4A8_MXFP4_MXFP8).

Sequence Diagram

sequenceDiagram
    participant User as User
    participant CMoE as ConfigurableMoE
    participant Router as Router
    participant Backend as MegaMoEDeepGemmFusedMoE
    participant Quant as Quantizer
    participant DG as DeepGEMM Kernel
    
    User->>CMoE: forward(x, router_logits)
    CMoE->>CMoE: _forward_chunk_mega_impl()
    CMoE->>Backend: count tokens per rank
    CMoE->>CMoE: slice x, router_logits to real tokens
    CMoE->>Router: apply routing
    CMoE->>Router: topk casting
    CMoE->>Backend: quantize_input(x)
    Backend->>Quant: mxfp8_quantize(BF16→FP8)
    Quant-->>Backend: x_fp8, x_sf
    CMoE->>Backend: copy to DeepGEMM SymmBuffer
    CMoE->>DG: fp8_fp4_mega_moe (collective kernel)
    DG-->>CMoE: output (FP32)
    CMoE-->>User: return BF16 combined output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding a new MegaMoEFusedMoE backend that wraps DeepGEMM's fp8_fp4_mega_moe kernel.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering the feature overview, integration approach, phase-1 constraints, three commits, and detailed test coverage with clear rationale for deferring hot-path validation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (5)
tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py (3)

209-209: Mutable class attribute should use frozenset.

Using a mutable set as a class attribute can lead to unexpected behavior if modified. Since this is a constant set of supported dtypes, use frozenset instead.

♻️ Proposed fix
-    _SUPPORTED_ACTIVATION_DTYPES = {torch.bfloat16}
+    _SUPPORTED_ACTIVATION_DTYPES = frozenset({torch.bfloat16})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py` at line 209, The
class-level constant _SUPPORTED_ACTIVATION_DTYPES is defined as a mutable set;
change it to an immutable frozenset to avoid accidental mutation. Locate the
_SUPPORTED_ACTIVATION_DTYPES symbol in the Mega MoE backend module and replace
its set literal with a frozenset containing the same element(s) (e.g., use
frozenset(...) around torch.bfloat16) so the attribute is immutable at class
scope.

182-192: Add defensive validation for hidden dimension alignment.

The reshape at line 192 assumes n % 128 == 0 (since n // 32 must be divisible by 4 for the int32 view). While can_implement enforces this at factory time, direct construction or external calls to quantize_input could hit a cryptic reshape error.

🛡️ Proposed fix
 def _quantize_bf16_to_fp8_ue8m0(x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
     """Return (x_fp8, x_sf) in DG mega_moe's expected layout (packed int32)."""
     if _trtllm_mxfp8_quantize_available():
         m, n = x.shape
+        if n % 128 != 0:
+            raise ValueError(
+                f"MegaMoE quantize_input requires hidden_size % 128 == 0 for "
+                f"packed-UE8M0 int32 SF layout; got n={n}"
+            )
         # ``is_sf_swizzled_layout=False`` → flat row-major uint8 SF, one
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py` around lines 182 -
192, In _quantize_bf16_to_fp8_ue8m0 validate the input's hidden dim before
reshaping: check the tensor shape (m, n) and assert that n is aligned so (n //
32) is divisible by 4 (equivalently n % 128 == 0); if not, raise a clear
ValueError describing required alignment and the offending n. Place this check
just after extracting m, n and before calling x_sf_u8.view(m, n // 32). This
prevents the cryptic reshape/view failure when quantize_input or external
callers pass misaligned tensors.

273-273: Function call in default argument creates shared instance.

ModelConfig() is evaluated once at function definition time, not per-call. If ModelConfig is mutable, this could lead to shared state issues. Consider using None as default and creating the instance inside the function.

♻️ Proposed fix
-        model_config: ModelConfig = ModelConfig(),
+        model_config: ModelConfig | None = None,

Then at the start of __init__:

if model_config is None:
    model_config = ModelConfig()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py` at line 273, The
constructor currently uses a mutable default ModelConfig() which is instantiated
at definition time; change the parameter default to model_config:
Optional[ModelConfig] = None (or just None) in the __init__ signature and then
inside __init__ (for the class that defines this constructor) add a guard like
"if model_config is None: model_config = ModelConfig()" so each call gets a
fresh ModelConfig instance; update any type hints/imports accordingly and ensure
all references to the parameter keep the name model_config.
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py (2)

635-644: Unused parameter use_dp_padding.

The use_dp_padding parameter is accepted but never used in the method body. If MegaMoE intentionally ignores padding (since it gets raw token counts), consider either:

  1. Documenting this in the docstring, or
  2. Removing the parameter if callers don't need to pass it
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py` around lines 635 -
644, The parameter use_dp_padding on _forward_chunk_mega_impl is never used;
either remove it from the method signature and all callers (update any
invocations in this module/class) to avoid dead API surface, or keep it but
explicitly mark it as intentionally unused by adding a short docstring note and
a sentinel usage (e.g., assign to _ = use_dp_padding or rename to
_use_dp_padding) so linters/readers know MegaMoE ignores dp padding because it
uses raw token counts; choose one approach and apply consistently across the
class (update _forward_chunk_mega_impl signature, callers, and the method
docstring).

689-691: Nit: Ambiguous Unicode character in comment.

Line 691 uses × (Unicode multiplication sign) instead of x. While readable, this can cause issues with some tools/editors.

✏️ Suggested fix
-        # contract exactly (4 × buf.copy_ + kernel).
+        # contract exactly (4 x buf.copy_ + kernel).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py` around lines 689 -
691, Replace the Unicode multiplication sign in the comment that references the
backend's "run_with_prequant" and DG's "run_fused" shape contract (the phrase "4
× buf.copy_ + kernel") with the ASCII letter "x" (change "4 × buf.copy_ +
kernel" to "4 x buf.copy_ + kernel") so tools/editors won't choke on the
ambiguous character; locate the comment near the mentions of run_with_prequant
and run_fused in configurable_moe.py and update the comment text accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 700-711: The zero-token placeholder creates x_sf with shape (0, 0)
which violates the quantize_input contract used by _quantize_bf16_to_fp8_ue8m0
and may break downstream validation; change the x_sf creation in the else branch
of configurable_moe (next to x_fp8, topk_idx, topk_weights) to use shape (0,
self.hidden_size // 32) with dtype torch.int32 on the same device so its shape
matches expectations used by run_with_prequant and the quantization helper.

In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 351-371: create_moe currently returns MegaMoEDeepGemmFusedMoE
directly, bypassing the ConfigurableMoE path (so _forward_chunk_mega_impl and
override_quant_config are skipped); change the logic so when
ENABLE_CONFIGURABLE_MOE is active the MegaMoE backend is included in the
ConfigurableMoE routing (i.e., treat MegaMoEDeepGemmFusedMoE as one of the
classes handled by ConfigurableMoE so it flows through the same construction
path that calls _forward_chunk_mega_impl and accepts override_quant_config), and
keep the direct import/return branch only as a legacy fallback executed when
configurable mode is disabled; update create_moe, the ConfigurableMoE
dispatch/tuple, and any factory mapping that selects ConfigurableMoE to ensure
override_quant_config and _forward_chunk_mega_impl are preserved for
MegaMoEDeepGemmFusedMoE.
- Around line 104-111: The call to MegaMoEDeepGemmFusedMoE.can_implement is
using a hardcoded activation dtype and the dense intermediate_size; update it to
use the actual pretrained config values from model_config.pretrained_config (use
pretrained.torch_dtype for dtype_activation, and use
pretrained.moe_intermediate_size if present otherwise
pretrained.intermediate_size for the FFN size) while keeping hidden_size and
other flags the same so the capability check matches the later backend creation.

In `@tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py`:
- Around line 1-14: The file backend.py has formatting differences flagged by
CI; run the ruff formatter on this file (e.g., ruff format backend.py) to apply
the project's style rules and commit the reformatted file so the SPDX header and
surrounding code match ruff-format output.
- Around line 726-728: The code indexes all_rank_num_tokens with
self.mapping.tp_rank which is always 0 in EP-only Phase 1; replace that index
with self.mapping.moe_ep_rank so each EP rank uses its correct token count
(change the statement that sets num_tokens from
all_rank_num_tokens[self.mapping.tp_rank] to use self.mapping.moe_ep_rank), and
update the nearby comment that currently references “[tp_rank]” to reference the
correct EP rank dimension (moe_ep_rank); apply the same replacement wherever the
pattern appears in configurable_moe.py (the analogous uses around num_tokens
retrieval).

---

Nitpick comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 635-644: The parameter use_dp_padding on _forward_chunk_mega_impl
is never used; either remove it from the method signature and all callers
(update any invocations in this module/class) to avoid dead API surface, or keep
it but explicitly mark it as intentionally unused by adding a short docstring
note and a sentinel usage (e.g., assign to _ = use_dp_padding or rename to
_use_dp_padding) so linters/readers know MegaMoE ignores dp padding because it
uses raw token counts; choose one approach and apply consistently across the
class (update _forward_chunk_mega_impl signature, callers, and the method
docstring).
- Around line 689-691: Replace the Unicode multiplication sign in the comment
that references the backend's "run_with_prequant" and DG's "run_fused" shape
contract (the phrase "4 × buf.copy_ + kernel") with the ASCII letter "x" (change
"4 × buf.copy_ + kernel" to "4 x buf.copy_ + kernel") so tools/editors won't
choke on the ambiguous character; locate the comment near the mentions of
run_with_prequant and run_fused in configurable_moe.py and update the comment
text accordingly.

In `@tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py`:
- Line 209: The class-level constant _SUPPORTED_ACTIVATION_DTYPES is defined as
a mutable set; change it to an immutable frozenset to avoid accidental mutation.
Locate the _SUPPORTED_ACTIVATION_DTYPES symbol in the Mega MoE backend module
and replace its set literal with a frozenset containing the same element(s)
(e.g., use frozenset(...) around torch.bfloat16) so the attribute is immutable
at class scope.
- Around line 182-192: In _quantize_bf16_to_fp8_ue8m0 validate the input's
hidden dim before reshaping: check the tensor shape (m, n) and assert that n is
aligned so (n // 32) is divisible by 4 (equivalently n % 128 == 0); if not,
raise a clear ValueError describing required alignment and the offending n.
Place this check just after extracting m, n and before calling x_sf_u8.view(m, n
// 32). This prevents the cryptic reshape/view failure when quantize_input or
external callers pass misaligned tensors.
- Line 273: The constructor currently uses a mutable default ModelConfig() which
is instantiated at definition time; change the parameter default to
model_config: Optional[ModelConfig] = None (or just None) in the __init__
signature and then inside __init__ (for the class that defines this constructor)
add a guard like "if model_config is None: model_config = ModelConfig()" so each
call gets a fresh ModelConfig instance; update any type hints/imports
accordingly and ensure all references to the parameter keep the name
model_config.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6ac4c50e-d778-462c-bffa-bc5bb8fae516

📥 Commits

Reviewing files that changed from the base of the PR and between e3bddf6 and 70c4946.

📒 Files selected for processing (8)
  • 3rdparty/fetch_content.json
  • cpp/tensorrt_llm/deep_gemm/CMakeLists.txt
  • scripts/attribution/data/dependency_metadata.yml
  • scripts/attribution/data/files_to_dependency.yml
  • tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
  • tensorrt_llm/_torch/modules/fused_moe/create_moe.py
  • tensorrt_llm/_torch/modules/fused_moe/mega_moe/__init__.py
  • tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py

Comment thread tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
Comment thread tensorrt_llm/_torch/modules/fused_moe/create_moe.py
Comment thread tensorrt_llm/_torch/modules/fused_moe/create_moe.py
Comment thread tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py
Comment thread tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45701 [ run ] triggered by Bot. Commit: 4ce2c3c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45701 [ run ] completed with state FAILURE. Commit: 4ce2c3c
/LLM/main/L0_MergeRequest_PR pipeline #35904 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45728 [ run ] triggered by Bot. Commit: e64d85a Link to invocation

Copy link
Copy Markdown
Collaborator

@juney-nvidia juney-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved from oss compliance perspective.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45728 [ run ] completed with state ABORTED. Commit: e64d85a

Link to invocation

@Barry-Delaney Barry-Delaney force-pushed the user/jinshik/dg_mega_moe branch from 6d13f9b to edb7e59 Compare April 28, 2026 15:22
@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45945 [ run ] triggered by Bot. Commit: edb7e59 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45945 [ run ] completed with state FAILURE. Commit: edb7e59

Link to invocation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45954 [ run ] triggered by Bot. Commit: edb7e59 Link to invocation

xxi-nv added a commit to xxi-nv/TensorRT-LLM that referenced this pull request Apr 29, 2026
…od path + tests

Builds on NVIDIA#13384 (Barry's MegaMoEDeepGemmFusedMoE backend) and refactors
it to share the standard ConfigurableMoE construction / weight-loading
pipeline used by CutlassFusedMoE / TRTLLMGenFusedMoE.

Refactor:
* Move weight lifecycle (DG-native MXFP4 + UE8M0 SF tensors, checkpoint
  loading, scale conversion, SymmBuffer allocation, DG weight transform)
  out of the MegaMoE backend file and into a dedicated
  ``W4A8MXFP4MXFP8MegaMoEDeepGemmMethod(FusedMoEMethodBase)`` in
  ``tensorrt_llm/_torch/modules/fused_moe/quantization.py``.
* Rename ``mega_moe/backend.py`` to ``mega_moe/mega_moe_deepgemm.py``
  and shrink it to capability checks, routing/activation quantization,
  and the fused kernel entry point.
* Wire the new method through ``ConfigurableMoE`` / ``create_moe`` so
  MegaMoEDeepGemm flows through the same construction and load_weights
  pipeline as the other backends.

Fixes:
* ``W4A8MXFP4MXFP8MegaMoEDeepGemmMethod.create_weights`` asserts
  ``hidden_size % 128 == 0`` and ``intermediate_size % 128 == 0`` up
  front (DG packs UE8M0 SF as int32 over 32-element blocks; misaligned
  configs hit a cryptic reshape downstream).
* ``w3_w1_weight`` is stored as ``[w1 | w3] = [gate | up]`` (matches
  DG's ``_interleave_l1_weights`` and TRT-LLM's gate_proj=w1,
  up_proj=w3 convention; same semantic as NVIDIA#13384 commit edb7e59
  applied in the new quant-method path).

Tests:
* Generic MoE module tests cover MegaMoEDeepGemm via ConfigurableMoE.
* Pure-PyTorch QDQ reference for MegaMoEDeepGemm in
  ``tests/unittest/_torch/modules/moe/quantize_utils.py``.
* Multi-GPU module-level coverage (TP/EP/DEP, NVLink one/two-sided)
  for MegaMoEDeepGemm, plus extended parametric coverage on the module
  side (DeepSeek-V4 / Kimi-K2 expert/hidden/intermediate Budugs).
* Wire matching integration test stages for B200 / B300.

Squashes prior local development commits (Wire MegaMoEDeepGemm backend
path, fix MegaMoEDeepGemm bugs, Add MegaMoE generic MoE tests, Add
MegaMoE DeepGEMM reference, Gate MegaMoE DeepGEMM SF alignment, Add
MegaMoE module multi-GPU coverage, Use MegaMoE module reference, Extend
MegaMoE module coverage x2) into a single commit.

Signed-off-by: xxi <[email protected]>
xxi-nv added a commit to xxi-nv/TensorRT-LLM that referenced this pull request Apr 29, 2026
…od path + tests

Builds on NVIDIA#13384 (Barry's MegaMoEDeepGemmFusedMoE backend) and refactors
it to share the standard ConfigurableMoE construction / weight-loading
pipeline used by CutlassFusedMoE / TRTLLMGenFusedMoE.

Refactor:
* Move weight lifecycle (DG-native MXFP4 + UE8M0 SF tensors, checkpoint
  loading, scale conversion, SymmBuffer allocation, DG weight transform)
  out of the MegaMoE backend file and into a dedicated
  ``W4A8MXFP4MXFP8MegaMoEDeepGemmMethod(FusedMoEMethodBase)`` in
  ``tensorrt_llm/_torch/modules/fused_moe/quantization.py``.
* Rename ``mega_moe/backend.py`` to ``mega_moe/mega_moe_deepgemm.py``
  and shrink it to capability checks, routing/activation quantization,
  and the fused kernel entry point.
* Wire the new method through ``ConfigurableMoE`` / ``create_moe`` so
  MegaMoEDeepGemm flows through the same construction and load_weights
  pipeline as the other backends.

Fixes:
* ``W4A8MXFP4MXFP8MegaMoEDeepGemmMethod.create_weights`` asserts
  ``hidden_size % 128 == 0`` and ``intermediate_size % 128 == 0`` up
  front (DG packs UE8M0 SF as int32 over 32-element blocks; misaligned
  configs hit a cryptic reshape downstream).
* ``w3_w1_weight`` is stored as ``[w1 | w3] = [gate | up]`` (matches
  DG's ``_interleave_l1_weights`` and TRT-LLM's gate_proj=w1,
  up_proj=w3 convention; same semantic as NVIDIA#13384 commit edb7e59
  applied in the new quant-method path).

Tests:
* Generic MoE module tests cover MegaMoEDeepGemm via ConfigurableMoE.
* Pure-PyTorch QDQ reference for MegaMoEDeepGemm in
  ``tests/unittest/_torch/modules/moe/quantize_utils.py``.
* Multi-GPU module-level coverage (TP/EP/DEP, NVLink one/two-sided)
  for MegaMoEDeepGemm, plus extended parametric coverage on the module
  side (DeepSeek-V4 / Kimi-K2 expert/hidden/intermediate Budugs).
* Wire matching integration test stages for B200 / B300.

Squashes prior local development commits (Wire MegaMoEDeepGemm backend
path, fix MegaMoEDeepGemm bugs, Add MegaMoE generic MoE tests, Add
MegaMoE DeepGEMM reference, Gate MegaMoE DeepGEMM SF alignment, Add
MegaMoE module multi-GPU coverage, Use MegaMoE module reference, Extend
MegaMoE module coverage x2) into a single commit.

Signed-off-by: xxi <[email protected]>
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45954 [ run ] completed with state ABORTED. Commit: edb7e59

Link to invocation

@Barry-Delaney Barry-Delaney enabled auto-merge (squash) April 30, 2026 01:25
@Barry-Delaney Barry-Delaney force-pushed the user/jinshik/dg_mega_moe branch from 660acf2 to e76ddc4 Compare May 1, 2026 10:14
mikeiovine pushed a commit to lfr-0531/TensorRT-LLM that referenced this pull request May 1, 2026
Update scripts/attribution/data/dependency_metadata.yml and
files_to_dependency.yml to reflect the deepgemm upgrade in
3rdparty/fetch_content.json (4ff3f54d... -> c491439e...). Aligned with
the upstream attribution refresh for the same bump (PR NVIDIA#13384) so this
branch only carries the deepgemm-related entries; no cutlass / cuda /
nccl / torch entries are introduced and no new cas/ blobs are added.

Signed-off-by: Fanrong Li <[email protected]>
mikeiovine pushed a commit to lfr-0531/TensorRT-LLM that referenced this pull request May 4, 2026
Update scripts/attribution/data/dependency_metadata.yml and
files_to_dependency.yml to reflect the deepgemm upgrade in
3rdparty/fetch_content.json (4ff3f54d... -> c491439e...). Aligned with
the upstream attribution refresh for the same bump (PR NVIDIA#13384) so this
branch only carries the deepgemm-related entries; no cutlass / cuda /
nccl / torch entries are introduced and no new cas/ blobs are added.

Signed-off-by: Fanrong Li <[email protected]>
@xxi-nv
Copy link
Copy Markdown
Collaborator

xxi-nv commented May 4, 2026

/bot run --disable-fail-fast

1 similar comment
@xxi-nv
Copy link
Copy Markdown
Collaborator

xxi-nv commented May 5, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46755 [ run ] triggered by Bot. Commit: e7bfb23 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46755 [ run ] completed with state SUCCESS. Commit: e7bfb23
/LLM/main/L0_MergeRequest_PR pipeline #36782 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

mikeiovine pushed a commit to lfr-0531/TensorRT-LLM that referenced this pull request May 5, 2026
Update scripts/attribution/data/dependency_metadata.yml and
files_to_dependency.yml to reflect the deepgemm upgrade in
3rdparty/fetch_content.json (4ff3f54d... -> c491439e...). Aligned with
the upstream attribution refresh for the same bump (PR NVIDIA#13384) so this
branch only carries the deepgemm-related entries; no cutlass / cuda /
nccl / torch entries are introduced and no new cas/ blobs are added.

Signed-off-by: Fanrong Li <[email protected]>
Barry-Delaney pushed a commit to lfr-0531/TensorRT-LLM that referenced this pull request May 6, 2026
Update scripts/attribution/data/dependency_metadata.yml and
files_to_dependency.yml to reflect the deepgemm upgrade in
3rdparty/fetch_content.json (4ff3f54d... -> c491439e...). Aligned with
the upstream attribution refresh for the same bump (PR NVIDIA#13384) so this
branch only carries the deepgemm-related entries; no cutlass / cuda /
nccl / torch entries are introduced and no new cas/ blobs are added.

Signed-off-by: Fanrong Li <[email protected]>
Barry-Delaney pushed a commit to lfr-0531/TensorRT-LLM that referenced this pull request May 6, 2026
Update scripts/attribution/data/dependency_metadata.yml and
files_to_dependency.yml to reflect the deepgemm upgrade in
3rdparty/fetch_content.json (4ff3f54d... -> c491439e...). Aligned with
the upstream attribution refresh for the same bump (PR NVIDIA#13384) so this
branch only carries the deepgemm-related entries; no cutlass / cuda /
nccl / torch entries are introduced and no new cas/ blobs are added.

Signed-off-by: Fanrong Li <[email protected]>
Introduces ``MegaMoEFusedMoE``, a new MoE backend wrapping DeepGEMM's
fused ``fp8_fp4_mega_moe`` kernel (dispatch + GEMM1 + SwiGLU + GEMM2 +
combine into a single launch via NVLink SymmBuffer). Accepts the same
W4A8_MXFP4_MXFP8 weight layout as ``TRTLLMGenFusedMoE`` so VANILLA /
FUSED_GATE_UP_PROJ loaders work unchanged.

Integration follows the ConfigurableMoE pattern per @xingfei:
- ``fused_moe/mega_moe/__init__.py`` / ``backend.py``: new
  ``MegaMoEFusedMoE`` that (a) creates DG-native uint8 MXFP4 + UE8M0
  weight tensors, (b) resolves the EP ProcessGroup at construction
  (no collective at forward time), (c) allocates the DG ``SymmBuffer``
  via a process-level cache on ``post_load_weights``, and (d)
  dispatches ``deep_gemm.fp8_fp4_mega_moe`` in ``forward_impl``.
- ``fused_moe/create_moe.py``: add ``MEGAMOE`` backend type;
  ``get_moe_cls`` routes W4A8_MXFP4_MXFP8 to ``MegaMoEFusedMoE`` and
  falls back to ``CutlassFusedMoE`` for every other quant (mirrors
  the TRTLLM / CUTEDSL fallback pattern). ``create_moe_backend``
  gets a ``MegaMoEFusedMoE`` constructor branch.
- ``fused_moe/configurable_moe.py``: add a fast-path guard at the
  top of ``_forward_chunk_impl`` and a dedicated
  ``_forward_chunk_mega_impl`` that forwards the ADP shape contract
  (``all_rank_num_tokens``, ``use_dp_padding``) directly to
  ``backend.forward_impl`` and skips the EPLB / quant / combine
  orchestration.

Unit tests in ``tests/unittest/_torch/modules/moe/test_mega_moe.py``
cover:
* ``can_implement`` capability matrix — accepts
  ``W4A8_MXFP4_MXFP8 + bfloat16 + SM100``, rejects every other Budug
  with a descriptive reason.
* ``get_moe_cls("MEGAMOE")`` dispatch — returns ``MegaMoEFusedMoE``
  for the supported quant, falls back to ``CutlassFusedMoE`` for
  anything else (and on non-SM100 / no-DG runners so CI on non-Blackwell
  machines passes without skipping).
* ``apply_router_weight_on_input`` rejection at construction time
  (the fused kernel applies routing weights on the MoE output, not
  the input — the two paths are not equivalent under SwiGLU).
* ADP topology guard: ``use_dp and parallel_size > 1`` requires
  ``ep_size == parallel_size``.
* Weight-loader shape contract for both ``VANILLA`` and
  ``FUSED_GATE_UP_PROJ`` loading modes — verifies the expected
  MXFP4/UE8M0 tensor shapes are produced after loading.

Hot-path validation is left to the multi-GPU harness under
``tmp_test_scripts`` (requires 4+ GPUs + bundled DeepGEMM with
``fp8_fp4_mega_moe``); unit tests skip cleanly when those prerequisites
are missing.

Phase 1 constraints: EPLB disabled (DG's SymmBuffer dispatch is
incompatible with ``prepare_dispatch`` / NVLinkTwoSided — will revisit
in a follow-up), moe_tp_size=1 (EP-only), shapes must be divisible by
128 (packed-UE8M0 SF int32 stride).

Signed-off-by: Barry Kang <[email protected]>
Collapses the pre-kernel overhead in the MegaMoE path from ~460 us down
to ~85 us (DSV3 ep=4, 4 × GB200, uniform routing) via two complementary
changes:

1. **Hoist routing + BF16→FP8 quant out of the backend.** The backend
   used to own its own ``routing_method.apply`` + ``per_token_cast_to_fp8``
   + buffer copies + kernel launch inside ``forward_impl``. That buried
   the pre-processing behind an extra two Python frames (ConfigurableMoE
   → _forward_chunk_impl → forward_impl → backend.forward_impl) and
   recomputed what the outer pipeline already knows how to do for the
   CUTLASS / CUTEDSL backends.

   ``_forward_chunk_mega_impl`` now mirrors the standard separated-
   routing contract: it slices ``x`` / ``router_logits`` to the unpadded
   ADP count, runs ``self.routing_method.apply`` once, runs
   ``self.backend.quantize_input(x_real)`` once, then calls a new
   kernel-only backend entry
   ``run_with_prequant(x_fp8, x_sf, topk_idx, topk_weights, num_tokens,
   output_dtype)`` that just does the 4 × ``buf.copy_()`` + fused kernel
   — matching DG's own ``run_fused`` shape contract, so the GPU work
   inside the backend call is now what DG's benchmarks measure.

   Zero-token ranks still enter ``run_with_prequant`` with fabricated
   empty tensors so the SymmBuffer collective doesn't hang peers.

   The existing ``backend.forward_impl`` path stays as a stand-alone
   fallback and now also routes through ``self.quantize_input``.

2. **Use TRT-LLM's C++ MXFP8 quant kernel.** DG's Python
   ``per_token_cast_to_fp8(..., gran_k=32, use_packed_ue8m0=True)``
   decomposes into ~8 elementwise/reduction ops (empty+fill, copy, abs,
   cast, amax, clamp, div, ue8m0 round, mul, cast to fp8, pack int32).
   Wrapping it with ``torch.compile`` fuses these to ~1-2 Triton kernels
   and gets to ~60-260 us depending on seq.

   ``torch.ops.trtllm.mxfp8_quantize(x, False, alignment=32)`` is an
   existing C++ CUDA kernel in this tree (``thop/mxFp8Quantize.cpp``)
   — CUTLASS's MoE path already uses it. Byte-identical output to DG's
   helper (roundtrip-verified on random BF16: fp8 bytes and SF int32
   both ``torch.equal`` after reshape), and 5-25× faster — consistently
   ~11 us independent of seq.

   ``_quantize_bf16_to_fp8_ue8m0`` prefers the TRT-LLM op when
   ``libth_common.so`` has registered it (always the case inside
   ConfigurableMoE because ``create_moe.py`` imports CutlassFusedMoE
   at module top, which triggers the library load). Falls back to the
   ``torch.compile``'d DG helper for slim builds.

Perf deltas (us/iter, DSV3 E=256 k=8 H=7168 I=2048, ep=4):

  seq   initial backend   +hoist   +C++ quant   final vs CUTLASS
  ---   ---------------   ------   ----------   ----------------
    1        895            527        391           0.70×
   32        890            454        369           0.74×
  128        863            453        432           0.87×
  512        866            462        437           0.88×
 2048       1426           1007        939           1.88×

MegaMoE now beats CUTLASS by 10-30% across seq 1-512 (uniform
routing). Large-seq (≥ 1024) remains limited by DG kernel scaling
itself — in-kernel tuning, not something the wrapper can address.

Unit tests (``test_mega_moe.py``, 15 cases) remain green.

Signed-off-by: Barry Kang <[email protected]>
Signed-off-by: Barry Kang <[email protected]>
R1: configurable_moe._forward_chunk_mega_impl — zero-token x_sf
placeholder used shape (0, 0) which violates the contract returned
by quantize_input (packed-UE8M0 int32 over 32-element blocks, 4 u8
per int32). Use (0, hidden_size // 128) int32 to match.

R2: create_moe — MegaMoEDeepGemmFusedMoE was bypassing
ConfigurableMoE because it wasn't in the supported tuple, leaving
ConfigurableMoE._forward_chunk_mega_impl (the perf hoist from
5112801) unreachable. Add it lazily so non-MegaMoE callers don't
import DeepGEMM at module load.

R3: get_moe_cls — capability check for MegaMoE used hardcoded
torch.bfloat16 and dense intermediate_size; resolve from
pretrained_config (torch_dtype + moe_intermediate_size first) so
can_implement matches the values used at construction.

R5: backend.forward_impl + ConfigurableMoE._forward_chunk_mega_impl
indexed all_rank_num_tokens with mapping.tp_rank, which is the
outer TP rank rather than the EP rank. Phase 1 asserts
ep_size == parallel_size, so use mapping.moe_ep_rank for clarity
and topology robustness.

Nits:
* drop unused use_dp_padding from _forward_chunk_mega_impl signature
  (and caller) — MegaMoE uses raw token counts, not DP padding.
* replace Unicode "x" multiplication sign with ASCII "x" in two
  comments/docstrings (configurable_moe.py + backend.py).
* _SUPPORTED_ACTIVATION_DTYPES set -> frozenset.
* _quantize_bf16_to_fp8_ue8m0 raises ValueError early when n % 128
  != 0 instead of failing at the int32 reshape.
* reshape five docstrings in mega_moe/backend.py to satisfy ruff D205
  (blank line between summary and description). ruff format and
  ruff check now pass with the project config.

Signed-off-by: Barry Kang <[email protected]>
Pre-commit's yapf hook prefers wrapping the single-element tuple
extension across two lines. No semantic change.

Signed-off-by: Barry Kang <[email protected]>
DeepGEMM's ``fp8_fp4_mega_moe`` kernel interprets the first half of the
L1 weight tensor as the SwiGLU gate side and the second half as the up
side (deep_gemm/mega/__init__.py:78 ``_interleave_l1_weights``: ``gate
= t[:, :half]; up = t[:, half:]``). TRT-LLM's MoE convention --
consistent across ``modeling_gpt_oss.py:743-746`` and the
``FUSED_GATE_UP_PROJ`` loader at ``quantization.py:362-365`` -- maps
``w1 = gate_proj`` and ``w3 = up_proj`` (HF's ``gate_up_proj`` is laid
out as ``[gate | up]`` along the output dim, ``chunk(2)[0]`` -> w1,
``chunk(2)[1]`` -> w3).

The previous ``cat([w3, w1])`` order silently swapped which side the
``silu`` was applied to, computing ``silu(up_proj @ x) * (gate_proj @
x)`` instead of ``silu(gate_proj @ x) * (up_proj @ x)``.

Verified against a pure-PyTorch QDQ reference (no DG / TRT-LLM kernel
ops): with this fix, MegaMoE output matches the reference bit-exact.
Without it, the per-element mismatch rate is ~94% vs the reference.

Signed-off-by: Barry Kang <[email protected]>
@Barry-Delaney Barry-Delaney force-pushed the user/jinshik/dg_mega_moe branch from 994b5a3 to e7d80e9 Compare May 7, 2026 09:58
@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47186 [ run ] triggered by Bot. Commit: e7d80e9 Link to invocation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47243 [ kill ] triggered by Bot. Commit: e7d80e9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47243 [ kill ] completed with state SUCCESS. Commit: e7d80e9
Successfully killed previous jobs for commit e7d80e9

Link to invocation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47245 [ run ] triggered by Bot. Commit: e7d80e9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47245 [ run ] completed with state SUCCESS. Commit: e7d80e9
/LLM/main/L0_MergeRequest_PR pipeline #37194 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47302 [ run ] triggered by Bot. Commit: e7d80e9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47302 [ run ] completed with state SUCCESS. Commit: e7d80e9
/LLM/main/L0_MergeRequest_PR pipeline #37244 completed with status: 'SUCCESS'

CI Report

Link to invocation

@Barry-Delaney Barry-Delaney merged commit 6e069b6 into NVIDIA:main May 8, 2026
6 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants