[None][feat] Add MegaMoEDeepGemmFusedMoE backend wrapping DeepGEMM fp8_fp4_mega_moe by Barry-Delaney · Pull Request #13384 · NVIDIA/TensorRT-LLM

Barry-Delaney · 2026-04-23T14:15:13Z

This PR enables the mega-MoE-kernel from DeepGEMM and added related backend into ConfigurableMoE.

xxi-nv · 2026-04-27T07:21:39Z

Discussed with @Barry-Delaney. Considering that Barry has tested the functionality locally and we are in a hurry for performance, it is suggested to rename the newly - added backend to MEGAMOE_DEEPGEMM. This is because we are developing our own MEGA kernels.
This PR mainly serves DeepSeek V4 and has no side - effects on other paths. Therefore, we can merge this PR first, and I will refactor it to make it more suitable for our architecture.
@longlee0622 for viz.

xxi-nv · 2026-04-27T09:34:28Z

/bot run --disable-fail-fast

coderabbitai · 2026-04-27T09:35:01Z

📝 Walkthrough

Walkthrough

This PR introduces a new MegaMoE backend for fused MoE operations powered by DeepGEMM, updating the deepgemm dependency to a newer commit and integrating the backend into the configurable MoE framework.

Changes

Cohort / File(s)	Summary
Dependency Updates `3rdparty/fetch_content.json`, `scripts/attribution/data/dependency_metadata.yml`, `scripts/attribution/data/files_to_dependency.yml`	Updates deepgemm git revision from `4ff3f54...` to `c491439...`, reflecting a new commit with modified dependency hashes.
Build Configuration `cpp/tensorrt_llm/deep_gemm/CMakeLists.txt`	Adds Python file processing to rewrite DeepGEMM imports from `from .. import _C` to `import tensorrt_llm.deep_gemm_cpp_tllm`, aligning subpackage namespace bindings.
MegaMoE Backend Implementation `tensorrt_llm/_torch/modules/fused_moe/mega_moe/__init__.py`, `tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py`	Introduces new `MegaMoEDeepGemmFusedMoE` backend with SM100 support, BF16/FP8 quantization, MXFP4 weight storage, and DeepGEMM kernel dispatch.
MoE Framework Integration `tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`, `tensorrt_llm/_torch/modules/fused_moe/create_moe.py`	Integrates MegaMoE backend into configurable MoE: adds `_forward_chunk_mega_impl` execution path with per-rank token slicing and quantization; extends `create_moe.py` to instantiate and validate MegaMoE backend with required quant mode (`W4A8_MXFP4_MXFP8`).

Sequence Diagram

sequenceDiagram
    participant User as User
    participant CMoE as ConfigurableMoE
    participant Router as Router
    participant Backend as MegaMoEDeepGemmFusedMoE
    participant Quant as Quantizer
    participant DG as DeepGEMM Kernel
    
    User->>CMoE: forward(x, router_logits)
    CMoE->>CMoE: _forward_chunk_mega_impl()
    CMoE->>Backend: count tokens per rank
    CMoE->>CMoE: slice x, router_logits to real tokens
    CMoE->>Router: apply routing
    CMoE->>Router: topk casting
    CMoE->>Backend: quantize_input(x)
    Backend->>Quant: mxfp8_quantize(BF16→FP8)
    Quant-->>Backend: x_fp8, x_sf
    CMoE->>Backend: copy to DeepGEMM SymmBuffer
    CMoE->>DG: fp8_fp4_mega_moe (collective kernel)
    DG-->>CMoE: output (FP32)
    CMoE-->>User: return BF16 combined output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 59.26% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding a new MegaMoEFusedMoE backend that wraps DeepGEMM's fp8_fp4_mega_moe kernel.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering the feature overview, integration approach, phase-1 constraints, three commits, and detailed test coverage with clear rationale for deferring hot-path validation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (5)

tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py (3)

209-209: Mutable class attribute should use frozenset.

Using a mutable set as a class attribute can lead to unexpected behavior if modified. Since this is a constant set of supported dtypes, use frozenset instead.

♻️ Proposed fix

-    _SUPPORTED_ACTIVATION_DTYPES = {torch.bfloat16}
+    _SUPPORTED_ACTIVATION_DTYPES = frozenset({torch.bfloat16})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py` at line 209, The
class-level constant _SUPPORTED_ACTIVATION_DTYPES is defined as a mutable set;
change it to an immutable frozenset to avoid accidental mutation. Locate the
_SUPPORTED_ACTIVATION_DTYPES symbol in the Mega MoE backend module and replace
its set literal with a frozenset containing the same element(s) (e.g., use
frozenset(...) around torch.bfloat16) so the attribute is immutable at class
scope.

182-192: Add defensive validation for hidden dimension alignment.

The reshape at line 192 assumes n % 128 == 0 (since n // 32 must be divisible by 4 for the int32 view). While can_implement enforces this at factory time, direct construction or external calls to quantize_input could hit a cryptic reshape error.

🛡️ Proposed fix

 def _quantize_bf16_to_fp8_ue8m0(x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
     """Return (x_fp8, x_sf) in DG mega_moe's expected layout (packed int32)."""
     if _trtllm_mxfp8_quantize_available():
         m, n = x.shape
+        if n % 128 != 0:
+            raise ValueError(
+                f"MegaMoE quantize_input requires hidden_size % 128 == 0 for "
+                f"packed-UE8M0 int32 SF layout; got n={n}"
+            )
         # ``is_sf_swizzled_layout=False`` → flat row-major uint8 SF, one

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py` around lines 182 -
192, In _quantize_bf16_to_fp8_ue8m0 validate the input's hidden dim before
reshaping: check the tensor shape (m, n) and assert that n is aligned so (n //
32) is divisible by 4 (equivalently n % 128 == 0); if not, raise a clear
ValueError describing required alignment and the offending n. Place this check
just after extracting m, n and before calling x_sf_u8.view(m, n // 32). This
prevents the cryptic reshape/view failure when quantize_input or external
callers pass misaligned tensors.

273-273: Function call in default argument creates shared instance.

ModelConfig() is evaluated once at function definition time, not per-call. If ModelConfig is mutable, this could lead to shared state issues. Consider using None as default and creating the instance inside the function.

♻️ Proposed fix

-        model_config: ModelConfig = ModelConfig(),
+        model_config: ModelConfig | None = None,

Then at the start of __init__:

if model_config is None:
    model_config = ModelConfig()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py` at line 273, The
constructor currently uses a mutable default ModelConfig() which is instantiated
at definition time; change the parameter default to model_config:
Optional[ModelConfig] = None (or just None) in the __init__ signature and then
inside __init__ (for the class that defines this constructor) add a guard like
"if model_config is None: model_config = ModelConfig()" so each call gets a
fresh ModelConfig instance; update any type hints/imports accordingly and ensure
all references to the parameter keep the name model_config.

tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py (2)

635-644: Unused parameter use_dp_padding.

The use_dp_padding parameter is accepted but never used in the method body. If MegaMoE intentionally ignores padding (since it gets raw token counts), consider either:

Documenting this in the docstring, or
Removing the parameter if callers don't need to pass it

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py` around lines 635 -
644, The parameter use_dp_padding on _forward_chunk_mega_impl is never used;
either remove it from the method signature and all callers (update any
invocations in this module/class) to avoid dead API surface, or keep it but
explicitly mark it as intentionally unused by adding a short docstring note and
a sentinel usage (e.g., assign to _ = use_dp_padding or rename to
_use_dp_padding) so linters/readers know MegaMoE ignores dp padding because it
uses raw token counts; choose one approach and apply consistently across the
class (update _forward_chunk_mega_impl signature, callers, and the method
docstring).

689-691: Nit: Ambiguous Unicode character in comment.

Line 691 uses × (Unicode multiplication sign) instead of x. While readable, this can cause issues with some tools/editors.

✏️ Suggested fix

-        # contract exactly (4 × buf.copy_ + kernel).
+        # contract exactly (4 x buf.copy_ + kernel).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py` around lines 689 -
691, Replace the Unicode multiplication sign in the comment that references the
backend's "run_with_prequant" and DG's "run_fused" shape contract (the phrase "4
× buf.copy_ + kernel") with the ASCII letter "x" (change "4 × buf.copy_ +
kernel" to "4 x buf.copy_ + kernel") so tools/editors won't choke on the
ambiguous character; locate the comment near the mentions of run_with_prequant
and run_fused in configurable_moe.py and update the comment text accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 700-711: The zero-token placeholder creates x_sf with shape (0, 0)
which violates the quantize_input contract used by _quantize_bf16_to_fp8_ue8m0
and may break downstream validation; change the x_sf creation in the else branch
of configurable_moe (next to x_fp8, topk_idx, topk_weights) to use shape (0,
self.hidden_size // 32) with dtype torch.int32 on the same device so its shape
matches expectations used by run_with_prequant and the quantization helper.

In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 351-371: create_moe currently returns MegaMoEDeepGemmFusedMoE
directly, bypassing the ConfigurableMoE path (so _forward_chunk_mega_impl and
override_quant_config are skipped); change the logic so when
ENABLE_CONFIGURABLE_MOE is active the MegaMoE backend is included in the
ConfigurableMoE routing (i.e., treat MegaMoEDeepGemmFusedMoE as one of the
classes handled by ConfigurableMoE so it flows through the same construction
path that calls _forward_chunk_mega_impl and accepts override_quant_config), and
keep the direct import/return branch only as a legacy fallback executed when
configurable mode is disabled; update create_moe, the ConfigurableMoE
dispatch/tuple, and any factory mapping that selects ConfigurableMoE to ensure
override_quant_config and _forward_chunk_mega_impl are preserved for
MegaMoEDeepGemmFusedMoE.
- Around line 104-111: The call to MegaMoEDeepGemmFusedMoE.can_implement is
using a hardcoded activation dtype and the dense intermediate_size; update it to
use the actual pretrained config values from model_config.pretrained_config (use
pretrained.torch_dtype for dtype_activation, and use
pretrained.moe_intermediate_size if present otherwise
pretrained.intermediate_size for the FFN size) while keeping hidden_size and
other flags the same so the capability check matches the later backend creation.

In `@tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py`:
- Around line 1-14: The file backend.py has formatting differences flagged by
CI; run the ruff formatter on this file (e.g., ruff format backend.py) to apply
the project's style rules and commit the reformatted file so the SPDX header and
surrounding code match ruff-format output.
- Around line 726-728: The code indexes all_rank_num_tokens with
self.mapping.tp_rank which is always 0 in EP-only Phase 1; replace that index
with self.mapping.moe_ep_rank so each EP rank uses its correct token count
(change the statement that sets num_tokens from
all_rank_num_tokens[self.mapping.tp_rank] to use self.mapping.moe_ep_rank), and
update the nearby comment that currently references “[tp_rank]” to reference the
correct EP rank dimension (moe_ep_rank); apply the same replacement wherever the
pattern appears in configurable_moe.py (the analogous uses around num_tokens
retrieval).

---

Nitpick comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 635-644: The parameter use_dp_padding on _forward_chunk_mega_impl
is never used; either remove it from the method signature and all callers
(update any invocations in this module/class) to avoid dead API surface, or keep
it but explicitly mark it as intentionally unused by adding a short docstring
note and a sentinel usage (e.g., assign to _ = use_dp_padding or rename to
_use_dp_padding) so linters/readers know MegaMoE ignores dp padding because it
uses raw token counts; choose one approach and apply consistently across the
class (update _forward_chunk_mega_impl signature, callers, and the method
docstring).
- Around line 689-691: Replace the Unicode multiplication sign in the comment
that references the backend's "run_with_prequant" and DG's "run_fused" shape
contract (the phrase "4 × buf.copy_ + kernel") with the ASCII letter "x" (change
"4 × buf.copy_ + kernel" to "4 x buf.copy_ + kernel") so tools/editors won't
choke on the ambiguous character; locate the comment near the mentions of
run_with_prequant and run_fused in configurable_moe.py and update the comment
text accordingly.

In `@tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py`:
- Line 209: The class-level constant _SUPPORTED_ACTIVATION_DTYPES is defined as
a mutable set; change it to an immutable frozenset to avoid accidental mutation.
Locate the _SUPPORTED_ACTIVATION_DTYPES symbol in the Mega MoE backend module
and replace its set literal with a frozenset containing the same element(s)
(e.g., use frozenset(...) around torch.bfloat16) so the attribute is immutable
at class scope.
- Around line 182-192: In _quantize_bf16_to_fp8_ue8m0 validate the input's
hidden dim before reshaping: check the tensor shape (m, n) and assert that n is
aligned so (n // 32) is divisible by 4 (equivalently n % 128 == 0); if not,
raise a clear ValueError describing required alignment and the offending n.
Place this check just after extracting m, n and before calling x_sf_u8.view(m, n
// 32). This prevents the cryptic reshape/view failure when quantize_input or
external callers pass misaligned tensors.
- Line 273: The constructor currently uses a mutable default ModelConfig() which
is instantiated at definition time; change the parameter default to
model_config: Optional[ModelConfig] = None (or just None) in the __init__
signature and then inside __init__ (for the class that defines this constructor)
add a guard like "if model_config is None: model_config = ModelConfig()" so each
call gets a fresh ModelConfig instance; update any type hints/imports
accordingly and ensure all references to the parameter keep the name
model_config.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6ac4c50e-d778-462c-bffa-bc5bb8fae516

📥 Commits

Reviewing files that changed from the base of the PR and between e3bddf6 and 70c4946.

📒 Files selected for processing (8)

3rdparty/fetch_content.json
cpp/tensorrt_llm/deep_gemm/CMakeLists.txt
scripts/attribution/data/dependency_metadata.yml
scripts/attribution/data/files_to_dependency.yml
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
tensorrt_llm/_torch/modules/fused_moe/create_moe.py
tensorrt_llm/_torch/modules/fused_moe/mega_moe/__init__.py
tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py

tensorrt-cicd · 2026-04-27T09:41:14Z

PR_Github #45701 [ run ] triggered by Bot. Commit: 4ce2c3c Link to invocation

tensorrt-cicd · 2026-04-27T10:03:11Z

PR_Github #45701 [ run ] completed with state FAILURE. Commit: 4ce2c3c
/LLM/main/L0_MergeRequest_PR pipeline #35904 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Barry-Delaney · 2026-04-27T11:28:23Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-27T11:34:24Z

PR_Github #45728 [ run ] triggered by Bot. Commit: e64d85a Link to invocation

juney-nvidia

Approved from oss compliance perspective.

tensorrt-cicd · 2026-04-28T11:35:12Z

PR_Github #45728 [ run ] completed with state ABORTED. Commit: e64d85a

Link to invocation

Barry-Delaney · 2026-04-28T15:22:30Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-28T15:28:52Z

PR_Github #45945 [ run ] triggered by Bot. Commit: edb7e59 Link to invocation

tensorrt-cicd · 2026-04-28T15:30:21Z

PR_Github #45945 [ run ] completed with state FAILURE. Commit: edb7e59

Link to invocation

Barry-Delaney · 2026-04-28T15:45:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-28T15:51:51Z

PR_Github #45954 [ run ] triggered by Bot. Commit: edb7e59 Link to invocation

…od path + tests Builds on NVIDIA#13384 (Barry's MegaMoEDeepGemmFusedMoE backend) and refactors it to share the standard ConfigurableMoE construction / weight-loading pipeline used by CutlassFusedMoE / TRTLLMGenFusedMoE. Refactor: * Move weight lifecycle (DG-native MXFP4 + UE8M0 SF tensors, checkpoint loading, scale conversion, SymmBuffer allocation, DG weight transform) out of the MegaMoE backend file and into a dedicated ``W4A8MXFP4MXFP8MegaMoEDeepGemmMethod(FusedMoEMethodBase)`` in ``tensorrt_llm/_torch/modules/fused_moe/quantization.py``. * Rename ``mega_moe/backend.py`` to ``mega_moe/mega_moe_deepgemm.py`` and shrink it to capability checks, routing/activation quantization, and the fused kernel entry point. * Wire the new method through ``ConfigurableMoE`` / ``create_moe`` so MegaMoEDeepGemm flows through the same construction and load_weights pipeline as the other backends. Fixes: * ``W4A8MXFP4MXFP8MegaMoEDeepGemmMethod.create_weights`` asserts ``hidden_size % 128 == 0`` and ``intermediate_size % 128 == 0`` up front (DG packs UE8M0 SF as int32 over 32-element blocks; misaligned configs hit a cryptic reshape downstream). * ``w3_w1_weight`` is stored as ``[w1 | w3] = [gate | up]`` (matches DG's ``_interleave_l1_weights`` and TRT-LLM's gate_proj=w1, up_proj=w3 convention; same semantic as NVIDIA#13384 commit edb7e59 applied in the new quant-method path). Tests: * Generic MoE module tests cover MegaMoEDeepGemm via ConfigurableMoE. * Pure-PyTorch QDQ reference for MegaMoEDeepGemm in ``tests/unittest/_torch/modules/moe/quantize_utils.py``. * Multi-GPU module-level coverage (TP/EP/DEP, NVLink one/two-sided) for MegaMoEDeepGemm, plus extended parametric coverage on the module side (DeepSeek-V4 / Kimi-K2 expert/hidden/intermediate Budugs). * Wire matching integration test stages for B200 / B300. Squashes prior local development commits (Wire MegaMoEDeepGemm backend path, fix MegaMoEDeepGemm bugs, Add MegaMoE generic MoE tests, Add MegaMoE DeepGEMM reference, Gate MegaMoE DeepGEMM SF alignment, Add MegaMoE module multi-GPU coverage, Use MegaMoE module reference, Extend MegaMoE module coverage x2) into a single commit. Signed-off-by: xxi <[email protected]>

tensorrt-cicd · 2026-04-29T15:52:41Z

PR_Github #45954 [ run ] completed with state ABORTED. Commit: edb7e59

Link to invocation

Update scripts/attribution/data/dependency_metadata.yml and files_to_dependency.yml to reflect the deepgemm upgrade in 3rdparty/fetch_content.json (4ff3f54d... -> c491439e...). Aligned with the upstream attribution refresh for the same bump (PR NVIDIA#13384) so this branch only carries the deepgemm-related entries; no cutlass / cuda / nccl / torch entries are introduced and no new cas/ blobs are added. Signed-off-by: Fanrong Li <[email protected]>

xxi-nv · 2026-05-04T23:57:10Z

/bot run --disable-fail-fast

xxi-nv · 2026-05-05T05:52:03Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-05T05:58:31Z

PR_Github #46755 [ run ] triggered by Bot. Commit: e7bfb23 Link to invocation

tensorrt-cicd · 2026-05-05T10:58:32Z

PR_Github #46755 [ run ] completed with state SUCCESS. Commit: e7bfb23
/LLM/main/L0_MergeRequest_PR pipeline #36782 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Update scripts/attribution/data/dependency_metadata.yml and files_to_dependency.yml to reflect the deepgemm upgrade in 3rdparty/fetch_content.json (4ff3f54d... -> c491439e...). Aligned with the upstream attribution refresh for the same bump (PR NVIDIA#13384) so this branch only carries the deepgemm-related entries; no cutlass / cuda / nccl / torch entries are introduced and no new cas/ blobs are added. Signed-off-by: Fanrong Li <[email protected]>

@xingfei

Introduces ``MegaMoEFusedMoE``, a new MoE backend wrapping DeepGEMM's fused ``fp8_fp4_mega_moe`` kernel (dispatch + GEMM1 + SwiGLU + GEMM2 + combine into a single launch via NVLink SymmBuffer). Accepts the same W4A8_MXFP4_MXFP8 weight layout as ``TRTLLMGenFusedMoE`` so VANILLA / FUSED_GATE_UP_PROJ loaders work unchanged. Integration follows the ConfigurableMoE pattern per @xingfei: - ``fused_moe/mega_moe/__init__.py`` / ``backend.py``: new ``MegaMoEFusedMoE`` that (a) creates DG-native uint8 MXFP4 + UE8M0 weight tensors, (b) resolves the EP ProcessGroup at construction (no collective at forward time), (c) allocates the DG ``SymmBuffer`` via a process-level cache on ``post_load_weights``, and (d) dispatches ``deep_gemm.fp8_fp4_mega_moe`` in ``forward_impl``. - ``fused_moe/create_moe.py``: add ``MEGAMOE`` backend type; ``get_moe_cls`` routes W4A8_MXFP4_MXFP8 to ``MegaMoEFusedMoE`` and falls back to ``CutlassFusedMoE`` for every other quant (mirrors the TRTLLM / CUTEDSL fallback pattern). ``create_moe_backend`` gets a ``MegaMoEFusedMoE`` constructor branch. - ``fused_moe/configurable_moe.py``: add a fast-path guard at the top of ``_forward_chunk_impl`` and a dedicated ``_forward_chunk_mega_impl`` that forwards the ADP shape contract (``all_rank_num_tokens``, ``use_dp_padding``) directly to ``backend.forward_impl`` and skips the EPLB / quant / combine orchestration. Unit tests in ``tests/unittest/_torch/modules/moe/test_mega_moe.py`` cover: * ``can_implement`` capability matrix — accepts ``W4A8_MXFP4_MXFP8 + bfloat16 + SM100``, rejects every other Budug with a descriptive reason. * ``get_moe_cls("MEGAMOE")`` dispatch — returns ``MegaMoEFusedMoE`` for the supported quant, falls back to ``CutlassFusedMoE`` for anything else (and on non-SM100 / no-DG runners so CI on non-Blackwell machines passes without skipping). * ``apply_router_weight_on_input`` rejection at construction time (the fused kernel applies routing weights on the MoE output, not the input — the two paths are not equivalent under SwiGLU). * ADP topology guard: ``use_dp and parallel_size > 1`` requires ``ep_size == parallel_size``. * Weight-loader shape contract for both ``VANILLA`` and ``FUSED_GATE_UP_PROJ`` loading modes — verifies the expected MXFP4/UE8M0 tensor shapes are produced after loading. Hot-path validation is left to the multi-GPU harness under ``tmp_test_scripts`` (requires 4+ GPUs + bundled DeepGEMM with ``fp8_fp4_mega_moe``); unit tests skip cleanly when those prerequisites are missing. Phase 1 constraints: EPLB disabled (DG's SymmBuffer dispatch is incompatible with ``prepare_dispatch`` / NVLinkTwoSided — will revisit in a follow-up), moe_tp_size=1 (EP-only), shapes must be divisible by 128 (packed-UE8M0 SF int32 stride). Signed-off-by: Barry Kang <[email protected]>

Collapses the pre-kernel overhead in the MegaMoE path from ~460 us down to ~85 us (DSV3 ep=4, 4 × GB200, uniform routing) via two complementary changes: 1. **Hoist routing + BF16→FP8 quant out of the backend.** The backend used to own its own ``routing_method.apply`` + ``per_token_cast_to_fp8`` + buffer copies + kernel launch inside ``forward_impl``. That buried the pre-processing behind an extra two Python frames (ConfigurableMoE → _forward_chunk_impl → forward_impl → backend.forward_impl) and recomputed what the outer pipeline already knows how to do for the CUTLASS / CUTEDSL backends. ``_forward_chunk_mega_impl`` now mirrors the standard separated- routing contract: it slices ``x`` / ``router_logits`` to the unpadded ADP count, runs ``self.routing_method.apply`` once, runs ``self.backend.quantize_input(x_real)`` once, then calls a new kernel-only backend entry ``run_with_prequant(x_fp8, x_sf, topk_idx, topk_weights, num_tokens, output_dtype)`` that just does the 4 × ``buf.copy_()`` + fused kernel — matching DG's own ``run_fused`` shape contract, so the GPU work inside the backend call is now what DG's benchmarks measure. Zero-token ranks still enter ``run_with_prequant`` with fabricated empty tensors so the SymmBuffer collective doesn't hang peers. The existing ``backend.forward_impl`` path stays as a stand-alone fallback and now also routes through ``self.quantize_input``. 2. **Use TRT-LLM's C++ MXFP8 quant kernel.** DG's Python ``per_token_cast_to_fp8(..., gran_k=32, use_packed_ue8m0=True)`` decomposes into ~8 elementwise/reduction ops (empty+fill, copy, abs, cast, amax, clamp, div, ue8m0 round, mul, cast to fp8, pack int32). Wrapping it with ``torch.compile`` fuses these to ~1-2 Triton kernels and gets to ~60-260 us depending on seq. ``torch.ops.trtllm.mxfp8_quantize(x, False, alignment=32)`` is an existing C++ CUDA kernel in this tree (``thop/mxFp8Quantize.cpp``) — CUTLASS's MoE path already uses it. Byte-identical output to DG's helper (roundtrip-verified on random BF16: fp8 bytes and SF int32 both ``torch.equal`` after reshape), and 5-25× faster — consistently ~11 us independent of seq. ``_quantize_bf16_to_fp8_ue8m0`` prefers the TRT-LLM op when ``libth_common.so`` has registered it (always the case inside ConfigurableMoE because ``create_moe.py`` imports CutlassFusedMoE at module top, which triggers the library load). Falls back to the ``torch.compile``'d DG helper for slim builds. Perf deltas (us/iter, DSV3 E=256 k=8 H=7168 I=2048, ep=4): seq initial backend +hoist +C++ quant final vs CUTLASS --- --------------- ------ ---------- ---------------- 1 895 527 391 0.70× 32 890 454 369 0.74× 128 863 453 432 0.87× 512 866 462 437 0.88× 2048 1426 1007 939 1.88× MegaMoE now beats CUTLASS by 10-30% across seq 1-512 (uniform routing). Large-seq (≥ 1024) remains limited by DG kernel scaling itself — in-kernel tuning, not something the wrapper can address. Unit tests (``test_mega_moe.py``, 15 cases) remain green. Signed-off-by: Barry Kang <[email protected]>

Signed-off-by: Barry Kang <[email protected]>

R1: configurable_moe._forward_chunk_mega_impl — zero-token x_sf placeholder used shape (0, 0) which violates the contract returned by quantize_input (packed-UE8M0 int32 over 32-element blocks, 4 u8 per int32). Use (0, hidden_size // 128) int32 to match. R2: create_moe — MegaMoEDeepGemmFusedMoE was bypassing ConfigurableMoE because it wasn't in the supported tuple, leaving ConfigurableMoE._forward_chunk_mega_impl (the perf hoist from 5112801) unreachable. Add it lazily so non-MegaMoE callers don't import DeepGEMM at module load. R3: get_moe_cls — capability check for MegaMoE used hardcoded torch.bfloat16 and dense intermediate_size; resolve from pretrained_config (torch_dtype + moe_intermediate_size first) so can_implement matches the values used at construction. R5: backend.forward_impl + ConfigurableMoE._forward_chunk_mega_impl indexed all_rank_num_tokens with mapping.tp_rank, which is the outer TP rank rather than the EP rank. Phase 1 asserts ep_size == parallel_size, so use mapping.moe_ep_rank for clarity and topology robustness. Nits: * drop unused use_dp_padding from _forward_chunk_mega_impl signature (and caller) — MegaMoE uses raw token counts, not DP padding. * replace Unicode "x" multiplication sign with ASCII "x" in two comments/docstrings (configurable_moe.py + backend.py). * _SUPPORTED_ACTIVATION_DTYPES set -> frozenset. * _quantize_bf16_to_fp8_ue8m0 raises ValueError early when n % 128 != 0 instead of failing at the int32 reshape. * reshape five docstrings in mega_moe/backend.py to satisfy ruff D205 (blank line between summary and description). ruff format and ruff check now pass with the project config. Signed-off-by: Barry Kang <[email protected]>

Pre-commit's yapf hook prefers wrapping the single-element tuple extension across two lines. No semantic change. Signed-off-by: Barry Kang <[email protected]>

DeepGEMM's ``fp8_fp4_mega_moe`` kernel interprets the first half of the L1 weight tensor as the SwiGLU gate side and the second half as the up side (deep_gemm/mega/__init__.py:78 ``_interleave_l1_weights``: ``gate = t[:, :half]; up = t[:, half:]``). TRT-LLM's MoE convention -- consistent across ``modeling_gpt_oss.py:743-746`` and the ``FUSED_GATE_UP_PROJ`` loader at ``quantization.py:362-365`` -- maps ``w1 = gate_proj`` and ``w3 = up_proj`` (HF's ``gate_up_proj`` is laid out as ``[gate | up]`` along the output dim, ``chunk(2)[0]`` -> w1, ``chunk(2)[1]`` -> w3). The previous ``cat([w3, w1])`` order silently swapped which side the ``silu`` was applied to, computing ``silu(up_proj @ x) * (gate_proj @ x)`` instead of ``silu(gate_proj @ x) * (up_proj @ x)``. Verified against a pure-PyTorch QDQ reference (no DG / TRT-LLM kernel ops): with this fix, MegaMoE output matches the reference bit-exact. Without it, the per-element mismatch rate is ~94% vs the reference. Signed-off-by: Barry Kang <[email protected]>

Barry-Delaney · 2026-05-07T09:58:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-07T10:04:52Z

PR_Github #47186 [ run ] triggered by Bot. Commit: e7d80e9 Link to invocation

Barry-Delaney · 2026-05-07T19:56:29Z

/bot kill

tensorrt-cicd · 2026-05-07T20:02:14Z

PR_Github #47243 [ kill ] triggered by Bot. Commit: e7d80e9 Link to invocation

tensorrt-cicd · 2026-05-07T20:02:58Z

PR_Github #47243 [ kill ] completed with state SUCCESS. Commit: e7d80e9
Successfully killed previous jobs for commit e7d80e9

Link to invocation

Barry-Delaney · 2026-05-07T20:03:20Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-07T20:09:04Z

PR_Github #47245 [ run ] triggered by Bot. Commit: e7d80e9 Link to invocation

tensorrt-cicd · 2026-05-08T03:41:37Z

PR_Github #47245 [ run ] completed with state SUCCESS. Commit: e7d80e9
/LLM/main/L0_MergeRequest_PR pipeline #37194 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Barry-Delaney · 2026-05-08T03:42:31Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-08T03:51:45Z

PR_Github #47302 [ run ] triggered by Bot. Commit: e7d80e9 Link to invocation

tensorrt-cicd · 2026-05-08T04:49:07Z

PR_Github #47302 [ run ] completed with state SUCCESS. Commit: e7d80e9
/LLM/main/L0_MergeRequest_PR pipeline #37244 completed with status: 'SUCCESS'

CI Report

Link to invocation

…8_fp4_mega_moe (NVIDIA#13384) Signed-off-by: Barry Kang <[email protected]>

github-actions Bot assigned Barry-Delaney Apr 23, 2026

xxi-nv self-requested a review April 27, 2026 07:17

Barry-Delaney force-pushed the user/jinshik/dg_mega_moe branch from 124d4f0 to b652f19 Compare April 27, 2026 07:19

Barry-Delaney force-pushed the user/jinshik/dg_mega_moe branch from b652f19 to 5112801 Compare April 27, 2026 09:09

Barry-Delaney marked this pull request as ready for review April 27, 2026 09:28

Barry-Delaney requested review from a team as code owners April 27, 2026 09:28

Barry-Delaney force-pushed the user/jinshik/dg_mega_moe branch from 70c4946 to 4ce2c3c Compare April 27, 2026 09:29

Barry-Delaney changed the title ~~[None][feat] Add MegaMoEFusedMoE backend wrapping DeepGEMM fp8_fp4_mega_moe~~ [None][feat] Add MegaMoEDeepGemmFusedMoE backend wrapping DeepGEMM fp8_fp4_mega_moe Apr 27, 2026

xxi-nv approved these changes Apr 27, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

juney-nvidia approved these changes Apr 28, 2026

View reviewed changes

Barry-Delaney force-pushed the user/jinshik/dg_mega_moe branch from 6d13f9b to edb7e59 Compare April 28, 2026 15:22

Barry-Delaney enabled auto-merge (squash) April 30, 2026 01:25

Barry-Delaney force-pushed the user/jinshik/dg_mega_moe branch from 660acf2 to e76ddc4 Compare May 1, 2026 10:14

Barry-Delaney added 6 commits May 7, 2026 02:56

Address review comments

dfb72fc

Signed-off-by: Barry Kang <[email protected]>

[None][fix] yapf line-wrap on MegaMoE configurable_supported tuple

492ac0c

Pre-commit's yapf hook prefers wrapping the single-element tuple extension across two lines. No semantic change. Signed-off-by: Barry Kang <[email protected]>

Barry-Delaney force-pushed the user/jinshik/dg_mega_moe branch from 994b5a3 to e7d80e9 Compare May 7, 2026 09:58

Barry-Delaney merged commit 6e069b6 into NVIDIA:main May 8, 2026
6 checks passed

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026

[None][feat] Add MegaMoEDeepGemmFusedMoE backend wrapping DeepGEMM fp…

3c70679

…8_fp4_mega_moe (NVIDIA#13384) Signed-off-by: Barry Kang <[email protected]>

Conversation

Barry-Delaney commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xxi-nv commented Apr 27, 2026

Uh oh!

xxi-nv commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

Barry-Delaney commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

juney-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

Barry-Delaney commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

Barry-Delaney commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

xxi-nv commented May 4, 2026

Uh oh!

xxi-nv commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

Barry-Delaney commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

Barry-Delaney commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

Barry-Delaney commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

Barry-Delaney commented May 8, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Barry-Delaney commented Apr 23, 2026 •

edited

Loading