[https://nvbugs/6163147][fix] swap layer.mlp in place for Mixtral modelopt export by longlee0622 · Pull Request #14179 · NVIDIA/TensorRT-LLM

longlee0622 · 2026-05-15T09:04:55Z

The previous fix (PR #14027) added a sibling layer.block_sparse_moe with the legacy per-expert layout, but modelopt iterates decoder_layer.named_ children() and hits the original layer.mlp (still backed by transformers 5.x's MixtralExperts) first, so build_moe_config still calls len(module.experts) and raises TypeError.

Replace layer.mlp in place with a compat MixtralSparseMoeBlock whose experts is a ModuleList of nn.Linear-backed per-expert MLPs, sharing storage with the 5.x fused gate_up_proj/down_proj tensors via parameter views (zero-copy). The compat block's forward is mathematically identical to the original (bit-exact on fp32 sanity check). The class name retains "MixtralSparseMoeBlock" so modelopt's is_moe substring check still matches, and the per-expert nn.Linear modules let modelopt attach per-expert weight quantizers during calibration.

Verified locally on H100x2 with transformers 5.3.0 + nvidia-modelopt 0.37: without the fix, export reproduces the bug's exact stack (modelopt/torch/export/layer_utils.py:1347 -> TypeError 'MixtralExperts' has no len()); with the fix, FP8 calibration + export_tensorrt_llm_ checkpoint succeeds for both TP=1 and TP=2.

Summary by CodeRabbit

Refactor
- Updated Mixtral model quantization processing to improve framework compatibility and optimize mixture-of-experts handling during model quantization workflows.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…elopt export The previous fix (PR NVIDIA#14027) added a sibling layer.block_sparse_moe with the legacy per-expert layout, but modelopt iterates decoder_layer.named_ children() and hits the original layer.mlp (still backed by transformers 5.x's MixtralExperts) first, so build_moe_config still calls len(module.experts) and raises TypeError. Replace layer.mlp in place with a compat MixtralSparseMoeBlock whose experts is a ModuleList of nn.Linear-backed per-expert MLPs, sharing storage with the 5.x fused gate_up_proj/down_proj tensors via parameter views (zero-copy). The compat block's forward is mathematically identical to the original (bit-exact on fp32 sanity check). The class name retains "MixtralSparseMoeBlock" so modelopt's is_moe substring check still matches, and the per-expert nn.Linear modules let modelopt attach per-expert weight quantizers during calibration. Verified locally on H100x2 with transformers 5.3.0 + nvidia-modelopt 0.37: without the fix, export reproduces the bug's exact stack (modelopt/torch/export/layer_utils.py:1347 -> TypeError 'MixtralExperts' has no len()); with the fix, FP8 calibration + export_tensorrt_llm_ checkpoint succeeds for both TP=1 and TP=2. Signed-off-by: Jonas Li <[email protected]>

coderabbitai · 2026-05-15T09:10:40Z

📝 Walkthrough

Walkthrough

This PR refactors Mixtral MoE handling in the modelopt quantization path by replacing shim-based wrappers with explicit compat nn.Module implementations that restore the pre-transformers-5.x per-expert layout using zero-copy parameter views and custom token routing logic.

Changes

Mixtral MoE Compat Module Refactor

Layer / File(s)	Summary
Compat module implementations `tensorrt_llm/quantization/quantize_by_modelopt.py`	New `_MixtralBlockSparseTop2MLPCompat` and `_MixtralSparseMoeBlockCompat` classes replace prior shims. Per-expert MLPs use `nn.Linear` modules backed by parameter views into fused tensors. MoE block handles token routing with training jitter, top-k expert selection, scaling, and accumulation.
Integration in unfuse function `tensorrt_llm/quantization/quantize_by_modelopt.py`	`_unfuse_mixtral_for_modelopt` updated to detect and replace `layer.mlp` with compat MoE module when `mlp.experts` is `MixtralExperts`. Comment adjusted to reflect per-expert layout restoration for both calibration and export.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	PR description fills in the issue and solution but omits required sections for title, test coverage, and incomplete PR checklist verification.	Add a properly formatted PR title following the template, document test coverage details, and ensure all checklist items are explicitly verified before submission.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is directly related to the main change: replacing layer.mlp in place for Mixtral modelopt export, which is the core fix described in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tensorrt_llm/quantization/quantize_by_modelopt.py (1)

323-337: ⚡ Quick win

Fail fast on the expected MixtralExperts tensor layout.

The slices at Lines 333-335 hardcode the current fused layout and ordering. Since this compat path depends on a transformers implementation detail, add shape checks before wiring the views so a future HF change throws immediately instead of silently exporting incorrect expert weights.

🛡️ Proposed guard

         gate_up = experts.gate_up_proj  # [N, 2*I, H]
         down = experts.down_proj  # [N, H, I]
         act_fn = experts.act_fn
+
+        expected_gate_up_shape = (
+            self.num_experts,
+            2 * self.intermediate_dim,
+            self.hidden_dim,
+        )
+        expected_down_shape = (
+            self.num_experts,
+            self.hidden_dim,
+            self.intermediate_dim,
+        )
+        if tuple(gate_up.shape) != expected_gate_up_shape or tuple(down.shape) != expected_down_shape:
+            raise ValueError(
+                "Unsupported MixtralExperts layout for modelopt compatibility: "
+                f"gate_up_proj={tuple(gate_up.shape)}, down_proj={tuple(down.shape)}"
+            )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/quantization/quantize_by_modelopt.py` around lines 323 - 337,
Add explicit shape and rank checks before slicing experts.gate_up_proj and
experts.down_proj to fail fast if the HF fused layout changes: verify gate_up
has 3 dims, gate_up.shape[0] == self.num_experts, gate_up.shape[1] == 2 *
self.intermediate_dim and gate_up.shape[2] is present, and verify down has 3
dims with down.shape[0] == self.num_experts and down.shape[2] ==
self.intermediate_dim; if any check fails raise a clear ValueError explaining
the expected shapes and referencing experts.gate_up_proj, experts.down_proj,
self.num_experts, self.intermediate_dim, and _MixtralBlockSparseTop2MLPCompat so
incorrect tensor layouts are caught immediately.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/quantization/quantize_by_modelopt.py`:
- Around line 394-397: The new compat block _MixtralSparseMoeBlockCompat is
replacing the original mlp but does not preserve the original module's
train/eval state, which re-enables the jitter branch during calibration/export;
after creating the compat instance in _unfuse_mixtral_for_modelopt() (where you
assign layer.mlp = _MixtralSparseMoeBlockCompat(mlp)), immediately set its mode
to match the original mlp by checking mlp.training and calling layer.mlp.train()
if True or layer.mlp.eval() if False (or set layer.mlp.training = mlp.training
and ensure any submodules follow that state) so the replaced module retains the
original train/eval behavior.

---

Nitpick comments:
In `@tensorrt_llm/quantization/quantize_by_modelopt.py`:
- Around line 323-337: Add explicit shape and rank checks before slicing
experts.gate_up_proj and experts.down_proj to fail fast if the HF fused layout
changes: verify gate_up has 3 dims, gate_up.shape[0] == self.num_experts,
gate_up.shape[1] == 2 * self.intermediate_dim and gate_up.shape[2] is present,
and verify down has 3 dims with down.shape[0] == self.num_experts and
down.shape[2] == self.intermediate_dim; if any check fails raise a clear
ValueError explaining the expected shapes and referencing experts.gate_up_proj,
experts.down_proj, self.num_experts, self.intermediate_dim, and
_MixtralBlockSparseTop2MLPCompat so incorrect tensor layouts are caught
immediately.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fcc6c566-c8e0-4b91-aff4-989949c4e2b0

📥 Commits

Reviewing files that changed from the base of the PR and between 6042b4c and 9b82f12.

📒 Files selected for processing (1)

tensorrt_llm/quantization/quantize_by_modelopt.py

…l mlp get_model() calls model.eval() before _unfuse_mixtral_for_modelopt, but the freshly constructed _MixtralSparseMoeBlockCompat defaults to training=True. That re-enables the router-input jitter branch during calibration/export when router_jitter_noise > 0, making FP8 amax nondeterministic. Mirror the original mlp's training mode after construction so a prior model.eval() is honored. Signed-off-by: Jonas Li <[email protected]>

longlee0622 · 2026-05-15T09:29:17Z

/bot run

tensorrt-cicd · 2026-05-15T09:35:20Z

PR_Github #48579 [ run ] triggered by Bot. Commit: 6f4c68e Link to invocation

tensorrt-cicd · 2026-05-15T21:10:42Z

PR_Github #48579 [ run ] completed with state SUCCESS. Commit: 6f4c68e
/LLM/main/L0_MergeRequest_PR pipeline #38364 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

…elopt export (NVIDIA#14179) Signed-off-by: Jonas Li <[email protected]>

github-actions Bot assigned longlee0622 May 15, 2026

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread tensorrt_llm/quantization/quantize_by_modelopt.py Outdated

longlee0622 enabled auto-merge (squash) May 16, 2026 04:01

ziyixiong-nv approved these changes May 18, 2026

View reviewed changes

longlee0622 merged commit 0886d6a into NVIDIA:main May 18, 2026
8 checks passed

longlee0622 deleted the fix/nvbug-6163147-mixtral-modelopt-export branch May 18, 2026 01:37

KleinBlueC pushed a commit to KleinBlueC/TensorRT-LLM that referenced this pull request May 19, 2026

[https://nvbugs/6163147][fix] swap layer.mlp in place for Mixtral mod…

e1aae37

…elopt export (NVIDIA#14179) Signed-off-by: Jonas Li <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6163147][fix] swap layer.mlp in place for Mixtral modelopt export#14179

[https://nvbugs/6163147][fix] swap layer.mlp in place for Mixtral modelopt export#14179
longlee0622 merged 2 commits into
NVIDIA:mainfrom
longlee0622:fix/nvbug-6163147-mixtral-modelopt-export

longlee0622 commented May 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

longlee0622 commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

longlee0622 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

longlee0622 commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

longlee0622 commented May 15, 2026 •

edited

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading