Skip to content

[https://nvbugs/6163147][fix] swap layer.mlp in place for Mixtral modelopt export#14179

Merged
longlee0622 merged 2 commits into
NVIDIA:mainfrom
longlee0622:fix/nvbug-6163147-mixtral-modelopt-export
May 18, 2026
Merged

[https://nvbugs/6163147][fix] swap layer.mlp in place for Mixtral modelopt export#14179
longlee0622 merged 2 commits into
NVIDIA:mainfrom
longlee0622:fix/nvbug-6163147-mixtral-modelopt-export

Conversation

@longlee0622
Copy link
Copy Markdown
Collaborator

@longlee0622 longlee0622 commented May 15, 2026

The previous fix (PR #14027) added a sibling layer.block_sparse_moe with the legacy per-expert layout, but modelopt iterates decoder_layer.named_ children() and hits the original layer.mlp (still backed by transformers 5.x's MixtralExperts) first, so build_moe_config still calls len(module.experts) and raises TypeError.

Replace layer.mlp in place with a compat MixtralSparseMoeBlock whose experts is a ModuleList of nn.Linear-backed per-expert MLPs, sharing storage with the 5.x fused gate_up_proj/down_proj tensors via parameter views (zero-copy). The compat block's forward is mathematically identical to the original (bit-exact on fp32 sanity check). The class name retains "MixtralSparseMoeBlock" so modelopt's is_moe substring check still matches, and the per-expert nn.Linear modules let modelopt attach per-expert weight quantizers during calibration.

Verified locally on H100x2 with transformers 5.3.0 + nvidia-modelopt 0.37: without the fix, export reproduces the bug's exact stack (modelopt/torch/export/layer_utils.py:1347 -> TypeError 'MixtralExperts' has no len()); with the fix, FP8 calibration + export_tensorrt_llm_ checkpoint succeeds for both TP=1 and TP=2.

Summary by CodeRabbit

  • Refactor
    • Updated Mixtral model quantization processing to improve framework compatibility and optimize mixture-of-experts handling during model quantization workflows.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…elopt export

The previous fix (PR NVIDIA#14027) added a sibling layer.block_sparse_moe with
the legacy per-expert layout, but modelopt iterates decoder_layer.named_
children() and hits the original layer.mlp (still backed by transformers
5.x's MixtralExperts) first, so build_moe_config still calls
len(module.experts) and raises TypeError.

Replace layer.mlp in place with a compat MixtralSparseMoeBlock whose
experts is a ModuleList of nn.Linear-backed per-expert MLPs, sharing
storage with the 5.x fused gate_up_proj/down_proj tensors via parameter
views (zero-copy). The compat block's forward is mathematically identical
to the original (bit-exact on fp32 sanity check). The class name retains
"MixtralSparseMoeBlock" so modelopt's is_moe substring check still
matches, and the per-expert nn.Linear modules let modelopt attach
per-expert weight quantizers during calibration.

Verified locally on H100x2 with transformers 5.3.0 + nvidia-modelopt 0.37:
without the fix, export reproduces the bug's exact stack
(modelopt/torch/export/layer_utils.py:1347 -> TypeError 'MixtralExperts'
has no len()); with the fix, FP8 calibration + export_tensorrt_llm_
checkpoint succeeds for both TP=1 and TP=2.

Signed-off-by: Jonas Li <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

📝 Walkthrough

Walkthrough

This PR refactors Mixtral MoE handling in the modelopt quantization path by replacing shim-based wrappers with explicit compat nn.Module implementations that restore the pre-transformers-5.x per-expert layout using zero-copy parameter views and custom token routing logic.

Changes

Mixtral MoE Compat Module Refactor

Layer / File(s) Summary
Compat module implementations
tensorrt_llm/quantization/quantize_by_modelopt.py
New _MixtralBlockSparseTop2MLPCompat and _MixtralSparseMoeBlockCompat classes replace prior shims. Per-expert MLPs use nn.Linear modules backed by parameter views into fused tensors. MoE block handles token routing with training jitter, top-k expert selection, scaling, and accumulation.
Integration in unfuse function
tensorrt_llm/quantization/quantize_by_modelopt.py
_unfuse_mixtral_for_modelopt updated to detect and replace layer.mlp with compat MoE module when mlp.experts is MixtralExperts. Comment adjusted to reflect per-expert layout restoration for both calibration and export.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description fills in the issue and solution but omits required sections for title, test coverage, and incomplete PR checklist verification. Add a properly formatted PR title following the template, document test coverage details, and ensure all checklist items are explicitly verified before submission.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title is directly related to the main change: replacing layer.mlp in place for Mixtral modelopt export, which is the core fix described in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tensorrt_llm/quantization/quantize_by_modelopt.py (1)

323-337: ⚡ Quick win

Fail fast on the expected MixtralExperts tensor layout.

The slices at Lines 333-335 hardcode the current fused layout and ordering. Since this compat path depends on a transformers implementation detail, add shape checks before wiring the views so a future HF change throws immediately instead of silently exporting incorrect expert weights.

🛡️ Proposed guard
         gate_up = experts.gate_up_proj  # [N, 2*I, H]
         down = experts.down_proj  # [N, H, I]
         act_fn = experts.act_fn
+
+        expected_gate_up_shape = (
+            self.num_experts,
+            2 * self.intermediate_dim,
+            self.hidden_dim,
+        )
+        expected_down_shape = (
+            self.num_experts,
+            self.hidden_dim,
+            self.intermediate_dim,
+        )
+        if tuple(gate_up.shape) != expected_gate_up_shape or tuple(down.shape) != expected_down_shape:
+            raise ValueError(
+                "Unsupported MixtralExperts layout for modelopt compatibility: "
+                f"gate_up_proj={tuple(gate_up.shape)}, down_proj={tuple(down.shape)}"
+            )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/quantization/quantize_by_modelopt.py` around lines 323 - 337,
Add explicit shape and rank checks before slicing experts.gate_up_proj and
experts.down_proj to fail fast if the HF fused layout changes: verify gate_up
has 3 dims, gate_up.shape[0] == self.num_experts, gate_up.shape[1] == 2 *
self.intermediate_dim and gate_up.shape[2] is present, and verify down has 3
dims with down.shape[0] == self.num_experts and down.shape[2] ==
self.intermediate_dim; if any check fails raise a clear ValueError explaining
the expected shapes and referencing experts.gate_up_proj, experts.down_proj,
self.num_experts, self.intermediate_dim, and _MixtralBlockSparseTop2MLPCompat so
incorrect tensor layouts are caught immediately.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/quantization/quantize_by_modelopt.py`:
- Around line 394-397: The new compat block _MixtralSparseMoeBlockCompat is
replacing the original mlp but does not preserve the original module's
train/eval state, which re-enables the jitter branch during calibration/export;
after creating the compat instance in _unfuse_mixtral_for_modelopt() (where you
assign layer.mlp = _MixtralSparseMoeBlockCompat(mlp)), immediately set its mode
to match the original mlp by checking mlp.training and calling layer.mlp.train()
if True or layer.mlp.eval() if False (or set layer.mlp.training = mlp.training
and ensure any submodules follow that state) so the replaced module retains the
original train/eval behavior.

---

Nitpick comments:
In `@tensorrt_llm/quantization/quantize_by_modelopt.py`:
- Around line 323-337: Add explicit shape and rank checks before slicing
experts.gate_up_proj and experts.down_proj to fail fast if the HF fused layout
changes: verify gate_up has 3 dims, gate_up.shape[0] == self.num_experts,
gate_up.shape[1] == 2 * self.intermediate_dim and gate_up.shape[2] is present,
and verify down has 3 dims with down.shape[0] == self.num_experts and
down.shape[2] == self.intermediate_dim; if any check fails raise a clear
ValueError explaining the expected shapes and referencing experts.gate_up_proj,
experts.down_proj, self.num_experts, self.intermediate_dim, and
_MixtralBlockSparseTop2MLPCompat so incorrect tensor layouts are caught
immediately.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fcc6c566-c8e0-4b91-aff4-989949c4e2b0

📥 Commits

Reviewing files that changed from the base of the PR and between 6042b4c and 9b82f12.

📒 Files selected for processing (1)
  • tensorrt_llm/quantization/quantize_by_modelopt.py

Comment thread tensorrt_llm/quantization/quantize_by_modelopt.py Outdated
…l mlp

get_model() calls model.eval() before _unfuse_mixtral_for_modelopt, but
the freshly constructed _MixtralSparseMoeBlockCompat defaults to
training=True. That re-enables the router-input jitter branch during
calibration/export when router_jitter_noise > 0, making FP8 amax
nondeterministic. Mirror the original mlp's training mode after
construction so a prior model.eval() is honored.

Signed-off-by: Jonas Li <[email protected]>
@longlee0622
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48579 [ run ] triggered by Bot. Commit: 6f4c68e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48579 [ run ] completed with state SUCCESS. Commit: 6f4c68e
/LLM/main/L0_MergeRequest_PR pipeline #38364 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

@longlee0622 longlee0622 enabled auto-merge (squash) May 16, 2026 04:01
@longlee0622 longlee0622 merged commit 0886d6a into NVIDIA:main May 18, 2026
8 checks passed
@longlee0622 longlee0622 deleted the fix/nvbug-6163147-mixtral-modelopt-export branch May 18, 2026 01:37
KleinBlueC pushed a commit to KleinBlueC/TensorRT-LLM that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants