[https://nvbugs/6163147][fix] swap layer.mlp in place for Mixtral modelopt export#14179
Conversation
…elopt export The previous fix (PR NVIDIA#14027) added a sibling layer.block_sparse_moe with the legacy per-expert layout, but modelopt iterates decoder_layer.named_ children() and hits the original layer.mlp (still backed by transformers 5.x's MixtralExperts) first, so build_moe_config still calls len(module.experts) and raises TypeError. Replace layer.mlp in place with a compat MixtralSparseMoeBlock whose experts is a ModuleList of nn.Linear-backed per-expert MLPs, sharing storage with the 5.x fused gate_up_proj/down_proj tensors via parameter views (zero-copy). The compat block's forward is mathematically identical to the original (bit-exact on fp32 sanity check). The class name retains "MixtralSparseMoeBlock" so modelopt's is_moe substring check still matches, and the per-expert nn.Linear modules let modelopt attach per-expert weight quantizers during calibration. Verified locally on H100x2 with transformers 5.3.0 + nvidia-modelopt 0.37: without the fix, export reproduces the bug's exact stack (modelopt/torch/export/layer_utils.py:1347 -> TypeError 'MixtralExperts' has no len()); with the fix, FP8 calibration + export_tensorrt_llm_ checkpoint succeeds for both TP=1 and TP=2. Signed-off-by: Jonas Li <[email protected]>
📝 WalkthroughWalkthroughThis PR refactors Mixtral MoE handling in the modelopt quantization path by replacing shim-based wrappers with explicit compat ChangesMixtral MoE Compat Module Refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tensorrt_llm/quantization/quantize_by_modelopt.py (1)
323-337: ⚡ Quick winFail fast on the expected
MixtralExpertstensor layout.The slices at Lines 333-335 hardcode the current fused layout and ordering. Since this compat path depends on a
transformersimplementation detail, add shape checks before wiring the views so a future HF change throws immediately instead of silently exporting incorrect expert weights.🛡️ Proposed guard
gate_up = experts.gate_up_proj # [N, 2*I, H] down = experts.down_proj # [N, H, I] act_fn = experts.act_fn + + expected_gate_up_shape = ( + self.num_experts, + 2 * self.intermediate_dim, + self.hidden_dim, + ) + expected_down_shape = ( + self.num_experts, + self.hidden_dim, + self.intermediate_dim, + ) + if tuple(gate_up.shape) != expected_gate_up_shape or tuple(down.shape) != expected_down_shape: + raise ValueError( + "Unsupported MixtralExperts layout for modelopt compatibility: " + f"gate_up_proj={tuple(gate_up.shape)}, down_proj={tuple(down.shape)}" + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/quantization/quantize_by_modelopt.py` around lines 323 - 337, Add explicit shape and rank checks before slicing experts.gate_up_proj and experts.down_proj to fail fast if the HF fused layout changes: verify gate_up has 3 dims, gate_up.shape[0] == self.num_experts, gate_up.shape[1] == 2 * self.intermediate_dim and gate_up.shape[2] is present, and verify down has 3 dims with down.shape[0] == self.num_experts and down.shape[2] == self.intermediate_dim; if any check fails raise a clear ValueError explaining the expected shapes and referencing experts.gate_up_proj, experts.down_proj, self.num_experts, self.intermediate_dim, and _MixtralBlockSparseTop2MLPCompat so incorrect tensor layouts are caught immediately.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/quantization/quantize_by_modelopt.py`:
- Around line 394-397: The new compat block _MixtralSparseMoeBlockCompat is
replacing the original mlp but does not preserve the original module's
train/eval state, which re-enables the jitter branch during calibration/export;
after creating the compat instance in _unfuse_mixtral_for_modelopt() (where you
assign layer.mlp = _MixtralSparseMoeBlockCompat(mlp)), immediately set its mode
to match the original mlp by checking mlp.training and calling layer.mlp.train()
if True or layer.mlp.eval() if False (or set layer.mlp.training = mlp.training
and ensure any submodules follow that state) so the replaced module retains the
original train/eval behavior.
---
Nitpick comments:
In `@tensorrt_llm/quantization/quantize_by_modelopt.py`:
- Around line 323-337: Add explicit shape and rank checks before slicing
experts.gate_up_proj and experts.down_proj to fail fast if the HF fused layout
changes: verify gate_up has 3 dims, gate_up.shape[0] == self.num_experts,
gate_up.shape[1] == 2 * self.intermediate_dim and gate_up.shape[2] is present,
and verify down has 3 dims with down.shape[0] == self.num_experts and
down.shape[2] == self.intermediate_dim; if any check fails raise a clear
ValueError explaining the expected shapes and referencing experts.gate_up_proj,
experts.down_proj, self.num_experts, self.intermediate_dim, and
_MixtralBlockSparseTop2MLPCompat so incorrect tensor layouts are caught
immediately.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fcc6c566-c8e0-4b91-aff4-989949c4e2b0
📒 Files selected for processing (1)
tensorrt_llm/quantization/quantize_by_modelopt.py
…l mlp get_model() calls model.eval() before _unfuse_mixtral_for_modelopt, but the freshly constructed _MixtralSparseMoeBlockCompat defaults to training=True. That re-enables the router-input jitter branch during calibration/export when router_jitter_noise > 0, making FP8 amax nondeterministic. Mirror the original mlp's training mode after construction so a prior model.eval() is honored. Signed-off-by: Jonas Li <[email protected]>
|
/bot run |
|
PR_Github #48579 [ run ] triggered by Bot. Commit: |
|
PR_Github #48579 [ run ] completed with state |
…elopt export (NVIDIA#14179) Signed-off-by: Jonas Li <[email protected]>
The previous fix (PR #14027) added a sibling layer.block_sparse_moe with the legacy per-expert layout, but modelopt iterates decoder_layer.named_ children() and hits the original layer.mlp (still backed by transformers 5.x's MixtralExperts) first, so build_moe_config still calls len(module.experts) and raises TypeError.
Replace layer.mlp in place with a compat MixtralSparseMoeBlock whose experts is a ModuleList of nn.Linear-backed per-expert MLPs, sharing storage with the 5.x fused gate_up_proj/down_proj tensors via parameter views (zero-copy). The compat block's forward is mathematically identical to the original (bit-exact on fp32 sanity check). The class name retains "MixtralSparseMoeBlock" so modelopt's is_moe substring check still matches, and the per-expert nn.Linear modules let modelopt attach per-expert weight quantizers during calibration.
Verified locally on H100x2 with transformers 5.3.0 + nvidia-modelopt 0.37: without the fix, export reproduces the bug's exact stack (modelopt/torch/export/layer_utils.py:1347 -> TypeError 'MixtralExperts' has no len()); with the fix, FP8 calibration + export_tensorrt_llm_ checkpoint succeeds for both TP=1 and TP=2.
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.