[TRTLLM-9521][feat] Unfuse indexer.wk from attention GEMM for DS-V3.2 NVFP4 by peihu-nv · Pull Request #11989 · NVIDIA/TensorRT-LLM

peihu-nv · 2026-03-06T18:39:25Z

Summary by CodeRabbit

Bug Fixes
- Fixed dimension calculation issues in DeepSeekV3 attention projection layers, eliminating oversized allocations.
- Improved weight handling logic in attention computation.
Refactor
- Simplified and optimized attention forward pass computation for better efficiency and maintainability.

Description

With model weight loading shared between DeepSeek R1/V3/V3.2, in existing main, indexer.wk was fused into kv_a_proj_with_mqa at load time via post_load_weights, forcing it to share the fused module's NVFP4 quantization. This can cause accuracy issues when serving DS V3.2 NVFP4 quantization with attention also quantized to NVFP4 because indexer.wk is more sensitive to quantization.

This change unfuses indexer.wk so it loads through the standard weight loading path but retains whatever precision the checkpoint provides. This is also a prerequisite for the later task, which will convert the indexer ops to TF32 and fuse indexer.wk with indexer.w_proj into a single TF32 kernel for a complete support of attention quantization in NVFP4 for DS V3.2.

Changes

Remove indexer.head_dim from kv_a_proj_with_mqa output dimension (2240 → 2112 for V3.2)
Compute indexer_k via separate indexer.wk(hidden_states) call instead of extracting it from the fused 4-way split
Remove post_load_weights (no longer needed since weights aren't fused)
Guard q_a_proj dtype check with is_lite since V3-Lite has no q_a_proj
Remove stale "oversized" comments from weight loading

Based on Binghan's closed draft PR #9776.

Test Coverage

test_sparse_mla_forward.py — 28 passed on B200, 17 passed on H100
test_short_seq_mha.py — all passed on both architectures
NVFP4 checkpoint inference (DeepSeek-V3.2, TP=4, B200) — coherent output confirmed
Added and tested with test_nvfp4_attn_multi_gpus with new accuracy section: MMLU 87.9%, GSM8K 95.2%
Corresponding checkpoint is uploaded to CI llm-models internally: DeepSeek-V3.2-NVFP4-FP4attn

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-03-06T18:44:12Z

📝 Walkthrough

Walkthrough

The changes refactor DeepSeek V3 attention handling by removing the extra head dimension from the kv projection output, eliminating the post-load weight fusion method, and moving indexer computation from the weight loading phase to explicit separate calls during the forward pass.

Changes

Cohort / File(s)	Summary
DeepSeekV3 Attention Configuration `tensorrt_llm/_torch/models/modeling_deepseekv3.py`	Removed `indexer.head_dim` from `kv_a_proj_with_mqa` output feature size in DeepseekV32Attention initialization. Updated NVFP4_e2m1x2 fusion logic to require both `kv_a_proj_with_mqa` and `q_a_proj` quantization only for non-Lite models. Removed `post_load_weights` method from DeepseekV3WeightLoader and associated comments about post-load weight fusion.
Attention Forward Implementation `tensorrt_llm/_torch/modules/attention.py`	Modified `forward_impl_with_dsa` to unpack three tensors instead of four from `kv_a_proj_with_mqa` split. Moved `indexer_k` computation to an explicit separate call via `self.indexer.wk(hidden_states)` rather than extracting it from the projection split.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: unfusing indexer.wk from the attention GEMM for DeepSeek-V3.2 NVFP4 quantization.
Description check	✅ Passed	The PR description thoroughly explains the issue, solution, test coverage, and adheres to the template with all required sections completed and checked.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable the changed files summary in the walkthrough.

Disable the reviews.changed_files_summary setting to disable the changed files summary in the walkthrough.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/modules/attention.py`:
- Around line 1486-1488: The unconditional call to indexer.wk (producing
indexer_k) is executed even on the short-MHA skip path; change the code so
indexer_k is only computed when indexer routing is actually used (i.e., when not
skipping short-MHA or when num_generations > 0). Concretely, wrap or move the
indexer_k = self.indexer.wk(hidden_states) invocation behind the same condition
that decides routing (referencing use_short_mha_for_ctx, num_generations and the
short-MHA skip branch) so the projection is avoided on the short-MHA bypass
path; ensure downstream uses of indexer_k remain valid by computing it only in
the branch that requires it.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 23695228-c908-4822-b182-e7a2c6bc9ffe

📥 Commits

Reviewing files that changed from the base of the PR and between 5918348 and 88a637a.

📒 Files selected for processing (2)

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/modules/attention.py

…V3.2 NVFP4 Signed-off-by: peihengh <[email protected]>

pengbowang-nv

Attention part LGTM

peihu-nv · 2026-03-09T16:12:15Z

/bot run

tensorrt-cicd · 2026-03-09T16:18:09Z

PR_Github #38292 [ run ] triggered by Bot. Commit: 9c95ca0 Link to invocation

tensorrt-cicd · 2026-03-09T17:04:29Z

PR_Github #38292 [ run ] completed with state FAILURE. Commit: 9c95ca0
/LLM/main/L0_MergeRequest_PR pipeline #29671 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-03-09T17:28:17Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-09T17:35:40Z

PR_Github #38305 [ run ] triggered by Bot. Commit: 9c95ca0 Link to invocation

tensorrt-cicd · 2026-03-09T21:51:50Z

PR_Github #38305 [ run ] completed with state SUCCESS. Commit: 9c95ca0
/LLM/main/L0_MergeRequest_PR pipeline #29680 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-03-09T22:16:25Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-09T22:23:18Z

PR_Github #38328 [ run ] triggered by Bot. Commit: 9c95ca0 Link to invocation

tensorrt-cicd · 2026-03-09T23:59:36Z

PR_Github #38328 [ run ] completed with state SUCCESS. Commit: 9c95ca0
/LLM/main/L0_MergeRequest_PR pipeline #29704 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-03-10T00:33:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-10T00:40:48Z

PR_Github #38340 [ run ] triggered by Bot. Commit: 9c95ca0 Link to invocation

tensorrt-cicd · 2026-03-10T02:28:06Z

PR_Github #38340 [ run ] completed with state FAILURE. Commit: 9c95ca0
/LLM/main/L0_MergeRequest_PR pipeline #29714 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-03-10T16:04:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-10T16:11:49Z

PR_Github #38458 [ run ] triggered by Bot. Commit: 9c95ca0 Link to invocation

tensorrt-cicd · 2026-03-10T17:18:26Z

PR_Github #38458 [ run ] completed with state SUCCESS. Commit: 9c95ca0
/LLM/main/L0_MergeRequest_PR pipeline #29814 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-03-10T18:44:55Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-10T18:52:31Z

PR_Github #38486 [ run ] triggered by Bot. Commit: 7eaa7ae Link to invocation

tensorrt-cicd · 2026-03-11T00:51:21Z

PR_Github #38486 [ run ] completed with state SUCCESS. Commit: 7eaa7ae
/LLM/main/L0_MergeRequest_PR pipeline #29837 completed with status: 'SUCCESS'

Link to invocation

Signed-off-by: peihengh <[email protected]>

peihu-nv · 2026-03-13T20:19:00Z

/bot run

tensorrt-cicd · 2026-03-13T20:26:10Z

PR_Github #38898 [ run ] triggered by Bot. Commit: 2217c6d Link to invocation

tensorrt-cicd · 2026-03-13T22:47:19Z

PR_Github #38898 [ run ] completed with state FAILURE. Commit: 2217c6d
/LLM/main/L0_MergeRequest_PR pipeline #30206 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-03-13T22:50:47Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-13T22:58:37Z

PR_Github #38919 [ run ] triggered by Bot. Commit: 2217c6d Link to invocation

tensorrt-cicd · 2026-03-14T03:50:22Z

PR_Github #38919 [ run ] completed with state FAILURE. Commit: 2217c6d
/LLM/main/L0_MergeRequest_PR pipeline #30224 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-03-16T16:03:38Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-16T16:11:17Z

PR_Github #39104 [ run ] triggered by Bot. Commit: 2217c6d Link to invocation

tensorrt-cicd · 2026-03-16T19:20:50Z

PR_Github #39104 [ run ] completed with state SUCCESS. Commit: 2217c6d
/LLM/main/L0_MergeRequest_PR pipeline #30365 completed with status: 'SUCCESS'

CI Report

Link to invocation

tburt-nv · 2026-03-16T22:24:14Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

tensorrt-cicd · 2026-03-16T22:32:10Z

PR_Github #39133 [ run ] triggered by Bot. Commit: 2217c6d Link to invocation

tensorrt-cicd · 2026-03-17T01:00:34Z

PR_Github #39133 [ run ] completed with state SUCCESS. Commit: 2217c6d
/LLM/main/L0_MergeRequest_PR pipeline #30392 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

longlee0622 · 2026-03-17T02:08:53Z

/bot skip --comment "already has a green CI 6 hours ago"

tensorrt-cicd · 2026-03-17T02:16:21Z

PR_Github #39160 [ skip ] triggered by Bot. Commit: 2217c6d Link to invocation

tensorrt-cicd · 2026-03-17T03:32:08Z

PR_Github #39160 [ skip ] completed with state SUCCESS. Commit: 2217c6d
Skipping testing for commit 2217c6d

Link to invocation

… NVFP4 (NVIDIA#11989) Signed-off-by: peihengh <[email protected]>

peihu-nv requested review from a team as code owners March 6, 2026 18:39

peihu-nv requested review from Wanli-Jiang, hlu1 and pengbowang-nv March 6, 2026 18:39

github-actions Bot assigned peihu-nv Mar 6, 2026

coderabbitai Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/modules/attention.py Outdated

[TRTLLM-9521][feat] Unfuse indexer.wk from kv_a_proj_with_mqa for DS-…

9c95ca0

…V3.2 NVFP4 Signed-off-by: peihengh <[email protected]>

peihu-nv force-pushed the feat/trtllm-9521-dsv32-nvfp4-unfuse-wlk branch from 88a637a to 9c95ca0 Compare March 6, 2026 21:11

Tabrizian requested a review from kaiyux March 7, 2026 05:30

pengbowang-nv approved these changes Mar 9, 2026

View reviewed changes

peihu-nv mentioned this pull request Mar 9, 2026

[TRTLLM-11285][feat] Fuse indexer wk + weights_proj into single GEMM in TF32 for DS-V3.2 #12055

Merged

5 tasks

Merge branch 'main' into feat/trtllm-9521-dsv32-nvfp4-unfuse-wlk

7eaa7ae

kaiyux approved these changes Mar 11, 2026

View reviewed changes

Add test NVFP4 attention with CI model and accuracy reference

2217c6d

Signed-off-by: peihengh <[email protected]>

peihu-nv requested a review from a team as a code owner March 13, 2026 18:37

tburt-nv approved these changes Mar 16, 2026

View reviewed changes

longlee0622 enabled auto-merge (squash) March 17, 2026 02:08

crazydemo reviewed Mar 17, 2026

View reviewed changes

Comment thread tests/integration/test_lists/test-db/l0_dgx_b200.yml

longlee0622 merged commit 2f45640 into NVIDIA:main Mar 17, 2026
7 checks passed

limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026

[TRTLLM-9521][feat] Unfuse indexer.wk from attention GEMM for DS-V3.2…

f656a88

… NVFP4 (NVIDIA#11989) Signed-off-by: peihengh <[email protected]>

longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026

[TRTLLM-9521][feat] Unfuse indexer.wk from attention GEMM for DS-V3.2…

f19c994

… NVFP4 (NVIDIA#11989) Signed-off-by: peihengh <[email protected]>

Conversation

peihu-nv commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Changes

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pengbowang-nv left a comment

Choose a reason for hiding this comment

Uh oh!

peihu-nv commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

peihu-nv commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

peihu-nv commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

peihu-nv commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

peihu-nv commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

peihu-nv commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 11, 2026

Uh oh!

peihu-nv commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

peihu-nv commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

peihu-nv commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

tburt-nv commented Mar 16, 2026

Uh oh!

peihu-nv commented Mar 6, 2026 •

edited

Loading

coderabbitai Bot commented Mar 6, 2026 •

edited

Loading