Skip to content

[TRTLLM-9521][feat] Unfuse indexer.wk from attention GEMM for DS-V3.2 NVFP4#11989

Merged
longlee0622 merged 3 commits into
NVIDIA:mainfrom
peihu-nv:feat/trtllm-9521-dsv32-nvfp4-unfuse-wlk
Mar 17, 2026
Merged

[TRTLLM-9521][feat] Unfuse indexer.wk from attention GEMM for DS-V3.2 NVFP4#11989
longlee0622 merged 3 commits into
NVIDIA:mainfrom
peihu-nv:feat/trtllm-9521-dsv32-nvfp4-unfuse-wlk

Conversation

@peihu-nv
Copy link
Copy Markdown
Collaborator

@peihu-nv peihu-nv commented Mar 6, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Fixed dimension calculation issues in DeepSeekV3 attention projection layers, eliminating oversized allocations.
    • Improved weight handling logic in attention computation.
  • Refactor

    • Simplified and optimized attention forward pass computation for better efficiency and maintainability.

Description

With model weight loading shared between DeepSeek R1/V3/V3.2, in existing main, indexer.wk was fused into kv_a_proj_with_mqa at load time via post_load_weights, forcing it to share the fused module's NVFP4 quantization. This can cause accuracy issues when serving DS V3.2 NVFP4 quantization with attention also quantized to NVFP4 because indexer.wk is more sensitive to quantization.

This change unfuses indexer.wk so it loads through the standard weight loading path but retains whatever precision the checkpoint provides. This is also a prerequisite for the later task, which will convert the indexer ops to TF32 and fuse indexer.wk with indexer.w_proj into a single TF32 kernel for a complete support of attention quantization in NVFP4 for DS V3.2.

Changes

  • Remove indexer.head_dim from kv_a_proj_with_mqa output dimension (2240 → 2112 for V3.2)
  • Compute indexer_k via separate indexer.wk(hidden_states) call instead of extracting it from the fused 4-way split
  • Remove post_load_weights (no longer needed since weights aren't fused)
  • Guard q_a_proj dtype check with is_lite since V3-Lite has no q_a_proj
  • Remove stale "oversized" comments from weight loading

Based on Binghan's closed draft PR #9776.

Test Coverage

  • test_sparse_mla_forward.py — 28 passed on B200, 17 passed on H100
  • test_short_seq_mha.py — all passed on both architectures
  • NVFP4 checkpoint inference (DeepSeek-V3.2, TP=4, B200) — coherent output confirmed
  • Added and tested with test_nvfp4_attn_multi_gpus with new accuracy section: MMLU 87.9%, GSM8K 95.2%
  • Corresponding checkpoint is uploaded to CI llm-models internally: DeepSeek-V3.2-NVFP4-FP4attn

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@peihu-nv peihu-nv requested review from a team as code owners March 6, 2026 18:39
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

The changes refactor DeepSeek V3 attention handling by removing the extra head dimension from the kv projection output, eliminating the post-load weight fusion method, and moving indexer computation from the weight loading phase to explicit separate calls during the forward pass.

Changes

Cohort / File(s) Summary
DeepSeekV3 Attention Configuration
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Removed indexer.head_dim from kv_a_proj_with_mqa output feature size in DeepseekV32Attention initialization. Updated NVFP4_e2m1x2 fusion logic to require both kv_a_proj_with_mqa and q_a_proj quantization only for non-Lite models. Removed post_load_weights method from DeepseekV3WeightLoader and associated comments about post-load weight fusion.
Attention Forward Implementation
tensorrt_llm/_torch/modules/attention.py
Modified forward_impl_with_dsa to unpack three tensors instead of four from kv_a_proj_with_mqa split. Moved indexer_k computation to an explicit separate call via self.indexer.wk(hidden_states) rather than extracting it from the projection split.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: unfusing indexer.wk from the attention GEMM for DeepSeek-V3.2 NVFP4 quantization.
Description check ✅ Passed The PR description thoroughly explains the issue, solution, test coverage, and adheres to the template with all required sections completed and checked.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable the changed files summary in the walkthrough.

Disable the reviews.changed_files_summary setting to disable the changed files summary in the walkthrough.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/modules/attention.py`:
- Around line 1486-1488: The unconditional call to indexer.wk (producing
indexer_k) is executed even on the short-MHA skip path; change the code so
indexer_k is only computed when indexer routing is actually used (i.e., when not
skipping short-MHA or when num_generations > 0). Concretely, wrap or move the
indexer_k = self.indexer.wk(hidden_states) invocation behind the same condition
that decides routing (referencing use_short_mha_for_ctx, num_generations and the
short-MHA skip branch) so the projection is avoided on the short-MHA bypass
path; ensure downstream uses of indexer_k remain valid by computing it only in
the branch that requires it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 23695228-c908-4822-b182-e7a2c6bc9ffe

📥 Commits

Reviewing files that changed from the base of the PR and between 5918348 and 88a637a.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/modules/attention.py

Comment thread tensorrt_llm/_torch/modules/attention.py Outdated
@peihu-nv peihu-nv force-pushed the feat/trtllm-9521-dsv32-nvfp4-unfuse-wlk branch from 88a637a to 9c95ca0 Compare March 6, 2026 21:11
@Tabrizian Tabrizian requested a review from kaiyux March 7, 2026 05:30
Copy link
Copy Markdown
Collaborator

@pengbowang-nv pengbowang-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attention part LGTM

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Mar 9, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38292 [ run ] triggered by Bot. Commit: 9c95ca0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38292 [ run ] completed with state FAILURE. Commit: 9c95ca0
/LLM/main/L0_MergeRequest_PR pipeline #29671 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Mar 9, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38305 [ run ] triggered by Bot. Commit: 9c95ca0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38305 [ run ] completed with state SUCCESS. Commit: 9c95ca0
/LLM/main/L0_MergeRequest_PR pipeline #29680 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Mar 9, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38328 [ run ] triggered by Bot. Commit: 9c95ca0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38328 [ run ] completed with state SUCCESS. Commit: 9c95ca0
/LLM/main/L0_MergeRequest_PR pipeline #29704 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38340 [ run ] triggered by Bot. Commit: 9c95ca0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38340 [ run ] completed with state FAILURE. Commit: 9c95ca0
/LLM/main/L0_MergeRequest_PR pipeline #29714 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38458 [ run ] triggered by Bot. Commit: 9c95ca0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38458 [ run ] completed with state SUCCESS. Commit: 9c95ca0
/LLM/main/L0_MergeRequest_PR pipeline #29814 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38486 [ run ] triggered by Bot. Commit: 7eaa7ae Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38486 [ run ] completed with state SUCCESS. Commit: 7eaa7ae
/LLM/main/L0_MergeRequest_PR pipeline #29837 completed with status: 'SUCCESS'

Link to invocation

@peihu-nv peihu-nv requested a review from a team as a code owner March 13, 2026 18:37
@peihu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38898 [ run ] triggered by Bot. Commit: 2217c6d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38898 [ run ] completed with state FAILURE. Commit: 2217c6d
/LLM/main/L0_MergeRequest_PR pipeline #30206 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38919 [ run ] triggered by Bot. Commit: 2217c6d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38919 [ run ] completed with state FAILURE. Commit: 2217c6d
/LLM/main/L0_MergeRequest_PR pipeline #30224 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39104 [ run ] triggered by Bot. Commit: 2217c6d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39104 [ run ] completed with state SUCCESS. Commit: 2217c6d
/LLM/main/L0_MergeRequest_PR pipeline #30365 completed with status: 'SUCCESS'

CI Report

Link to invocation

@tburt-nv
Copy link
Copy Markdown
Collaborator

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39133 [ run ] triggered by Bot. Commit: 2217c6d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39133 [ run ] completed with state SUCCESS. Commit: 2217c6d
/LLM/main/L0_MergeRequest_PR pipeline #30392 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@longlee0622 longlee0622 enabled auto-merge (squash) March 17, 2026 02:08
@longlee0622
Copy link
Copy Markdown
Collaborator

/bot skip --comment "already has a green CI 6 hours ago"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39160 [ skip ] triggered by Bot. Commit: 2217c6d Link to invocation

Comment thread tests/integration/test_lists/test-db/l0_dgx_b200.yml
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39160 [ skip ] completed with state SUCCESS. Commit: 2217c6d
Skipping testing for commit 2217c6d

Link to invocation

@longlee0622 longlee0622 merged commit 2f45640 into NVIDIA:main Mar 17, 2026
7 checks passed
limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026
longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants