[None][fix] Reliability fixes for MTP with DSA and support host cache offload for DSA by dmtri35 · Pull Request #12010 · NVIDIA/TensorRT-LLM

dmtri35 · 2026-03-08T16:37:36Z

Summary by CodeRabbit

New Features
- Added indexer_rope_interleave configuration option for controlling rotational embedding interleaving behavior in attention operations.
Bug Fixes
- Improved safety checks for block table access to prevent out-of-bounds errors.
- Added kernel synchronization to prevent potential deadlocks in concurrent stream execution.
Improvements
- Enhanced memory pool management for sparse attention operations.
- Refined padding and alignment safeguards for more robust block table handling.

Description

Important fixes that we landed internally on the way to productionize GLM5:

Support indexer host cache for DSA.
Fix a fairly often hang at fp8_mqa_logits due to it being a persistent kernel, so we added a sync before launching it.
Fix a quality issue that comes up when running MTP + overlap scheduler + with DSA it generate garbled outputs after a while.
Support indexer_rope_interleave (hard requirement for GLM5 or it generates garbage after a while)

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Tri Dao <[email protected]>

coderabbitai · 2026-03-08T16:46:28Z

📝 Walkthrough

Walkthrough

These changes introduce an indexer_rope_interleave flag throughout the attention backend, refactor KV cache block indexing to use memory pool-derived indices, and add block table padding safety mechanisms. Includes block clamping in kernel code, synchronization safeguards, and RoPE embedding configuration adjustments.

Changes

Cohort / File(s)	Summary
Sparse Attention Core Logic `tensorrt_llm/_torch/attention_backend/sparse/dsa.py`	Added `_get_pool_block_indices()` method for computing memory pool block indices. Refactored block table construction and index population to use pool-derived indices. Introduced `indexer_rope_interleave` flag to `Indexer` and `DSATrtllmAttention.__init__()`. Updated `RotaryEmbedding` initialization to depend on `is_neox=not indexer_rope_interleave`. Enhanced `on_update_kv_lens()` with slot mapping recalculation. Added `torch.cuda.synchronize()` calls for prefill and decode paths. Expanded block table safety with -1 padding and clamping.
Sparse Attention Kernel Safety `tensorrt_llm/_torch/attention_backend/sparse/kernel.py`	Added safety clamp for negative token values to prevent out-of-bounds block_table accesses. Updated block_id and inblock_off computations to use clamped values. Tightened loading mask condition and expanded invalid-output logic to consider block_table padding (base < 0) alongside token invalidity.
Attention Factory and Module Integration `tensorrt_llm/_torch/attention_backend/utils.py`, `tensorrt_llm/_torch/modules/attention.py`	Added `indexer_rope_interleave: bool = False` parameter to `create_attention()` function and propagated it through to attention backend construction in both utils.py and MLA module integration in attention.py.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is incomplete and missing critical required sections. The Test Coverage section is empty, and the PR Checklist is not properly filled out with meaningful responses.	Complete the Test Coverage section by listing relevant tests that safeguard these changes. Provide detailed explanations for each checklist item (coding guidelines compliance, test case details, dependency scans, documentation updates) rather than just checking a box. Add the PR title in the required format [ticket/issue][type] Summary.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main changes: reliability fixes for MTP with DSA and host cache offload support for DSA.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py`:
- Around line 1435-1439: The single-pass fallback that calls fp8_mqa_logits when
metadata.indexer_prefill_chunks is None isn't being synchronized and can
deadlock; modify the control flow so both the chunked prefill branch and the
non-chunked (single-pass) branch route through the same synchronized launch
path: before invoking fp8_mqa_logits (the persistent kernel), call
torch.cuda.synchronize() (or reuse the existing synchronization block) so the
persistent kernel launch is always preceded by the stream drain used for the
chunked path; update any conditionals around metadata.indexer_prefill_chunks to
reuse the synchronized launch logic rather than duplicating an unsynchronized
call.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e4c777f2-fbf5-4564-989e-669709f5b5bb

📥 Commits

Reviewing files that changed from the base of the PR and between 69b6203 and 994c3cd.

📒 Files selected for processing (4)

tensorrt_llm/_torch/attention_backend/sparse/dsa.py
tensorrt_llm/_torch/attention_backend/sparse/kernel.py
tensorrt_llm/_torch/attention_backend/utils.py
tensorrt_llm/_torch/modules/attention.py

nvxuanyuc · 2026-03-09T16:36:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-09T16:44:34Z

PR_Github #38298 [ run ] triggered by Bot. Commit: 994c3cd Link to invocation

tensorrt-cicd · 2026-03-09T21:24:28Z

PR_Github #38298 [ run ] completed with state SUCCESS. Commit: 994c3cd
/LLM/main/L0_MergeRequest_PR pipeline #29676 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

NVShreyas · 2026-03-10T00:25:06Z

Will add these fixes if not already there in this PR #11990

chang-l

Should we add a test to validate DSA host-offloading for both pools (MLA KV cache + indexer-K cache)?

Signed-off-by: Tri Dao <[email protected]>

dmtri35 · 2026-03-11T03:00:48Z

I will remove the torch.cuda.synchronize() call to unblock merging. Our fork is a couple RCs behind so it might have already got fixed in master (maybe https://github.com/NVIDIA/TensorRT-LLM/pull/11916/changes considering the indexer takes up a lot of mem)

…prod-fixes Signed-off-by: Tri Dao <[email protected]>

nvxuanyuc · 2026-03-11T06:14:50Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-11T06:21:51Z

PR_Github #38545 [ run ] triggered by Bot. Commit: 4ab4fd8 Link to invocation

tensorrt-cicd · 2026-03-11T13:07:16Z

PR_Github #38545 [ run ] completed with state SUCCESS. Commit: 4ab4fd8
/LLM/main/L0_MergeRequest_PR pipeline #29889 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Tri Dao <[email protected]>

nvxuanyuc · 2026-03-11T17:14:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-11T17:22:13Z

PR_Github #38616 [ run ] triggered by Bot. Commit: 8b8f40c Link to invocation

tensorrt-cicd · 2026-03-12T02:51:37Z

PR_Github #38616 [ run ] completed with state SUCCESS. Commit: 8b8f40c
/LLM/main/L0_MergeRequest_PR pipeline #29950 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Tri Dao <[email protected]>

…prod-fixes Signed-off-by: Tri Dao <[email protected]>

dmtri35 · 2026-03-12T06:28:53Z

/bot run --disable-fail-fast

nvxuanyuc · 2026-03-12T06:50:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-12T06:57:02Z

PR_Github #38685 [ run ] triggered by Bot. Commit: 2b6776f Link to invocation

tensorrt-cicd · 2026-03-12T13:50:08Z

PR_Github #38685 [ run ] completed with state SUCCESS. Commit: 2b6776f
/LLM/main/L0_MergeRequest_PR pipeline #30007 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nvxuanyuc · 2026-03-12T20:11:44Z

/bot run

tensorrt-cicd · 2026-03-12T20:20:41Z

PR_Github #38774 [ run ] triggered by Bot. Commit: 2b6776f Link to invocation

tensorrt-cicd · 2026-03-13T00:14:03Z

PR_Github #38774 [ run ] completed with state SUCCESS. Commit: 2b6776f
/LLM/main/L0_MergeRequest_PR pipeline #30088 completed with status: 'SUCCESS'

CI Report

Link to invocation

nv-guomingz

LGTM for llm api part.

Add test_dsa_host_cache_offload[host_cache_offload] and test_dsa_host_cache_offload[host_cache_offload_mtp1] from TestDeepSeekV32 to QA llm_function_core.txt and CI test-db for DGX B200 and H200 (8-GPU configurations). Resolves review comment on PR NVIDIA#12010. Signed-off-by: Jonas Li <[email protected]>

… offload for DSA (NVIDIA#12010) Signed-off-by: Tri Dao <[email protected]>

dmtri35 added 4 commits March 8, 2026 16:27

proper fix for indexer rope interleave

0fa265c

Signed-off-by: Tri Dao <[email protected]>

support indexer host cache

4ce37a4

Signed-off-by: Tri Dao <[email protected]>

fix mtp indexer hang

53e7e40

Signed-off-by: Tri Dao <[email protected]>

fix mtp + indexer quality degradation

994c3cd

Signed-off-by: Tri Dao <[email protected]>

dmtri35 requested a review from a team as a code owner March 8, 2026 16:37

dmtri35 requested a review from brb-nv March 8, 2026 16:37

coderabbitai Bot reviewed Mar 8, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 8, 2026

nvxuanyuc requested a review from chang-l March 9, 2026 23:06

lfr-0531 reviewed Mar 10, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py

Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated

Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated

yuxianq reviewed Mar 10, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated

yuxianq reviewed Mar 10, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated

chang-l reviewed Mar 10, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/attention_backend/utils.py Outdated

Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py

first pass resolving comments

019fb2c

Signed-off-by: Tri Dao <[email protected]>

dmtri35 requested review from a team as code owners March 11, 2026 01:56

dmtri35 requested review from hchings, shaharmor98 and yilin-void March 11, 2026 01:56

add dsa host cache offload integration test

62514de

Signed-off-by: Tri Dao <[email protected]>

dmtri35 requested a review from a team as a code owner March 11, 2026 02:02

rm torch.cuda.sync

72d0e96

Signed-off-by: Tri Dao <[email protected]>

Merge branch 'main' of github.com:NVIDIA/TensorRT-LLM into trid/glm5-…

4ab4fd8

…prod-fixes Signed-off-by: Tri Dao <[email protected]>

dmtri35 requested a review from lfr-0531 March 11, 2026 03:10

dmtri35 requested a review from yuxianq March 11, 2026 03:10

yuxianq approved these changes Mar 11, 2026

View reviewed changes

fix unit test

8b8f40c

Signed-off-by: Tri Dao <[email protected]>

chang-l approved these changes Mar 12, 2026

View reviewed changes

dmtri35 added 2 commits March 12, 2026 04:39

fix cuda graph capture

7428347

Signed-off-by: Tri Dao <[email protected]>

Merge branch 'main' of github.com:NVIDIA/TensorRT-LLM into trid/glm5-…

2b6776f

…prod-fixes Signed-off-by: Tri Dao <[email protected]>

lfr-0531 approved these changes Mar 13, 2026

View reviewed changes

StanleySun639 approved these changes Mar 13, 2026

View reviewed changes

Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch.py

nv-guomingz approved these changes Mar 17, 2026

View reviewed changes

longlee0622 enabled auto-merge (squash) March 17, 2026 07:26

longlee0622 mentioned this pull request Mar 17, 2026

[None][test] Add DSA host cache offload tests to CI and QA test lists #12278

Merged

2 tasks

longlee0622 merged commit 402a056 into NVIDIA:main Mar 17, 2026
5 checks passed

limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026

[None][fix] Reliability fixes for MTP with DSA and support host cache…

5737b23

… offload for DSA (NVIDIA#12010) Signed-off-by: Tri Dao <[email protected]>

longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026

[None][fix] Reliability fixes for MTP with DSA and support host cache…

c734cb3

… offload for DSA (NVIDIA#12010) Signed-off-by: Tri Dao <[email protected]>

Conversation

dmtri35 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nvxuanyuc commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

NVShreyas commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chang-l left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dmtri35 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvxuanyuc commented Mar 11, 2026

Uh oh!

tensorrt-cicd commented Mar 11, 2026

Uh oh!

tensorrt-cicd commented Mar 11, 2026

Uh oh!

nvxuanyuc commented Mar 11, 2026

Uh oh!

tensorrt-cicd commented Mar 11, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

dmtri35 commented Mar 12, 2026

Uh oh!

nvxuanyuc commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

nvxuanyuc commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

Uh oh!

nv-guomingz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

dmtri35 commented Mar 8, 2026 •

edited

Loading

coderabbitai Bot commented Mar 8, 2026 •

edited

Loading

chang-l left a comment •

edited

Loading

dmtri35 commented Mar 11, 2026 •

edited

Loading