Skip to content

[None][fix] Reliability fixes for MTP with DSA and support host cache offload for DSA#12010

Merged
longlee0622 merged 11 commits into
NVIDIA:mainfrom
dmtri35:trid/glm5-prod-fixes
Mar 17, 2026
Merged

[None][fix] Reliability fixes for MTP with DSA and support host cache offload for DSA#12010
longlee0622 merged 11 commits into
NVIDIA:mainfrom
dmtri35:trid/glm5-prod-fixes

Conversation

@dmtri35
Copy link
Copy Markdown
Contributor

@dmtri35 dmtri35 commented Mar 8, 2026

Summary by CodeRabbit

  • New Features

    • Added indexer_rope_interleave configuration option for controlling rotational embedding interleaving behavior in attention operations.
  • Bug Fixes

    • Improved safety checks for block table access to prevent out-of-bounds errors.
    • Added kernel synchronization to prevent potential deadlocks in concurrent stream execution.
  • Improvements

    • Enhanced memory pool management for sparse attention operations.
    • Refined padding and alignment safeguards for more robust block table handling.

Description

Important fixes that we landed internally on the way to productionize GLM5:

  • Support indexer host cache for DSA.
  • Fix a fairly often hang at fp8_mqa_logits due to it being a persistent kernel, so we added a sync before launching it.
  • Fix a quality issue that comes up when running MTP + overlap scheduler + with DSA it generate garbled outputs after a while.
  • Support indexer_rope_interleave (hard requirement for GLM5 or it generates garbage after a while)

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@dmtri35 dmtri35 requested a review from a team as a code owner March 8, 2026 16:37
@dmtri35 dmtri35 requested a review from brb-nv March 8, 2026 16:37
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 8, 2026

📝 Walkthrough

Walkthrough

These changes introduce an indexer_rope_interleave flag throughout the attention backend, refactor KV cache block indexing to use memory pool-derived indices, and add block table padding safety mechanisms. Includes block clamping in kernel code, synchronization safeguards, and RoPE embedding configuration adjustments.

Changes

Cohort / File(s) Summary
Sparse Attention Core Logic
tensorrt_llm/_torch/attention_backend/sparse/dsa.py
Added _get_pool_block_indices() method for computing memory pool block indices. Refactored block table construction and index population to use pool-derived indices. Introduced indexer_rope_interleave flag to Indexer and DSATrtllmAttention.__init__(). Updated RotaryEmbedding initialization to depend on is_neox=not indexer_rope_interleave. Enhanced on_update_kv_lens() with slot mapping recalculation. Added torch.cuda.synchronize() calls for prefill and decode paths. Expanded block table safety with -1 padding and clamping.
Sparse Attention Kernel Safety
tensorrt_llm/_torch/attention_backend/sparse/kernel.py
Added safety clamp for negative token values to prevent out-of-bounds block_table accesses. Updated block_id and inblock_off computations to use clamped values. Tightened loading mask condition and expanded invalid-output logic to consider block_table padding (base < 0) alongside token invalidity.
Attention Factory and Module Integration
tensorrt_llm/_torch/attention_backend/utils.py, tensorrt_llm/_torch/modules/attention.py
Added indexer_rope_interleave: bool = False parameter to create_attention() function and propagated it through to attention backend construction in both utils.py and MLA module integration in attention.py.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is incomplete and missing critical required sections. The Test Coverage section is empty, and the PR Checklist is not properly filled out with meaningful responses. Complete the Test Coverage section by listing relevant tests that safeguard these changes. Provide detailed explanations for each checklist item (coding guidelines compliance, test case details, dependency scans, documentation updates) rather than just checking a box. Add the PR title in the required format [ticket/issue][type] Summary.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main changes: reliability fixes for MTP with DSA and host cache offload support for DSA.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py`:
- Around line 1435-1439: The single-pass fallback that calls fp8_mqa_logits when
metadata.indexer_prefill_chunks is None isn't being synchronized and can
deadlock; modify the control flow so both the chunked prefill branch and the
non-chunked (single-pass) branch route through the same synchronized launch
path: before invoking fp8_mqa_logits (the persistent kernel), call
torch.cuda.synchronize() (or reuse the existing synchronization block) so the
persistent kernel launch is always preceded by the stream drain used for the
chunked path; update any conditionals around metadata.indexer_prefill_chunks to
reuse the synchronized launch logic rather than duplicating an unsynchronized
call.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e4c777f2-fbf5-4564-989e-669709f5b5bb

📥 Commits

Reviewing files that changed from the base of the PR and between 69b6203 and 994c3cd.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/attention_backend/sparse/dsa.py
  • tensorrt_llm/_torch/attention_backend/sparse/kernel.py
  • tensorrt_llm/_torch/attention_backend/utils.py
  • tensorrt_llm/_torch/modules/attention.py

Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 8, 2026
@nvxuanyuc
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38298 [ run ] triggered by Bot. Commit: 994c3cd Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38298 [ run ] completed with state SUCCESS. Commit: 994c3cd
/LLM/main/L0_MergeRequest_PR pipeline #29676 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nvxuanyuc nvxuanyuc requested a review from chang-l March 9, 2026 23:06
@NVShreyas
Copy link
Copy Markdown
Collaborator

Will add these fixes if not already there in this PR #11990

Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py
Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated
Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated
Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated
Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated
Copy link
Copy Markdown
Collaborator

@chang-l chang-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a test to validate DSA host-offloading for both pools (MLA KV cache + indexer-K cache)?

Comment thread tensorrt_llm/_torch/attention_backend/utils.py Outdated
Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py
@dmtri35 dmtri35 requested review from a team as code owners March 11, 2026 01:56
@dmtri35 dmtri35 requested a review from a team as a code owner March 11, 2026 02:02
Signed-off-by: Tri Dao <[email protected]>
@dmtri35
Copy link
Copy Markdown
Contributor Author

dmtri35 commented Mar 11, 2026

I will remove the torch.cuda.synchronize() call to unblock merging. Our fork is a couple RCs behind so it might have already got fixed in master (maybe https://github.com/NVIDIA/TensorRT-LLM/pull/11916/changes considering the indexer takes up a lot of mem)

@dmtri35 dmtri35 requested a review from lfr-0531 March 11, 2026 03:10
@dmtri35 dmtri35 requested a review from yuxianq March 11, 2026 03:10
@nvxuanyuc
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38545 [ run ] triggered by Bot. Commit: 4ab4fd8 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38545 [ run ] completed with state SUCCESS. Commit: 4ab4fd8
/LLM/main/L0_MergeRequest_PR pipeline #29889 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Tri Dao <[email protected]>
@nvxuanyuc
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38616 [ run ] triggered by Bot. Commit: 8b8f40c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38616 [ run ] completed with state SUCCESS. Commit: 8b8f40c
/LLM/main/L0_MergeRequest_PR pipeline #29950 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dmtri35
Copy link
Copy Markdown
Contributor Author

dmtri35 commented Mar 12, 2026

/bot run --disable-fail-fast

1 similar comment
@nvxuanyuc
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38685 [ run ] triggered by Bot. Commit: 2b6776f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38685 [ run ] completed with state SUCCESS. Commit: 2b6776f
/LLM/main/L0_MergeRequest_PR pipeline #30007 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nvxuanyuc
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38774 [ run ] triggered by Bot. Commit: 2b6776f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38774 [ run ] completed with state SUCCESS. Commit: 2b6776f
/LLM/main/L0_MergeRequest_PR pipeline #30088 completed with status: 'SUCCESS'

CI Report

Link to invocation

Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch.py
Copy link
Copy Markdown
Collaborator

@nv-guomingz nv-guomingz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for llm api part.

@longlee0622 longlee0622 enabled auto-merge (squash) March 17, 2026 07:26
longlee0622 added a commit to longlee0622/TensorRT-LLM that referenced this pull request Mar 17, 2026
Add test_dsa_host_cache_offload[host_cache_offload] and
test_dsa_host_cache_offload[host_cache_offload_mtp1] from
TestDeepSeekV32 to QA llm_function_core.txt and CI test-db
for DGX B200 and H200 (8-GPU configurations).

Resolves review comment on PR NVIDIA#12010.

Signed-off-by: Jonas Li <[email protected]>
@longlee0622 longlee0622 merged commit 402a056 into NVIDIA:main Mar 17, 2026
5 checks passed
longlee0622 added a commit to longlee0622/TensorRT-LLM that referenced this pull request Mar 17, 2026
Add test_dsa_host_cache_offload[host_cache_offload] and
test_dsa_host_cache_offload[host_cache_offload_mtp1] from
TestDeepSeekV32 to QA llm_function_core.txt and CI test-db
for DGX B200 and H200 (8-GPU configurations).

Resolves review comment on PR NVIDIA#12010.

Signed-off-by: Jonas Li <[email protected]>
longlee0622 added a commit to longlee0622/TensorRT-LLM that referenced this pull request Mar 18, 2026
Add test_dsa_host_cache_offload[host_cache_offload] and
test_dsa_host_cache_offload[host_cache_offload_mtp1] from
TestDeepSeekV32 to QA llm_function_core.txt and CI test-db
for DGX B200 and H200 (8-GPU configurations).

Resolves review comment on PR NVIDIA#12010.

Signed-off-by: Jonas Li <[email protected]>
limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026
longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.