[None][fix] Reliability fixes for MTP with DSA and support host cache offload for DSA#12010
Conversation
Signed-off-by: Tri Dao <[email protected]>
Signed-off-by: Tri Dao <[email protected]>
Signed-off-by: Tri Dao <[email protected]>
Signed-off-by: Tri Dao <[email protected]>
📝 WalkthroughWalkthroughThese changes introduce an Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py`:
- Around line 1435-1439: The single-pass fallback that calls fp8_mqa_logits when
metadata.indexer_prefill_chunks is None isn't being synchronized and can
deadlock; modify the control flow so both the chunked prefill branch and the
non-chunked (single-pass) branch route through the same synchronized launch
path: before invoking fp8_mqa_logits (the persistent kernel), call
torch.cuda.synchronize() (or reuse the existing synchronization block) so the
persistent kernel launch is always preceded by the stream drain used for the
chunked path; update any conditionals around metadata.indexer_prefill_chunks to
reuse the synchronized launch logic rather than duplicating an unsynchronized
call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e4c777f2-fbf5-4564-989e-669709f5b5bb
📒 Files selected for processing (4)
tensorrt_llm/_torch/attention_backend/sparse/dsa.pytensorrt_llm/_torch/attention_backend/sparse/kernel.pytensorrt_llm/_torch/attention_backend/utils.pytensorrt_llm/_torch/modules/attention.py
|
/bot run --disable-fail-fast |
|
PR_Github #38298 [ run ] triggered by Bot. Commit: |
|
PR_Github #38298 [ run ] completed with state
|
|
Will add these fixes if not already there in this PR #11990 |
Signed-off-by: Tri Dao <[email protected]>
Signed-off-by: Tri Dao <[email protected]>
Signed-off-by: Tri Dao <[email protected]>
|
I will remove the |
…prod-fixes Signed-off-by: Tri Dao <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #38545 [ run ] triggered by Bot. Commit: |
|
PR_Github #38545 [ run ] completed with state
|
Signed-off-by: Tri Dao <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #38616 [ run ] triggered by Bot. Commit: |
|
PR_Github #38616 [ run ] completed with state
|
Signed-off-by: Tri Dao <[email protected]>
…prod-fixes Signed-off-by: Tri Dao <[email protected]>
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #38685 [ run ] triggered by Bot. Commit: |
|
PR_Github #38685 [ run ] completed with state
|
|
/bot run |
|
PR_Github #38774 [ run ] triggered by Bot. Commit: |
|
PR_Github #38774 [ run ] completed with state |
nv-guomingz
left a comment
There was a problem hiding this comment.
LGTM for llm api part.
Add test_dsa_host_cache_offload[host_cache_offload] and test_dsa_host_cache_offload[host_cache_offload_mtp1] from TestDeepSeekV32 to QA llm_function_core.txt and CI test-db for DGX B200 and H200 (8-GPU configurations). Resolves review comment on PR NVIDIA#12010. Signed-off-by: Jonas Li <[email protected]>
Add test_dsa_host_cache_offload[host_cache_offload] and test_dsa_host_cache_offload[host_cache_offload_mtp1] from TestDeepSeekV32 to QA llm_function_core.txt and CI test-db for DGX B200 and H200 (8-GPU configurations). Resolves review comment on PR NVIDIA#12010. Signed-off-by: Jonas Li <[email protected]>
Add test_dsa_host_cache_offload[host_cache_offload] and test_dsa_host_cache_offload[host_cache_offload_mtp1] from TestDeepSeekV32 to QA llm_function_core.txt and CI test-db for DGX B200 and H200 (8-GPU configurations). Resolves review comment on PR NVIDIA#12010. Signed-off-by: Jonas Li <[email protected]>
… offload for DSA (NVIDIA#12010) Signed-off-by: Tri Dao <[email protected]>
… offload for DSA (NVIDIA#12010) Signed-off-by: Tri Dao <[email protected]>
Summary by CodeRabbit
New Features
indexer_rope_interleaveconfiguration option for controlling rotational embedding interleaving behavior in attention operations.Bug Fixes
Improvements
Description
Important fixes that we landed internally on the way to productionize GLM5:
fp8_mqa_logitsdue to it being a persistent kernel, so we added a sync before launching it.Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.