Skip to content

[None][feat] Temporally-Correlated Heuristic-guided Indexer TopK for Sparse Attention#12385

Merged
juney-nvidia merged 22 commits into
NVIDIA:mainfrom
longcheng-nv:feat/heuristic-indexer-topk
Apr 3, 2026
Merged

[None][feat] Temporally-Correlated Heuristic-guided Indexer TopK for Sparse Attention#12385
juney-nvidia merged 22 commits into
NVIDIA:mainfrom
longcheng-nv:feat/heuristic-indexer-topk

Conversation

@longcheng-nv
Copy link
Copy Markdown
Collaborator

@longcheng-nv longcheng-nv commented Mar 20, 2026

Summary

  • Adds a heuristic-guided TopK name GVR Top-K ("Guess-Verify-Refine") decode path to the DSA (Dynamic Sparse Attention) indexer that exploits temporal correlation between consecutive decode steps, reusing previous TopK indices as candidate hints to reduce search depth
  • Implements a 3-phase adaptive threshold search with 2048-bin parallel histogram (V2e kernel with OPT5/6/7 optimizations) in a standalone compilation unit (heuristic_topk.cuh)
  • Supports multi-row (MTP, next_n > 1) execution and CUDA Graph compatibility via per-layer metadata buffers with stable addresses
  • Coexists with upstream CuTE DSL TopK in a 3-way priority dispatch: heuristic → CuTE DSL → CUDA C++ radix sort
  • Disabled by default (enable_heuristic_topk=False); opt-in via DeepSeekSparseAttentionConfig(enable_heuristic_topk=True)

Commit Breakdown (11 commits)

  1. feat: Add heuristic pre-indexed candidate path to IndexerTopK decode kernel
  2. perf: Split heuristic TopK kernel into separate compilation unit
  3. perf: Reuse shared heuristic TopK kernel across decode paths
  4. feat: Extend heuristic TopK decode to multi-row (MTP) support
  5. perf: Refactor heuristic TopK to __noinline__ device function pattern
  6. perf: Use adaptive launch_bounds and restore sort threshold
  7. feat: Reuse previous TopK indices as heuristic hints for decode indexer
  8. fix: Fix heuristic TopK pre_idx offset and config propagation
  9. feat: CUDA-graph-safe heuristic TopK with per-layer metadata buffers
  10. feat: Add DSA sparse attention and MTP support to LongBench evaluation
  11. perf: Upgrade heuristic TopK kernel to V2e with three optimizations (OPT5/6/7)

Key Files

File Description
cpp/tensorrt_llm/kernels/heuristic_topk.cuh Core V2e kernel: 3-phase adaptive threshold with 2048-bin histogram
cpp/tensorrt_llm/kernels/heuristicTopKDecode.cu Multi-row kernel wrapper
cpp/tensorrt_llm/kernels/indexerTopK.cu Dispatch logic with heuristic eligibility check
tensorrt_llm/_torch/attention_backend/sparse/dsa.py DSA backend: 3-way dispatch, per-layer CUDA Graph buffers
tensorrt_llm/_torch/model_config.py enable_heuristic_topk config propagation
tensorrt_llm/llmapi/llm_args.py DeepSeekSparseAttentionConfig.enable_heuristic_topk user option

API

Enable via DeepSeekSparseAttentionConfig(enable_heuristic_topk=True) in LlmArgs.

Test plan

  • Unit tests: pytest tests/unittest/_torch/thop/parallel/test_indexer_topk.py (distribution-parameterized correctness tests)
  • Verify CUDA Graph capture/replay with heuristic path enabled
  • Kernel correctness: all output indices verified identical to torch.topk across all test configurations

Accuracy Evaluation (DeepSeek V3.2 NVFP4 on B200, 8×TP/EP, DSA sparse attention)

MMLU (14,042 questions)

Exp 1 Exp 2 Exp 3 Average
Default TRT-LLM indexer TopK 87.51 87.51
Heuristic-guided indexer TopK 87.52 87.51 87.49 87.51

Delta: 0.00 — accuracy-neutral.

GSM8K (5-shot)

Exp 1 Exp 2 Exp 3 Average
Default TRT-LLM indexer TopK 95.11 95.11
Heuristic-guided indexer TopK 95.03 95.49 95.19 95.24

Delta: +0.13 — accuracy-neutral (within noise).

GPQA-Diamond (CoT zero-shot)

Exp 1 Exp 2 Exp 3 Average
Default TRT-LLM indexer TopK 77.27 77.27
Heuristic-guided indexer TopK 75.76 75.25 78.79 76.60

Delta: −0.67 — within GPQA-Diamond's ~3pt stderr.

LongBench v1 (8 runs, MTP=1)

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 7 Exp 8 Average
Default TRT-LLM indexer TopK 44.37 45.01 44.68 44.72 44.47 44.25 44.64 44.74 44.61
Heuristic-guided indexer TopK 44.00 44.46 44.18 44.67 44.52 43.73 44.42 44.29 44.28

Delta: −0.33 — accuracy-neutral (< 0.75%).

LongBench v2 (215 questions, medium context length, MTP=1)

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Average
Heuristic TopK 49.30% 50.23% 48.37% 49.77% 47.91% 49.12%
Default TopK 50.70% 49.77% 52.09% 47.91% 47.44% 49.58%

Delta: −0.46% — LongBench v2 has high per-run variance (215 questions, range 47–52% even for baseline). With 5 matched runs per config, the delta is within expected noise.

Summary

Benchmark Default TopK Heuristic TopK Delta
MMLU (14K questions) 87.51 87.51 0.00
GSM8K (5-shot) 95.11 95.24 +0.13
GPQA-Diamond (CoT) 77.27 76.60 −0.67
LongBench v1 (8 runs) 44.61 44.28 −0.33
LongBench v2 (medium context length, 5 runs) 49.58% 49.12% −0.46%

All benchmarks are accuracy-neutral within their respective noise margins. MMLU and GSM8K (large-N, deterministic) show effectively zero impact. GPQA and LongBench (small-N, high variance) show deltas within expected run-to-run noise.

Performance Results (B200)

The heuristic TopK micro-kernel (heuristicTopKMultiRowKernel) is benchmarked against the default radix-sort path (topKPerRowDecode) on single-row logits input (one CTA per row). All benchmark runs first verify output indices match torch.topk exactly.

Random input (norm-similar distribution, y = 1 + 0.1 * N[0,1])

N (seq length) Heuristic (ns) Radix Sort (ns) Speedup
8,192 16,512 11,200 0.68x
16,384 21,856 21,984 1.01x
32,768 26,112 32,928 1.26x
65,536 31,904 47,936 1.50x
70,690 36,864 51,200 1.39x
131,072 43,392 76,128 1.75x

Heuristic kernel wins at N >= 16K; at shorter sequences, the radix sort is faster (expected — the heuristic's histogram overhead dominates at small N, which is why the dispatch threshold gates it).

Realistic input (DeepSeek V3.2 decode logits, SWE-Bench ISL/OSL=64K/2K)

Profiled across 9 layers x 17 decode steps (N ~ 68.7K-70.7K):

Metric Value
Average speedup 1.81x
Range 1.15x – 2.36x
Radix sort baseline ~52–57 us (stable across layers)
Heuristic kernel ~24–46 us (varies by layer/step)

The heuristic kernel achieves 1.81x average speedup over the radix sort on realistic DeepSeek V3.2 decode workloads at typical sequence lengths (~70K).

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

📝 Walkthrough

Walkthrough

This PR introduces a heuristic top-K decoding mechanism that accelerates top-K selection by reusing previous step's top-K indices as hints. It adds new CUDA kernel implementations, extends the IndexerTopK interface to accept pre-computed indices, integrates with PyTorch operators and DSA sparse attention, and introduces comprehensive distribution-based tests.

Changes

Cohort / File(s) Summary
Kernel API Updates
cpp/tensorrt_llm/kernels/IndexerTopK.h
Updated invokeIndexerTopKDecode signature to accept optional pre-index parameters (preIdx, preIdxStride, preIdxCount) for heuristic decoding path.
New Heuristic TopK Headers
cpp/tensorrt_llm/kernels/heuristicTopKDecode.h, cpp/tensorrt_llm/kernels/heuristic_topk.cuh
Introduced header files defining CUDA kernel launcher launchHeuristicTopKDecode and low-level heuristic TopK implementation with histogram-based selection, threshold refinement, and candidate collection.
CUDA Kernel Implementations
cpp/tensorrt_llm/kernels/heuristicTopKDecode.cu
Added multi-row heuristic TopK decode launcher with dynamic shared memory configuration, scratch buffer management, and optional logits alignment for CUDA float4 requirements.
IndexerTopK Implementation
cpp/tensorrt_llm/kernels/indexerTopK.cu
Extended invokeIndexerTopKDecode to conditionally invoke heuristic path when pre-index data is provided and conditions are met; introduced dynamic split-work threshold logic.
PyTorch Operator Integration
cpp/tensorrt_llm/thop/IndexerTopKOp.cpp, tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
Updated indexer_topk_decode operator to accept optional pre_idx tensor and propagate to CUDA kernel; updated fake op registration.
DSA Sparse Attention Integration
tensorrt_llm/_torch/attention_backend/sparse/dsa.py
Added heuristic TopK path in decode with per-layer persistent buffers to store and reuse previous layer's topK indices; conditional staging and indexing logic.
Configuration and Model Setup
tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/_torch/model_config.py
Added enable_heuristic_topk field to DeepSeekSparseAttentionConfig; wired flag through model initialization for Deepseek and GLM-MoE architectures.
Evaluation Script
examples/longbench/eval_longbench_v1.py
Extended CLI with DSA-specific flags (--dsa_sparse, --enable_heuristic_topk, --dsa_index_topk), speculative decoding option (--mtp), and conditional KV cache dtype configuration.
Distribution-based Tests
tests/unittest/_torch/thop/parallel/test_indexer_topk.py
Added parameterized correctness tests with distribution-shaped logits (beta, logistic, lognorm, weibull), MTP correlation structures, and heuristic pre_idx generation with configurable hit/fill ratios.

Sequence Diagram

sequenceDiagram
    participant PyTorch as PyTorch Inference
    participant DSA as DSA Attention Layer
    participant Indexer as Indexer (sparse_attn_indexer)
    participant TopK as indexer_topk_decode Op
    participant Kernel as CUDA Kernels
    
    PyTorch->>DSA: Generate tokens (decode step)
    DSA->>Indexer: Call with use_custom_topk flag
    alt enable_heuristic_topk enabled
        Indexer->>Indexer: Derive pre_idx from heuristic_prev_topk<br/>(previous layer's stored TopK)
        Indexer->>TopK: Call with pre_idx parameter
        TopK->>Kernel: Dispatch to launchHeuristicTopKDecode
        Kernel->>Kernel: Check if N <= topK (simple case)<br/>else use heuristic path
        Kernel-->>TopK: Return top-K indices
        TopK-->>Indexer: top-K results
        Indexer->>Indexer: Update heuristic_prev_topk<br/>with current layer's topK
    else enable_heuristic_topk disabled
        Indexer->>TopK: Call with pre_idx=None
        TopK->>Kernel: Dispatch to standard path<br/>(insertion/radix/multi-block)
        Kernel-->>TopK: Return top-K indices
        TopK-->>Indexer: top-K results
    end
    Indexer-->>DSA: Sparse attention indices
    DSA-->>PyTorch: Attention output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • hyukn
  • yilin-void
  • litaotju
  • longlee0622
  • Kefeng-Duan
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.03% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature being added: a temporally-correlated heuristic-guided TopK decoder for sparse attention.
Description check ✅ Passed The PR description is comprehensive and well-structured, providing detailed information on the feature, implementation, test coverage, and performance results.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/kernels/heuristic_topk.cuh`:
- Around line 751-776: The launcher launchHeuristicTopK is unsafe because
heuristicTopKJob/heuristicTopKKernel assumes the compile-time TOP_K (2048) but
launchHeuristicTopK accepts a runtime topK; either enforce and reject mismatched
sizes or propagate the runtime topK into the kernel/job. Fix by validating topK
against TOP_K at the top of launchHeuristicTopK and return cudaErrorInvalidValue
(or similar) if they differ, or modify the kernel/job interfaces
(heuristicTopKKernel and heuristicTopKJob) to accept and use the runtime topK
everywhere (remove compile-time-only assumptions/padding/thresholding) so no
writes occur past the provided outputValues/outputIndices.
- Around line 772-774: The explicit launch with a 64-bit index type
(launchHeuristicTopK<float, int64_t>) corrupts buffers because the kernel launch
reinterprets preIdx/outputIndices as int*/int const* and uses 32-bit element
sizes; fix by making the kernel and launch correctly use the index template
IdxT: either remove the int64_t instantiation and only use 32-bit indices, or
fully templatize the kernel/job on IdxT and change the reinterpret_casts and
buffer pointer types in the launch sites (e.g., heuristicTopKKernel launch and
the calls around the launchHeuristicTopK instantiations at the same block and at
lines ~779-782) so pointers, casts, and writes/reads use reinterpret_cast<IdxT
const*> / reinterpret_cast<IdxT*> and the correct element sizes.

In `@cpp/tensorrt_llm/kernels/heuristicTopKDecode.cu`:
- Around line 85-98: The current process-wide static flag `configured` causes
`cudaFuncSetAttribute(heuristicTopKMultiRowKernel,
cudaFuncAttributeMaxDynamicSharedMemorySize, ...)` to be applied only once
across all GPUs; make this device-aware by replacing the single `configured`
with a per-device guard (e.g., a container keyed by the current device id
obtained via `cudaGetDevice`) or simply remove the guard and call
`cudaFuncSetAttribute` unconditionally when `smemSize > 48u * 1024u && smemSize
<= static_cast<size_t>(maxSmem)`; ensure you check `device` via `cudaGetDevice`
and set the per-device flag after a successful `cudaFuncSetAttribute` for
`heuristicTopKMultiRowKernel` to avoid suppressing opt-in on other GPUs.

In `@cpp/tensorrt_llm/thop/IndexerTopKOp.cpp`:
- Around line 74-85: The pre_idx handling only checks is_cuda() but must also
ensure it lives on the same CUDA device as logits to avoid passing a foreign
device pointer into invokeIndexerTopKDecode; update the pre_idx branch (the
block using preIdxTensor, preIdxPtr, preIdxStride, preIdxCount) to assert the
devices match (compare preIdxTensor.device() with logits.device() or
logits.device().index()) using TORCH_CHECK and a clear error message, so the
kernel always receives a pointer on the same CUDA device as logits.
- Around line 80-81: The TORCH_CHECK in IndexerTopKOp.cpp incorrectly allows
preIdxTensor.size(0) == numRows64 when next_n > 1, which leads
heuristicTopKMultiRowKernel to index hints by rowIdx / next_n and ignore per-row
hints; update the validation around preIdxTensor, next_n and numRows64 so that
when next_n > 1 the only accepted hint shape is preIdxTensor.size(0) * next_n ==
numRows64 (or equivalently preIdxTensor.size(0) == numRows64 / next_n), and only
allow the preIdxTensor.size(0) == numRows64 shorthand when next_n == 1; locate
and change the TORCH_CHECK that currently references preIdxTensor.size(0),
next_n and numRows64 (and any nearby comments) to enforce this stricter
condition so callers cannot silently pass per-row hints that will be ignored by
heuristicTopKMultiRowKernel.

In `@examples/longbench/eval_longbench_v1.py`:
- Around line 389-409: Add upfront argument validation to reject combinations
that require PyTorch when the user selected the TensorRT backend: if
args.backend == "tensorrt" and (args.dsa_sparse is True or args.mtp > 0) then
raise a clear CLI error (e.g., parser.error or raise ValueError) before building
DeepSeekSparseAttentionConfig or MTPDecodingConfig; reference the symbols
args.dsa_sparse, DeepSeekSparseAttentionConfig, args.mtp, and MTPDecodingConfig
so the check runs prior to the blocks that construct those configs and prevents
LLM(...) from failing later.

In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py`:
- Around line 533-540: The allocated buffer heuristic_prev_topk (created via
get_empty(...)) is being read/shifted before being seeded; reset newly allocated
or newly-assigned slots to -1 and ensure the shift logic (where staging += 1 is
applied) only increments entries that are >= 0 so padding/sentinel -1 stays as
“no hint”. Update the code paths that touch heuristic_prev_topk (and the same
pattern at the other occurrences) to explicitly initialize new request slots to
-1 after get_empty, and change the increment/shift logic to conditionally add 1
only for non-negative elements (e.g., mask with >=0 or use where/conditional
update) so stale -1 values never turn into candidate 0.

In `@tests/unittest/_torch/thop/parallel/test_indexer_topk.py`:
- Around line 313-367: The issue is that a block was accidentally dedented to
module scope so _run_cute_dsl_topk_test() returns immediately after seed setup,
making the rest of the test body run at import time and creating a duplicate
test name; fix it by re-indenting the displaced block so all code that generates
logits, runs assertions, and parametrized test logic is inside the
_run_cute_dsl_topk_test function (identify by the function name
_run_cute_dsl_topk_test and the seed setup lines), and remove or rename the
stray duplicate test_indexer_topk_decode_dist definition at module scope so
there is no F811 duplicate symbol.
- Around line 834-835: Clamp seq_lens to ensure no sequence is shorter than
next_n before computing row_ends: after calling generate_seq_lens(...) assign
seq_lens = seq_lens.clamp(min=next_n) so that row_ends = seq_lens[row_indices] -
next_n + next_n_offset + 1 cannot produce non-positive values; this prevents
generate_pre_idx from receiving non-positive valid_len and blowing up. Ensure
you update the variable used by subsequent logic (row_ends, generate_pre_idx) so
the clamp takes effect.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 948bc074-35b6-4176-80ee-3ff5990b6509

📥 Commits

Reviewing files that changed from the base of the PR and between 7dd0865 and 09b44a6.

📒 Files selected for processing (12)
  • cpp/tensorrt_llm/kernels/IndexerTopK.h
  • cpp/tensorrt_llm/kernels/heuristicTopKDecode.cu
  • cpp/tensorrt_llm/kernels/heuristicTopKDecode.h
  • cpp/tensorrt_llm/kernels/heuristic_topk.cuh
  • cpp/tensorrt_llm/kernels/indexerTopK.cu
  • cpp/tensorrt_llm/thop/IndexerTopKOp.cpp
  • examples/longbench/eval_longbench_v1.py
  • tensorrt_llm/_torch/attention_backend/sparse/dsa.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py

Comment thread cpp/tensorrt_llm/kernels/heuristic_topk.cuh
Comment thread cpp/tensorrt_llm/kernels/heuristic_topk.cuh Outdated
Comment thread cpp/tensorrt_llm/kernels/heuristicTopKDecode.cu Outdated
Comment thread cpp/tensorrt_llm/thop/IndexerTopKOp.cpp
Comment thread cpp/tensorrt_llm/thop/IndexerTopKOp.cpp Outdated
Comment thread examples/longbench/eval_longbench_v1.py
Comment thread tensorrt_llm/_torch/attention_backend/sparse/dsa.py
Comment thread tests/unittest/_torch/thop/parallel/test_indexer_topk.py
Comment thread tests/unittest/_torch/thop/parallel/test_indexer_topk.py
@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

2 similar comments
@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot help

@github-actions
Copy link
Copy Markdown

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

1 similar comment
@HuiGao-NV
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39861 [ run ] triggered by Bot. Commit: 7de8fcf Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39861 [ run ] completed with state SUCCESS. Commit: 7de8fcf
/LLM/main/L0_MergeRequest_PR pipeline #31034 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

1 similar comment
@HuiGao-NV
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39901 [ run ] triggered by Bot. Commit: 7de8fcf Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39901 [ run ] completed with state SUCCESS. Commit: 7de8fcf
/LLM/main/L0_MergeRequest_PR pipeline #31070 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@longcheng-nv longcheng-nv force-pushed the feat/heuristic-indexer-topk branch 3 times, most recently from 3ef3244 to 5f02778 Compare March 24, 2026 03:01
@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

…e indexer

Enable the heuristic TopK decode path by persisting each layer's
previous-step TopK indices and passing them as pre_idx hints to
indexer_topk_decode. Consecutive decode steps have nearly identical
attention patterns, so prior indices bootstrap a better initial
threshold and reduce interpolation iterations in the heuristic kernel.

Key changes:
- Add enable_heuristic_topk config field (default False) to
  DeepSeekSparseAttentionConfig for opt-in activation.
- Per-layer lazy-allocated buffers in Indexer for TopK index persistence
  with request-id-based slot management and stale-request cleanup.
- Save prefill last-token TopK as seed for first decode step; save
  decode last-MTP-position TopK for subsequent steps.
- Fix MTP preIdx row indexing in heuristic kernel (rowIdx/next_n).
- Relax pre_idx size check in thop binding for per-request shape.

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <[email protected]>
Two fixes for the heuristic TopK decode path:

1. Add +1 offset to pre_idx in _gather_prev_topk_for_decode: the saved
   TopK came from a query at position P (last MTP pos of previous step),
   while the current step's first query is at P+1. Shifting by +1
   preserves relative distances under RoPE for a more accurate initial
   threshold in the heuristic kernel.

2. Propagate enable_heuristic_topk through model_config.py: the
   DeepSeekV32 config builder was reconstructing DeepSeekSparseAttention
   Config without forwarding enable_heuristic_topk from the user's
   sparse_attention_config, causing the field to always default to False.

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <[email protected]>
…uffers

Redesign the heuristic TopK pre_idx management to be fully CUDA Graph
compatible. Replace the old Indexer-side dict/slot approach (which used
.item() and torch.tensor() forbidden during graph capture) with
pre-allocated metadata buffers using a feedback loop pattern.

Key changes:
- Per-layer 3D buffer heuristic_prev_topk on metadata, allocated via
  get_empty(capture_graph=True) for stable addresses across replays.
- Shared staging buffer for +1 RoPE offset, all in-place ops.
- Each graph replay's write becomes the next replay's read.
- Heuristic now works with CUDA Graphs enabled (no need to disable).
- Refine heuristic_topk.cuh kernel for improved convergence.

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <[email protected]>
…aluation

Extend eval_longbench_v1.py to support DeepSeek DSA sparse attention
alongside existing RocketKV, with heuristic TopK and MTP speculative
decoding options.

Key changes:
- Add --dsa_sparse flag as mutually exclusive with --rocket_sparse
- Add --enable_heuristic_topk for DSA heuristic TopK pre_idx reuse
- Add --mtp for MTP speculative decoding layers
- Update usage examples for both RocketKV and DSA workflows

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <[email protected]>
…tions

Three kernel-level optimizations (OPT5/OPT6/OPT7) to reduce latency of
the single-CTA heuristic TopK on Blackwell (sm_100):

OPT5 — Skip Phase 3 blockCountGE re-scan when Phase 2 already confirmed
the candidate count is in [TOP_K, MAX_CANDIDATES] (done==1).  Eliminates
one full N-element scan in the common fast-convergence case.

OPT6 — Increase NUM_BINS from 256 to 2048 and replace the serial O(2048)
K-th-bin scan with a two-step parallel search: each warp accumulates its
128-bin slice (NUM_BINS/NUM_WARPS), then thread 0 locates the target
warp in 16 steps and one warp-lane scans 128 bins.  Serial depth drops
from 2048 to 144 steps.  Shared memory grows from ~50 KB to ~59 KB; the
histogram is now 8 KB instead of 1 KB.

OPT7 — Cache per-thread element counts into smem->per_thread_counts[]
inside blockCountGE.  Phase 3 prefix-sum reuses these cached values
instead of repeating the N-scan, saving one full global-memory pass.

Also add distribution-parameterised correctness tests for the heuristic
indexer_topk_decode path in test_indexer_topk.py:

- Four logit families: beta (bounded), logistic (heavy-tailed), lognorm
  (positively skewed), weibull_min (right-skewed extreme-value)
- MTP correlation: consecutive rows within each batch element share tail
  logits (next_n up to 3)
- pre_idx accuracy sweep: success_ratio in {0.5, 0.9}
- Tolerance set to full_range/256 to accept histogram-bin boundary ties

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <[email protected]>
…flicts

Pass caller-owned scratch buffer through heuristic TopK pipeline to
eliminate cudaMallocAsync/cudaFreeAsync inside the kernel, enabling
CUDA Graph capture. Also resolves stash merge conflicts in test file
and adds seq_lens clamping for valid row lengths with next_n > 1.

Signed-off-by: longcheng-nv <[email protected]>
Made-with: claude-4.6-opus-high
…pK tests

Add docstrings to all test functions and helpers to improve docstring
coverage. Remove broken duplicate test_indexer_topk_decode_dist that
referenced undefined names (dtype, run_fn). Suppress F811 on intentional
CuTE DSL test redefinitions with different parametrize configurations.

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <[email protected]>
…n test_dsa_indexer

The DSA backend accesses sparse_attention_config.enable_heuristic_topk
but the mock SparseAttentionConfig in test_dsa_indexer.py was missing
this field, causing all 76 DSA indexer tests to fail with AttributeError.

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <[email protected]>
Add one-line docstrings to all public functions, classes, and methods
across the 5 Python files modified by this PR to satisfy the CI
docstring coverage gate (was 41%, now ~100%).

Signed-off-by: longcheng-nv <[email protected]>
Made-with: claude-4.6-opus-high
Without heuristic_scratch the call silently falls back to the radix kernel.

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <[email protected]>
The heuristic TopK kernel requires stride0 divisible by 4 for float4
loads. In production fp8_paged_mqa_logits always returns tensors with
stride aligned to 256, but the test helpers created logits with
arbitrary max_len from random seq_lens, causing assertion failures
when max_len was not a multiple of 4.

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <[email protected]>
…duplicate allocation

- Remove duplicate `heuristic_scratch_values` allocation from
  `create_expanded_buffers()` (already created in `__post_init__`);
  move resize logic into `update_spec_dec_param()` where shape changes.
- Eliminate `heuristic_pre_idx_staging` buffer entirely: pass
  `heuristic_prev_topk` directly as `pre_idx` to the kernel.
- Move +1 temporal offset into the C++ kernel (`preIdxOffset =
  (rowIdx % next_n) + 1`), removing two Python tensor ops (copy_ and
  += 1) from the CUDA Graph captured region per layer per decode step.

Addresses review comments from lfr-0531 on PR NVIDIA#12385.

Signed-off-by: longcheng-nv <[email protected]>
…on pre-Blackwell

- Add `get_sm_version() >= 100` guard to `enable_heuristic_topk` in both
  `DSAtrtllmAttentionMetadata` and `Indexer`, so the heuristic path
  silently falls back to radix sort on Hopper and older architectures.
- Add `@skip_pre_blackwell` decorator to `test_indexer_topk_decode_dist`
  in `test_indexer_topk.py` and `test_indexer_topk_dist.py`, since the
  heuristic kernel is only validated for Blackwell GPUs.

Fixes CI failure on DGX_H100: test_indexer_topk_decode_dist[8192-2048-2-128-logistic_m0.47_s1.46-0.5].

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <[email protected]>
…eedback

Revert docstring-only changes in cpp_custom_ops.py, model_config.py, and
llm_args.py that were added in eda0351 but are outside the scope of this
PR. Docstrings in files directly modified by this feature (dsa.py and
test_dsa_indexer.py) are retained.

Addresses feedback from lfr-0531 and kaiyux on PR NVIDIA#12385.

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <[email protected]>
… and multi-GPU

Add runtime topK == TOP_K validation, compile-time static_assert blocking
non-32-bit index types, remove process-wide static flag in favor of
unconditional cudaFuncSetAttribute, and drop unsafe reinterpret_casts.

Signed-off-by: Long Cheng <[email protected]>
Signed-off-by: longcheng-nv <[email protected]>
Add a parametrized variant to TestDeepSeekV32.test_fp8_blockscale that
exercises the end-to-end heuristic TopK decode path with MTP (next_n=1).
Gated to Blackwell (SM >= 100). Validates accuracy on MMLU and GSM8K
with enable_heuristic_topk=True through the LLM API.

Made-with: claude-4.6-opus-high
Signed-off-by: longcheng-nv <[email protected]>
@longcheng-nv longcheng-nv force-pushed the feat/heuristic-indexer-topk branch from 386cd7b to 5f59b99 Compare April 2, 2026 01:48
@longcheng-nv longcheng-nv requested a review from a team as a code owner April 2, 2026 01:48
@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41297 [ run ] triggered by Bot. Commit: 5f59b99 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41297 [ run ] completed with state SUCCESS. Commit: 5f59b99
/LLM/main/L0_MergeRequest_PR pipeline #32252 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@longcheng-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41536 [ run ] triggered by Bot. Commit: 5f59b99 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41536 [ run ] completed with state SUCCESS. Commit: 5f59b99
/LLM/main/L0_MergeRequest_PR pipeline #32451 completed with status: 'SUCCESS'

CI Report

Link to invocation

@juney-nvidia juney-nvidia merged commit 29fac6b into NVIDIA:main Apr 3, 2026
5 checks passed
govind-ramnarayan pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Apr 6, 2026
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants