[None][feat] Temporally-Correlated Heuristic-guided Indexer TopK for Sparse Attention by longcheng-nv · Pull Request #12385 · NVIDIA/TensorRT-LLM

longcheng-nv · 2026-03-20T05:12:29Z

Summary

Adds a heuristic-guided TopK name GVR Top-K ("Guess-Verify-Refine") decode path to the DSA (Dynamic Sparse Attention) indexer that exploits temporal correlation between consecutive decode steps, reusing previous TopK indices as candidate hints to reduce search depth
Implements a 3-phase adaptive threshold search with 2048-bin parallel histogram (V2e kernel with OPT5/6/7 optimizations) in a standalone compilation unit (heuristic_topk.cuh)
Supports multi-row (MTP, next_n > 1) execution and CUDA Graph compatibility via per-layer metadata buffers with stable addresses
Coexists with upstream CuTE DSL TopK in a 3-way priority dispatch: heuristic → CuTE DSL → CUDA C++ radix sort
Disabled by default (enable_heuristic_topk=False); opt-in via DeepSeekSparseAttentionConfig(enable_heuristic_topk=True)

Commit Breakdown (11 commits)

feat: Add heuristic pre-indexed candidate path to IndexerTopK decode kernel
perf: Split heuristic TopK kernel into separate compilation unit
perf: Reuse shared heuristic TopK kernel across decode paths
feat: Extend heuristic TopK decode to multi-row (MTP) support
perf: Refactor heuristic TopK to __noinline__ device function pattern
perf: Use adaptive launch_bounds and restore sort threshold
feat: Reuse previous TopK indices as heuristic hints for decode indexer
fix: Fix heuristic TopK pre_idx offset and config propagation
feat: CUDA-graph-safe heuristic TopK with per-layer metadata buffers
feat: Add DSA sparse attention and MTP support to LongBench evaluation
perf: Upgrade heuristic TopK kernel to V2e with three optimizations (OPT5/6/7)

Key Files

File	Description
`cpp/tensorrt_llm/kernels/heuristic_topk.cuh`	Core V2e kernel: 3-phase adaptive threshold with 2048-bin histogram
`cpp/tensorrt_llm/kernels/heuristicTopKDecode.cu`	Multi-row kernel wrapper
`cpp/tensorrt_llm/kernels/indexerTopK.cu`	Dispatch logic with heuristic eligibility check
`tensorrt_llm/_torch/attention_backend/sparse/dsa.py`	DSA backend: 3-way dispatch, per-layer CUDA Graph buffers
`tensorrt_llm/_torch/model_config.py`	`enable_heuristic_topk` config propagation
`tensorrt_llm/llmapi/llm_args.py`	`DeepSeekSparseAttentionConfig.enable_heuristic_topk` user option

API

Enable via DeepSeekSparseAttentionConfig(enable_heuristic_topk=True) in LlmArgs.

Test plan

Unit tests: pytest tests/unittest/_torch/thop/parallel/test_indexer_topk.py (distribution-parameterized correctness tests)
Verify CUDA Graph capture/replay with heuristic path enabled
Kernel correctness: all output indices verified identical to torch.topk across all test configurations

Accuracy Evaluation (DeepSeek V3.2 NVFP4 on B200, 8×TP/EP, DSA sparse attention)

MMLU (14,042 questions)

	Exp 1	Exp 2	Exp 3	Average
Default TRT-LLM indexer TopK	87.51	—	—	87.51
Heuristic-guided indexer TopK	87.52	87.51	87.49	87.51

Delta: 0.00 — accuracy-neutral.

GSM8K (5-shot)

	Exp 1	Exp 2	Exp 3	Average
Default TRT-LLM indexer TopK	95.11	—	—	95.11
Heuristic-guided indexer TopK	95.03	95.49	95.19	95.24

Delta: +0.13 — accuracy-neutral (within noise).

GPQA-Diamond (CoT zero-shot)

	Exp 1	Exp 2	Exp 3	Average
Default TRT-LLM indexer TopK	77.27	—	—	77.27
Heuristic-guided indexer TopK	75.76	75.25	78.79	76.60

Delta: −0.67 — within GPQA-Diamond's ~3pt stderr.

LongBench v1 (8 runs, MTP=1)

	Exp 1	Exp 2	Exp 3	Exp 4	Exp 5	Exp 6	Exp 7	Exp 8	Average
Default TRT-LLM indexer TopK	44.37	45.01	44.68	44.72	44.47	44.25	44.64	44.74	44.61
Heuristic-guided indexer TopK	44.00	44.46	44.18	44.67	44.52	43.73	44.42	44.29	44.28

Delta: −0.33 — accuracy-neutral (< 0.75%).

LongBench v2 (215 questions, medium context length, MTP=1)

	Exp 1	Exp 2	Exp 3	Exp 4	Exp 5	Average
Heuristic TopK	49.30%	50.23%	48.37%	49.77%	47.91%	49.12%
Default TopK	50.70%	49.77%	52.09%	47.91%	47.44%	49.58%

Delta: −0.46% — LongBench v2 has high per-run variance (215 questions, range 47–52% even for baseline). With 5 matched runs per config, the delta is within expected noise.

Summary

Benchmark	Default TopK	Heuristic TopK	Delta
MMLU (14K questions)	87.51	87.51	0.00
GSM8K (5-shot)	95.11	95.24	+0.13
GPQA-Diamond (CoT)	77.27	76.60	−0.67
LongBench v1 (8 runs)	44.61	44.28	−0.33
LongBench v2 (medium context length, 5 runs)	49.58%	49.12%	−0.46%

All benchmarks are accuracy-neutral within their respective noise margins. MMLU and GSM8K (large-N, deterministic) show effectively zero impact. GPQA and LongBench (small-N, high variance) show deltas within expected run-to-run noise.

Performance Results (B200)

The heuristic TopK micro-kernel (heuristicTopKMultiRowKernel) is benchmarked against the default radix-sort path (topKPerRowDecode) on single-row logits input (one CTA per row). All benchmark runs first verify output indices match torch.topk exactly.

Random input (norm-similar distribution, y = 1 + 0.1 * N[0,1])

N (seq length)	Heuristic (ns)	Radix Sort (ns)	Speedup
8,192	16,512	11,200	0.68x
16,384	21,856	21,984	1.01x
32,768	26,112	32,928	1.26x
65,536	31,904	47,936	1.50x
70,690	36,864	51,200	1.39x
131,072	43,392	76,128	1.75x

Heuristic kernel wins at N >= 16K; at shorter sequences, the radix sort is faster (expected — the heuristic's histogram overhead dominates at small N, which is why the dispatch threshold gates it).

Realistic input (DeepSeek V3.2 decode logits, SWE-Bench ISL/OSL=64K/2K)

Profiled across 9 layers x 17 decode steps (N ~ 68.7K-70.7K):

Metric	Value
Average speedup	1.81x
Range	1.15x – 2.36x
Radix sort baseline	~52–57 us (stable across layers)
Heuristic kernel	~24–46 us (varies by layer/step)

The heuristic kernel achieves 1.81x average speedup over the radix sort on realistic DeepSeek V3.2 decode workloads at typical sequence lengths (~70K).

coderabbitai · 2026-03-20T05:45:05Z

📝 Walkthrough

Walkthrough

This PR introduces a heuristic top-K decoding mechanism that accelerates top-K selection by reusing previous step's top-K indices as hints. It adds new CUDA kernel implementations, extends the IndexerTopK interface to accept pre-computed indices, integrates with PyTorch operators and DSA sparse attention, and introduces comprehensive distribution-based tests.

Changes

Cohort / File(s)	Summary
Kernel API Updates `cpp/tensorrt_llm/kernels/IndexerTopK.h`	Updated `invokeIndexerTopKDecode` signature to accept optional pre-index parameters (`preIdx`, `preIdxStride`, `preIdxCount`) for heuristic decoding path.
New Heuristic TopK Headers `cpp/tensorrt_llm/kernels/heuristicTopKDecode.h`, `cpp/tensorrt_llm/kernels/heuristic_topk.cuh`	Introduced header files defining CUDA kernel launcher `launchHeuristicTopKDecode` and low-level heuristic TopK implementation with histogram-based selection, threshold refinement, and candidate collection.
CUDA Kernel Implementations `cpp/tensorrt_llm/kernels/heuristicTopKDecode.cu`	Added multi-row heuristic TopK decode launcher with dynamic shared memory configuration, scratch buffer management, and optional logits alignment for CUDA float4 requirements.
IndexerTopK Implementation `cpp/tensorrt_llm/kernels/indexerTopK.cu`	Extended `invokeIndexerTopKDecode` to conditionally invoke heuristic path when pre-index data is provided and conditions are met; introduced dynamic split-work threshold logic.
PyTorch Operator Integration `cpp/tensorrt_llm/thop/IndexerTopKOp.cpp`, `tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py`	Updated `indexer_topk_decode` operator to accept optional `pre_idx` tensor and propagate to CUDA kernel; updated fake op registration.
DSA Sparse Attention Integration `tensorrt_llm/_torch/attention_backend/sparse/dsa.py`	Added heuristic TopK path in decode with per-layer persistent buffers to store and reuse previous layer's topK indices; conditional staging and indexing logic.
Configuration and Model Setup `tensorrt_llm/llmapi/llm_args.py`, `tensorrt_llm/_torch/model_config.py`	Added `enable_heuristic_topk` field to `DeepSeekSparseAttentionConfig`; wired flag through model initialization for Deepseek and GLM-MoE architectures.
Evaluation Script `examples/longbench/eval_longbench_v1.py`	Extended CLI with DSA-specific flags (`--dsa_sparse`, `--enable_heuristic_topk`, `--dsa_index_topk`), speculative decoding option (`--mtp`), and conditional KV cache dtype configuration.
Distribution-based Tests `tests/unittest/_torch/thop/parallel/test_indexer_topk.py`	Added parameterized correctness tests with distribution-shaped logits (beta, logistic, lognorm, weibull), MTP correlation structures, and heuristic `pre_idx` generation with configurable hit/fill ratios.

Sequence Diagram

sequenceDiagram
    participant PyTorch as PyTorch Inference
    participant DSA as DSA Attention Layer
    participant Indexer as Indexer (sparse_attn_indexer)
    participant TopK as indexer_topk_decode Op
    participant Kernel as CUDA Kernels
    
    PyTorch->>DSA: Generate tokens (decode step)
    DSA->>Indexer: Call with use_custom_topk flag
    alt enable_heuristic_topk enabled
        Indexer->>Indexer: Derive pre_idx from heuristic_prev_topk<br/>(previous layer's stored TopK)
        Indexer->>TopK: Call with pre_idx parameter
        TopK->>Kernel: Dispatch to launchHeuristicTopKDecode
        Kernel->>Kernel: Check if N <= topK (simple case)<br/>else use heuristic path
        Kernel-->>TopK: Return top-K indices
        TopK-->>Indexer: top-K results
        Indexer->>Indexer: Update heuristic_prev_topk<br/>with current layer's topK
    else enable_heuristic_topk disabled
        Indexer->>TopK: Call with pre_idx=None
        TopK->>Kernel: Dispatch to standard path<br/>(insertion/radix/multi-block)
        Kernel-->>TopK: Return top-K indices
        TopK-->>Indexer: top-K results
    end
    Indexer-->>DSA: Sparse attention indices
    DSA-->>PyTorch: Attention output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

hyukn
yilin-void
litaotju
longlee0622
Kefeng-Duan

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.03% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main feature being added: a temporally-correlated heuristic-guided TopK decoder for sparse attention.
Description check	✅ Passed	The PR description is comprehensive and well-structured, providing detailed information on the feature, implementation, test coverage, and performance results.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/kernels/heuristic_topk.cuh`:
- Around line 751-776: The launcher launchHeuristicTopK is unsafe because
heuristicTopKJob/heuristicTopKKernel assumes the compile-time TOP_K (2048) but
launchHeuristicTopK accepts a runtime topK; either enforce and reject mismatched
sizes or propagate the runtime topK into the kernel/job. Fix by validating topK
against TOP_K at the top of launchHeuristicTopK and return cudaErrorInvalidValue
(or similar) if they differ, or modify the kernel/job interfaces
(heuristicTopKKernel and heuristicTopKJob) to accept and use the runtime topK
everywhere (remove compile-time-only assumptions/padding/thresholding) so no
writes occur past the provided outputValues/outputIndices.
- Around line 772-774: The explicit launch with a 64-bit index type
(launchHeuristicTopK<float, int64_t>) corrupts buffers because the kernel launch
reinterprets preIdx/outputIndices as int*/int const* and uses 32-bit element
sizes; fix by making the kernel and launch correctly use the index template
IdxT: either remove the int64_t instantiation and only use 32-bit indices, or
fully templatize the kernel/job on IdxT and change the reinterpret_casts and
buffer pointer types in the launch sites (e.g., heuristicTopKKernel launch and
the calls around the launchHeuristicTopK instantiations at the same block and at
lines ~779-782) so pointers, casts, and writes/reads use reinterpret_cast<IdxT
const*> / reinterpret_cast<IdxT*> and the correct element sizes.

In `@cpp/tensorrt_llm/kernels/heuristicTopKDecode.cu`:
- Around line 85-98: The current process-wide static flag `configured` causes
`cudaFuncSetAttribute(heuristicTopKMultiRowKernel,
cudaFuncAttributeMaxDynamicSharedMemorySize, ...)` to be applied only once
across all GPUs; make this device-aware by replacing the single `configured`
with a per-device guard (e.g., a container keyed by the current device id
obtained via `cudaGetDevice`) or simply remove the guard and call
`cudaFuncSetAttribute` unconditionally when `smemSize > 48u * 1024u && smemSize
<= static_cast<size_t>(maxSmem)`; ensure you check `device` via `cudaGetDevice`
and set the per-device flag after a successful `cudaFuncSetAttribute` for
`heuristicTopKMultiRowKernel` to avoid suppressing opt-in on other GPUs.

In `@cpp/tensorrt_llm/thop/IndexerTopKOp.cpp`:
- Around line 74-85: The pre_idx handling only checks is_cuda() but must also
ensure it lives on the same CUDA device as logits to avoid passing a foreign
device pointer into invokeIndexerTopKDecode; update the pre_idx branch (the
block using preIdxTensor, preIdxPtr, preIdxStride, preIdxCount) to assert the
devices match (compare preIdxTensor.device() with logits.device() or
logits.device().index()) using TORCH_CHECK and a clear error message, so the
kernel always receives a pointer on the same CUDA device as logits.
- Around line 80-81: The TORCH_CHECK in IndexerTopKOp.cpp incorrectly allows
preIdxTensor.size(0) == numRows64 when next_n > 1, which leads
heuristicTopKMultiRowKernel to index hints by rowIdx / next_n and ignore per-row
hints; update the validation around preIdxTensor, next_n and numRows64 so that
when next_n > 1 the only accepted hint shape is preIdxTensor.size(0) * next_n ==
numRows64 (or equivalently preIdxTensor.size(0) == numRows64 / next_n), and only
allow the preIdxTensor.size(0) == numRows64 shorthand when next_n == 1; locate
and change the TORCH_CHECK that currently references preIdxTensor.size(0),
next_n and numRows64 (and any nearby comments) to enforce this stricter
condition so callers cannot silently pass per-row hints that will be ignored by
heuristicTopKMultiRowKernel.

In `@examples/longbench/eval_longbench_v1.py`:
- Around line 389-409: Add upfront argument validation to reject combinations
that require PyTorch when the user selected the TensorRT backend: if
args.backend == "tensorrt" and (args.dsa_sparse is True or args.mtp > 0) then
raise a clear CLI error (e.g., parser.error or raise ValueError) before building
DeepSeekSparseAttentionConfig or MTPDecodingConfig; reference the symbols
args.dsa_sparse, DeepSeekSparseAttentionConfig, args.mtp, and MTPDecodingConfig
so the check runs prior to the blocks that construct those configs and prevents
LLM(...) from failing later.

In `@tensorrt_llm/_torch/attention_backend/sparse/dsa.py`:
- Around line 533-540: The allocated buffer heuristic_prev_topk (created via
get_empty(...)) is being read/shifted before being seeded; reset newly allocated
or newly-assigned slots to -1 and ensure the shift logic (where staging += 1 is
applied) only increments entries that are >= 0 so padding/sentinel -1 stays as
“no hint”. Update the code paths that touch heuristic_prev_topk (and the same
pattern at the other occurrences) to explicitly initialize new request slots to
-1 after get_empty, and change the increment/shift logic to conditionally add 1
only for non-negative elements (e.g., mask with >=0 or use where/conditional
update) so stale -1 values never turn into candidate 0.

In `@tests/unittest/_torch/thop/parallel/test_indexer_topk.py`:
- Around line 313-367: The issue is that a block was accidentally dedented to
module scope so _run_cute_dsl_topk_test() returns immediately after seed setup,
making the rest of the test body run at import time and creating a duplicate
test name; fix it by re-indenting the displaced block so all code that generates
logits, runs assertions, and parametrized test logic is inside the
_run_cute_dsl_topk_test function (identify by the function name
_run_cute_dsl_topk_test and the seed setup lines), and remove or rename the
stray duplicate test_indexer_topk_decode_dist definition at module scope so
there is no F811 duplicate symbol.
- Around line 834-835: Clamp seq_lens to ensure no sequence is shorter than
next_n before computing row_ends: after calling generate_seq_lens(...) assign
seq_lens = seq_lens.clamp(min=next_n) so that row_ends = seq_lens[row_indices] -
next_n + next_n_offset + 1 cannot produce non-positive values; this prevents
generate_pre_idx from receiving non-positive valid_len and blowing up. Ensure
you update the variable used by subsequent logic (row_ends, generate_pre_idx) so
the clamp takes effect.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 948bc074-35b6-4176-80ee-3ff5990b6509

📥 Commits

Reviewing files that changed from the base of the PR and between 7dd0865 and 09b44a6.

📒 Files selected for processing (12)

cpp/tensorrt_llm/kernels/IndexerTopK.h
cpp/tensorrt_llm/kernels/heuristicTopKDecode.cu
cpp/tensorrt_llm/kernels/heuristicTopKDecode.h
cpp/tensorrt_llm/kernels/heuristic_topk.cuh
cpp/tensorrt_llm/kernels/indexerTopK.cu
cpp/tensorrt_llm/thop/IndexerTopKOp.cpp
examples/longbench/eval_longbench_v1.py
tensorrt_llm/_torch/attention_backend/sparse/dsa.py
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_args.py
tests/unittest/_torch/thop/parallel/test_indexer_topk.py

longcheng-nv · 2026-03-21T03:14:44Z

/bot run

longcheng-nv · 2026-03-21T08:17:23Z

/bot run

longcheng-nv · 2026-03-23T01:48:25Z

/bot run

longcheng-nv · 2026-03-23T01:57:28Z

/bot help

github-actions · 2026-03-23T01:57:36Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

longcheng-nv · 2026-03-23T02:00:59Z

/bot run

HuiGao-NV · 2026-03-23T02:37:23Z

/bot run

tensorrt-cicd · 2026-03-23T02:44:05Z

PR_Github #39861 [ run ] triggered by Bot. Commit: 7de8fcf Link to invocation

tensorrt-cicd · 2026-03-23T04:58:55Z

PR_Github #39861 [ run ] completed with state SUCCESS. Commit: 7de8fcf
/LLM/main/L0_MergeRequest_PR pipeline #31034 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

longcheng-nv · 2026-03-23T07:10:46Z

/bot run

HuiGao-NV · 2026-03-23T07:49:54Z

/bot run

tensorrt-cicd · 2026-03-23T07:55:40Z

PR_Github #39901 [ run ] triggered by Bot. Commit: 7de8fcf Link to invocation

tensorrt-cicd · 2026-03-23T11:00:04Z

PR_Github #39901 [ run ] completed with state SUCCESS. Commit: 7de8fcf
/LLM/main/L0_MergeRequest_PR pipeline #31070 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

longcheng-nv · 2026-03-24T03:31:38Z

/bot run

…e indexer Enable the heuristic TopK decode path by persisting each layer's previous-step TopK indices and passing them as pre_idx hints to indexer_topk_decode. Consecutive decode steps have nearly identical attention patterns, so prior indices bootstrap a better initial threshold and reduce interpolation iterations in the heuristic kernel. Key changes: - Add enable_heuristic_topk config field (default False) to DeepSeekSparseAttentionConfig for opt-in activation. - Per-layer lazy-allocated buffers in Indexer for TopK index persistence with request-id-based slot management and stale-request cleanup. - Save prefill last-token TopK as seed for first decode step; save decode last-MTP-position TopK for subsequent steps. - Fix MTP preIdx row indexing in heuristic kernel (rowIdx/next_n). - Relax pre_idx size check in thop binding for per-request shape. Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

Two fixes for the heuristic TopK decode path: 1. Add +1 offset to pre_idx in _gather_prev_topk_for_decode: the saved TopK came from a query at position P (last MTP pos of previous step), while the current step's first query is at P+1. Shifting by +1 preserves relative distances under RoPE for a more accurate initial threshold in the heuristic kernel. 2. Propagate enable_heuristic_topk through model_config.py: the DeepSeekV32 config builder was reconstructing DeepSeekSparseAttention Config without forwarding enable_heuristic_topk from the user's sparse_attention_config, causing the field to always default to False. Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

…uffers Redesign the heuristic TopK pre_idx management to be fully CUDA Graph compatible. Replace the old Indexer-side dict/slot approach (which used .item() and torch.tensor() forbidden during graph capture) with pre-allocated metadata buffers using a feedback loop pattern. Key changes: - Per-layer 3D buffer heuristic_prev_topk on metadata, allocated via get_empty(capture_graph=True) for stable addresses across replays. - Shared staging buffer for +1 RoPE offset, all in-place ops. - Each graph replay's write becomes the next replay's read. - Heuristic now works with CUDA Graphs enabled (no need to disable). - Refine heuristic_topk.cuh kernel for improved convergence. Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

…aluation Extend eval_longbench_v1.py to support DeepSeek DSA sparse attention alongside existing RocketKV, with heuristic TopK and MTP speculative decoding options. Key changes: - Add --dsa_sparse flag as mutually exclusive with --rocket_sparse - Add --enable_heuristic_topk for DSA heuristic TopK pre_idx reuse - Add --mtp for MTP speculative decoding layers - Update usage examples for both RocketKV and DSA workflows Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

…tions Three kernel-level optimizations (OPT5/OPT6/OPT7) to reduce latency of the single-CTA heuristic TopK on Blackwell (sm_100): OPT5 — Skip Phase 3 blockCountGE re-scan when Phase 2 already confirmed the candidate count is in [TOP_K, MAX_CANDIDATES] (done==1). Eliminates one full N-element scan in the common fast-convergence case. OPT6 — Increase NUM_BINS from 256 to 2048 and replace the serial O(2048) K-th-bin scan with a two-step parallel search: each warp accumulates its 128-bin slice (NUM_BINS/NUM_WARPS), then thread 0 locates the target warp in 16 steps and one warp-lane scans 128 bins. Serial depth drops from 2048 to 144 steps. Shared memory grows from ~50 KB to ~59 KB; the histogram is now 8 KB instead of 1 KB. OPT7 — Cache per-thread element counts into smem->per_thread_counts[] inside blockCountGE. Phase 3 prefix-sum reuses these cached values instead of repeating the N-scan, saving one full global-memory pass. Also add distribution-parameterised correctness tests for the heuristic indexer_topk_decode path in test_indexer_topk.py: - Four logit families: beta (bounded), logistic (heavy-tailed), lognorm (positively skewed), weibull_min (right-skewed extreme-value) - MTP correlation: consecutive rows within each batch element share tail logits (next_n up to 3) - pre_idx accuracy sweep: success_ratio in {0.5, 0.9} - Tolerance set to full_range/256 to accept histogram-bin boundary ties Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

…flicts Pass caller-owned scratch buffer through heuristic TopK pipeline to eliminate cudaMallocAsync/cudaFreeAsync inside the kernel, enabling CUDA Graph capture. Also resolves stash merge conflicts in test file and adds seq_lens clamping for valid row lengths with next_n > 1. Signed-off-by: longcheng-nv <[email protected]> Made-with: claude-4.6-opus-high

…pK tests Add docstrings to all test functions and helpers to improve docstring coverage. Remove broken duplicate test_indexer_topk_decode_dist that referenced undefined names (dtype, run_fn). Suppress F811 on intentional CuTE DSL test redefinitions with different parametrize configurations. Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

…n test_dsa_indexer The DSA backend accesses sparse_attention_config.enable_heuristic_topk but the mock SparseAttentionConfig in test_dsa_indexer.py was missing this field, causing all 76 DSA indexer tests to fail with AttributeError. Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

Add one-line docstrings to all public functions, classes, and methods across the 5 Python files modified by this PR to satisfy the CI docstring coverage gate (was 41%, now ~100%). Signed-off-by: longcheng-nv <[email protected]> Made-with: claude-4.6-opus-high

Without heuristic_scratch the call silently falls back to the radix kernel. Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

The heuristic TopK kernel requires stride0 divisible by 4 for float4 loads. In production fp8_paged_mqa_logits always returns tensors with stride aligned to 256, but the test helpers created logits with arbitrary max_len from random seq_lens, causing assertion failures when max_len was not a multiple of 4. Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

…duplicate allocation - Remove duplicate `heuristic_scratch_values` allocation from `create_expanded_buffers()` (already created in `__post_init__`); move resize logic into `update_spec_dec_param()` where shape changes. - Eliminate `heuristic_pre_idx_staging` buffer entirely: pass `heuristic_prev_topk` directly as `pre_idx` to the kernel. - Move +1 temporal offset into the C++ kernel (`preIdxOffset = (rowIdx % next_n) + 1`), removing two Python tensor ops (copy_ and += 1) from the CUDA Graph captured region per layer per decode step. Addresses review comments from lfr-0531 on PR NVIDIA#12385. Signed-off-by: longcheng-nv <[email protected]>

…on pre-Blackwell - Add `get_sm_version() >= 100` guard to `enable_heuristic_topk` in both `DSAtrtllmAttentionMetadata` and `Indexer`, so the heuristic path silently falls back to radix sort on Hopper and older architectures. - Add `@skip_pre_blackwell` decorator to `test_indexer_topk_decode_dist` in `test_indexer_topk.py` and `test_indexer_topk_dist.py`, since the heuristic kernel is only validated for Blackwell GPUs. Fixes CI failure on DGX_H100: test_indexer_topk_decode_dist[8192-2048-2-128-logistic_m0.47_s1.46-0.5]. Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

…eedback Revert docstring-only changes in cpp_custom_ops.py, model_config.py, and llm_args.py that were added in eda0351 but are outside the scope of this PR. Docstrings in files directly modified by this feature (dsa.py and test_dsa_indexer.py) are retained. Addresses feedback from lfr-0531 and kaiyux on PR NVIDIA#12385. Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

… and multi-GPU Add runtime topK == TOP_K validation, compile-time static_assert blocking non-32-bit index types, remove process-wide static flag in favor of unconditional cudaFuncSetAttribute, and drop unsafe reinterpret_casts. Signed-off-by: Long Cheng <[email protected]> Signed-off-by: longcheng-nv <[email protected]>

Add a parametrized variant to TestDeepSeekV32.test_fp8_blockscale that exercises the end-to-end heuristic TopK decode path with MTP (next_n=1). Gated to Blackwell (SM >= 100). Validates accuracy on MMLU and GSM8K with enable_heuristic_topk=True through the LLM API. Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

longcheng-nv · 2026-04-02T01:50:55Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-02T01:56:46Z

PR_Github #41297 [ run ] triggered by Bot. Commit: 5f59b99 Link to invocation

tensorrt-cicd · 2026-04-02T09:45:51Z

PR_Github #41297 [ run ] completed with state SUCCESS. Commit: 5f59b99
/LLM/main/L0_MergeRequest_PR pipeline #32252 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

longcheng-nv · 2026-04-03T01:33:48Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-03T01:39:57Z

PR_Github #41536 [ run ] triggered by Bot. Commit: 5f59b99 Link to invocation

tensorrt-cicd · 2026-04-03T04:45:58Z

PR_Github #41536 [ run ] completed with state SUCCESS. Commit: 5f59b99
/LLM/main/L0_MergeRequest_PR pipeline #32451 completed with status: 'SUCCESS'

CI Report

Link to invocation

…Sparse Attention (NVIDIA#12385) Signed-off-by: longcheng-nv <[email protected]>

longcheng-nv requested review from a team as code owners March 20, 2026 05:12

longcheng-nv requested review from HuiGao-NV, Shixiaowei02, Superjomn, arysef, brb-nv and yizhang-nv March 20, 2026 05:12

coderabbitai Bot reviewed Mar 20, 2026

View reviewed changes

longcheng-nv force-pushed the feat/heuristic-indexer-topk branch 3 times, most recently from 3ef3244 to 5f02778 Compare March 24, 2026 03:01

longcheng-nv added 16 commits April 2, 2026 01:11

[None][fix] Pass heuristic_scratch to indexer_topk_decode with pre_idx

7a8eb36

Without heuristic_scratch the call silently falls back to the radix kernel. Made-with: claude-4.6-opus-high Signed-off-by: longcheng-nv <[email protected]>

longcheng-nv force-pushed the feat/heuristic-indexer-topk branch from 386cd7b to 5f59b99 Compare April 2, 2026 01:48

longcheng-nv requested a review from a team as a code owner April 2, 2026 01:48

juney-nvidia approved these changes Apr 3, 2026

View reviewed changes

juney-nvidia merged commit 29fac6b into NVIDIA:main Apr 3, 2026
5 checks passed

govind-ramnarayan pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Apr 6, 2026

[None][feat] Temporally-Correlated Heuristic-guided Indexer TopK for …

c498877

…Sparse Attention (NVIDIA#12385) Signed-off-by: longcheng-nv <[email protected]>

karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026

[None][feat] Temporally-Correlated Heuristic-guided Indexer TopK for …

51eebf7

…Sparse Attention (NVIDIA#12385) Signed-off-by: longcheng-nv <[email protected]>

This was referenced Apr 26, 2026

[None][perf] Scheme X L2-aware dispatcher and PDL launchers for sparse-attention GVR Top-K #13477

Merged

[None][docs] add GVR Top-K technical blog #13714

Merged

Conversation

longcheng-nv commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit Breakdown (11 commits)

Key Files

API

Test plan

Accuracy Evaluation (DeepSeek V3.2 NVFP4 on B200, 8×TP/EP, DSA sparse attention)

MMLU (14,042 questions)

GSM8K (5-shot)

GPQA-Diamond (CoT zero-shot)

LongBench v1 (8 runs, MTP=1)

LongBench v2 (215 questions, medium context length, MTP=1)

Summary

Performance Results (B200)

Random input (norm-similar distribution, y = 1 + 0.1 * N[0,1])

Realistic input (DeepSeek V3.2 decode logits, SWE-Bench ISL/OSL=64K/2K)

Uh oh!

coderabbitai Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

longcheng-nv commented Mar 21, 2026

Uh oh!

longcheng-nv commented Mar 21, 2026

Uh oh!

longcheng-nv commented Mar 23, 2026

Uh oh!

longcheng-nv commented Mar 23, 2026

Uh oh!

github-actions Bot commented Mar 23, 2026

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

longcheng-nv commented Mar 23, 2026

Uh oh!

HuiGao-NV commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

longcheng-nv commented Mar 23, 2026

Uh oh!

HuiGao-NV commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

longcheng-nv commented Mar 24, 2026

Uh oh!

longcheng-nv commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

longcheng-nv commented Apr 3, 2026

Uh oh!

longcheng-nv commented Mar 20, 2026 •

edited

Loading

coderabbitai Bot commented Mar 20, 2026 •

edited

Loading