[None][perf] Use +64 batch sizes for padding-enabled CUDA graphs by yijingl-nvidia · Pull Request #12895 · NVIDIA/TensorRT-LLM

yijingl-nvidia · 2026-04-09T16:52:25Z

Description

When enable_padding=True, replace the sparse powers-of-2 schedule (256, 512, 1024, 2048) with uniform +64 increments after the initial [1,2,4,8,...,128] base, giving denser coverage (192, 256, 320, …, max_batch_size) and reducing padding waste at intermediate batch sizes.

To test how denser batch sizes improve output throughput, we’ve tested on four major models, DeepSeek R1 and V3.2, Kimi K2 and Qwen 3, and in both aggregate and disagg modes.

A denser set of batch sizes comes with the cost of more GPU memory. Each CUDA graph metadata takes about 10 MiB for DeepSeek R1. For +64 batch sizes, it increases from 220MiB to 490 MiB per GPU. For +8 batch sizes, it goes further to 2.6 GiB.

From the experiments, we found that denser batch sizes led to smaller drop on output throughput. Both +64 and +8 batch sizes improve the metrics. +64 batch sizes improve by up to 1.3x for agg and 1.5x for disagg for some large concurrencies.

+8 batch sizes improves a bit more than +64 on most models. However. We did notice a regression of +8 batch sizes on DeepSeek V3.2. We suspect DeepSeek V3.2 uses DSA which accelerates the attention computation but makes the CUDA graph more complex to store, thereby costing more than 2x GPU memory to save. This squeezes out available space for KV cache, reducing server capacity. Each GPU in the +8 experiment has 5 GiB fewer KV cache memory than the +64.

On the downside, +64 batch sizes increase server startup time. TRTLLM repeats the graph capture twice. 1st as a dry run to measure how much GPU memory is used at runtime to estimate how much KV cache we can safely allocate. 2nd as the actual startup with the full KV cache allocated. Tested on GB200, DeepSeek R1 TP=4, IFB, max batch size 2048, it increases CUDA graph capture time by 1.4x and total LLM init time (measured as the time of get_llm()) by 1.17x. Future remedy can be to reduce the number of CUDA graph captures in the dry run phase and use an estimate based on the memory usage of the remaining CUDA graphs captured.

We concluded by suggesting landing the change to use +64 batch sizes for the next release. We are also writing a blog post with the data and guidance on how to set CUDA graph batch sizes for improving performance. Future work would include testing more options like +16, and designing a way of automatically setting the batch sizes to reach the best balance of memory cost and performance gains.

Test Coverage

CI

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Refactor
- Updated CUDA graph batch size generation with adjusted scheduling patterns for different padding configurations.

coderabbitai · 2026-04-09T16:57:08Z

📝 Walkthrough

Walkthrough

Modified the batch size generation logic in CudaGraphConfig._generate_cuda_graph_batch_sizes. When enable_padding is true, the function now generates batch sizes using [1, 2, 4] followed by multiples of 8 up to 128, then increments by 64 until reaching max_batch_size. When enable_padding is false, existing power-of-two extension logic is moved into an else block, preserving functional behavior.

Changes

Cohort / File(s)	Summary
CUDA Graph Batch Size Generation `tensorrt_llm/llmapi/llm_args.py`	Modified `_generate_cuda_graph_batch_sizes` method: when padding is enabled, uses new batch size strategy (8-step increments transitioning to 64-step increments); when disabled, consolidates power-of-two extension logic into else block while maintaining prior behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title accurately describes the main change: implementing +64 batch size increments for padding-enabled CUDA graphs, which is the core functional modification.
Description check	✅ Passed	PR description comprehensively explains the change, rationale, experimental results, and trade-offs, exceeding template requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/llmapi/llm_args.py (1)

169-187: ⚠️ Potential issue | 🟠 Major

Clamp padding-mode batch sizes to max_batch_size before returning.

On Line 181-Line 183, filtering/sorting now runs only in the else branch.
With enable_padding=True, batch_sizes can exceed max_batch_size (e.g., max_batch_size=64 still keeps values up to 128), which can lead to unnecessary graph capture and inconsistent config behavior.

🔧 Proposed fix

         if enable_padding:
             # Start with [1, 2, 4, 8, 16, 24, ..., 128] (multiples of 8)
             batch_sizes = [1, 2, 4] + [i * 8 for i in range(1, 17)]
             # Sliding 64: extend by increments of 64 up to max_batch_size
             while batch_sizes[-1] + 64 <= max_batch_size:
                 batch_sizes.append(batch_sizes[-1] + 64)
         else:
             batch_sizes = list(range(1, 32)) + [32, 64, 128]
             # Add powers of 2 up to max_batch_size
             batch_sizes += [
                 2**i for i in range(8, math.ceil(math.log(max_batch_size, 2)))
             ]
-            # Filter and sort batch sizes
-            batch_sizes = sorted(
-                [size for size in batch_sizes if size <= max_batch_size])
+        # Filter and sort batch sizes for both branches
+        batch_sizes = sorted(size for size in batch_sizes if size <= max_batch_size)
 
         # Add max_batch_size if not already included
-        if max_batch_size != batch_sizes[-1]:
+        if not batch_sizes or max_batch_size != batch_sizes[-1]:
             batch_sizes.append(max_batch_size)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/llmapi/llm_args.py` around lines 169 - 187, The enable_padding
branch for building batch_sizes can produce values > max_batch_size; update the
block that builds batch_sizes when enable_padding is True (the batch_sizes list
creation) to clamp/filter values to <= max_batch_size and sort/unique them just
like the else branch does, ensuring batch_sizes = sorted([s for s in batch_sizes
if s <= max_batch_size]) before the final check that appends max_batch_size if
missing; keep the subsequent append-of-max_batch_size logic unchanged.

🧹 Nitpick comments (1)

tensorrt_llm/llmapi/llm_args.py (1)
169-189: Add regression coverage for padding schedule edge cases.

Please add tests for enable_padding=True with at least: max_batch_size=64, 129, and 320 to assert all values are <= max_batch_size, sorted, and include max_batch_size. This will prevent regressions from branch-local filtering changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/llmapi/llm_args.py` around lines 169 - 189, Add regression tests
covering the padding schedule branch where enable_padding=True by calling the
batch-size generator (the function that takes enable_padding and max_batch_size
and returns batch_sizes) with max_batch_size values 64, 129, and 320; for each
case assert that every returned value in batch_sizes is <= max_batch_size, that
batch_sizes is sorted (non-decreasing), and that max_batch_size is present in
the returned list; target the branch that builds batch_sizes using the
enable_padding path (referencing the enable_padding parameter, max_batch_size
variable, and the returned batch_sizes) so future changes to the padding
schedule are validated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 169-187: The enable_padding branch for building batch_sizes can
produce values > max_batch_size; update the block that builds batch_sizes when
enable_padding is True (the batch_sizes list creation) to clamp/filter values to
<= max_batch_size and sort/unique them just like the else branch does, ensuring
batch_sizes = sorted([s for s in batch_sizes if s <= max_batch_size]) before the
final check that appends max_batch_size if missing; keep the subsequent
append-of-max_batch_size logic unchanged.

---

Nitpick comments:
In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 169-189: Add regression tests covering the padding schedule branch
where enable_padding=True by calling the batch-size generator (the function that
takes enable_padding and max_batch_size and returns batch_sizes) with
max_batch_size values 64, 129, and 320; for each case assert that every returned
value in batch_sizes is <= max_batch_size, that batch_sizes is sorted
(non-decreasing), and that max_batch_size is present in the returned list;
target the branch that builds batch_sizes using the enable_padding path
(referencing the enable_padding parameter, max_batch_size variable, and the
returned batch_sizes) so future changes to the padding schedule are validated.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5920502e-9591-4b1c-983a-406885dd4e30

📥 Commits

Reviewing files that changed from the base of the PR and between 28b6afb and 6e65d44.

📒 Files selected for processing (1)

tensorrt_llm/llmapi/llm_args.py

yijingl-nvidia · 2026-04-09T19:40:16Z

/bot run

tensorrt-cicd · 2026-04-09T19:47:50Z

PR_Github #42571 [ run ] triggered by Bot. Commit: 6e65d44 Link to invocation

tensorrt-cicd · 2026-04-10T01:51:42Z

PR_Github #42571 [ run ] completed with state SUCCESS. Commit: 6e65d44
/LLM/main/L0_MergeRequest_PR pipeline #33304 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yijingl-nvidia · 2026-04-10T04:07:08Z

/bot run

tensorrt-cicd · 2026-04-10T04:13:22Z

PR_Github #42656 [ run ] triggered by Bot. Commit: a557b8b Link to invocation

kaiyux · 2026-04-10T08:06:32Z

@yijingl-nvidia can you put a summary on the performance validation results on the PR description? Thanks.

tensorrt-cicd · 2026-04-10T14:35:17Z

PR_Github #42656 [ run ] completed with state SUCCESS. Commit: a557b8b
/LLM/main/L0_MergeRequest_PR pipeline #33365 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yijingl-nvidia · 2026-04-10T18:29:28Z

/bot run

tensorrt-cicd · 2026-04-10T18:46:03Z

PR_Github #42731 [ run ] triggered by Bot. Commit: a557b8b Link to invocation

tensorrt-cicd · 2026-04-11T01:21:24Z

PR_Github #42731 [ run ] completed with state SUCCESS. Commit: a557b8b
/LLM/main/L0_MergeRequest_PR pipeline #33414 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yijingl-nvidia · 2026-04-13T04:42:27Z

@yijingl-nvidia can you put a summary on the performance validation results on the PR description? Thanks.

Thanks for the suggestion. Edited the summary to include a brief summary of our performance findings.

yijingl-nvidia · 2026-04-13T14:23:14Z

/bot run

tensorrt-cicd · 2026-04-13T14:30:09Z

PR_Github #43059 [ run ] triggered by Bot. Commit: a557b8b Link to invocation

tensorrt-cicd · 2026-04-13T23:35:22Z

PR_Github #43059 [ run ] completed with state SUCCESS. Commit: a557b8b
/LLM/main/L0_MergeRequest_PR pipeline #33702 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yijingl-nvidia · 2026-04-20T17:58:50Z

/bot run

tensorrt-cicd · 2026-04-20T18:05:57Z

PR_Github #44494 [ run ] triggered by Bot. Commit: f77a4d2 Link to invocation

tensorrt-cicd · 2026-04-20T22:42:38Z

PR_Github #44494 [ run ] completed with state SUCCESS. Commit: f77a4d2
/LLM/main/L0_MergeRequest_PR pipeline #34894 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yijingl-nvidia · 2026-04-21T17:46:56Z

/bot run

tensorrt-cicd · 2026-04-21T17:53:35Z

PR_Github #44781 [ run ] triggered by Bot. Commit: e4fa2e3 Link to invocation

tensorrt-cicd · 2026-04-21T21:04:28Z

PR_Github #44781 [ run ] completed with state SUCCESS. Commit: e4fa2e3
/LLM/main/L0_MergeRequest_PR pipeline #35135 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yijingl-nvidia · 2026-04-21T23:50:07Z

/bot run

tensorrt-cicd · 2026-04-21T23:55:40Z

PR_Github #44822 [ run ] triggered by Bot. Commit: e4fa2e3 Link to invocation

When enable_padding=True, replace the sparse powers-of-2 schedule (256, 512, 1024, 2048) with uniform +64 increments after the initial [1,2,4,8,...,128] base, giving denser coverage (192, 256, 320, …, max_batch_size) and reducing padding waste at intermediate batch sizes. Signed-off-by: Yijing Li <[email protected]>

Move filter/sort outside the if/else so sizes exceeding max_batch_size are dropped in the enable_padding=True branch as well. Add guard for empty list before the max_batch_size append. Add regression tests for edge cases: max_batch_size=64, 129, 320. Signed-off-by: Yijing Li <[email protected]>

yijingl-nvidia · 2026-04-21T23:59:01Z

/bot run

tensorrt-cicd · 2026-04-22T00:05:38Z

PR_Github #44826 [ run ] triggered by Bot. Commit: 6be963b Link to invocation

tensorrt-cicd · 2026-04-22T03:00:20Z

PR_Github #44826 [ run ] completed with state SUCCESS. Commit: 6be963b
/LLM/main/L0_MergeRequest_PR pipeline #35172 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yijingl-nvidia · 2026-04-22T05:56:47Z

/bot run

tensorrt-cicd · 2026-04-22T06:03:13Z

PR_Github #44902 [ run ] triggered by Bot. Commit: 6be963b Link to invocation

tensorrt-cicd · 2026-04-22T12:57:31Z

PR_Github #44902 [ run ] completed with state SUCCESS. Commit: 6be963b
/LLM/main/L0_MergeRequest_PR pipeline #35236 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yijingl-nvidia · 2026-04-22T16:44:38Z

/bot run

tensorrt-cicd · 2026-04-22T16:50:42Z

PR_Github #44990 [ run ] triggered by Bot. Commit: 6be963b Link to invocation

tensorrt-cicd · 2026-04-22T22:12:10Z

PR_Github #44990 [ run ] completed with state SUCCESS. Commit: 6be963b
/LLM/main/L0_MergeRequest_PR pipeline #35311 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yijingl-nvidia · 2026-04-23T04:27:59Z

/bot run

tensorrt-cicd · 2026-04-23T04:34:40Z

PR_Github #45101 [ run ] triggered by Bot. Commit: 6be963b Link to invocation

tensorrt-cicd · 2026-04-23T14:51:09Z

PR_Github #45101 [ run ] completed with state SUCCESS. Commit: 6be963b
/LLM/main/L0_MergeRequest_PR pipeline #35398 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

yijingl-nvidia · 2026-04-23T19:57:05Z

Corresponding blog post at #13393

…DIA#12895) Signed-off-by: Yijing Li <[email protected]>

yijingl-nvidia requested a review from a team as a code owner April 9, 2026 16:52

yijingl-nvidia requested a review from syuoni April 9, 2026 16:52

github-actions Bot assigned yijingl-nvidia Apr 9, 2026

coderabbitai Bot reviewed Apr 9, 2026

View reviewed changes

yijingl-nvidia changed the title ~~[None][perf] Use sliding-64 batch sizes for padding-enabled CUDA graphs~~ [None][perf] Use +64 batch sizes for padding-enabled CUDA graphs Apr 9, 2026

kaiyux approved these changes Apr 10, 2026

View reviewed changes

venkywonka approved these changes Apr 17, 2026

View reviewed changes

yijingl-nvidia force-pushed the default_cuda_graph_batch_sizes branch from a557b8b to f77a4d2 Compare April 20, 2026 17:57

yijingl-nvidia force-pushed the default_cuda_graph_batch_sizes branch from f77a4d2 to e4fa2e3 Compare April 21, 2026 17:43

yijingl-nvidia added 2 commits April 21, 2026 16:57

yijingl-nvidia force-pushed the default_cuda_graph_batch_sizes branch from e4fa2e3 to 6be963b Compare April 21, 2026 23:57

yijingl-nvidia mentioned this pull request Apr 23, 2026

[None][doc] Add blog post for tuning batch sizes for CUDA graph padding and increasing the default batch size granularity for it #13393

Merged

1 task

taylor-yb-lee merged commit 7e5275f into NVIDIA:main Apr 24, 2026
5 checks passed

yijingl-nvidia mentioned this pull request May 5, 2026

[https://nvbugs/6115290][fix] Fix GPT OSS 120B GB200 Test Regression #13743

Merged

1 task

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026

[None][perf] Use +64 batch sizes for padding-enabled CUDA graphs (NVI…

4a8720c

…DIA#12895) Signed-off-by: Yijing Li <[email protected]>

Conversation

yijingl-nvidia commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yijingl-nvidia commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

yijingl-nvidia commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

kaiyux commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

yijingl-nvidia commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 11, 2026

Uh oh!

yijingl-nvidia commented Apr 13, 2026

Uh oh!

yijingl-nvidia commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

yijingl-nvidia commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

yijingl-nvidia commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

yijingl-nvidia commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

yijingl-nvidia commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

yijingl-nvidia commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

yijingl-nvidia commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

yijingl-nvidia commented Apr 9, 2026 •

edited

Loading

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading