[TRTLLM-11287][feat] Implement python based scheduler for KVCacheManagerV2 by lancelly · Pull Request #11939 · NVIDIA/TensorRT-LLM

lancelly · 2026-03-05T08:59:23Z

Summary

Implement a Python-based interleaved scheduler (KVCacheV2Scheduler) for KVCacheManagerV2 that unifies capacity scheduling and micro-batch scheduling into a single pass, replacing the previous KVCacheV2DummyScheduler + BindCapacityScheduler/BindMicroBatchScheduler two-stage path for V2.

Key changes:

New KVCacheV2Scheduler (scheduler_v2.py): merges KV cache allocation (via resize()) and token budget assignment into one unified scheduling loop. Supports:
- MAX_UTILIZATION eviction policy (suspend-to-host) with tail-eviction strategy
- Self-eviction for generation requests that cannot allocate
- Block reuse integration for context requests
- FCFS chunked prefill with configurable chunk unit size
- PEFT (LoRA) page accounting
- Draft model (MTP) KV cache coordination
- LoRA-aware request sorting
KVCacheManagerV2 scheduling API (resource_manager.py): exposes fine-grained allocation methods called by the scheduler:
- prepare_context() — create _KVCache, handle block reuse lookup, resume from suspended state
- resize_context() — resize KV cache to cover context tokens
- try_allocate_generation() — allocate one additional KV slot for generation (with resume)
- suspend_request() — suspend a request's KV cache to host tier
- is_request_active() — check if a request has a live, non-suspended KV cache
Draft KV cache manager refactor (resource_manager.py): extract _prepare_draft_resources() that mirrors the main manager's allocations for MTP draft layers, since the V2 scheduler only manages the primary KV cache
PyExecutor integration (py_executor.py): when V2 scheduler is active, skip executor-level _terminate_requests and _pause_requests for paused requests (the scheduler manages suspend/resume directly)
Bug fixes:
- Fix BAD_PAGE_INDEX (-1) causing CUDA_ERROR_ILLEGAL_ADDRESS in Flash MLA kernel — replace with 0 for unused page slots
- Cap get_num_available_tokens by GPU-only capacity when max_tokens is explicitly set
- Add assertions blocking Star attention and multi-beam with V2

Test plan

New unit tests: tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py covering scheduling logic, eviction, chunked prefill, PEFT, draft tokens, and edge cases
New integration tests: tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py (+542 lines) covering LLaMA accuracy with V2 KV cache + chunked context + MTP + block reuse
Existing accuracy ITs parametrized with v2_kv_cache=True/False variants
CI test-db entries added for B200, DGX-B200, and H100

Summary by CodeRabbit

Release Notes

New Features
- Enhanced KV cache scheduling strategy with improved memory management and resource efficiency
- Added support for draft token generation during inference
- Implemented GPU memory-aware caching for better resource utilization
- Expanded support for adapter-based inference workflows
Tests
- Added comprehensive scheduling validation test suite
- Extended test coverage with new caching configuration options

Signed-off-by: Lanyu Liao <[email protected]>

- Fix NameError in _create_one_model_draft_kv_cache_manager: use self._kv_cache_config instead of undefined draft_kv_cache_config - Remove test_reject_no_evict_policy: code now logs a warning instead of asserting on non-MAX_UTILIZATION policy Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2026-03-12T15:13:07Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-12T15:20:30Z

PR_Github #38745 [ run ] triggered by Bot. Commit: 521d419 Link to invocation

tensorrt-cicd · 2026-03-12T17:36:22Z

PR_Github #38745 [ run ] completed with state SUCCESS. Commit: 521d419
/LLM/main/L0_MergeRequest_PR pipeline #30064 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-13T01:22:52Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-13T01:30:36Z

PR_Github #38801 [ run ] triggered by Bot. Commit: 521d419 Link to invocation

- Add kv_cache/__init__.py for relative imports in IT - Remove duplicate ids kwarg in parametrize_with_ids for v2_kv_cache Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2026-03-13T03:12:28Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-13T03:22:40Z

PR_Github #38816 [ run ] triggered by Bot. Commit: a1a7531 Link to invocation

Rename v1_kv_cache -> v2_kv_cache=False and v2_kv_cache -> v2_kv_cache=True to match the auto-generated ids from parametrize_with_ids. Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2026-03-13T05:56:47Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-13T06:04:02Z

PR_Github #38833 [ run ] triggered by Bot. Commit: de3e8ea Link to invocation

Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2026-03-19T05:50:35Z

/bot run --disable-fail-fast

eopXD · 2026-03-19T05:55:36Z

Thank you for patiently addressing the comments.

tensorrt-cicd · 2026-03-19T05:57:21Z

PR_Github #39547 [ run ] triggered by Bot. Commit: 0d2ef32 Link to invocation

yizhang-nv · 2026-03-19T06:01:55Z

This PR has some WAR for MTP. Should be removed after we truly fix this issue.

lancelly · 2026-03-19T06:03:47Z

Thank you for patiently addressing the comments.

Thanks for the thorough review! The feedback really helped improve the design.

Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2026-03-19T08:11:19Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-19T08:17:03Z

PR_Github #39571 [ run ] triggered by Bot. Commit: 23a073a Link to invocation

Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2026-03-19T09:45:54Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-19T09:51:26Z

PR_Github #39586 [ run ] triggered by Bot. Commit: 208a207 Link to invocation

Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2026-03-19T14:33:13Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-19T14:41:46Z

PR_Github #39607 [ run ] triggered by Bot. Commit: 6ab16fe Link to invocation

tensorrt-cicd · 2026-03-19T21:32:04Z

PR_Github #39607 [ run ] completed with state SUCCESS. Commit: 6ab16fe
/LLM/main/L0_MergeRequest_PR pipeline #30815 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2026-03-20T02:19:03Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-20T02:24:52Z

PR_Github #39669 [ run ] triggered by Bot. Commit: 7de5c05 Link to invocation

tensorrt-cicd · 2026-03-20T09:23:32Z

PR_Github #39669 [ run ] completed with state SUCCESS. Commit: 7de5c05
/LLM/main/L0_MergeRequest_PR pipeline #30872 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-20T09:25:32Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-20T09:32:06Z

PR_Github #39726 [ run ] triggered by Bot. Commit: 7de5c05 Link to invocation

tensorrt-cicd · 2026-03-20T10:48:02Z

PR_Github #39726 [ run ] completed with state SUCCESS. Commit: 7de5c05
/LLM/main/L0_MergeRequest_PR pipeline #30922 completed with status: 'SUCCESS'

CI Report

Link to invocation

…gerV2 (NVIDIA#11939) Signed-off-by: Lanyu Liao <[email protected]> Co-authored-by: Lanyu Liao <[email protected]>

init draft for v2 scheduler

7d18d32

Signed-off-by: Lanyu Liao <[email protected]>

github-actions Bot assigned lancelly Mar 5, 2026

lancelly added 16 commits March 5, 2026 22:19

improve check logic

93e8352

Signed-off-by: Lanyu Liao <[email protected]>

add UTs for schedulerv2

bcfd812

Signed-off-by: Lanyu Liao <[email protected]>

fix broken UTs

09f23b2

Signed-off-by: Lanyu Liao <[email protected]>

add ITs

d8ad21c

Signed-off-by: Lanyu Liao <[email protected]>

draft ITs

f785297

Signed-off-by: Lanyu Liao <[email protected]>

remove dummy scheduler

fbde282

Signed-off-by: Lanyu Liao <[email protected]>

fix block alignment in prepare_resources

b9ce63f

Signed-off-by: Lanyu Liao <[email protected]>

fix some of IT bugs and scheduler eviction logic

8585e66

Signed-off-by: Lanyu Liao <[email protected]>

add evict its

17ee69b

Signed-off-by: Lanyu Liao <[email protected]>

almost clean state except for draft manager

8450571

Signed-off-by: Lanyu Liao <[email protected]>

clean some ITs

082b0cf

Signed-off-by: Lanyu Liao <[email protected]>

clean ITs

54b6a30

Signed-off-by: Lanyu Liao <[email protected]>

merge main and move IT

51e6618

Signed-off-by: Lanyu Liao <[email protected]>

pre-commit

bca23d0

Signed-off-by: Lanyu Liao <[email protected]>

remove redudante IT

823be62

Signed-off-by: Lanyu Liao <[email protected]>

Fix CI collection errors

a1a7531

- Add kv_cache/__init__.py for relative imports in IT - Remove duplicate ids kwarg in parametrize_with_ids for v2_kv_cache Signed-off-by: Lanyu Liao <[email protected]>

Fix test-db names to match parametrize_with_ids output

de3e8ea

Rename v1_kv_cache -> v2_kv_cache=False and v2_kv_cache -> v2_kv_cache=True to match the auto-generated ids from parametrize_with_ids. Signed-off-by: Lanyu Liao <[email protected]>

fix some review comments

cba8198

Signed-off-by: Lanyu Liao <[email protected]>

abstract required_gen_capacity

0d2ef32

Signed-off-by: Lanyu Liao <[email protected]>

eopXD approved these changes Mar 19, 2026

View reviewed changes

yizhang-nv approved these changes Mar 19, 2026

View reviewed changes

lancelly added 2 commits March 19, 2026 00:18

consider draft tokens for both main and draft request

06a2b41

Signed-off-by: Lanyu Liao <[email protected]>

minor change to save function call

23a073a

Signed-off-by: Lanyu Liao <[email protected]>

nvpohanh mentioned this pull request Mar 19, 2026

[TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler #11307

Closed

fix UT after using draft len helper

208a207

Signed-off-by: Lanyu Liao <[email protected]>

fix UT

6ab16fe

Signed-off-by: Lanyu Liao <[email protected]>

litaotju approved these changes Mar 20, 2026

View reviewed changes

add more ITs

7de5c05

Signed-off-by: Lanyu Liao <[email protected]>

lancelly merged commit 68001ce into NVIDIA:main Mar 20, 2026
5 checks passed

Conversation

lancelly commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes:

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

lancelly commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

lancelly commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

lancelly commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

lancelly commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

lancelly commented Mar 19, 2026

Uh oh!

eopXD commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

yizhang-nv commented Mar 19, 2026

Uh oh!

lancelly commented Mar 19, 2026

Uh oh!

lancelly commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

lancelly commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

lancelly commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

lancelly commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

lancelly commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

lancelly commented Mar 5, 2026 •

edited

Loading