Skip to content

[TRTLLM-11287][feat] Implement python based scheduler for KVCacheManagerV2#11939

Merged
lancelly merged 43 commits into
NVIDIA:mainfrom
lancelly:kvcache_manager_v2_scheduler
Mar 20, 2026
Merged

[TRTLLM-11287][feat] Implement python based scheduler for KVCacheManagerV2#11939
lancelly merged 43 commits into
NVIDIA:mainfrom
lancelly:kvcache_manager_v2_scheduler

Conversation

@lancelly
Copy link
Copy Markdown
Collaborator

@lancelly lancelly commented Mar 5, 2026

Summary

Implement a Python-based interleaved scheduler (KVCacheV2Scheduler) for KVCacheManagerV2 that unifies capacity scheduling and micro-batch scheduling into a single pass, replacing the previous KVCacheV2DummyScheduler + BindCapacityScheduler/BindMicroBatchScheduler two-stage path for V2.

Key changes:

  • New KVCacheV2Scheduler (scheduler_v2.py): merges KV cache allocation (via resize()) and token budget assignment into one unified scheduling loop. Supports:

    • MAX_UTILIZATION eviction policy (suspend-to-host) with tail-eviction strategy
    • Self-eviction for generation requests that cannot allocate
    • Block reuse integration for context requests
    • FCFS chunked prefill with configurable chunk unit size
    • PEFT (LoRA) page accounting
    • Draft model (MTP) KV cache coordination
    • LoRA-aware request sorting
  • KVCacheManagerV2 scheduling API (resource_manager.py): exposes fine-grained allocation methods called by the scheduler:

    • prepare_context() — create _KVCache, handle block reuse lookup, resume from suspended state
    • resize_context() — resize KV cache to cover context tokens
    • try_allocate_generation() — allocate one additional KV slot for generation (with resume)
    • suspend_request() — suspend a request's KV cache to host tier
    • is_request_active() — check if a request has a live, non-suspended KV cache
  • Draft KV cache manager refactor (resource_manager.py): extract _prepare_draft_resources() that mirrors the main manager's allocations for MTP draft layers, since the V2 scheduler only manages the primary KV cache

  • PyExecutor integration (py_executor.py): when V2 scheduler is active, skip executor-level _terminate_requests and _pause_requests for paused requests (the scheduler manages suspend/resume directly)

  • Bug fixes:

    • Fix BAD_PAGE_INDEX (-1) causing CUDA_ERROR_ILLEGAL_ADDRESS in Flash MLA kernel — replace with 0 for unused page slots
    • Cap get_num_available_tokens by GPU-only capacity when max_tokens is explicitly set
    • Add assertions blocking Star attention and multi-beam with V2

Test plan

  • New unit tests: tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py covering scheduling logic, eviction, chunked prefill, PEFT, draft tokens, and edge cases
  • New integration tests: tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py (+542 lines) covering LLaMA accuracy with V2 KV cache + chunked context + MTP + block reuse
  • Existing accuracy ITs parametrized with v2_kv_cache=True/False variants
  • CI test-db entries added for B200, DGX-B200, and H100

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced KV cache scheduling strategy with improved memory management and resource efficiency
    • Added support for draft token generation during inference
    • Implemented GPU memory-aware caching for better resource utilization
    • Expanded support for adapter-based inference workflows
  • Tests

    • Added comprehensive scheduling validation test suite
    • Extended test coverage with new caching configuration options

lancelly added 16 commits March 5, 2026 22:19
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
- Fix NameError in _create_one_model_draft_kv_cache_manager: use
  self._kv_cache_config instead of undefined draft_kv_cache_config
- Remove test_reject_no_evict_policy: code now logs a warning instead
  of asserting on non-MAX_UTILIZATION policy

Signed-off-by: Lanyu Liao <[email protected]>
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38745 [ run ] triggered by Bot. Commit: 521d419 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38745 [ run ] completed with state SUCCESS. Commit: 521d419
/LLM/main/L0_MergeRequest_PR pipeline #30064 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38801 [ run ] triggered by Bot. Commit: 521d419 Link to invocation

- Add kv_cache/__init__.py for relative imports in IT
- Remove duplicate ids kwarg in parametrize_with_ids for v2_kv_cache

Signed-off-by: Lanyu Liao <[email protected]>
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38816 [ run ] triggered by Bot. Commit: a1a7531 Link to invocation

Rename v1_kv_cache -> v2_kv_cache=False and v2_kv_cache -> v2_kv_cache=True
to match the auto-generated ids from parametrize_with_ids.

Signed-off-by: Lanyu Liao <[email protected]>
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38833 [ run ] triggered by Bot. Commit: de3e8ea Link to invocation

Signed-off-by: Lanyu Liao <[email protected]>
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@eopXD
Copy link
Copy Markdown
Collaborator

eopXD commented Mar 19, 2026

Thank you for patiently addressing the comments.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39547 [ run ] triggered by Bot. Commit: 0d2ef32 Link to invocation

@yizhang-nv
Copy link
Copy Markdown
Member

This PR has some WAR for MTP. Should be removed after we truly fix this issue.

@lancelly
Copy link
Copy Markdown
Collaborator Author

Thank you for patiently addressing the comments.

Thanks for the thorough review! The feedback really helped improve the design.

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39571 [ run ] triggered by Bot. Commit: 23a073a Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39586 [ run ] triggered by Bot. Commit: 208a207 Link to invocation

Signed-off-by: Lanyu Liao <[email protected]>
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39607 [ run ] triggered by Bot. Commit: 6ab16fe Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39607 [ run ] completed with state SUCCESS. Commit: 6ab16fe
/LLM/main/L0_MergeRequest_PR pipeline #30815 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

Signed-off-by: Lanyu Liao <[email protected]>
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39669 [ run ] triggered by Bot. Commit: 7de5c05 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39669 [ run ] completed with state SUCCESS. Commit: 7de5c05
/LLM/main/L0_MergeRequest_PR pipeline #30872 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39726 [ run ] triggered by Bot. Commit: 7de5c05 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39726 [ run ] completed with state SUCCESS. Commit: 7de5c05
/LLM/main/L0_MergeRequest_PR pipeline #30922 completed with status: 'SUCCESS'

CI Report

Link to invocation

@lancelly lancelly merged commit 68001ce into NVIDIA:main Mar 20, 2026
5 checks passed
longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants