[TRTLLM-11287][feat] Implement python based scheduler for KVCacheManagerV2#11939
Conversation
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
- Fix NameError in _create_one_model_draft_kv_cache_manager: use self._kv_cache_config instead of undefined draft_kv_cache_config - Remove test_reject_no_evict_policy: code now logs a warning instead of asserting on non-MAX_UTILIZATION policy Signed-off-by: Lanyu Liao <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #38745 [ run ] triggered by Bot. Commit: |
|
PR_Github #38745 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #38801 [ run ] triggered by Bot. Commit: |
- Add kv_cache/__init__.py for relative imports in IT - Remove duplicate ids kwarg in parametrize_with_ids for v2_kv_cache Signed-off-by: Lanyu Liao <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #38816 [ run ] triggered by Bot. Commit: |
Rename v1_kv_cache -> v2_kv_cache=False and v2_kv_cache -> v2_kv_cache=True to match the auto-generated ids from parametrize_with_ids. Signed-off-by: Lanyu Liao <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #38833 [ run ] triggered by Bot. Commit: |
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
|
/bot run --disable-fail-fast |
|
Thank you for patiently addressing the comments. |
|
PR_Github #39547 [ run ] triggered by Bot. Commit: |
|
This PR has some WAR for MTP. Should be removed after we truly fix this issue. |
Thanks for the thorough review! The feedback really helped improve the design. |
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #39571 [ run ] triggered by Bot. Commit: |
Signed-off-by: Lanyu Liao <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #39586 [ run ] triggered by Bot. Commit: |
Signed-off-by: Lanyu Liao <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #39607 [ run ] triggered by Bot. Commit: |
|
PR_Github #39607 [ run ] completed with state |
Signed-off-by: Lanyu Liao <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #39669 [ run ] triggered by Bot. Commit: |
|
PR_Github #39669 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #39726 [ run ] triggered by Bot. Commit: |
|
PR_Github #39726 [ run ] completed with state |
…gerV2 (NVIDIA#11939) Signed-off-by: Lanyu Liao <[email protected]> Co-authored-by: Lanyu Liao <[email protected]>
Summary
Implement a Python-based interleaved scheduler (
KVCacheV2Scheduler) forKVCacheManagerV2that unifies capacity scheduling and micro-batch scheduling into a single pass, replacing the previousKVCacheV2DummyScheduler+BindCapacityScheduler/BindMicroBatchSchedulertwo-stage path for V2.Key changes:
New
KVCacheV2Scheduler(scheduler_v2.py): merges KV cache allocation (viaresize()) and token budget assignment into one unified scheduling loop. Supports:KVCacheManagerV2scheduling API (resource_manager.py): exposes fine-grained allocation methods called by the scheduler:prepare_context()— create_KVCache, handle block reuse lookup, resume from suspended stateresize_context()— resize KV cache to cover context tokenstry_allocate_generation()— allocate one additional KV slot for generation (with resume)suspend_request()— suspend a request's KV cache to host tieris_request_active()— check if a request has a live, non-suspended KV cacheDraft KV cache manager refactor (
resource_manager.py): extract_prepare_draft_resources()that mirrors the main manager's allocations for MTP draft layers, since the V2 scheduler only manages the primary KV cachePyExecutor integration (
py_executor.py): when V2 scheduler is active, skip executor-level_terminate_requestsand_pause_requestsfor paused requests (the scheduler manages suspend/resume directly)Bug fixes:
BAD_PAGE_INDEX(-1) causingCUDA_ERROR_ILLEGAL_ADDRESSin Flash MLA kernel — replace with 0 for unused page slotsget_num_available_tokensby GPU-only capacity whenmax_tokensis explicitly setTest plan
tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.pycovering scheduling logic, eviction, chunked prefill, PEFT, draft tokens, and edge casestests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py(+542 lines) covering LLaMA accuracy with V2 KV cache + chunked context + MTP + block reusev2_kv_cache=True/FalsevariantsSummary by CodeRabbit
Release Notes
New Features
Tests