[None][fix] Release deferred ctx KV pages in V2 delay batching#13805
Conversation
|
/bot run --disable-fail-fast |
|
PR_Github #47135 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughPyExecutor now uses ChangesKV-Cache Context Revert for V2 Manager
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
2351-2381:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClear stale
py_ctx_pre_resize_capbefore attempting resize.If
kv_cache.resize(...)fails (Line 2373),req.py_ctx_pre_resize_capis left untouched. A prior iteration’s snapshot can then be incorrectly consumed byrevert_allocate_context, causing an outdated shrink target.💡 Suggested fix
def resize_context(self, req: LlmRequest, num_tokens: int) -> bool: @@ target = req.context_current_position + num_tokens + self.num_extra_kv_tokens capacity = max(kv_cache.capacity, target) pre_cap = kv_cache.capacity + # Clear stale snapshot from prior iterations before resize attempt. + req.py_ctx_pre_resize_cap = None if not kv_cache.resize(capacity): if req.is_first_context_chunk: kv_cache.suspend() return False - # None means "no growth this iter, nothing to revert"; this also - # invalidates a stale snapshot from a prior iter on the same req. - req.py_ctx_pre_resize_cap = pre_cap if capacity > pre_cap else None + if capacity > pre_cap: + req.py_ctx_pre_resize_cap = pre_cap return True🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 2351 - 2381, In resize_context, clear any stale snapshot before attempting the resize by setting req.py_ctx_pre_resize_cap = None prior to calling kv_cache.resize(...); then compute pre_cap and on successful growth (capacity > pre_cap) set req.py_ctx_pre_resize_cap = pre_cap, otherwise leave it None; keep the existing kv_cache.suspend() behavior when kv_cache.resize(...) fails for req.is_first_context_chunk and return False as before so revert_allocate_context cannot consume an outdated snapshot.tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1566-1566:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAlso revert context growth when
can_queueskips the batch.
_schedule()now unwinds ctx growth for requests dropped by delay batching, but the later skip paths still only call_revert_gen_alloc(). On KV-manager V2,resize_contexthas already grownscheduled_batch.context_requests, so an attention-DP skip / post-connector recheck / PP queue skip can still leave those ctx pages pinned until a future iteration. That recreates the same pool-pressure leak for skipped batches.Suggested direction
- def _revert_gen_alloc(self, scheduled_batch): - """Revert KV cache capacity growth when the batch is skipped. + def _revert_skipped_batch_alloc(self, scheduled_batch): + """Revert KV cache capacity growth when a scheduled batch is skipped. - With attention DP, can_queue=False means another rank has an empty - batch so no forward pass will run. The V2 scheduler already grew - each generation request's KV cache capacity during scheduling; - revert that growth so it does not accumulate across skipped - iterations and overflow the host page-index buffer. + With KV cache manager V2, scheduling may already have grown both + context and generation capacity before we discover that the batch + cannot run on this iteration. """ if self._is_kv_manager_v2: + for req in scheduled_batch.context_requests: + self.kv_cache_manager.revert_allocate_context(req) for req in scheduled_batch.generation_requests: self.kv_cache_manager.revert_allocate_generation(req)Then switch the
not can_queuebranches in PP / overlap / non-overlap to call the new helper.Also applies to: 1951-1983, 2257-2258, 2513-2514
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` at line 1566, When a batch is skipped by a can_queue check, also undo any context growth (resize_context) instead of only calling _revert_gen_alloc; replace the direct calls to self._revert_gen_alloc(scheduled_batch) in _schedule and in the PP / overlap / non-overlap "not can_queue" branches with the new helper that unrolls context growth (the helper introduced alongside _revert_gen_alloc that also reverts scheduled_batch.context_requests), so skipped batches do not leave ctx pages pinned.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Line 1566: When a batch is skipped by a can_queue check, also undo any context
growth (resize_context) instead of only calling _revert_gen_alloc; replace the
direct calls to self._revert_gen_alloc(scheduled_batch) in _schedule and in the
PP / overlap / non-overlap "not can_queue" branches with the new helper that
unrolls context growth (the helper introduced alongside _revert_gen_alloc that
also reverts scheduled_batch.context_requests), so skipped batches do not leave
ctx pages pinned.
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 2351-2381: In resize_context, clear any stale snapshot before
attempting the resize by setting req.py_ctx_pre_resize_cap = None prior to
calling kv_cache.resize(...); then compute pre_cap and on successful growth
(capacity > pre_cap) set req.py_ctx_pre_resize_cap = pre_cap, otherwise leave it
None; keep the existing kv_cache.suspend() behavior when kv_cache.resize(...)
fails for req.is_first_context_chunk and return False as before so
revert_allocate_context cannot consume an outdated snapshot.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 216f91f5-f041-4231-9b0b-314f40cc5bd7
📒 Files selected for processing (5)
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytests/integration/test_lists/qa/llm_function_core.txttests/integration/test_lists/qa/llm_function_rtx6k.txt
|
PR_Github #47135 [ run ] completed with state
|
a750f62 to
4cce44f
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #47363 [ run ] triggered by Bot. Commit: |
4cce44f to
150ddcb
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #47364 [ run ] triggered by Bot. Commit: |
When delay batching (_balance_adp_requests / _waiting_requests) defers context requests in the V2 scheduler path, those requests have already had their KV cache capacity grown by resize_context during scheduling. Without a revert, those freshly allocated pages sit idle for the entire wait window, blocking the rest of the pool — a real cost for long-context workloads where one deferred ctx can hold GBs of KV. Adds revert_allocate_context, mirroring the existing revert_allocate_generation: resize_context snapshots the pre-resize cap on req.py_ctx_pre_resize_cap (None when no growth), and the executor calls revert_allocate_context for any ctx dropped by delay batching at the end of _schedule(). The shrink path on _KVCache.resize is the same one already exercised by spec-decoding rewind, so behavior and perf are known-good. For non-first-chunk defers (pre_cap > 0), the revert additionally suspends the cache so V2's pool-pressure machinery can evict the prior chunks' pages if needed; the next iter's _resume_and_restore re-enters through the _never_resumed=False short path, with batched_lock_to_gpu re-locking _active_pages() only if eviction actually happened. First- chunk defers (pre_cap == 0) skip the suspend since resize already emptied _blocks. Also rename _scheduler_manages_kv_suspend to _is_kv_manager_v2 since the flag is exactly an isinstance(KVCacheManagerV2) check and is now used by both the gen and ctx revert paths, beyond the original suspend path. Tests: - Parametrize test_nvfp4_batch_waiting with v2_kv_cache=[False, True] so the V2 path runs in CI; update qa/llm_function_core.txt and qa/llm_function_rtx6k.txt with the new -v2_kv_cache= test IDs. - Add the V2 batch_waiting case to the B200 pre-merge stage. Signed-off-by: Lanyu Liao <[email protected]>
150ddcb to
1411498
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #47380 [ run ] triggered by Bot. Commit: |
|
PR_Github #47380 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47643 [ run ] triggered by Bot. Commit: |
|
PR_Github #47643 [ run ] completed with state |
…A#13805) Signed-off-by: Lanyu Liao <[email protected]> Co-authored-by: Lanyu Liao <[email protected]>
When delay batching (_balance_adp_requests / _waiting_requests) defers context requests in the V2 scheduler path, those requests have already had their KV cache capacity grown by resize_context during scheduling. Without a revert, those freshly allocated pages sit idle for the entire wait window, blocking the rest of the pool — a real cost for long-context workloads where one deferred ctx can hold GBs of KV.
Adds revert_allocate_context, mirroring the existing revert_allocate_generation: resize_context snapshots the pre-resize cap on req.py_ctx_pre_resize_cap (None when no growth), and the executor calls revert_allocate_context for any ctx dropped by delay batching at the end of _schedule(). The shrink path on _KVCache.resize is the same one already exercised by spec-decoding rewind, so behavior and perf are known-good.
Also rename _scheduler_manages_kv_suspend to _is_kv_manager_v2 since the flag is exactly an isinstance(KVCacheManagerV2) check and is now used by both the gen and ctx revert paths, beyond the original suspend path.
The added QA tests passed locally on GB200.
Alternative considered: gate delay batching before resume
A natural variant came up: do _create_kv_cache (cheap, no GPU work) for all candidates → run delay batching on the candidate set → resume + resize only the survivors. This would skip the resume cost on deferred reqs entirely. We discussed and rejected it, for two coupled reasons.
Resume failure is part of admission. kv_cache.resume() can fail on the max_util_for_resume utilization gate or on batched_lock_to_gpu OOM. V2's per-req scheduling loop uses these per-req True/False returns to drive its token / block budget across reqs — without per-req resume feedback,
the budget loop cannot make consistent decisions for the next req.
Delay batching's semantics couple compute and KV admission. Today, when delay batching "passes", the surviving set has already succeeded resume + resize, so KV admission is guaranteed. A pre-resume variant weakens this guarantee: the token-budget gate may pass 4 candidates while only 2
actually fit on KV (resume failures filter the rest). The iter still ends up running 2 reqs and the delay-batching decision gets effectively re-arbitrated by resume failures — defeating the optimization while losing the "decision == admission" property. Note that delay batching today has
two tunables (waiting-iter count, current-batch token count), so it's already implicitly waiting on both enough tokens and enough KV slots; splitting these out into separate pre/post-resume phases would be a semantic change, not just a reordering.
So the current shape — prepare_context (resume) → resize_context (grow) → delay batching → revert resize on dropped, with conditional suspend() when pre_cap > 0 — is the minimum-change form that preserves both invariants. The wasted resume work on deferred reqs is bounded: near-zero for
first-chunk fresh caches (empty _active_pages()), real but unavoidable for reuse / SSM / mid-chunk cases under the current admission coupling.