[None][feat] Add hit-rate gate and fair-share cap to KV-aware ADP router#13198
Conversation
…ounting to KV-aware ADP router Three env-gated additions to the KV-cache-aware ADP router that address cold-start load imbalance without touching existing code paths. All default off; behaviour is unchanged unless env vars are set. 1. TLLM_ADP_ROUTER_MATCH_RATE_THRESHOLD (float, default 0.0): Gate cache affinity in scoring when the best available hit rate (max match_len / req_tokens across eligible ranks) is at or below the threshold. With threshold=0.10, first turns of new conversations (where match_len is only the shared thinking-template prefix) fall through to pure load-balanced routing and seed cold ranks with fresh trajectories instead of piling on ranks that happened to cache the template first. Subsequent multi-turn requests still honour cache affinity because their hit rate far exceeds the threshold. 2. TLLM_ADP_ROUTER_RANDOMIZE_TIEBREAK (0/1, default 0): Iterate eligible ranks in a per-decision random order so score/ active_tokens ties are resolved by uniform random pick instead of the default "lowest-index wins". Seeded deterministically with req_id so every TP rank produces the same shuffle and routing decisions stay consistent across ranks. Primarily helps gate_off events during cold start where multiple ranks have zero load. 3. TLLM_ADP_ROUTER_INCLUDE_TRANSFER_LOAD (0/1, default 0): Include KV-transfer-in-progress requests (held by AsyncTransferManager after prefill completes) in per-rank load accounting. Without this, disaggregated CTX workers appear idle between prefills and the router concentrates subsequent requests on already-busy ranks. PyExecutor now injects its async_transfer_manager into the router after both are constructed. Also extends the diagnostic log already added in PR NVIDIA#13198: - cache_affinity_active, max_match_for_req in adp_router_v2_decision - match_rate_threshold, randomize_tiebreak in adp_router_v2_batch - adp_router_v2_rank_state (per-rank state snapshot) - adp_router_v2_pyexec_snapshot (pyexec-global state snapshot) On DSV3.2 1P1D at concurrency 24 with threshold=0.10 + randomize=1, rank coverage improves from baseline min/mean = 0.18 % (rank 6/7 serving 1 request each out of ~570 average) to 52 % (rank 6/7 serving 312 and 779 respectively), with TTFT p90 within run-to-run noise and cache hit rate unchanged at 96 %. Signed-off-by: Lance Liao <[email protected]>
7731e41 to
640bdae
Compare
640bdae to
07df92b
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #44959 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThe pull request enhances the ADP routing system with cache-affinity awareness and configurable routing thresholds. Changes include reordering PyExecutor initialization to pass the AsyncTransferManager to the ADPRouter, extending KVCacheAwareADPRouter with match-rate and fair-share controls, updating active-request computation to account for remaining-to-compute tokens and in-flight transfers, and implementing per-request rank shuffling with cache-affinity gating. New routing configuration parameters and corresponding tests are introduced. Changes
Sequence DiagramsequenceDiagram
participant PyExec as PyExecutor
participant ATM as AsyncTransferManager
participant Router as ADPRouter
participant KVRouter as KVCacheAwareADPRouter
participant KVMgr as KVCacheManager
PyExec->>ATM: Initialize
PyExec->>Router: create(async_transfer_manager=ATM)
Router->>KVRouter: new(match_rate_threshold, fair_share_multiplier, async_transfer_manager)
Note over KVRouter: Routing Request Received
KVRouter->>KVMgr: Query prefix matches & cached tokens
KVRouter->>ATM: Count requests in transfer
KVRouter->>KVRouter: Compute remaining-to-compute tokens
KVRouter->>KVRouter: Apply cache-affinity gate (match_rate_threshold)
KVRouter->>KVRouter: Shuffle eligible ranks per request
KVRouter->>KVRouter: Score ranks (affinity + load balance)
KVRouter->>KVRouter: Apply fair_share_multiplier cap
KVRouter-->>PyExec: Route decision (rank assignment)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py`:
- Around line 541-547: The code computes expected_num_active_requests using
int(self.fair_share_multiplier * fair_share) which truncates float multipliers;
replace the truncation with ceiling to implement the intended "round up"
behavior: use math.ceil(self.fair_share_multiplier * fair_share) (add an import
for math if needed) in the calculation for expected_num_active_requests in
adp_router where fair_share and self.fair_share_multiplier are used, and add a
regression test that constructs a scenario with fair_share==1 and
fair_share_multiplier==1.5 (or another non-integer) to assert the cap allows 2
requests rather than 1 and that routing/eviction behavior matches the
multiplier.
In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 624-646: The two new routing knobs lack validation: constrain
kv_cache_routing_match_rate_threshold to be between 0.0 and 1.0 (use Field(...,
ge=0.0, le=1.0)) and constrain kv_cache_routing_fair_share_multiplier to be at
least 1.0 (use Field(..., ge=1.0)); update the Field definitions for the
variables named kv_cache_routing_match_rate_threshold and
kv_cache_routing_fair_share_multiplier in llm_args.py to include these bounds so
Pydantic will reject semantically invalid configs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 0b6cbfb8-e4b5-4b2d-9a06-130fd0d7b504
📒 Files selected for processing (4)
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/scheduler/adp_router.pytensorrt_llm/llmapi/llm_args.pytests/unittest/_torch/executor/test_adp_router.py
|
/bot run --disable-fail-fast |
|
PR_Github #45142 [ run ] triggered by Bot. Commit: |
|
PR_Github #45142 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #45187 [ run ] triggered by Bot. Commit: |
|
PR_Github #45187 [ run ] completed with state
|
fa796af to
3660223
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #45388 [ run ] triggered by Bot. Commit: |
|
PR_Github #45388 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #45575 [ run ] triggered by Bot. Commit: |
3660223 to
a73ad8a
Compare
Expose match_rate_threshold and fair_share_multiplier as AttentionDpConfig fields. Add a hit-rate gate with random tiebreak, always account KV-transfer-in-progress requests in router load, and enforce a 2x fair-share cap on per-rank token load. Clean up debug logging to a single per-batch line, simplify rank-state handling, and extend unit tests to cover the new config fields and direct routing behavior. Signed-off-by: Lanyu Liao <[email protected]>
a73ad8a to
f324316
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #45588 [ run ] triggered by Bot. Commit: |
|
PR_Github #45588 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #45606 [ run ] triggered by Bot. Commit: |
|
PR_Github #45606 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #45646 [ run ] triggered by Bot. Commit: |
|
PR_Github #45646 [ run ] completed with state |
lishicheng1996-nv
left a comment
There was a problem hiding this comment.
LGTM, it works great!
…ter (NVIDIA#13198) Signed-off-by: Lanyu Liao <[email protected]> Co-authored-by: Lanyu Liao <[email protected]>
…ter (NVIDIA#13198) Signed-off-by: Lanyu Liao <[email protected]> Co-authored-by: Lanyu Liao <[email protected]>
Description
Tunes
KVCacheAwareADPRouterso cache affinity and load balance work together instead of fighting each other. Three behaviour changes, one config surface change, and a small amount of cleanup.1. Hit-rate gate —
kv_cache_routing_match_rate_threshold(default0.1)For each request,
match_lencontributes to scoring only whenmax(match_len) / request_tokensacross eligible ranks is strictly above the threshold; below it,match_lenis forced to0and routing is driven purely by load. This prevents a small universal prefix (e.g. a shared system prompt) from pinning all traffic to the first warm ranks. Set to0.0to honour any nonzero match.2. Fair-share cap —
kv_cache_routing_fair_share_multiplier(default2.0)Per-rank active-request cap expressed as a multiplier of the ceil fair-share, i.e.
fair_share_multiplier * ceil((total_active + new) / tp_size). Once a rank hits the cap within a scheduling batch it is dropped from the eligible set for the rest of that batch.2.0leaves enough slack for affinity to win in the common case while preventing runaway concentration; set to1.0for strict fair share.3. Transfer-in-progress load accounting
Requests mid-KV-transfer to GEN are no longer visible in
active_requests, so the router used to under-count load on the rank that is sending. The router now pullsrequests_in_transfer()from the PyExecutor'sAsyncTransferManagerand folds those requests into bothnum_active_requestsandnum_active_tokenswhen building each rank'sRankState.Summary by CodeRabbit
Release Notes
New Features
kv_cache_routing_match_rate_thresholdandkv_cache_routing_fair_share_multiplier.Improvements