[None][feat] Add per-rank iteration stats to /metrics endpoint#13221
Conversation
9c19fcb to
6465a96
Compare
📝 WalkthroughWalkthroughEnhanced iteration performance statistics collection for distributed tensor-parallel deployments by adding per-TP-rank stat serialization in PyExecutor with distributed gather, and updated BaseWorker's serializer to handle the new per-rank format while maintaining backward compatibility. Changes
Sequence DiagramsequenceDiagram
participant PyExecutor
participant TP_Rank_0
participant TP_Rank_N
participant Allgather
participant BaseWorker
Note over PyExecutor,BaseWorker: Per-TP-Rank Stats Collection Flow (when tp_size > 1)
rect rgba(100, 150, 200, 0.5)
PyExecutor->>PyExecutor: Serialize IterationStats + req_stats to dict
PyExecutor->>PyExecutor: Augment with kvCacheIterationStats
PyExecutor->>PyExecutor: Tag payload with rank id
end
rect rgba(150, 100, 200, 0.5)
PyExecutor->>Allgather: tp_allgather(per_rank_dict)
Allgather->>TP_Rank_0: Gather results from all ranks
Allgather->>TP_Rank_N: Gather results from all ranks
end
rect rgba(200, 150, 100, 0.5)
TP_Rank_0->>TP_Rank_0: Store gathered per-rank results<br/>with rolling-cap truncation (scaled by tp_size)
TP_Rank_0->>BaseWorker: ("per_rank_dict", gathered_data)
TP_Rank_N->>TP_Rank_N: Drop gathered result, return early
end
rect rgba(150, 200, 100, 0.5)
BaseWorker->>BaseWorker: Detect "per_rank_dict" format
BaseWorker->>BaseWorker: Direct json.dumps(per_rank_dict)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1237-1244: The current loop evicts entries inside for d in
gathered which can leave a partial per-rank iteration in self.stats; fix it by
performing trimming atomically before appending the new TP batch: inside the
with self.stats_lock block compute cap = self.max_stats_len * tp_size, then
while len(self.stats) + len(gathered) > cap pop(0) to remove enough oldest
entries, and only after that append each ("per_rank_dict", d) for d in gathered
so a complete per-rank set is added and /metrics cannot return partial
iterations; use the existing names (self.stats_lock, self.max_stats_len,
tp_size, gathered, self.stats) to locate and update the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 315d2a8e-1d80-4c7c-a79f-b6a99b237c9f
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/executor/base_worker.py
6465a96 to
48d3c65
Compare
|
/bot run |
|
PR_Github #44458 [ run ] triggered by Bot. Commit: |
|
PR_Github #44458 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #44674 [ run ] triggered by Bot. Commit: |
8c638de to
47be81a
Compare
|
PR_Github #44674 [ run ] completed with state
|
eopXD
left a comment
There was a problem hiding this comment.
Overall looks good.
Some questions:
- We we need to guard the read of iteration stats here safely with a lock?
- Is there a way avoid enumerating all fields under
kvCacheIterationStatshere?
eopXD
left a comment
There was a problem hiding this comment.
No other blocker to me otherwise. The comments are nit. Please evaluate and see if you can address them.
|
Hi eop. Thanks for the review!
For the second question, we didn't come up with a lightweight method to de-duplicate the enum. It either need another enum in C++ binding, or need to use methods like X-macro. |
|
/bot run --disable-fail-fast |
|
PR_Github #45607 [ run ] triggered by Bot. Commit: |
|
PR_Github #45607 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #45698 [ run ] triggered by Bot. Commit: |
|
PR_Github #45698 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #45914 [ run ] triggered by Bot. Commit: |
|
PR_Github #45915 [ run ] triggered by Bot. Commit: |
|
PR_Github #45914 [ run ] completed with state |
|
PR_Github #45915 [ run ] completed with state |
Head branch was pushed to by a user without write access
0ca49f2 to
65a665f
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #46393 [ run ] triggered by Bot. Commit: |
|
PR_Github #46393 [ run ] completed with state |
65a665f to
62fcfd9
Compare
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
Previously, /metrics and get_stats_async only returned rank-0's iteration
stats. Under attention-DP or any multi-rank PyTorch-executor deployment,
each rank has its own KV cache / scheduler / batch state and their
utilization can diverge, but only rank 0 was observable.
This change makes PyExecutor._append_iter_stats collectively gather each
rank's IterationStats (+ RequestStats, KV cache iter stats) via a
tp_allgather. Rank 0 serializes the gathered results to JSON dicts tagged
with "rank": N and stores them alongside its local stats so the same
existing /metrics transport (RPC -> _iter_stats_result queue -> JSON list)
returns one entry per (iteration, rank).
Non-leader ranks drop the gathered result (they don't export). The gather
is opt-in via the TLLM_METRICS_ALL_RANKS=1 environment variable and only
runs when tp_size > 1 and attention-DP is enabled, so default behavior is
byte-identical to upstream.
base_worker._stats_serializer gains a fast-path: if the buffer entry is
the new ("per_rank_dict", {...}) tuple, emit its JSON directly instead of
calling to_json_str() on the already-serialized dict.
Signed-off-by: Shicheng Li <[email protected]>
62fcfd9 to
63b17b7
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #47194 [ run ] triggered by Bot. Commit: |
|
PR_Github #47194 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47298 [ run ] triggered by Bot. Commit: |
|
PR_Github #47298 [ run ] completed with state |
…A#13221) Signed-off-by: Shicheng Li <[email protected]>
Previously, /metrics and get_stats_async only returned rank-0's iteration stats. Under attention-DP or any multi-rank PyTorch-executor deployment, each rank has its own KV cache / scheduler / batch state and their utilization can diverge, but only rank 0 was observable.
This change makes PyExecutor._append_iter_stats collectively gather each rank's IterationStats (+ RequestStats, KV cache iter stats) via a tp_allgather. Rank 0 serializes the gathered results to JSON dicts tagged with "rank": N and stores them alongside its local stats so the same existing /metrics transport (RPC -> _iter_stats_result queue -> JSON list) returns one entry per (iteration, rank).
The gather is opt-in via the
TLLM_METRICS_ALL_RANKS=1environment variable and only runs whentp_size > 1andenable_attention_dp=True. Default behavior (env unset or!=1) is byte-identical to upstream: only rank 0's stats are exported. Non-leader ranks drop the gathered result (they don't export).base_worker._stats_serializer gains a fast-path: if the buffer entry is the new ("per_rank_dict", {...}) tuple, emit its JSON directly instead of calling to_json_str() on the already-serialized dict.
Usage
Opt in on every worker process (e.g. via
trtllm-serveenv) in an attention-DP deployment:with
enable_attention_dp: trueandenable_iter_perf_stats: truein the server config. Then each/metricsJSON entry carries a new"rank": Nfield; a single iteration produces N entries (one per rank). Group byiterfor cross-rank snapshots, or filter byrankfor per-rank time series.Under pure TP (no attn-DP) the gather is skipped regardless of the env var — every rank runs the same requests on the same iteration, so per-rank stats would be redundant and the allgather would only add a CPU-GPU sync on the hot path.
Overhead verification
Summary by CodeRabbit
Release Notes
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.