[None][feat] Add per-rank iteration stats to /metrics endpoint by lishicheng1996-nv · Pull Request #13221 · NVIDIA/TensorRT-LLM

lishicheng1996-nv · 2026-04-20T12:01:19Z

Previously, /metrics and get_stats_async only returned rank-0's iteration stats. Under attention-DP or any multi-rank PyTorch-executor deployment, each rank has its own KV cache / scheduler / batch state and their utilization can diverge, but only rank 0 was observable.

This change makes PyExecutor._append_iter_stats collectively gather each rank's IterationStats (+ RequestStats, KV cache iter stats) via a tp_allgather. Rank 0 serializes the gathered results to JSON dicts tagged with "rank": N and stores them alongside its local stats so the same existing /metrics transport (RPC -> _iter_stats_result queue -> JSON list) returns one entry per (iteration, rank).

The gather is opt-in via the TLLM_METRICS_ALL_RANKS=1 environment variable and only runs when tp_size > 1 and enable_attention_dp=True. Default behavior (env unset or !=1) is byte-identical to upstream: only rank 0's stats are exported. Non-leader ranks drop the gathered result (they don't export).

base_worker._stats_serializer gains a fast-path: if the buffer entry is the new ("per_rank_dict", {...}) tuple, emit its JSON directly instead of calling to_json_str() on the already-serialized dict.

Usage

Opt in on every worker process (e.g. via trtllm-serve env) in an attention-DP deployment:

TLLM_METRICS_ALL_RANKS=1 trtllm-serve <model> --tp_size 8 --ep_size 8 \
    --extra_llm_api_options extra_llm_api_options.yaml ...

with enable_attention_dp: true and enable_iter_perf_stats: true in the server config. Then each /metrics JSON entry carries a new "rank": N field; a single iteration produces N entries (one per rank). Group by iter for cross-rank snapshots, or filter by rank for per-rank time series.

Under pure TP (no attn-DP) the gather is skipped regardless of the env var — every rank runs the same requests on the same iteration, so per-rank stats would be redundant and the allgather would only add a CPU-GPU sync on the hot path.

Overhead verification

Summary by CodeRabbit

Release Notes

Performance Monitoring
- Enhanced per-rank iteration performance statistics collection for distributed tensor parallel setups when multi-GPU deployment is enabled.
- Updated statistics serialization format to support per-rank metrics collection and storage.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-20T12:10:31Z

📝 Walkthrough

Walkthrough

Enhanced iteration performance statistics collection for distributed tensor-parallel deployments by adding per-TP-rank stat serialization in PyExecutor with distributed gather, and updated BaseWorker's serializer to handle the new per-rank format while maintaining backward compatibility.

Changes

Cohort / File(s)	Summary
PyExecutor Per-TP-Rank Stats Collection `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Enhanced `_append_iter_stats` to support per-TP-rank iteration performance stats when `enable_iter_perf_stats` is enabled and `dist.tp_size > 1`. Serializes stats to JSON per rank, conditionally augments with `kvCacheIterationStats`, tags with rank id, performs `tp_allgather` across TP ranks, and only rank 0 stores gathered results; non-zero ranks return early. Falls back to legacy path for single-TP scenarios.
BaseWorker Stats Serialization Format `tensorrt_llm/executor/base_worker.py`	Updated `_stats_serializer` to recognize and handle "per-rank" serialization format when stats tuple's first element is `"per_rank_dict"`, directly returning serialized dictionary. Preserves existing unpacking and transformation logic for standard iteration/request/kV-cache stats.

Sequence Diagram

sequenceDiagram
    participant PyExecutor
    participant TP_Rank_0
    participant TP_Rank_N
    participant Allgather
    participant BaseWorker

    Note over PyExecutor,BaseWorker: Per-TP-Rank Stats Collection Flow (when tp_size > 1)
    
    rect rgba(100, 150, 200, 0.5)
    PyExecutor->>PyExecutor: Serialize IterationStats + req_stats to dict
    PyExecutor->>PyExecutor: Augment with kvCacheIterationStats
    PyExecutor->>PyExecutor: Tag payload with rank id
    end
    
    rect rgba(150, 100, 200, 0.5)
    PyExecutor->>Allgather: tp_allgather(per_rank_dict)
    Allgather->>TP_Rank_0: Gather results from all ranks
    Allgather->>TP_Rank_N: Gather results from all ranks
    end
    
    rect rgba(200, 150, 100, 0.5)
    TP_Rank_0->>TP_Rank_0: Store gathered per-rank results<br/>with rolling-cap truncation (scaled by tp_size)
    TP_Rank_0->>BaseWorker: ("per_rank_dict", gathered_data)
    TP_Rank_N->>TP_Rank_N: Drop gathered result, return early
    end
    
    rect rgba(150, 200, 100, 0.5)
    BaseWorker->>BaseWorker: Detect "per_rank_dict" format
    BaseWorker->>BaseWorker: Direct json.dumps(per_rank_dict)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description provides comprehensive context on the issue, solution, usage, and overhead verification, but several template sections are incomplete.	Complete the 'Description', 'Test Coverage', and 'PR Checklist' sections. Fill in the 'Description' section with the 'issue and solution' summary, and provide specific test cases in 'Test Coverage' that safeguard the new per-rank metrics collection logic.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding per-rank iteration stats to the /metrics endpoint, which is the primary feature introduced in this PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1237-1244: The current loop evicts entries inside for d in
gathered which can leave a partial per-rank iteration in self.stats; fix it by
performing trimming atomically before appending the new TP batch: inside the
with self.stats_lock block compute cap = self.max_stats_len * tp_size, then
while len(self.stats) + len(gathered) > cap pop(0) to remove enough oldest
entries, and only after that append each ("per_rank_dict", d) for d in gathered
so a complete per-rank set is added and /metrics cannot return partial
iterations; use the existing names (self.stats_lock, self.max_stats_len,
tp_size, gathered, self.stats) to locate and update the code.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 315d2a8e-1d80-4c7c-a79f-b6a99b237c9f

📥 Commits

Reviewing files that changed from the base of the PR and between 04915ad and 9c19fcb.

📒 Files selected for processing (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/executor/base_worker.py

lishicheng1996-nv · 2026-04-20T13:25:11Z

/bot run

tensorrt-cicd · 2026-04-20T13:30:56Z

PR_Github #44458 [ run ] triggered by Bot. Commit: 48d3c65 Link to invocation

tensorrt-cicd · 2026-04-20T17:48:04Z

PR_Github #44458 [ run ] completed with state SUCCESS. Commit: 48d3c65
/LLM/main/L0_MergeRequest_PR pipeline #34862 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lishicheng1996-nv · 2026-04-21T07:57:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-21T08:03:07Z

PR_Github #44674 [ run ] triggered by Bot. Commit: 48d3c65 Link to invocation

tensorrt-cicd · 2026-04-21T17:07:06Z

PR_Github #44674 [ run ] completed with state SUCCESS. Commit: 48d3c65
/LLM/main/L0_MergeRequest_PR pipeline #35043 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

eopXD

Overall looks good.

Some questions:

We we need to guard the read of iteration stats here safely with a lock?
Is there a way avoid enumerating all fields under kvCacheIterationStats here?

eopXD

No other blocker to me otherwise. The comments are nit. Please evaluate and see if you can address them.

lishicheng1996-nv · 2026-04-24T07:47:55Z

Hi eop. Thanks for the review!
For the first question, Claude evaluate that the lock here is reasonable and correct. As quoted

What does need a lock — and is already correct

self.stats is the cross-thread hand-off buffer: the executor main-loop thread writes to it, and the RPC handler thread reads-and-clears it via get_stats_latest. That's why stats_lock exists.

Both code paths — the new per-rank gathered path and the legacy rank-0-only path — hold stats_lock while touching self.stats, which is what eop's "safe lock" concern is really about. The reads of stats / req_stats / _latest_kv_iter_stats outside that critical section are local / single-threaded and don't need additional guarding.

For the second question, we didn't come up with a lightweight method to de-duplicate the enum. It either need another enum in C++ binding, or need to use methods like X-macro.

lishicheng1996-nv · 2026-04-27T01:23:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-27T01:29:45Z

PR_Github #45607 [ run ] triggered by Bot. Commit: 47be81a Link to invocation

tensorrt-cicd · 2026-04-27T08:52:56Z

PR_Github #45607 [ run ] completed with state SUCCESS. Commit: 47be81a
/LLM/main/L0_MergeRequest_PR pipeline #35822 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lishicheng1996-nv · 2026-04-27T09:11:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-27T09:19:12Z

PR_Github #45698 [ run ] triggered by Bot. Commit: c5f6239 Link to invocation

lancelly

LGTM

tensorrt-cicd · 2026-04-28T09:19:02Z

PR_Github #45698 [ run ] completed with state ABORTED. Commit: c5f6239
/LLM/main/L0_MergeRequest_PR pipeline #35903 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lishicheng1996-nv · 2026-04-28T09:35:35Z

/bot run --disable-fail-fast

lishicheng1996-nv · 2026-04-28T09:38:47Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-28T09:42:37Z

PR_Github #45914 [ run ] triggered by Bot. Commit: 0ca49f2 Link to invocation

tensorrt-cicd · 2026-04-28T09:45:34Z

PR_Github #45915 [ run ] triggered by Bot. Commit: 0ca49f2 Link to invocation

tensorrt-cicd · 2026-04-28T09:45:40Z

PR_Github #45914 [ run ] completed with state ABORTED. Commit: 0ca49f2

Link to invocation

tensorrt-cicd · 2026-04-29T09:46:22Z

PR_Github #45915 [ run ] completed with state ABORTED. Commit: 0ca49f2

Link to invocation

lishicheng1996-nv · 2026-04-30T14:50:08Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-30T14:57:56Z

PR_Github #46393 [ run ] triggered by Bot. Commit: 65a665f Link to invocation

tensorrt-cicd · 2026-05-01T14:58:45Z

PR_Github #46393 [ run ] completed with state ABORTED. Commit: 65a665f

Link to invocation

lishicheng1996-nv · 2026-05-06T02:27:01Z

/bot run --disable-fail-fast

lishicheng1996-nv · 2026-05-06T02:32:08Z

/bot run --disable-fail-fast

Previously, /metrics and get_stats_async only returned rank-0's iteration stats. Under attention-DP or any multi-rank PyTorch-executor deployment, each rank has its own KV cache / scheduler / batch state and their utilization can diverge, but only rank 0 was observable. This change makes PyExecutor._append_iter_stats collectively gather each rank's IterationStats (+ RequestStats, KV cache iter stats) via a tp_allgather. Rank 0 serializes the gathered results to JSON dicts tagged with "rank": N and stores them alongside its local stats so the same existing /metrics transport (RPC -> _iter_stats_result queue -> JSON list) returns one entry per (iteration, rank). Non-leader ranks drop the gathered result (they don't export). The gather is opt-in via the TLLM_METRICS_ALL_RANKS=1 environment variable and only runs when tp_size > 1 and attention-DP is enabled, so default behavior is byte-identical to upstream. base_worker._stats_serializer gains a fast-path: if the buffer entry is the new ("per_rank_dict", {...}) tuple, emit its JSON directly instead of calling to_json_str() on the already-serialized dict. Signed-off-by: Shicheng Li <[email protected]>

lishicheng1996-nv · 2026-05-07T11:00:08Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-07T11:06:15Z

PR_Github #47194 [ run ] triggered by Bot. Commit: 63b17b7 Link to invocation

tensorrt-cicd · 2026-05-08T03:14:12Z

PR_Github #47194 [ run ] completed with state SUCCESS. Commit: 63b17b7
/LLM/main/L0_MergeRequest_PR pipeline #37152 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lishicheng1996-nv · 2026-05-08T03:29:07Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-08T03:35:03Z

PR_Github #47298 [ run ] triggered by Bot. Commit: 63b17b7 Link to invocation

tensorrt-cicd · 2026-05-08T09:00:30Z

PR_Github #47298 [ run ] completed with state SUCCESS. Commit: 63b17b7
/LLM/main/L0_MergeRequest_PR pipeline #37239 completed with status: 'SUCCESS'

CI Report

Link to invocation

…A#13221) Signed-off-by: Shicheng Li <[email protected]>

lishicheng1996-nv requested review from a team as code owners April 20, 2026 12:01

lishicheng1996-nv requested review from hchings and joyang-nv April 20, 2026 12:01

github-actions Bot assigned lishicheng1996-nv Apr 20, 2026

lishicheng1996-nv force-pushed the feat/per-rank-iter-metrics branch from 9c19fcb to 6465a96 Compare April 20, 2026 12:08

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py

lishicheng1996-nv force-pushed the feat/per-rank-iter-metrics branch from 6465a96 to 48d3c65 Compare April 20, 2026 12:48

lishicheng1996-nv requested a review from eopXD April 20, 2026 13:32

lishicheng1996-nv force-pushed the feat/per-rank-iter-metrics branch 2 times, most recently from 8c638de to 47be81a Compare April 21, 2026 08:17

eopXD reviewed Apr 24, 2026

View reviewed changes

eopXD approved these changes Apr 24, 2026

View reviewed changes

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 24, 2026

lishicheng1996-nv requested review from lancelly and removed request for lancelly April 28, 2026 09:03

lancelly enabled auto-merge (squash) April 28, 2026 09:09

lancelly approved these changes Apr 28, 2026

View reviewed changes

venkywonka approved these changes Apr 28, 2026

View reviewed changes

venkywonka removed the Community want to contribute PRs initiated from Community label Apr 28, 2026

auto-merge was automatically disabled April 30, 2026 14:49
Head branch was pushed to by a user without write access

lishicheng1996-nv force-pushed the feat/per-rank-iter-metrics branch from 0ca49f2 to 65a665f Compare April 30, 2026 14:49

lishicheng1996-nv force-pushed the feat/per-rank-iter-metrics branch from 65a665f to 62fcfd9 Compare May 6, 2026 02:23

lishicheng1996-nv force-pushed the feat/per-rank-iter-metrics branch from 62fcfd9 to 63b17b7 Compare May 6, 2026 03:30

lfr-0531 merged commit 1651d1b into NVIDIA:main May 8, 2026
6 checks passed

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026

[None][feat] Add per-rank iteration stats to /metrics endpoint (NVIDI…

f758806

…A#13221) Signed-off-by: Shicheng Li <[email protected]>

Conversation

lishicheng1996-nv commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Overhead verification

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lishicheng1996-nv commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

lishicheng1996-nv commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

eopXD left a comment

Choose a reason for hiding this comment

Uh oh!

eopXD left a comment

Choose a reason for hiding this comment

Uh oh!

lishicheng1996-nv commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishicheng1996-nv commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

lishicheng1996-nv commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

lancelly left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

lishicheng1996-nv commented Apr 28, 2026

Uh oh!

lishicheng1996-nv commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

lishicheng1996-nv commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

lishicheng1996-nv commented May 6, 2026

Uh oh!

lishicheng1996-nv commented Apr 20, 2026 •

edited

Loading

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

lishicheng1996-nv commented Apr 24, 2026 •

edited

Loading