Skip to content

[None][feat] Add per-rank iteration stats to /metrics endpoint#13221

Merged
lfr-0531 merged 1 commit into
NVIDIA:mainfrom
lishicheng1996-nv:feat/per-rank-iter-metrics
May 8, 2026
Merged

[None][feat] Add per-rank iteration stats to /metrics endpoint#13221
lfr-0531 merged 1 commit into
NVIDIA:mainfrom
lishicheng1996-nv:feat/per-rank-iter-metrics

Conversation

@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator

@lishicheng1996-nv lishicheng1996-nv commented Apr 20, 2026

Previously, /metrics and get_stats_async only returned rank-0's iteration stats. Under attention-DP or any multi-rank PyTorch-executor deployment, each rank has its own KV cache / scheduler / batch state and their utilization can diverge, but only rank 0 was observable.

This change makes PyExecutor._append_iter_stats collectively gather each rank's IterationStats (+ RequestStats, KV cache iter stats) via a tp_allgather. Rank 0 serializes the gathered results to JSON dicts tagged with "rank": N and stores them alongside its local stats so the same existing /metrics transport (RPC -> _iter_stats_result queue -> JSON list) returns one entry per (iteration, rank).

The gather is opt-in via the TLLM_METRICS_ALL_RANKS=1 environment variable and only runs when tp_size > 1 and enable_attention_dp=True. Default behavior (env unset or !=1) is byte-identical to upstream: only rank 0's stats are exported. Non-leader ranks drop the gathered result (they don't export).

base_worker._stats_serializer gains a fast-path: if the buffer entry is the new ("per_rank_dict", {...}) tuple, emit its JSON directly instead of calling to_json_str() on the already-serialized dict.

Usage

Opt in on every worker process (e.g. via trtllm-serve env) in an attention-DP deployment:

TLLM_METRICS_ALL_RANKS=1 trtllm-serve <model> --tp_size 8 --ep_size 8 \
    --extra_llm_api_options extra_llm_api_options.yaml ...

with enable_attention_dp: true and enable_iter_perf_stats: true in the server config. Then each /metrics JSON entry carries a new "rank": N field; a single iteration produces N entries (one per rank). Group by iter for cross-rank snapshots, or filter by rank for per-rank time series.

Under pure TP (no attn-DP) the gather is skipped regardless of the env var — every rank runs the same requests on the same iteration, so per-rank stats would be redundant and the allgather would only add a CPU-GPU sync on the hot path.

Overhead verification

image

Summary by CodeRabbit

Release Notes

  • Performance Monitoring
    • Enhanced per-rank iteration performance statistics collection for distributed tensor parallel setups when multi-GPU deployment is enabled.
    • Updated statistics serialization format to support per-rank metrics collection and storage.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

📝 Walkthrough

Walkthrough

Enhanced iteration performance statistics collection for distributed tensor-parallel deployments by adding per-TP-rank stat serialization in PyExecutor with distributed gather, and updated BaseWorker's serializer to handle the new per-rank format while maintaining backward compatibility.

Changes

Cohort / File(s) Summary
PyExecutor Per-TP-Rank Stats Collection
tensorrt_llm/_torch/pyexecutor/py_executor.py
Enhanced _append_iter_stats to support per-TP-rank iteration performance stats when enable_iter_perf_stats is enabled and dist.tp_size > 1. Serializes stats to JSON per rank, conditionally augments with kvCacheIterationStats, tags with rank id, performs tp_allgather across TP ranks, and only rank 0 stores gathered results; non-zero ranks return early. Falls back to legacy path for single-TP scenarios.
BaseWorker Stats Serialization Format
tensorrt_llm/executor/base_worker.py
Updated _stats_serializer to recognize and handle "per-rank" serialization format when stats tuple's first element is "per_rank_dict", directly returning serialized dictionary. Preserves existing unpacking and transformation logic for standard iteration/request/kV-cache stats.

Sequence Diagram

sequenceDiagram
    participant PyExecutor
    participant TP_Rank_0
    participant TP_Rank_N
    participant Allgather
    participant BaseWorker

    Note over PyExecutor,BaseWorker: Per-TP-Rank Stats Collection Flow (when tp_size > 1)
    
    rect rgba(100, 150, 200, 0.5)
    PyExecutor->>PyExecutor: Serialize IterationStats + req_stats to dict
    PyExecutor->>PyExecutor: Augment with kvCacheIterationStats
    PyExecutor->>PyExecutor: Tag payload with rank id
    end
    
    rect rgba(150, 100, 200, 0.5)
    PyExecutor->>Allgather: tp_allgather(per_rank_dict)
    Allgather->>TP_Rank_0: Gather results from all ranks
    Allgather->>TP_Rank_N: Gather results from all ranks
    end
    
    rect rgba(200, 150, 100, 0.5)
    TP_Rank_0->>TP_Rank_0: Store gathered per-rank results<br/>with rolling-cap truncation (scaled by tp_size)
    TP_Rank_0->>BaseWorker: ("per_rank_dict", gathered_data)
    TP_Rank_N->>TP_Rank_N: Drop gathered result, return early
    end
    
    rect rgba(150, 200, 100, 0.5)
    BaseWorker->>BaseWorker: Detect "per_rank_dict" format
    BaseWorker->>BaseWorker: Direct json.dumps(per_rank_dict)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description provides comprehensive context on the issue, solution, usage, and overhead verification, but several template sections are incomplete. Complete the 'Description', 'Test Coverage', and 'PR Checklist' sections. Fill in the 'Description' section with the 'issue and solution' summary, and provide specific test cases in 'Test Coverage' that safeguard the new per-rank metrics collection logic.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding per-rank iteration stats to the /metrics endpoint, which is the primary feature introduced in this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1237-1244: The current loop evicts entries inside for d in
gathered which can leave a partial per-rank iteration in self.stats; fix it by
performing trimming atomically before appending the new TP batch: inside the
with self.stats_lock block compute cap = self.max_stats_len * tp_size, then
while len(self.stats) + len(gathered) > cap pop(0) to remove enough oldest
entries, and only after that append each ("per_rank_dict", d) for d in gathered
so a complete per-rank set is added and /metrics cannot return partial
iterations; use the existing names (self.stats_lock, self.max_stats_len,
tp_size, gathered, self.stats) to locate and update the code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 315d2a8e-1d80-4c7c-a79f-b6a99b237c9f

📥 Commits

Reviewing files that changed from the base of the PR and between 04915ad and 9c19fcb.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/executor/base_worker.py

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
@lishicheng1996-nv lishicheng1996-nv force-pushed the feat/per-rank-iter-metrics branch from 6465a96 to 48d3c65 Compare April 20, 2026 12:48
@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44458 [ run ] triggered by Bot. Commit: 48d3c65 Link to invocation

@lishicheng1996-nv lishicheng1996-nv requested a review from eopXD April 20, 2026 13:32
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44458 [ run ] completed with state SUCCESS. Commit: 48d3c65
/LLM/main/L0_MergeRequest_PR pipeline #34862 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44674 [ run ] triggered by Bot. Commit: 48d3c65 Link to invocation

@lishicheng1996-nv lishicheng1996-nv force-pushed the feat/per-rank-iter-metrics branch 2 times, most recently from 8c638de to 47be81a Compare April 21, 2026 08:17
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44674 [ run ] completed with state SUCCESS. Commit: 48d3c65
/LLM/main/L0_MergeRequest_PR pipeline #35043 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Copy link
Copy Markdown
Collaborator

@eopXD eopXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

Some questions:

  • We we need to guard the read of iteration stats here safely with a lock?
  • Is there a way avoid enumerating all fields under kvCacheIterationStats here?

Copy link
Copy Markdown
Collaborator

@eopXD eopXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No other blocker to me otherwise. The comments are nit. Please evaluate and see if you can address them.

@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

lishicheng1996-nv commented Apr 24, 2026

Hi eop. Thanks for the review!
For the first question, Claude evaluate that the lock here is reasonable and correct. As quoted

What does need a lock — and is already correct

self.stats is the cross-thread hand-off buffer: the executor main-loop thread writes to it, and the RPC handler thread reads-and-clears it via get_stats_latest. That's why stats_lock exists.

Both code paths — the new per-rank gathered path and the legacy rank-0-only path — hold stats_lock while touching self.stats, which is what eop's "safe lock" concern is really about. The reads of stats / req_stats / _latest_kv_iter_stats outside that critical section are local / single-threaded and don't need additional guarding.

For the second question, we didn't come up with a lightweight method to de-duplicate the enum. It either need another enum in C++ binding, or need to use methods like X-macro.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 24, 2026
@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45607 [ run ] triggered by Bot. Commit: 47be81a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45607 [ run ] completed with state SUCCESS. Commit: 47be81a
/LLM/main/L0_MergeRequest_PR pipeline #35822 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45698 [ run ] triggered by Bot. Commit: c5f6239 Link to invocation

@lishicheng1996-nv lishicheng1996-nv requested review from lancelly and removed request for lancelly April 28, 2026 09:03
@lancelly lancelly enabled auto-merge (squash) April 28, 2026 09:09
Copy link
Copy Markdown
Collaborator

@lancelly lancelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45698 [ run ] completed with state ABORTED. Commit: c5f6239
/LLM/main/L0_MergeRequest_PR pipeline #35903 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

1 similar comment
@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45914 [ run ] triggered by Bot. Commit: 0ca49f2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45915 [ run ] triggered by Bot. Commit: 0ca49f2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45914 [ run ] completed with state ABORTED. Commit: 0ca49f2

Link to invocation

@venkywonka venkywonka removed the Community want to contribute PRs initiated from Community label Apr 28, 2026
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45915 [ run ] completed with state ABORTED. Commit: 0ca49f2

Link to invocation

auto-merge was automatically disabled April 30, 2026 14:49

Head branch was pushed to by a user without write access

@lishicheng1996-nv lishicheng1996-nv force-pushed the feat/per-rank-iter-metrics branch from 0ca49f2 to 65a665f Compare April 30, 2026 14:49
@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46393 [ run ] triggered by Bot. Commit: 65a665f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46393 [ run ] completed with state ABORTED. Commit: 65a665f

Link to invocation

@lishicheng1996-nv lishicheng1996-nv force-pushed the feat/per-rank-iter-metrics branch from 65a665f to 62fcfd9 Compare May 6, 2026 02:23
@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

1 similar comment
@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

Previously, /metrics and get_stats_async only returned rank-0's iteration
stats. Under attention-DP or any multi-rank PyTorch-executor deployment,
each rank has its own KV cache / scheduler / batch state and their
utilization can diverge, but only rank 0 was observable.

This change makes PyExecutor._append_iter_stats collectively gather each
rank's IterationStats (+ RequestStats, KV cache iter stats) via a
tp_allgather. Rank 0 serializes the gathered results to JSON dicts tagged
with "rank": N and stores them alongside its local stats so the same
existing /metrics transport (RPC -> _iter_stats_result queue -> JSON list)
returns one entry per (iteration, rank).

Non-leader ranks drop the gathered result (they don't export). The gather
is opt-in via the TLLM_METRICS_ALL_RANKS=1 environment variable and only
runs when tp_size > 1 and attention-DP is enabled, so default behavior is
byte-identical to upstream.

base_worker._stats_serializer gains a fast-path: if the buffer entry is
the new ("per_rank_dict", {...}) tuple, emit its JSON directly instead of
calling to_json_str() on the already-serialized dict.

Signed-off-by: Shicheng Li <[email protected]>
@lishicheng1996-nv lishicheng1996-nv force-pushed the feat/per-rank-iter-metrics branch from 62fcfd9 to 63b17b7 Compare May 6, 2026 03:30
@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47194 [ run ] triggered by Bot. Commit: 63b17b7 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47194 [ run ] completed with state SUCCESS. Commit: 63b17b7
/LLM/main/L0_MergeRequest_PR pipeline #37152 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lishicheng1996-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47298 [ run ] triggered by Bot. Commit: 63b17b7 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47298 [ run ] completed with state SUCCESS. Commit: 63b17b7
/LLM/main/L0_MergeRequest_PR pipeline #37239 completed with status: 'SUCCESS'

CI Report

Link to invocation

@lfr-0531 lfr-0531 merged commit 1651d1b into NVIDIA:main May 8, 2026
6 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants