Skip to content

[TRTLLM-12520][perf] Reduce host overhead during scheduling and sampling#13843

Merged
tongyuantongyu merged 1 commit into
NVIDIA:mainfrom
tongyuantongyu:ytong/exec-host-opt
May 20, 2026
Merged

[TRTLLM-12520][perf] Reduce host overhead during scheduling and sampling#13843
tongyuantongyu merged 1 commit into
NVIDIA:mainfrom
tongyuantongyu:ytong/exec-host-opt

Conversation

@tongyuantongyu
Copy link
Copy Markdown
Member

@tongyuantongyu tongyuantongyu commented May 7, 2026

Summary by CodeRabbit

Release Notes

  • Improvements
    • Reduced overhead reading the current beam width
    • Enhanced speculative decoding with improved draft token management.
    • Optimized sampling and logprob extraction in the generation pipeline.

Description

Removed some high-overhead codes:

  • .sampling_config.beam_width is 2 binding property access with temporal wrapper object. Cache the value in .py_beam_width.
  • Avoid computing useless value

Test Coverage

Covered by current tests

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

📝 Walkthrough

Walkthrough

This PR migrates beam-width accesses across five pyexecutor modules from direct sampling_config.beam_width reads to a cached py_beam_width property introduced in LlmRequest. The sampler module undergoes significant refactoring in logits selection and request processing to support this change while improving clarity.

Changes

Unified Beam Width Property Migration

Layer / File(s) Summary
Request Property Definition
tensorrt_llm/_torch/pyexecutor/llm_request.py
Introduce py_beam_width as an int property cached in LlmRequest.__init__ by casting sampling_config.beam_width. Update create_response streaming logprob condition to use this property.
Sampler Base Methods
tensorrt_llm/_torch/pyexecutor/sampler.py
Update Sampler.beam_width() property to return cached py_beam_width instead of casting sampling_config.beam_width.
Finish Reason Handling
tensorrt_llm/_torch/pyexecutor/sampler.py
Update _handle_finish_reasons() and _handle_first_finish_reasons() to derive beam width from py_beam_width.
Logprobs Storage & Extraction
tensorrt_llm/_torch/pyexecutor/sampler.py
Update _store_logprobs_list_to_request() and handle_logprobs() to reference py_beam_width for topk tensor handling and beam-dependent logic.
Beam History & Finalization
tensorrt_llm/_torch/pyexecutor/sampler.py
Update _prepare_beam_history() and _finalize_beam() to compute beam dimensions from py_beam_width.
Beam Search Logic & Completion
tensorrt_llm/_torch/pyexecutor/sampler.py
Update _check_beam_search_stop_criteria() and update_requests() to use py_beam_width > 1 for beam-search gating and draft-path selection.
Logits Selection & Slicing
tensorrt_llm/_torch/pyexecutor/sampler.py
Refactor _select_generated_logits() to explicitly append context-finished and generation requests in two passes, track context-return-context-logits requirement, and gate logits slicing on this flag. Update _process_logprobs() to split requests by py_beam_width == 1.
Post-Processing
tensorrt_llm/_torch/pyexecutor/sampler.py
Update TRTLLMSampler._post_process_request() to derive beam_width and log_probs_offset from py_beam_width.
Downstream Module Integration
tensorrt_llm/_torch/pyexecutor/model_engine.py, py_executor.py, resource_manager.py
Update generation expansion, disaggregated initialization, draft-token assignment, and context-request scheduling to read py_beam_width from requests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description explains the rationale (reducing host overhead by caching beam_width) but lacks detail about which specific code paths benefit and why this change is necessary. Add more context about the performance impact and which specific high-overhead code paths are being optimized. Clarify how caching py_beam_width reduces binding property access overhead.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main objective: reducing host overhead during scheduling and sampling, which aligns with all file changes that optimize beam_width access patterns.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

691-691: 💤 Low value

Consider int() instead of cast() to guarantee a Python-native int at runtime.

typing.cast is a type-checker annotation that is a no-op at runtime — it does not convert the value. All other py_* cached attributes (e.g., py_min_length, py_prompt_len) use plain assignment without cast. pybind11 ordinarily maps C++ integral types to Python int, so this works in practice, but the inconsistency is worth noting. If the binding ever returns a pybind11 integer wrapper instead of a Python int, downstream code using py_beam_width in arithmetic or isinstance checks could see unexpected behaviour.

Using int(self.sampling_config.beam_width) is a one-character change, guarantees a true Python int, is self-documenting, and is consistent with how other cached scalar attributes are written in this class.

♻️ Suggested alternative
-        self.py_beam_width = cast(int, self.sampling_config.beam_width)
+        self.py_beam_width: int = int(self.sampling_config.beam_width)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/llm_request.py` at line 691, The cached
attribute py_beam_width is using typing.cast which is a no-op at runtime;
replace the cast usage with an actual conversion by assigning py_beam_width =
int(self.sampling_config.beam_width) so it becomes a native Python int at
runtime (mirror how other cached scalars like py_min_length are set) — update
the assignment in the llm_request class where py_beam_width is initialized to
use int(...) instead of cast(int, ...).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/llm_request.py`:
- Line 691: The cached attribute py_beam_width is using typing.cast which is a
no-op at runtime; replace the cast usage with an actual conversion by assigning
py_beam_width = int(self.sampling_config.beam_width) so it becomes a native
Python int at runtime (mirror how other cached scalars like py_min_length are
set) — update the assignment in the llm_request class where py_beam_width is
initialized to use int(...) instead of cast(int, ...).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4d866904-c5e6-4599-954b-9dec27767e40

📥 Commits

Reviewing files that changed from the base of the PR and between cbfb02a and d220f6b.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py

@longlee0622 longlee0622 requested a review from hyukn May 7, 2026 22:47
@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47295 [ run ] triggered by Bot. Commit: d220f6b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47295 [ run ] completed with state SUCCESS. Commit: d220f6b
/LLM/main/L0_MergeRequest_PR pipeline #37236 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47361 [ run ] triggered by Bot. Commit: d220f6b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47361 [ run ] completed with state SUCCESS. Commit: d220f6b
/LLM/main/L0_MergeRequest_PR pipeline #37295 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

Comment thread tensorrt_llm/_torch/pyexecutor/sampler.py Outdated
@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48084 [ run ] triggered by Bot. Commit: 04e21ea Link to invocation

Copy link
Copy Markdown
Collaborator

@SimengLiu-nv SimengLiu-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Copy Markdown
Collaborator

@eopXD eopXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caching at Python level will fail silently if setBeamWidth is called and you have already cached it here. Push this caching to C++ level will fool-proof this and prevent us from a silent bug.

@tongyuantongyu
Copy link
Copy Markdown
Member Author

tongyuantongyu commented May 13, 2026

The caching at Python level will fail silently if setBeamWidth is called and you have already cached it here. Push this caching to C++ level will fool-proof this and prevent us from a silent bug.

Are we calling setBeamWidth after creating LlmRequest currently? We can't afford the accumulated overhead of such frequent access to a binding property. This is the reason we have this long list of py_ prefixed properties:

self.py_sampling_strategy: "Strategy | None" = None
self.py_logits_post_processors = kwargs.pop("py_logits_post_processors",
None)
self.py_lora_path: str | None = kwargs.pop("py_lora_path", None)
# Multimodal data
self.py_multimodal_data = kwargs.pop("py_multimodal_data", None)
if llm_request is not None:
super().__init__(llm_request)
else:
super().__init__(
*args,
client_id=client_id,
return_log_probs=return_log_probs,
return_context_logits=False,
return_generation_logits=False,
return_perf_metrics=return_perf_metrics,
stop_words_list=torch.tensor(stop_words_list, dtype=torch.int32)
if stop_words_list else None,
**kwargs)
self.py_client_id = client_id
self.py_request_id = self.request_id
self.py_llm_request_type = self.llm_request_type
self.py_end_id = self.end_id
self.py_prompt_len = self.prompt_len
self.py_orig_prompt_len = self.orig_prompt_len
self.py_max_new_tokens = self.max_new_tokens
self.py_min_length = self.sampling_config.min_length
# `seqlen_this_rank_cp`, `total_input_len_cp`, and `py_helix_is_inactive_rank` are relevant to helix parallelism.
self.seqlen_this_rank_cp = self.prompt_len
self.total_input_len_cp = self.prompt_len
self.py_helix_is_inactive_rank = False
self.py_batch_idx = None
self.py_draft_pages_allocated = 0
self.py_rewind_len = 0
self.py_draft_tokens = [] if self.draft_tokens is None else self.draft_tokens
self.py_last_context_chunk = (None, None)
self.py_draft_logits = None
self.py_target_probs = None
self.py_last_draft_tokens = None
self.py_num_accepted_draft_tokens = 0
self.py_num_accepted_draft_tokens_indices = []
self.py_rewind_draft_token_separate_adjustment = 0
self.py_decoding_iter = 0
self.is_attention_dp_dummy = False
self.is_cuda_graph_dummy = False
self.py_kv_transfer_start_time = None
self.py_kv_transfer_timed_out = False
# Performance timing info (step metrics, GPU events, context GPU timing)
# Lazily created only when return_perf_metrics is enabled to avoid
# overhead for every request.
self.py_perf_timing: Optional[PerfTimingInfo] = None
self.py_num_logprobs = num_logprobs
self.py_return_log_probs = return_log_probs
self.py_return_context_logits = return_context_logits
self.py_return_generation_logits = return_generation_logits
self.py_return_logits_device_memory = return_logits_device_memory
self.py_additional_outputs = additional_outputs
self.py_beam_width = cast(int, self.sampling_config.beam_width)
self.py_is_draft = is_draft
# The request's sequence slot ID, an index between 0 (inclusive) and max_batch_size (exclusive).
self.py_seq_slot = seq_slot
# If the request is a draft request, target_seq_slot is the sequence slot ID of its target request.
self.py_target_seq_slot = target_seq_slot
self.use_draft_model = is_draft
self._cached_tokens = 0
self._cached_tokens_set = False
# Whether the request is for the first forward of the draft model.
self.py_is_first_draft = is_first_draft
self.d2t = None
self.py_draft_use_greedy_sampling = False
self.py_disable_speculative_decoding = False
# Chunked logits parameters
self.py_use_chunked_generation_logits = use_chunked_generation_logits
self.py_logits_chunk_size = logits_chunk_size if not self.streaming else 1
# TODO: remove this when use DynamicDecodeOp in pytorch flow.
# currently, keep py_stop_words_list as python list, rather than tensor.
self.py_stop_words_list = stop_words_list
self.py_logprobs_mode = LogprobMode(
logprobs_mode) # handle passed a raw string
self.py_disaggregated_params = None
self.py_num_connector_matched_tokens = 0
self.py_result = PyResult(
prompt_len=self.py_prompt_len,
max_new_tokens=self.py_max_new_tokens,
use_device_memory=return_logits_device_memory,
streaming=self.streaming,
return_log_probs=return_log_probs,
return_context_logits=return_context_logits,
return_generation_logits=return_generation_logits,
exclude_last_generation_logits=exclude_last_generation_logits,
use_chunked_generation_logits=self.py_use_chunked_generation_logits,
chunk_size=self.py_logits_chunk_size,
additional_outputs=additional_outputs)
self.child_requests = []
self._py_embedding_bias_1d: Optional[torch.Tensor] = None
if hasattr(self, 'embedding_bias') and self.embedding_bias is not None:
# Pre-squeeze to 1D if needed (remove batch dimension)
if self.embedding_bias.dim() > 1:
self._py_embedding_bias_1d = self.embedding_bias.squeeze(0)
else:
self._py_embedding_bias_1d = self.embedding_bias

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48084 [ run ] completed with state SUCCESS. Commit: 04e21ea
/LLM/main/L0_MergeRequest_PR pipeline #37915 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48122 [ run ] triggered by Bot. Commit: 04e21ea Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48122 [ run ] completed with state SUCCESS. Commit: 04e21ea
/LLM/main/L0_MergeRequest_PR pipeline #37949 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@github-actions
Copy link
Copy Markdown

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component Vulnerability Description Severity
python-multipart CVE-2024-53981 python-multipart is a streaming multipart parser for Python. When parsing form data, python-multipart skips line breaks (CR \r or LF \n) in front of the first boundary and any tailing bytes after the last boundary. This happens one byte at a time and emits a log event each time, which may cause excessive logging for certain inputs. An attacker could abuse this by sending a malicious request with lots of data before the first or after the last boundary, causing high CPU load and stalling the processing thread for a significant amount of time. In case of ASGI application, this could stall the event loop and prevent other requests from being processed, resulting in a denial of service (DoS). This vulnerability is fixed in 0.0.18. HIGH

@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tongyuantongyu tongyuantongyu requested a review from eopXD May 14, 2026 07:38
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48328 [ run ] triggered by Bot. Commit: 6fe4c73 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48328 [ run ] completed with state SUCCESS. Commit: 6fe4c73
/LLM/main/L0_MergeRequest_PR pipeline #38136 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48561 [ run ] triggered by Bot. Commit: 6fe4c73 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48561 [ run ] completed with state FAILURE. Commit: 6fe4c73
/LLM/main/L0_MergeRequest_PR pipeline #38350 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48814 [ run ] triggered by Bot. Commit: 6fe4c73 Link to invocation

@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48850 [ run ] triggered by Bot. Commit: 6fe4c73 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48814 [ run ] completed with state ABORTED. Commit: 6fe4c73

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48850 [ run ] completed with state FAILURE. Commit: 6fe4c73
/LLM/main/L0_MergeRequest_PR pipeline #38604 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tongyuantongyu
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49160 [ run ] triggered by Bot. Commit: 6fe4c73 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49160 [ run ] completed with state SUCCESS. Commit: 6fe4c73
/LLM/main/L0_MergeRequest_PR pipeline #38841 completed with status: 'SUCCESS'

CI Report

Link to invocation

@tongyuantongyu tongyuantongyu merged commit 4a58dc3 into NVIDIA:main May 20, 2026
7 checks passed
xxi-nv pushed a commit to xxi-nv/TensorRT-LLM that referenced this pull request May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants