Skip to content

[None][perf] Clear multimodal data upon prefill completion#13259

Merged
2ez4bz merged 1 commit into
NVIDIA:mainfrom
2ez4bz:dev-free-mm-io
Apr 23, 2026
Merged

[None][perf] Clear multimodal data upon prefill completion#13259
2ez4bz merged 1 commit into
NVIDIA:mainfrom
2ez4bz:dev-free-mm-io

Conversation

@2ez4bz
Copy link
Copy Markdown
Collaborator

@2ez4bz 2ez4bz commented Apr 21, 2026

Summary by CodeRabbit

  • Optimization

    • Improved memory efficiency in multimodal AI models by automatically releasing encoder caches and raw tensors immediately after the prefill phase completes.
  • Refactor

    • Extracted and centralized multimodal data cleanup logic into a shared helper function for improved code maintainability and consistency across the codebase.

Description

  • Why?

The multimodal inputs (pixel values, audio tensor, etc.) as well as outputs (embeddings) get stored in a request's py_multimodal_data dictionary. This is only ever dereferenced once a request is completed.

This meant that, at high-concurrency, given that there is no queuing of requests based on multimodal considerations, GPU memory would keep increasing until the number of concurrent requests that have gotten out of the queue decreases. Coupled with large multimodal input (e.g. long audio sequences), this could very easily lead to OOM errors at runtime.

  • What?

This commit frees a requests multimodal data once it has completed the prefill phase.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Copy link
Copy Markdown
Collaborator

@venkywonka venkywonka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 21, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44794 [ run ] triggered by Bot. Commit: f217a91 Link to invocation

@2ez4bz 2ez4bz marked this pull request as ready for review April 21, 2026 22:33
@2ez4bz 2ez4bz requested review from a team as code owners April 21, 2026 22:33
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6936dc42-b22b-4ea0-9175-cd430b536db2

📥 Commits

Reviewing files that changed from the base of the PR and between cdbe26d and f217a91.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/inputs/multimodal.py

📝 Walkthrough

Walkthrough

These changes introduce memory optimization for multimodal model generation by stripping unnecessary encoder caches and raw pre-encoder tensors after prefill. A new utility function centralizes the stripping logic and is integrated into the request state update flow to release pinned multimodal data immediately after prefill completes.

Changes

Cohort / File(s) Summary
Multimodal Data Stripping Utility
tensorrt_llm/inputs/multimodal.py
Added strip_mm_data_for_generation() function that clears all multimodal data keys except mrope_config with mrope_position_deltas. Refactored MultimodalParams.strip_for_generation() to copy and delegate to the new utility for code centralization.
Request State Integration
tensorrt_llm/_torch/pyexecutor/py_executor.py
Added _strip_py_multimodal_data_post_prefill() function and integrated it into _update_request_states_tp() to invoke multimodal data stripping when context requests complete prefill, releasing unnecessary tensors early.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: clearing multimodal data after prefill completion, which matches the core objective of the PR.
Description check ✅ Passed The description includes the required sections (Why, What) with clear explanations of the motivation and solution, though the Test Coverage section is empty and the checklist is incomplete.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 21, 2026

/bot kill

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44820 [ kill ] triggered by Bot. Commit: f217a91 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44820 [ kill ] completed with state SUCCESS. Commit: f217a91
Successfully killed previous jobs for commit f217a91

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 21, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44824 [ run ] triggered by Bot. Commit: f217a91 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44824 [ run ] completed with state SUCCESS. Commit: f217a91
/LLM/main/L0_MergeRequest_PR pipeline #35170 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 22, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44840 [ run ] triggered by Bot. Commit: f217a91 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44840 [ run ] completed with state SUCCESS. Commit: f217a91
/LLM/main/L0_MergeRequest_PR pipeline #35182 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 22, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44852 [ run ] triggered by Bot. Commit: f217a91 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44852 [ run ] completed with state SUCCESS. Commit: f217a91
/LLM/main/L0_MergeRequest_PR pipeline #35191 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 22, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44875 [ run ] triggered by Bot. Commit: f217a91 Link to invocation

* Why?

The multimodal inputs (pixel values, audio tensor, etc.) as well as
outputs (embeddings) get stored in a request's `py_multimodal_data`
dictionary. This is only ever dereferenced once a request is completed.

This meant that, at high-concurrency, given that there is no queuing
of requests based on multimodal considerations, GPU memory would keep
increasing until the number of concurrent requests that have gotten out
of the queue decreases. Coupled with large multimodal input (e.g. long
audio sequences), this could very easily lead to OOM errors at runtime.

* What?

This commit frees a requests multimodal data once it has completed the
prefill phase.

Signed-off-by: William Zhang <[email protected]>
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 22, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44876 [ run ] triggered by Bot. Commit: 65b7b1c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44875 [ run ] completed with state ABORTED. Commit: f217a91

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44876 [ run ] completed with state SUCCESS. Commit: 65b7b1c
/LLM/main/L0_MergeRequest_PR pipeline #35213 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 22, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44923 [ run ] triggered by Bot. Commit: 65b7b1c Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 22, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44982 [ run ] triggered by Bot. Commit: 65b7b1c Link to invocation

@2ez4bz 2ez4bz enabled auto-merge (squash) April 23, 2026 04:51
Copy link
Copy Markdown
Collaborator

@yechank-nvidia yechank-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44982 [ run ] completed with state SUCCESS. Commit: 65b7b1c
/LLM/main/L0_MergeRequest_PR pipeline #35303 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yechank-nvidia
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45156 [ run ] triggered by Bot. Commit: 65b7b1c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45156 [ run ] completed with state SUCCESS. Commit: 65b7b1c
/LLM/main/L0_MergeRequest_PR pipeline #35433 completed with status: 'SUCCESS'

CI Report

Link to invocation

@2ez4bz 2ez4bz merged commit 837a097 into NVIDIA:main Apr 23, 2026
5 checks passed
ziyixiong-nv pushed a commit to ziyixiong-nv/TensorRT-LLM that referenced this pull request Apr 24, 2026
)

* Why?

The multimodal inputs (pixel values, audio tensor, etc.) as well as
outputs (embeddings) get stored in a request's `py_multimodal_data`
dictionary. This is only ever dereferenced once a request is completed.

This meant that, at high-concurrency, given that there is no queuing
of requests based on multimodal considerations, GPU memory would keep
increasing until the number of concurrent requests that have gotten out
of the queue decreases. Coupled with large multimodal input (e.g. long
audio sequences), this could very easily lead to OOM errors at runtime.

* What?

This commit frees a requests multimodal data once it has completed the
prefill phase.

Signed-off-by: William Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants