[None][perf] Clear multimodal data upon prefill completion by 2ez4bz · Pull Request #13259 · NVIDIA/TensorRT-LLM

2ez4bz · 2026-04-21T06:04:00Z

Summary by CodeRabbit

Optimization
- Improved memory efficiency in multimodal AI models by automatically releasing encoder caches and raw tensors immediately after the prefill phase completes.
Refactor
- Extracted and centralized multimodal data cleanup logic into a shared helper function for improved code maintainability and consistency across the codebase.

Description

Why?

The multimodal inputs (pixel values, audio tensor, etc.) as well as outputs (embeddings) get stored in a request's py_multimodal_data dictionary. This is only ever dereferenced once a request is completed.

This meant that, at high-concurrency, given that there is no queuing of requests based on multimodal considerations, GPU memory would keep increasing until the number of concurrent requests that have gotten out of the queue decreases. Coupled with large multimodal input (e.g. long audio sequences), this could very easily lead to OOM errors at runtime.

What?

This commit frees a requests multimodal data once it has completed the prefill phase.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

venkywonka

Great catch!

2ez4bz · 2026-04-21T20:08:02Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-21T20:14:07Z

PR_Github #44794 [ run ] triggered by Bot. Commit: f217a91 Link to invocation

coderabbitai · 2026-04-21T22:36:48Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6936dc42-b22b-4ea0-9175-cd430b536db2

📥 Commits

Reviewing files that changed from the base of the PR and between cdbe26d and f217a91.

📒 Files selected for processing (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/inputs/multimodal.py

📝 Walkthrough

Walkthrough

These changes introduce memory optimization for multimodal model generation by stripping unnecessary encoder caches and raw pre-encoder tensors after prefill. A new utility function centralizes the stripping logic and is integrated into the request state update flow to release pinned multimodal data immediately after prefill completes.

Changes

Cohort / File(s)	Summary
Multimodal Data Stripping Utility `tensorrt_llm/inputs/multimodal.py`	Added `strip_mm_data_for_generation()` function that clears all multimodal data keys except `mrope_config` with `mrope_position_deltas`. Refactored `MultimodalParams.strip_for_generation()` to copy and delegate to the new utility for code centralization.
Request State Integration `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Added `_strip_py_multimodal_data_post_prefill()` function and integrated it into `_update_request_states_tp()` to invoke multimodal data stripping when context requests complete prefill, releasing unnecessary tensors early.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: clearing multimodal data after prefill completion, which matches the core objective of the PR.
Description check	✅ Passed	The description includes the required sections (Why, What) with clear explanations of the motivation and solution, though the Test Coverage section is empty and the checklist is incomplete.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

2ez4bz · 2026-04-21T23:39:39Z

/bot kill

tensorrt-cicd · 2026-04-21T23:46:57Z

PR_Github #44820 [ kill ] triggered by Bot. Commit: f217a91 Link to invocation

tensorrt-cicd · 2026-04-21T23:47:33Z

PR_Github #44820 [ kill ] completed with state SUCCESS. Commit: f217a91
Successfully killed previous jobs for commit f217a91

Link to invocation

2ez4bz · 2026-04-21T23:56:38Z

/bot run

tensorrt-cicd · 2026-04-22T00:02:48Z

PR_Github #44824 [ run ] triggered by Bot. Commit: f217a91 Link to invocation

tensorrt-cicd · 2026-04-22T00:45:29Z

PR_Github #44824 [ run ] completed with state SUCCESS. Commit: f217a91
/LLM/main/L0_MergeRequest_PR pipeline #35170 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-22T00:48:07Z

/bot run

tensorrt-cicd · 2026-04-22T00:54:31Z

PR_Github #44840 [ run ] triggered by Bot. Commit: f217a91 Link to invocation

tensorrt-cicd · 2026-04-22T01:40:34Z

PR_Github #44840 [ run ] completed with state SUCCESS. Commit: f217a91
/LLM/main/L0_MergeRequest_PR pipeline #35182 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-22T02:23:51Z

/bot run

tensorrt-cicd · 2026-04-22T02:29:55Z

PR_Github #44852 [ run ] triggered by Bot. Commit: f217a91 Link to invocation

tensorrt-cicd · 2026-04-22T03:32:38Z

PR_Github #44852 [ run ] completed with state SUCCESS. Commit: f217a91
/LLM/main/L0_MergeRequest_PR pipeline #35191 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-22T03:46:55Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-22T03:52:44Z

PR_Github #44875 [ run ] triggered by Bot. Commit: f217a91 Link to invocation

* Why? The multimodal inputs (pixel values, audio tensor, etc.) as well as outputs (embeddings) get stored in a request's `py_multimodal_data` dictionary. This is only ever dereferenced once a request is completed. This meant that, at high-concurrency, given that there is no queuing of requests based on multimodal considerations, GPU memory would keep increasing until the number of concurrent requests that have gotten out of the queue decreases. Coupled with large multimodal input (e.g. long audio sequences), this could very easily lead to OOM errors at runtime. * What? This commit frees a requests multimodal data once it has completed the prefill phase. Signed-off-by: William Zhang <[email protected]>

2ez4bz · 2026-04-22T03:57:04Z

/bot run

tensorrt-cicd · 2026-04-22T04:02:55Z

PR_Github #44876 [ run ] triggered by Bot. Commit: 65b7b1c Link to invocation

tensorrt-cicd · 2026-04-22T04:03:00Z

PR_Github #44875 [ run ] completed with state ABORTED. Commit: f217a91

Link to invocation

tensorrt-cicd · 2026-04-22T06:05:13Z

PR_Github #44876 [ run ] completed with state SUCCESS. Commit: 65b7b1c
/LLM/main/L0_MergeRequest_PR pipeline #35213 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-22T06:32:18Z

/bot run

tensorrt-cicd · 2026-04-22T06:38:41Z

PR_Github #44923 [ run ] triggered by Bot. Commit: 65b7b1c Link to invocation

2ez4bz · 2026-04-22T15:52:00Z

/bot run

tensorrt-cicd · 2026-04-22T15:58:43Z

PR_Github #44982 [ run ] triggered by Bot. Commit: 65b7b1c Link to invocation

yechank-nvidia

LGTM

tensorrt-cicd · 2026-04-23T08:22:31Z

PR_Github #44982 [ run ] completed with state SUCCESS. Commit: 65b7b1c
/LLM/main/L0_MergeRequest_PR pipeline #35303 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yechank-nvidia · 2026-04-23T08:30:25Z

/bot run

tensorrt-cicd · 2026-04-23T08:37:10Z

PR_Github #45156 [ run ] triggered by Bot. Commit: 65b7b1c Link to invocation

tensorrt-cicd · 2026-04-23T19:18:58Z

PR_Github #45156 [ run ] completed with state SUCCESS. Commit: 65b7b1c
/LLM/main/L0_MergeRequest_PR pipeline #35433 completed with status: 'SUCCESS'

CI Report

Link to invocation

) * Why? The multimodal inputs (pixel values, audio tensor, etc.) as well as outputs (embeddings) get stored in a request's `py_multimodal_data` dictionary. This is only ever dereferenced once a request is completed. This meant that, at high-concurrency, given that there is no queuing of requests based on multimodal considerations, GPU memory would keep increasing until the number of concurrent requests that have gotten out of the queue decreases. Coupled with large multimodal input (e.g. long audio sequences), this could very easily lead to OOM errors at runtime. * What? This commit frees a requests multimodal data once it has completed the prefill phase. Signed-off-by: William Zhang <[email protected]>

github-actions Bot assigned 2ez4bz Apr 21, 2026

venkywonka approved these changes Apr 21, 2026

View reviewed changes

2ez4bz marked this pull request as ready for review April 21, 2026 22:33

2ez4bz requested review from a team as code owners April 21, 2026 22:33

2ez4bz requested review from dongxuy04, moraxu and tijyojwad April 21, 2026 22:33

Tabrizian approved these changes Apr 21, 2026

View reviewed changes

moraxu approved these changes Apr 21, 2026

View reviewed changes

2ez4bz force-pushed the dev-free-mm-io branch from f217a91 to 65b7b1c Compare April 22, 2026 03:56

2ez4bz enabled auto-merge (squash) April 23, 2026 04:51

yechank-nvidia approved these changes Apr 23, 2026

View reviewed changes

2ez4bz merged commit 837a097 into NVIDIA:main Apr 23, 2026
5 checks passed

Conversation

2ez4bz commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

venkywonka left a comment

Choose a reason for hiding this comment

Uh oh!

2ez4bz commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

2ez4bz commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

2ez4bz commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

2ez4bz commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

2ez4bz commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

2ez4bz commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

2ez4bz commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

2ez4bz commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

2ez4bz commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

yechank-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 23, 2026

Uh oh!

yechank-nvidia commented Apr 23, 2026

Uh oh!

tensorrt-cicd commented Apr 23, 2026

Uh oh!

tensorrt-cicd commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

2ez4bz commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading