[None][perf] Clear multimodal data upon prefill completion#13259
Conversation
|
/bot run --disable-fail-fast |
|
PR_Github #44794 [ run ] triggered by Bot. Commit: |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThese changes introduce memory optimization for multimodal model generation by stripping unnecessary encoder caches and raw pre-encoder tensors after prefill. A new utility function centralizes the stripping logic and is integrated into the request state update flow to release pinned multimodal data immediately after prefill completes. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot kill |
|
PR_Github #44820 [ kill ] triggered by Bot. Commit: |
|
PR_Github #44820 [ kill ] completed with state |
|
/bot run |
|
PR_Github #44824 [ run ] triggered by Bot. Commit: |
|
PR_Github #44824 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44840 [ run ] triggered by Bot. Commit: |
|
PR_Github #44840 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44852 [ run ] triggered by Bot. Commit: |
|
PR_Github #44852 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #44875 [ run ] triggered by Bot. Commit: |
* Why? The multimodal inputs (pixel values, audio tensor, etc.) as well as outputs (embeddings) get stored in a request's `py_multimodal_data` dictionary. This is only ever dereferenced once a request is completed. This meant that, at high-concurrency, given that there is no queuing of requests based on multimodal considerations, GPU memory would keep increasing until the number of concurrent requests that have gotten out of the queue decreases. Coupled with large multimodal input (e.g. long audio sequences), this could very easily lead to OOM errors at runtime. * What? This commit frees a requests multimodal data once it has completed the prefill phase. Signed-off-by: William Zhang <[email protected]>
|
/bot run |
|
PR_Github #44876 [ run ] triggered by Bot. Commit: |
|
PR_Github #44875 [ run ] completed with state |
|
PR_Github #44876 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44923 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #44982 [ run ] triggered by Bot. Commit: |
|
PR_Github #44982 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45156 [ run ] triggered by Bot. Commit: |
|
PR_Github #45156 [ run ] completed with state |
) * Why? The multimodal inputs (pixel values, audio tensor, etc.) as well as outputs (embeddings) get stored in a request's `py_multimodal_data` dictionary. This is only ever dereferenced once a request is completed. This meant that, at high-concurrency, given that there is no queuing of requests based on multimodal considerations, GPU memory would keep increasing until the number of concurrent requests that have gotten out of the queue decreases. Coupled with large multimodal input (e.g. long audio sequences), this could very easily lead to OOM errors at runtime. * What? This commit frees a requests multimodal data once it has completed the prefill phase. Signed-off-by: William Zhang <[email protected]>
Summary by CodeRabbit
Optimization
Refactor
Description
The multimodal inputs (pixel values, audio tensor, etc.) as well as outputs (embeddings) get stored in a request's
py_multimodal_datadictionary. This is only ever dereferenced once a request is completed.This meant that, at high-concurrency, given that there is no queuing of requests based on multimodal considerations, GPU memory would keep increasing until the number of concurrent requests that have gotten out of the queue decreases. Coupled with large multimodal input (e.g. long audio sequences), this could very easily lead to OOM errors at runtime.
This commit frees a requests multimodal data once it has completed the prefill phase.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.