[TRTLLM-11257][fix] release GPU memory and FDs in MnnvlMemory on pidfd failure to prevent leak#11979
Conversation
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-2" |
1 similar comment
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-2" |
|
/bot run |
1 similar comment
|
/bot run |
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot run |
|
/bot kill |
|
/bot run |
1 similar comment
|
/bot run |
|
PR_Github #38166 [ run ] triggered by Bot. Commit: |
|
PR_Github #38166 [ run ] completed with state
|
a44a6bb to
a7aaaeb
Compare
|
/bot run |
👎 Promotion blocked, new vulnerability foundVulnerability report
|
a7aaaeb to
f0280cb
Compare
|
/bot run |
|
PR_Github #38251 [ run ] triggered by Bot. Commit: |
|
PR_Github #38251 [ run ] completed with state
|
|
/bot run |
|
PR_Github #38351 [ run ] triggered by Bot. Commit: |
f0280cb to
4aacbb4
Compare
|
/bot run |
4aacbb4 to
29636ae
Compare
|
/bot run |
|
PR_Github #38357 [ run ] triggered by Bot. Commit: |
|
PR_Github #39077 [ run ] triggered by Bot. Commit: |
|
PR_Github #39077 [ run ] completed with state
|
|
/bot run --add-multi-gpu-test |
|
PR_Github #39093 [ run ] triggered by Bot. Commit: |
|
PR_Github #39093 [ run ] completed with state
|
4b3d0a2 to
63cfc24
Compare
|
/bot run --add-multi-gpu-test |
|
PR_Github #39176 [ run ] triggered by Bot. Commit: |
|
PR_Github #39176 [ run ] completed with state
|
bcf0eeb to
c551a12
Compare
|
/bot run --add-multi-gpu-test |
|
PR_Github #39244 [ run ] triggered by Bot. Commit: |
|
PR_Github #39244 [ run ] completed with state
|
Release allocated_mem_handle, exported shareable handle, and open pidfds/remote_fds before re-raise to avoid leaks. Signed-off-by: ZhaoyangWang <[email protected]>
Signed-off-by: ZhaoyangWang <[email protected]>
Signed-off-by: ZhaoyangWang <[email protected]>
Signed-off-by: ZhaoyangWang <[email protected]>
c551a12 to
9a41699
Compare
|
/bot run --add-multi-gpu-test |
|
PR_Github #39353 [ run ] triggered by Bot. Commit: |
|
PR_Github #39353 [ run ] completed with state
|
|
/bot run --add-multi-gpu-test |
|
PR_Github #39388 [ run ] triggered by Bot. Commit: |
|
PR_Github #39388 [ run ] completed with state |
|
Hi @bobboli All CI check passed, could you help to merge this PR, thanks. |
…d failure to prevent leak (NVIDIA#11979) Signed-off-by: ZhaoyangWang <[email protected]>
…d failure to prevent leak (NVIDIA#11979) Signed-off-by: ZhaoyangWang <[email protected]>
Summary by CodeRabbit
Description
When NVLink one-sided communication is used for MoE, workspace allocation in nvlink_one_sided.py calls MnnvlMemory(mapping, workspace_size_per_rank), which allocates via cuMemCreate + cuMemExportToShareableHandle and shares across processes using pidfd_open / pidfd_getfd. If pidfd_open or pidfd_getfd fails (e.g., EPERM in containers without SYS_PTRACE), the code previously raised without releasing resources created in the current attempt, including the CUDA allocation handle, exported shareable handle (FD for POSIX handle type), and any already-opened pidfds / duplicated remote FDs. Because self.WORKSPACE remains None, later retries could repeat this path, causing cumulative GPU memory and FD leaks.
This PR fixes the failure path in the POSIX (non-FABRIC) branch of open_mnnvl_memory in tensorrt_llm/_mnnvl_utils.py by ensuring proper cleanup before re-raising: it closes the exported shareable FD when applicable, calls cuMemRelease on the cuMemCreate allocation handle, and closes any pidfds and duplicated remote FDs opened during the attempt. Cleanup errors are logged as warnings and do not mask the original exception.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.