Skip to content

[TRTLLMINF-54][infra] Migrate typed-exception classifier to shared library#13863

Merged
dpitman-nvda merged 1 commit into
NVIDIA:mainfrom
dpitman-nvda:typed-exception-shared-lib-migration
May 8, 2026
Merged

[TRTLLMINF-54][infra] Migrate typed-exception classifier to shared library#13863
dpitman-nvda merged 1 commit into
NVIDIA:mainfrom
dpitman-nvda:typed-exception-shared-lib-migration

Conversation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator

@dpitman-nvda dpitman-nvda commented May 7, 2026

Summary by CodeRabbit

  • Refactor
    • Centralized failure classification logic in test infrastructure to improve consistency and maintainability of error handling across compute platforms.

Description

The typed exception hierarchy (TrtllmCiException, InfraFailure, UserFailure, PipelineInterruption) and FailureClassifier (PATTERN_CATALOG, classify(), flattenThrowable()) are moved out of inline pipeline-script Groovy and into the trtllm-jenkins-shared-lib at src/trtllm/{,exceptions/}. The implementation is unchanged -- only its location.

Why: the Jenkins script-security sandbox requires per-instance signature approval for RuntimeException(String, Throwable) and Throwable.getSuppressed() when those operations execute inside an inline pipeline script. We saw a real-world hit on the first downstream-instance CI run that exercised the typed throw at the SlurmConfig boundary -- five Jenkins instances would each need administrator approval, and every new method call we added would re-trigger the cycle.

Code under src/ in a Jenkins shared library runs outside the sandbox. By moving the classes and the classifier there, we sidestep the approval treadmill entirely. Pipeline scripts only call statically-resolved entry points (FailureClassifier.classify(...) and the constructors of the exception classes), which the sandbox permits without explicit approval.

Test Coverage

N/A, this is a CI change

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

The typed exception hierarchy (TrtllmCiException, InfraFailure, UserFailure,
PipelineInterruption) and FailureClassifier (PATTERN_CATALOG, classify(),
flattenThrowable()) are moved out of inline pipeline-script Groovy and into
the trtllm-jenkins-shared-lib at src/trtllm/{,exceptions/}. The
implementation is unchanged -- only its location.

Why: the Jenkins script-security sandbox requires per-instance signature
approval for `RuntimeException(String, Throwable)` and
`Throwable.getSuppressed()` when those operations execute inside an inline
pipeline script. We saw a real-world hit on the first downstream-instance
CI run that exercised the typed throw at the SlurmConfig boundary --
five Jenkins instances would each need administrator approval, and every
new method call we added would re-trigger the cycle.

Code under src/ in a Jenkins shared library runs *outside* the sandbox.
By moving the classes and the classifier there, we sidestep the approval
treadmill entirely. Pipeline scripts only call statically-resolved entry
points (FailureClassifier.classify(...) and the constructors of the
exception classes), which the sandbox permits without explicit approval.

Companion change in trtllm-jenkins-shared-lib: adds
- src/trtllm/exceptions/TrtllmCiException.groovy
- src/trtllm/exceptions/InfraFailure.groovy
- src/trtllm/exceptions/UserFailure.groovy
- src/trtllm/exceptions/PipelineInterruption.groovy
- src/trtllm/FailureClassifier.groovy
The shared-lib PR must merge before this one. The @Library line at the
top of L0_Test.groovy is `@trtllm-jenkins-shared-lib@main` (unpinned),
so the next pipeline build picks up the shared-lib changes automatically
once they are on main.

Diff summary in this commit:
- Adds 5 imports under trtllm.* / trtllm.exceptions.*.
- Removes 200 lines of inline class definitions, PATTERN_CATALOG, and
  classifier methods that were introduced earlier in the typed-exception
  series.
- Re-qualifies the three classify() call sites (SLURM retry @ L1556,
  cacheErrorAndUploadResult @ L1776, K8s pod-launch retry @ L3435) to
  FailureClassifier.classify(...).
- Leaves the legacy classifyInfraFailure() and the four pre-typed-exception
  pattern lists in place. They are deleted by a later cleanup commit in
  the typed-exception series and are unaffected by this migration.

Open in-flight PRs in the typed-exception stack will need to rebase onto
this commit so the typed throws / consumer call sites resolve to the
shared-lib symbols. No code change required during the rebase -- just a
rebase to pick up the new imports.

Signed-off-by: Derek Pitman <[email protected]>
@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The PR migrates the Jenkins pipeline's failure classification logic from an inline implementation to a shared library. It adds imports for typed exceptions and FailureClassifier, removes the legacy local classifier code block, and updates three infra retry mechanisms to call the shared classifier instead of the local function, maintaining existing retry control flow patterns.

Changes

Failure Classification Migration

Layer / File(s) Summary
Shared Library Integration
jenkins/L0_Test.groovy
Imports typed exception classes (TrtllmCiException, InfraFailure, UserFailure, PipelineInterruption) and FailureClassifier from trtllm-jenkins-shared-lib; documents that the classifier hierarchy is now external due to Jenkins sandbox constraints.
Legacy Implementation Removal
jenkins/L0_Test.groovy
Removes the inline typed exception definitions, pattern catalog, throwable flattening utility, and local classify(Throwable ex, String scope) function.
Classification Call Site Updates
jenkins/L0_Test.groovy
Replaces classify(e, ...) with FailureClassifier.classify(e, ...) in three locations: SLURM infra retry loop (runLLMTestlistOnSlurm), Kubernetes cache error handling (cacheErrorAndUploadResult), and Kubernetes pod infra retry loop (runKubernetesPodWithInfraRetry), preserving downstream retry control flow in each context.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: migrating the typed-exception classifier from inline pipeline script to the shared library.
Description check ✅ Passed The description includes all major required sections: a clear explanation of what was moved and why (sandbox signature approval issue), Test Coverage section (marked N/A with justification), and a completed PR Checklist.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

19-23: 💤 Low value

Consider removing unused imports.

TrtllmCiException (line 22) and UserFailure (line 23) are imported but not directly referenced in this file. The code handles UserFailure implicitly via !(c instanceof InfraFailure). If these are intentionally kept for completeness or future use, consider adding a comment to clarify.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/L0_Test.groovy` around lines 19 - 23, Remove the unused imports
TrtllmCiException and UserFailure from the import block (they are not
referenced; UserFailure is implicitly handled via the !(c instanceof
InfraFailure) check) or, if you intend to keep them for clarity, add a short
comment next to the import lines referencing TrtllmCiException and UserFailure
to explain they are intentionally imported for future use/readability; update
the import block that currently includes FailureClassifier, InfraFailure,
PipelineInterruption, TrtllmCiException, UserFailure accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@jenkins/L0_Test.groovy`:
- Around line 19-23: Remove the unused imports TrtllmCiException and UserFailure
from the import block (they are not referenced; UserFailure is implicitly
handled via the !(c instanceof InfraFailure) check) or, if you intend to keep
them for clarity, add a short comment next to the import lines referencing
TrtllmCiException and UserFailure to explain they are intentionally imported for
future use/readability; update the import block that currently includes
FailureClassifier, InfraFailure, PipelineInterruption, TrtllmCiException,
UserFailure accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6bcf7278-2db7-4162-aeaa-9349f5db1dd9

📥 Commits

Reviewing files that changed from the base of the PR and between 2f79ebb and df80190.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47238 [ run ] triggered by Bot. Commit: df80190 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47238 [ run ] completed with state SUCCESS. Commit: df80190
/LLM/main/L0_MergeRequest_PR pipeline #37188 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47404 [ run ] triggered by Bot. Commit: df80190 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47404 [ run ] completed with state SUCCESS. Commit: df80190
/LLM/main/L0_MergeRequest_PR pipeline #37333 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47428 [ run ] triggered by Bot. Commit: df80190 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47428 [ run ] completed with state SUCCESS. Commit: df80190
/LLM/main/L0_MergeRequest_PR pipeline #37351 completed with status: 'SUCCESS'

CI Report

Link to invocation

@dpitman-nvda dpitman-nvda merged commit 43f4b94 into NVIDIA:main May 8, 2026
10 of 11 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants