[TRTLLMINF-54][infra] Migrate typed-exception classifier to shared library#13863
Conversation
The typed exception hierarchy (TrtllmCiException, InfraFailure, UserFailure,
PipelineInterruption) and FailureClassifier (PATTERN_CATALOG, classify(),
flattenThrowable()) are moved out of inline pipeline-script Groovy and into
the trtllm-jenkins-shared-lib at src/trtllm/{,exceptions/}. The
implementation is unchanged -- only its location.
Why: the Jenkins script-security sandbox requires per-instance signature
approval for `RuntimeException(String, Throwable)` and
`Throwable.getSuppressed()` when those operations execute inside an inline
pipeline script. We saw a real-world hit on the first downstream-instance
CI run that exercised the typed throw at the SlurmConfig boundary --
five Jenkins instances would each need administrator approval, and every
new method call we added would re-trigger the cycle.
Code under src/ in a Jenkins shared library runs *outside* the sandbox.
By moving the classes and the classifier there, we sidestep the approval
treadmill entirely. Pipeline scripts only call statically-resolved entry
points (FailureClassifier.classify(...) and the constructors of the
exception classes), which the sandbox permits without explicit approval.
Companion change in trtllm-jenkins-shared-lib: adds
- src/trtllm/exceptions/TrtllmCiException.groovy
- src/trtllm/exceptions/InfraFailure.groovy
- src/trtllm/exceptions/UserFailure.groovy
- src/trtllm/exceptions/PipelineInterruption.groovy
- src/trtllm/FailureClassifier.groovy
The shared-lib PR must merge before this one. The @Library line at the
top of L0_Test.groovy is `@trtllm-jenkins-shared-lib@main` (unpinned),
so the next pipeline build picks up the shared-lib changes automatically
once they are on main.
Diff summary in this commit:
- Adds 5 imports under trtllm.* / trtllm.exceptions.*.
- Removes 200 lines of inline class definitions, PATTERN_CATALOG, and
classifier methods that were introduced earlier in the typed-exception
series.
- Re-qualifies the three classify() call sites (SLURM retry @ L1556,
cacheErrorAndUploadResult @ L1776, K8s pod-launch retry @ L3435) to
FailureClassifier.classify(...).
- Leaves the legacy classifyInfraFailure() and the four pre-typed-exception
pattern lists in place. They are deleted by a later cleanup commit in
the typed-exception series and are unaffected by this migration.
Open in-flight PRs in the typed-exception stack will need to rebase onto
this commit so the typed throws / consumer call sites resolve to the
shared-lib symbols. No code change required during the rebase -- just a
rebase to pick up the new imports.
Signed-off-by: Derek Pitman <[email protected]>
|
/bot run |
📝 WalkthroughWalkthroughThe PR migrates the Jenkins pipeline's failure classification logic from an inline implementation to a shared library. It adds imports for typed exceptions and FailureClassifier, removes the legacy local classifier code block, and updates three infra retry mechanisms to call the shared classifier instead of the local function, maintaining existing retry control flow patterns. ChangesFailure Classification Migration
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
19-23: 💤 Low valueConsider removing unused imports.
TrtllmCiException(line 22) andUserFailure(line 23) are imported but not directly referenced in this file. The code handlesUserFailureimplicitly via!(c instanceof InfraFailure). If these are intentionally kept for completeness or future use, consider adding a comment to clarify.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@jenkins/L0_Test.groovy` around lines 19 - 23, Remove the unused imports TrtllmCiException and UserFailure from the import block (they are not referenced; UserFailure is implicitly handled via the !(c instanceof InfraFailure) check) or, if you intend to keep them for clarity, add a short comment next to the import lines referencing TrtllmCiException and UserFailure to explain they are intentionally imported for future use/readability; update the import block that currently includes FailureClassifier, InfraFailure, PipelineInterruption, TrtllmCiException, UserFailure accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@jenkins/L0_Test.groovy`:
- Around line 19-23: Remove the unused imports TrtllmCiException and UserFailure
from the import block (they are not referenced; UserFailure is implicitly
handled via the !(c instanceof InfraFailure) check) or, if you intend to keep
them for clarity, add a short comment next to the import lines referencing
TrtllmCiException and UserFailure to explain they are intentionally imported for
future use/readability; update the import block that currently includes
FailureClassifier, InfraFailure, PipelineInterruption, TrtllmCiException,
UserFailure accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6bcf7278-2db7-4162-aeaa-9349f5db1dd9
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
|
/bot run |
|
PR_Github #47238 [ run ] triggered by Bot. Commit: |
|
PR_Github #47238 [ run ] completed with state
|
|
/bot run |
|
PR_Github #47404 [ run ] triggered by Bot. Commit: |
|
PR_Github #47404 [ run ] completed with state
|
|
/bot run |
|
PR_Github #47428 [ run ] triggered by Bot. Commit: |
|
PR_Github #47428 [ run ] completed with state |
…brary (NVIDIA#13863) Signed-off-by: Derek Pitman <[email protected]>
Summary by CodeRabbit
Description
The typed exception hierarchy (TrtllmCiException, InfraFailure, UserFailure, PipelineInterruption) and FailureClassifier (PATTERN_CATALOG, classify(), flattenThrowable()) are moved out of inline pipeline-script Groovy and into the trtllm-jenkins-shared-lib at src/trtllm/{,exceptions/}. The implementation is unchanged -- only its location.
Why: the Jenkins script-security sandbox requires per-instance signature approval for
RuntimeException(String, Throwable)andThrowable.getSuppressed()when those operations execute inside an inline pipeline script. We saw a real-world hit on the first downstream-instance CI run that exercised the typed throw at the SlurmConfig boundary -- five Jenkins instances would each need administrator approval, and every new method call we added would re-trigger the cycle.Code under src/ in a Jenkins shared library runs outside the sandbox. By moving the classes and the classifier there, we sidestep the approval treadmill entirely. Pipeline scripts only call statically-resolved entry points (FailureClassifier.classify(...) and the constructors of the exception classes), which the sandbox permits without explicit approval.
Test Coverage
N/A, this is a CI change
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.