Skip to content

[TRTLLM-10491][test] unwaive DeepSeekV3Lite nvfp4 4gpus test (flaky, self-healed)#13196

Merged
litaotju merged 1 commit into
NVIDIA:mainfrom
tianyuxbear:fix/nvbug-5819005-unwaive
Apr 28, 2026
Merged

[TRTLLM-10491][test] unwaive DeepSeekV3Lite nvfp4 4gpus test (flaky, self-healed)#13196
litaotju merged 1 commit into
NVIDIA:mainfrom
tianyuxbear:fix/nvbug-5819005-unwaive

Conversation

@tianyuxbear
Copy link
Copy Markdown
Collaborator

@tianyuxbear tianyuxbear commented Apr 20, 2026

Background

NVBug 5819005 was filed on 2026-01-17 after a single CI failure on
TestDeepSeekV3Lite::test_nvfp4_4gpus with the configuration
moe_backend=CUTLASS-mtp_nextn=0-tp2pp2-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-low_precision_combine=False-torch_compile=True
on DGX B300 (4 GPU). Accuracy fell to 60.235, 0.27% below the
statistical threshold of 60.507. The test was waived in
tests/integration/test_lists/waives.txt pending investigation.

Evidence this is flaky / self-healed

CI history (2025-12-01 ~ 2026-04-20):

  • ~349 runs total: 348 PASSED, 1 FAILED
  • The single failure was on 2026-01-18, all other runs PASSED

Local verification (DGX B300, 4 GPU, 2026-04-16):
Ran the exact waived configuration 10 times on current main:

Run Accuracy Result
1 64.22 PASSED
2 64.63 PASSED
3 63.27 PASSED
4 63.38 PASSED
5 64.06 PASSED
6 63.27 PASSED
7 63.72 PASSED
8 63.68 PASSED
9 63.31 PASSED
10 63.72 PASSED
  • 10/10 PASSED
  • Accuracy range: 63.27 ~ 64.63 (avg 63.726, essentially equal
    to reference 63.710)
  • Worst case is +2.76% above threshold (60.507), ~10x the
    original failure margin (−0.27%)
  • The 2026-01-18 low-accuracy run (60.235) is not reproducible

Conclusion

One isolated failure in ~349 CI runs, no local reproduction in 10
consecutive runs on current main, and accuracy distribution
centered on the reference value — this is flaky behavior under an
older software stack that has since self-healed. Safe to remove
the waive.

Changes

  • Remove the waive entry for this test from
    tests/integration/test_lists/waives.txt

Test plan

  • Ran the waived configuration 10 times locally on DGX B300
    (4 GPU), 10/10 PASSED
  • CI run with the waive removed (triggered via /bot run on this PR)

Linked

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

📝 Walkthrough

Walkthrough

A single test waiver entry was removed from the test exemption list for a specific test case variant (accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4_4gpus with defined parameters) that was previously marked to skip.

Changes

Cohort / File(s) Summary
Test Waiver Removal
tests/integration/test_lists/waives.txt
Removed one SKIP waiver entry for a specific test case with FP8 KV cache and parameter configuration (bug reference: https://nvbugs/5819005).

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~1 minute

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The PR title accurately describes the main change: removing a waiver for the DeepSeekV3Lite nvfp4 4gpus test due to flaky behavior that has self-healed, which aligns with the single-line removal in the waives.txt file and the PR objectives.
Description check ✅ Passed The PR description is comprehensive and well-structured, exceeding template requirements with detailed background, evidence of flakiness, local verification data, conclusions, and test plan.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tianyuxbear tianyuxbear force-pushed the fix/nvbug-5819005-unwaive branch from 7863023 to a0917be Compare April 20, 2026 03:06
@tianyuxbear tianyuxbear changed the title [TRTLLM-10491][nvbug/5819005][test] unwaive DeepSeekV3Lite nvfp4 4gpus test (flaky, self-healed) [TRTLLM-10491][test] unwaive DeepSeekV3Lite nvfp4 4gpus test (flaky, self-healed) Apr 20, 2026
@tianyuxbear
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44278 [ run ] triggered by Bot. Commit: a0917be Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44278 [ run ] completed with state FAILURE. Commit: a0917be
/LLM/main/L0_MergeRequest_PR pipeline #34699 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tianyuxbear tianyuxbear force-pushed the fix/nvbug-5819005-unwaive branch from a0917be to 09f9bc7 Compare April 21, 2026 09:18
@tianyuxbear
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44692 [ run ] triggered by Bot. Commit: 09f9bc7 Link to invocation

@tianyuxbear tianyuxbear force-pushed the fix/nvbug-5819005-unwaive branch from 09f9bc7 to f3e0d02 Compare April 21, 2026 09:33
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44692 [ run ] completed with state SUCCESS. Commit: 09f9bc7
/LLM/main/L0_MergeRequest_PR pipeline #35058 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tianyuxbear tianyuxbear force-pushed the fix/nvbug-5819005-unwaive branch from f3e0d02 to 6ead19e Compare April 22, 2026 03:14
@tianyuxbear
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44874 [ run ] triggered by Bot. Commit: 6ead19e Link to invocation

@tianyuxbear
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45540 [ run ] triggered by Bot. Commit: 6ead19e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45540 [ run ] completed with state SUCCESS. Commit: 6ead19e
/LLM/main/L0_MergeRequest_PR pipeline #35760 completed with status: 'SUCCESS'

CI Report

Link to invocation

@tianyuxbear tianyuxbear force-pushed the fix/nvbug-5819005-unwaive branch from 8419a70 to 6e95a9a Compare April 27, 2026 05:44
@tianyuxbear
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45664 [ run ] triggered by Bot. Commit: 6e95a9a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45664 [ run ] completed with state SUCCESS. Commit: 6e95a9a
/LLM/main/L0_MergeRequest_PR pipeline #35875 completed with status: 'SUCCESS'

CI Report

Link to invocation

@litaotju litaotju merged commit 3a790bd into NVIDIA:main Apr 28, 2026
5 checks passed
@tianyuxbear
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B300-4_GPUs-PyTorch-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45865 [ ] completed with state FAILURE. Commit: 6e95a9a
Not allowed on merged PR

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants