[TRTLLM-10491][test] unwaive DeepSeekV3Lite nvfp4 4gpus test (flaky, self-healed)#13196
Conversation
📝 WalkthroughWalkthroughA single test waiver entry was removed from the test exemption list for a specific test case variant ( Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~1 minute 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
7863023 to
a0917be
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #44278 [ run ] triggered by Bot. Commit: |
|
PR_Github #44278 [ run ] completed with state
|
a0917be to
09f9bc7
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #44692 [ run ] triggered by Bot. Commit: |
09f9bc7 to
f3e0d02
Compare
|
PR_Github #44692 [ run ] completed with state
|
f3e0d02 to
6ead19e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #44874 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #45540 [ run ] triggered by Bot. Commit: |
|
PR_Github #45540 [ run ] completed with state |
6ead19e to
8419a70
Compare
…self-healed) Signed-off-by: Tianyu Xiong <[email protected]>
8419a70 to
6e95a9a
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #45664 [ run ] triggered by Bot. Commit: |
|
PR_Github #45664 [ run ] completed with state |
|
/bot run --stage-list "DGX_B300-4_GPUs-PyTorch-Post-Merge-1" |
|
PR_Github #45865 [ ] completed with state |
…self-healed) (NVIDIA#13196) Signed-off-by: Tianyu Xiong <[email protected]>
Background
NVBug 5819005 was filed on 2026-01-17 after a single CI failure on
TestDeepSeekV3Lite::test_nvfp4_4gpuswith the configurationmoe_backend=CUTLASS-mtp_nextn=0-tp2pp2-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-low_precision_combine=False-torch_compile=Trueon DGX B300 (4 GPU). Accuracy fell to 60.235, 0.27% below the
statistical threshold of 60.507. The test was waived in
tests/integration/test_lists/waives.txtpending investigation.Evidence this is flaky / self-healed
CI history (2025-12-01 ~ 2026-04-20):
Local verification (DGX B300, 4 GPU, 2026-04-16):
Ran the exact waived configuration 10 times on current
main:to reference 63.710)
original failure margin (−0.27%)
Conclusion
One isolated failure in ~349 CI runs, no local reproduction in 10
consecutive runs on current
main, and accuracy distributioncentered on the reference value — this is flaky behavior under an
older software stack that has since self-healed. Safe to remove
the waive.
Changes
tests/integration/test_lists/waives.txtTest plan
(4 GPU), 10/10 PASSED
/bot runon this PR)Linked