[None][feat] Add Claude Code agents and skills for kernel dev, perf analysis, and compilation#12831
Conversation
…, performance analysis, and compilation Add specialized agents and skills covering: - Kernel writing: CuTe DSL, Triton, TileIR optimization, CUDA C++ - Performance: Nsight Systems/Compute analysis, host overhead, CUDA graphs, sync-free, workload profiling - Compilation: local and SLURM-based TRT-LLM builds - Code contribution and codebase exploration guides - Updates to existing AD and serve-config skills Signed-off-by: Kaiyu Xie <[email protected]>
|
/bot skip --comment "currently no CI/CD coverage for skills and agents" |
|
PR_Github #42301 [ skip ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis pull request introduces a comprehensive agent and skill framework for TensorRT-LLM development and optimization. It adds new agent definitions for model compilation, kernel development across multiple frameworks (CUDA, Triton, CuTe, TileIR), and performance profiling/optimization. Accompanying skill files define detailed workflows for each agent, supported by Python utility scripts for kernel verification and benchmarking, plus extensive reference documentation covering kernel APIs, patterns, and performance analysis tooling. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 20
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (24)
.claude/skills/perf-host-analysis/references/iteration-isolation-techniques.md-85-87 (1)
85-87:⚠️ Potential issue | 🟡 MinorLabel this fenced block to keep markdownlint happy.
The new fence is unlabeled and trips MD040.
textis enough here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/perf-host-analysis/references/iteration-isolation-techniques.md around lines 85 - 87, The fenced code block containing "[Executor] _forward_step N: X ctx reqs, Y gen reqs" is unlabeled and triggers markdownlint MD040; update the fence delimiter from ``` to ```text so the block is labeled (e.g., replace the opening and closing backticks around that line in iteration-isolation-techniques.md with ```text and ``` respectively) to satisfy the linter while preserving the content..claude/skills/exec-slurm-compile/scripts/enroot-import-27-35 (1)
27-35:⚠️ Potential issue | 🟡 MinorFix the stale help text.
The usage block says the default time is 15 minutes, but the code defaults to 1 hour, and the dependency example is missing the required
afterok:separator. Both are copy-paste footguns for a brand-new CLI.Also applies to: 72-72
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/exec-slurm-compile/scripts/enroot-import around lines 27 - 35, Update the help/usage text in the enroot-import script to reflect the actual defaults and correct dependency syntax: change the default time note from "15 minutes" to "1 hour" (or "60 minutes") wherever the usage block or the "[--time=...|-t]" description appears, and fix the dependency example to include the required "afterok:" separator (e.g., "--dependency=afterok:<jobid>") in the example sentence that currently shows "'--dependency=afterok<jobid>'"; ensure these edits are made in the script's usage/help string (the top-of-script usage block and the example line) so they match the actual sbatch behavior..claude/skills/perf-host-analysis/references/trtllm-nvtx-ranges.md-53-66 (1)
53-66:⚠️ Potential issue | 🟡 MinorAdd language tags to the new fenced blocks.
Both unlabeled fences trip MD040. Mark them as
text(orpythonfor the gap formula) so this reference stays lint-clean.Also applies to: 176-178
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/perf-host-analysis/references/trtllm-nvtx-ranges.md around lines 53 - 66, The unlabeled fenced code blocks showing the Executor _forward_step list and the separate gap formula block should be given language tags to satisfy MD040: add ```text before the executor step list block and ```python before the gap-formula block (or use ```text if the formula is non-executable), and update the other similar unlabeled fenced block later in the file the reviewer called out the same way so all unlabeled blocks are labeled..claude/skills/kernel-cute-writing/references/concepts-layouts.md-117-123 (1)
117-123:⚠️ Potential issue | 🟡 MinorCorrect the complement example.
6:4denotes six positions at stride 4, but the listed complement set has only five elements and does not match that layout. Please fix either the layout expression or the enumerated indices.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-cute-writing/references/concepts-layouts.md around lines 117 - 123, The complement example for complement(4:1, 24) is inconsistent: the layout expression "6:4" implies six positions at stride 4 but the enumerated set shows only five elements; to fix, make the two representations match by either changing "6:4" to "5:4" to match the listed indices {4,8,12,16,20}, or else change the enumerated indices and total size so "6:4" is valid (e.g., include {4,8,12,16,20,24} and adjust the second argument from 24 to 25); update the example so complement(4:1, 24) and the layout token ("5:4" or "6:4") and the explicit index list are all consistent..claude/skills/perf-host-analysis/references/examples.md-22-32 (1)
22-32:⚠️ Potential issue | 🟡 MinorAdd a language tag to these fenced blocks.
markdownlint is already flagging every example fence here. Using
textfor these threshold tables will clear the warnings and improve editor rendering.🧹 Proposed fix
-``` +```text GPU idle ratio: 42.1% → >30% threshold → CROSSED Launch overhead: 12.0% → >10% threshold → CROSSED @@ Crossed: 6/6 → Verdict: YES (host overhead IS the bottleneck) Host prep confirmed: YES (3b=15% AND 3c=85% both crossed)Apply the same `text` fence to the other five example blocks in this file. </details> Also applies to: 36-45, 51-66, 72-95, 101-116, 122-137 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/perf-host-analysis/references/examples.md around lines 22 -
32, Change the six fenced code blocks in
.claude/skills/perf-host-analysis/references/examples.md from plainto language-taggedtext; specifically, for each example block that begins with
lines like "GPU idle ratio: 42.1% ..." (and the other five similar threshold
tables), replace the opening triple backticks with ```text so markdownlint stops
flagging them and editor rendering improves.</details> </blockquote></details> <details> <summary>.claude/skills/exec-slurm-compile/scripts/compile.sh-21-22 (1)</summary><blockquote> `21-22`: _⚠️ Potential issue_ | _🟡 Minor_ **Documentation inconsistency: `--no-venv` mentioned in comment but not in default command.** The comment on line 22 lists `--no-venv` as a default flag, but the actual default command (lines 38-42) does not include it. Either add `--no-venv` to the default command or remove it from the comment. <details> <summary>Proposed fix (if `--no-venv` should be a default)</summary> ```diff python3 ./scripts/build_wheel.py \ --trt_root /usr/local/tensorrt \ --benchmarks \ -a "100-real" \ - --nvtx + --nvtx \ + --no-venv fi ``` </details> Also applies to: 36-42 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/exec-slurm-compile/scripts/compile.sh around lines 21 - 22, The comment lists --no-venv as a default build_wheel.py flag but the script's actual default invocation of build_wheel.py (the block that constructs the default flags/command around the build_wheel.py call) does not include --no-venv; either add --no-venv to that default flags list in the default command invocation or remove it from the comment so they match; update the build_wheel.py invocation (the default FLAGS/args used when calling build_wheel.py) to include --no-venv if you want it to be a true default, or edit the comment text to drop --no-venv if it should not be default. ``` </details> </blockquote></details> <details> <summary>.claude/skills/perf-host-optimization/references/examples.md-84-136 (1)</summary><blockquote> `84-136`: _⚠️ Potential issue_ | _🟡 Minor_ **Label the optimization transcript code fence.** Line 84 starts an unlabeled fenced block. Add `text` to satisfy MD040 and improve renderer behavior. <details> <summary>Suggested fix</summary> ```diff -``` +```text Round 0 (Baseline): -> Profile with default functions -> Identify _forward_step spends 98.7% in model_engine.forward() @@ _prepare_tp_inputs: 57.7s -> 40.1s (-30.5% cumulative) Mean TPOT: 37.4ms -> 28.9ms (-22.6%) Output throughput: ~3200 -> 3799 tok/s (+18.7%) ``` ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/perf-host-optimization/references/examples.md around lines 84
- 136, The fenced code block that begins with
(the unlabeled triple-backtick block containing the optimization transcript) must be changed to a labeled fence by replacing the openingwithtext to satisfy MD040 and ensure correct rendering; locate the block that starts with "Round 0 (Baseline):" and update only the opening fence totext (leave the contents and closing ```
unchanged).</details> </blockquote></details> <details> <summary>.claude/skills/kernel-triton-writing/references/concepts-semantics.md-65-69 (1)</summary><blockquote> `65-69`: _⚠️ Potential issue_ | _🟡 Minor_ **Add a language tag to the fenced block.** Line 65 uses an unlabeled fenced code block, which triggers MD040. Use `text` (or `plaintext`) for this hierarchy snippet. <details> <summary>Suggested fix</summary> ```diff -``` +```text {bool} < {int8, int16, int32, int64, uint8, uint16, uint32, uint64} < {fp8, fp16, bf16, fp32, fp64} ^ ^ (integral types) ^ (floating types) kind 0 kind 1 kind 2 ``` ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-triton-writing/references/concepts-semantics.md around lines 65 - 69, The fenced code block containing the type hierarchy starting with "{bool} < {int8, int16, int32, int64, uint8, uint16, uint32, uint64} < {fp8, fp16, bf16, fp32, fp64}" should include a language tag to satisfy MD040; change the opening triple-backticks to use "text" (or "plaintext") so the block becomes a labeled plaintext code fence and keep the inner lines unchanged. ``` </details> </blockquote></details> <details> <summary>.claude/skills/perf-host-analysis/references/output-format.md-82-82 (1)</summary><blockquote> `82-82`: _⚠️ Potential issue_ | _🟡 Minor_ **Use consistent skill name for handoff command.** Line 82 says `host-perf-optimization`, but the rest of this file uses `perf-host-optimization`. Please standardize to one canonical skill name. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/perf-host-analysis/references/output-format.md at line 82, The handoff skill name is inconsistent: replace the string "host-perf-optimization" with the canonical "perf-host-optimization" so the command matches the rest of the document; search for and update any occurrences of "host-perf-optimization" in this file to "perf-host-optimization" to ensure uniformity. ``` </details> </blockquote></details> <details> <summary>.claude/skills/perf-host-analysis/references/output-format.md-22-22 (1)</summary><blockquote> `22-22`: _⚠️ Potential issue_ | _🟡 Minor_ **Add a language identifier to the fenced block.** The fenced code block at Line 22 should specify a language (e.g., `text`) to satisfy markdown lint rules. <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/perf-host-analysis/references/output-format.md at line 22,
The fenced code block currently uses plain; update that fence to include a language identifier (for example changeto ```text) so the block declares
its language and satisfies the markdown lint rule for fenced code blocks.</details> </blockquote></details> <details> <summary>.claude/skills/kernel-triton-writing/references/patterns-gemm.md-242-242 (1)</summary><blockquote> `242-242`: _⚠️ Potential issue_ | _🟡 Minor_ **Correct typo in key pattern bullet.** “subtiling” should be “sub-tiling” (or “tiling”) for clarity. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-triton-writing/references/patterns-gemm.md at line 242, Fix the typo in the key pattern bullet that reads "Epilogue subtiling:" by changing it to "Epilogue sub-tiling:" (or simply "Epilogue tiling:") so the heading is clear; update the bullet text in the patterns-gemm.md entry that contains the phrase "Slice accumulator along N for the store phase to cut register pressure." to use the corrected label. ``` </details> </blockquote></details> <details> <summary>.claude/skills/kernel-cute-writing/references/api-runtime-utils.md-128-141 (1)</summary><blockquote> `128-141`: _⚠️ Potential issue_ | _🟡 Minor_ **Add missing import for `ClcDynamicPersistentTileSchedulerParams`.** The dynamic scheduler snippet uses `ClcDynamicPersistentTileSchedulerParams` but doesn't import it. Add to the import block: ```python from cutlass.utils import ( StaticPersistentTileScheduler, ClcDynamicPersistentTileScheduler, PersistentTileSchedulerParams, ClcDynamicPersistentTileSchedulerParams, ) ``` Without this, the snippet will fail with a `NameError` on line 138. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-cute-writing/references/api-runtime-utils.md around lines 128 - 141, The import block is missing ClcDynamicPersistentTileSchedulerParams which causes a NameError when using ClcDynamicPersistentTileScheduler.create; update the import tuple that currently lists StaticPersistentTileScheduler, ClcDynamicPersistentTileScheduler, and PersistentTileSchedulerParams to also include ClcDynamicPersistentTileSchedulerParams so the dynamic scheduler snippet can reference ClcDynamicPersistentTileSchedulerParams without error. ``` </details> </blockquote></details> <details> <summary>.claude/skills/perf-optimization/SKILL.md-122-129 (1)</summary><blockquote> `122-129`: _⚠️ Potential issue_ | _🟡 Minor_ **Add language identifiers to fenced code blocks** These fenced blocks are missing language tags, which triggers markdownlint (`MD040`) and reduces render/tooling quality. <details> <summary>Suggested fix</summary> ```diff -``` +```text Primary bottleneck: memory-bound Evidence: Memory bandwidth at 89% of peak, compute at 35% Recommendations: 1. [High] Enable FlashAttention for self-attention layers 2. [Medium] Apply memory pooling for attention buffers 3. [Low] Consider gradient checkpointing for memory reduction ``` -``` +```text # Before delegating to specialist backup_file("train.py") ... revert_file("train.py") ``` -``` +```markdown ## Optimization Applied: <optimization_name> ... SUCCESS: Achieved X% improvement ``` -``` +```markdown ## Optimization Summary ... - <reason for not applying> ``` ``` </details> Also applies to: 158-170, 293-312, 316-347 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/perf-optimization/SKILL.md around lines 122 - 129, The fenced
code blocks in .claude/skills/perf-optimization/SKILL.md are missing language
identifiers (triggering MD040); update each triple-backtick block that contains
the snippets starting with "Primary bottleneck: memory-bound", the block that
begins with "Before delegating to specialist", the "## Optimization Applied:
<optimization_name>" block, and the "## Optimization Summary" block (and any
similar blocks in the other ranges noted) by adding appropriate language tags
such as text or markdown (e.g.,text ormarkdown) to each opening fence so
markdownlint passes and rendering/tooling improve.</details> </blockquote></details> <details> <summary>.claude/skills/perf-nsight-compute-analysis/references/sections-guide.md-223-229 (1)</summary><blockquote> `223-229`: _⚠️ Potential issue_ | _🟡 Minor_ **Annotate metric counts as version-specific or use qualitative sizing** The `~Metrics` values in this table are version-dependent and will become stale as Nsight Compute releases add or modify section files and metrics. Mark these counts with the Nsight Compute version they were collected with (e.g., "As of Nsight Compute 2024.3: 213"), or replace exact counts with relative labels (low/medium/high) that remain accurate across versions. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/perf-nsight-compute-analysis/references/sections-guide.md around lines 223 - 229, The "~Metrics" column contains version-dependent counts that will go stale; update the table rows for `basic`, `detailed`, `full`, `roofline`, and `nvlink` so each cell either prefixes the count with the Nsight Compute version/date (e.g., "As of Nsight Compute 2024.3: 213") or replaces the numeric count with a qualitative size label ("low/medium/high"); also add or update a short footnote above/below the table explaining that counts are versioned and how to interpret qualitative labels so future readers know which approach was used. ``` </details> </blockquote></details> <details> <summary>.claude/skills/kernel-tileir-optimization/references/tma-conversion.md-56-57 (1)</summary><blockquote> `56-57`: _⚠️ Potential issue_ | _🟡 Minor_ **Add safety guard for NUM_CTAS division.** The code divides `NUM_SMS` by `NUM_CTAS` without checking if `NUM_CTAS` exists in `nargs` or is non-zero. This could cause `KeyError` or `ZeroDivisionError` at runtime. <details> <summary>🛡️ Proposed fix with guard condition</summary> ```diff # Prevent oversubscription with 2CTA - if "NUM_SMS" in nargs and "NUM_CTAS" in nargs: + if "NUM_SMS" in nargs and "NUM_CTAS" in nargs and nargs["NUM_CTAS"] > 0: nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"] ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-tileir-optimization/references/tma-conversion.md around lines 56 - 57, The code performs integer division of nargs["NUM_SMS"] by nargs["NUM_CTAS"] without ensuring nargs contains "NUM_CTAS" and that it's non-zero, which can raise KeyError or ZeroDivisionError; update the block around the check for "NUM_SMS" and "NUM_CTAS" so you validate that "NUM_CTAS" is present in nargs and that int(nargs["NUM_CTAS"]) != 0 (or truthy) before doing nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"], and if the guard fails, either skip the division or set a safe default/raise a clear error. Ensure you reference the same keys ("NUM_SMS", "NUM_CTAS") and the nargs dict when applying the guard. ``` </details> </blockquote></details> <details> <summary>.claude/skills/kernel-cute-writing/references/patterns-getting-started.md-96-113 (1)</summary><blockquote> `96-113`: _⚠️ Potential issue_ | _🟡 Minor_ **Import `cutlass` in this kernel example.** This block uses `cutlass.dynamic_expr(...)` on line 109 but never imports `cutlass`, so a standalone copy-paste run fails with `NameError`. <details> <summary>💡 Suggested fix</summary> ```diff +import cutlass + `@cute.kernel` def my_kernel(gA: cute.Tensor, gC: cute.Tensor): ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-cute-writing/references/patterns-getting-started.md around lines 96 - 113, The kernel example uses cutlass.dynamic_expr inside the my_kernel function but never imports the cutlass module; add a top-level import for cutlass (e.g., import cutlass) so cutlass.dynamic_expr is defined when my_kernel runs, ensuring the reference in the my_kernel body (where cutlass.dynamic_expr(thread_idx < total_tiles) is called) no longer raises NameError. ``` </details> </blockquote></details> <details> <summary>.claude/skills/kernel-cute-writing/references/patterns-gemm.md-241-247 (1)</summary><blockquote> `241-247`: _⚠️ Potential issue_ | _🟡 Minor_ **Fix the broken multiline `torch.empty(...).permute(...)` examples in the code snippet.** Lines 242-246 start chained calls on new lines without wrapping the full expression, which is invalid Python syntax and raises `SyntaxError` when copied as written. <details> <summary>Suggested fix</summary> ```diff -fake_A = torch.empty(8, 8, 1, dtype=torch.float16, device="cuda") - .permute(2, 1, 0) # M-major physical layout -fake_B = torch.empty(8, 8, 1, dtype=torch.float16, device="cuda") - .permute(2, 1, 0) # N-major physical layout +fake_A = ( + torch.empty(8, 8, 1, dtype=torch.float16, device="cuda") + .permute(2, 1, 0) # M-major physical layout +) +fake_B = ( + torch.empty(8, 8, 1, dtype=torch.float16, device="cuda") + .permute(2, 1, 0) # N-major physical layout +) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-cute-writing/references/patterns-gemm.md around lines 241 - 247, The multiline examples for creating fake tensors (fake_A, fake_B, fake_C) break Python syntax because the chained .permute(...) calls are placed on a new line; fix by putting the full expression on one logical line or by enclosing the torch.empty(...) call in parentheses so the subsequent .permute(...) remains attached (e.g., ensure fake_A = (torch.empty(...).permute(...)) or keep fake_A = torch.empty(...).permute(...) on one line), and do the same for fake_B and fake_C to restore valid chaining for permute. ``` </details> </blockquote></details> <details> <summary>.claude/skills/kernel-tileir-optimization/scripts/tileir_check.py-145-154 (1)</summary><blockquote> `145-154`: _⚠️ Potential issue_ | _🟡 Minor_ **Mock scenario is internally inconsistent with recommendation logic.** At Line 148 and Line 153, mock values indicate nvtriton is not installed, but recommendation assumes nvtriton is present and only TileIR activation is missing. <details> <summary>Proposed fix</summary> ```diff def _mock_data() -> dict: """Return realistic mock data for testing without GPU/triton.""" return { - "nvtriton_installed": False, + "nvtriton_installed": True, "tileir_active": False, "blackwell_gpu": True, "triton_installed": True, "gpu_capability": [10, 0], "recommendation": ("Set TRITON_PTXAS_PATH and run with ENABLE_TILE=1 to activate TileIR"), } ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-tileir-optimization/scripts/tileir_check.py around lines 145 - 154, The mock in _mock_data returns nvtriton_installed: False but the recommendation string assumes nvtriton is present; update the mock to be internally consistent by either setting "nvtriton_installed": True if you want the scenario where only TileIR activation is missing, or change the "recommendation" text to instruct installing nvtriton when "nvtriton_installed" is False; adjust the dictionary returned by _mock_data (keys: "nvtriton_installed", "tileir_active", "recommendation") accordingly so the boolean flags and the recommendation align. ``` </details> </blockquote></details> <details> <summary>.claude/skills/kernel-triton-writing/SKILL.md-43-46 (1)</summary><blockquote> `43-46`: _⚠️ Potential issue_ | _🟡 Minor_ **Inconsistency: `tl.sigmoid()` warning may conflict with code examples.** Line 43 warns that `tl.sigmoid()` is unavailable in some Triton versions and recommends using `1.0 / (1.0 + tl.exp(-x_fp32))`. However, the SiLU + Multiply pattern in `references/patterns-fusion.md` (line 103) uses `tl.sigmoid(gate)` directly. Consider adding a note to the pattern file about this compatibility concern, or update the pattern to use the workaround. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-triton-writing/SKILL.md around lines 43 - 46, Update the SiLU + Multiply pattern in references/patterns-fusion.md to avoid calling tl.sigmoid(gate) directly because tl.sigmoid() is missing in some Triton versions; instead, replace that usage with the recommended portable expression 1.0 / (1.0 + tl.exp(-gate_fp32)) (ensure gate is cast to fp32 for the exp and then cast back to the original dtype before tl.store), or add an explicit compatibility note next to the SiLU pattern that instructs readers to use the exp-based fallback when targeting older Triton versions and to perform proper dtype casts. ``` </details> </blockquote></details> <details> <summary>.claude/skills/kernel-triton-writing/SKILL.md-245-253 (1)</summary><blockquote> `245-253`: _⚠️ Potential issue_ | _🟡 Minor_ **Reference function randomness won't match kernel output.** The `reference_fn` uses `torch.manual_seed` for dropout, but the kernel uses `tl.rand` with a different PRNG algorithm. Verification will fail due to different random masks even with the same seed. For dropout kernels, either: 1. Skip dropout comparison and verify GELU only, or 2. Document that dropout verification requires comparing statistical properties rather than exact values. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-triton-writing/SKILL.md around lines 245 - 253, The reference_fn uses PyTorch dropout with torch.manual_seed which won't match the kernel's tl.rand PRNG, so update the test to avoid byte-for-byte dropout comparison: either change reference_fn to return only the GELU output (remove/dropout in reference_fn) and compare kernel output to GELU-only results, or modify the verification logic that calls get_inputs to skip exact-mask checks and instead compare statistical properties (mean/variance or dropout rate) between the kernel output and a sampled PyTorch dropout across many seeds; reference the functions reference_fn and get_inputs and the kernel's use of tl.rand when making the change. ``` </details> </blockquote></details> <details> <summary>.claude/skills/kernel-triton-writing/references/patterns-fusion.md-99-106 (1)</summary><blockquote> `99-106`: _⚠️ Potential issue_ | _🟡 Minor_ **`tl.sigmoid()` usage conflicts with SKILL.md compatibility warning.** The SKILL.md (line 43) warns that `tl.sigmoid()` is unavailable in some Triton versions and recommends `1.0 / (1.0 + tl.exp(-x_fp32))`. This pattern uses `tl.sigmoid(gate)` directly. Consider using the workaround for broader compatibility. <details> <summary>♻️ Proposed fix for compatibility</summary> ```diff # SiLU(gate) * x = gate * sigmoid(gate) * x - silu_gate = gate * tl.sigmoid(gate) + gate_fp32 = gate.to(tl.float32) + sigmoid_gate = 1.0 / (1.0 + tl.exp(-gate_fp32)) + silu_gate = gate * sigmoid_gate.to(gate.dtype) out = silu_gate * x ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-triton-writing/references/patterns-fusion.md around lines 99 - 106, Replace the direct call to tl.sigmoid in the SiLU computation with the compatible expression using tl.exp: compute sig = 1.0 / (1.0 + tl.exp(-gate_fp32)) (ensuring gate is cast to fp32 if necessary), then calculate silu_gate = gate * sig and out = silu_gate * x before storing; update the expressions around tl.sigmoid(gate), silu_gate and out to use this exp-based sigmoid to maintain Triton version compatibility. ``` </details> </blockquote></details> <details> <summary>.claude/skills/kernel-triton-writing/references/patterns-fusion.md-308-316 (1)</summary><blockquote> `308-316`: _⚠️ Potential issue_ | _🟡 Minor_ **Missing fp32 cast for LayerNorm computation.** Same precision concern as RMSNorm - the variance computation and `tl.sqrt` should use fp32 intermediates for fp16/bf16 inputs. <details> <summary>♻️ Proposed fix</summary> ```diff # LayerNorm - mean = tl.sum(x, axis=0) / n_cols - x_centered = x - mean - var = tl.sum(x_centered * x_centered, axis=0) / n_cols - x_norm = x_centered / tl.sqrt(var + eps) + x_fp32 = x.to(tl.float32) + mean = tl.sum(x_fp32, axis=0) / n_cols + x_centered = x_fp32 - mean + var = tl.sum(x_centered * x_centered, axis=0) / n_cols + x_norm = x_centered / tl.sqrt(var + eps) weight = tl.load(weight_ptr + col_offsets, mask=mask, other=1.0) bias = tl.load(bias_ptr + col_offsets, mask=mask, other=0.0) - out = x_norm * weight + bias + out = (x_norm * weight + bias).to(x.dtype) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-triton-writing/references/patterns-fusion.md around lines 308 - 316, LayerNorm uses lower-precision intermediates for variance/sqrt which can underflow for fp16/bf16 inputs; in the LayerNorm block (mean, var, tl.sqrt, x_norm) force fp32 intermediates: cast x (or x_centered) to float32 before summing and variance computation, perform tl.sqrt on the fp32 var+eps, then cast the normalized result back to the input dtype before applying weight/bias; ensure tl.load for weight_ptr/bias_ptr with mask remains correct and only the arithmetic around mean, var and tl.sqrt uses fp32 temporaries. ``` </details> </blockquote></details> <details> <summary>.claude/skills/kernel-tileir-optimization/SKILL.md-81-85 (1)</summary><blockquote> `81-85`: _⚠️ Potential issue_ | _🟡 Minor_ **Inconsistent `verify_kernel.py` invocation syntax.** The command shown here uses `--kernel`, `--reference`, `--shapes`, and `--dtypes` flags, but the actual `verify_kernel.py` script in this PR uses a positional `kernel_path` argument and `--rtol`/`--atol` flags. The script expects the kernel module to export `reference_fn` and `get_inputs()` internally, not via CLI flags. Update to match the actual script interface: <details> <summary>📝 Suggested fix</summary> ```diff -python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}' +python scripts/verify_kernel.py path/to/kernel.py --rtol 1e-3 --atol 1e-3 ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-tileir-optimization/SKILL.md around lines 81 - 85, The README example uses flags that don't match the current verify_kernel.py interface; update the doc to call verify_kernel.py with the positional kernel_path argument and optional numeric tolerance flags (--rtol/--atol) instead of --reference/--shapes/--dtypes, and note that the kernel module must export reference_fn and get_inputs() to supply shapes/dtypes and reference data; also keep the instruction to run with ENABLE_TILE=0 when verifying the kernel. ``` </details> </blockquote></details> <details> <summary>.claude/skills/kernel-triton-writing/references/patterns-fusion.md-190-197 (1)</summary><blockquote> `190-197`: _⚠️ Potential issue_ | _🟡 Minor_ **Missing fp32 cast for RMS computation with fp16/bf16 inputs.** Per the SKILL.md precision rules, `tl.sqrt` requires fp32 input for correct results. The computation here operates on the input dtype directly, which could cause precision issues or incorrect results with fp16/bf16. <details> <summary>♻️ Proposed fix to add fp32 casts</summary> ```diff # Compute RMS - x_sq = x * x - rms = tl.sqrt(tl.sum(x_sq, axis=0) / n_cols + eps) + x_fp32 = x.to(tl.float32) + x_sq = x_fp32 * x_fp32 + rms = tl.sqrt(tl.sum(x_sq, axis=0) / n_cols + eps) # Normalize and scale - x_norm = x / rms + x_norm = x_fp32 / rms weight = tl.load(weight_ptr + col_offsets, mask=mask, other=1.0) - out = x_norm * weight + out = (x_norm * weight).to(x.dtype) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-triton-writing/references/patterns-fusion.md around lines 190 - 197, The RMS computation uses tl.sqrt(tl.sum(...)) directly on x which can be fp16/bf16; cast the squared terms and the eps divisor to tl.float32 before sum/sqrt and perform sqrt in fp32, then convert the result back as needed so subsequent normalization uses correct precision. Concretely, in the RMS block (symbols: x, x_sq, rms, tl.sum, tl.sqrt, eps) compute x_sq in fp32 (e.g., cast x to tl.float32 before squaring or cast the product), ensure eps and the division by n_cols are fp32, call tl.sqrt on that fp32 value to get rms_fp32, then use rms_fp32 appropriately when computing x_norm (cast rms_fp32 to the input dtype or cast x to fp32 for the division and cast the final out back to the original dtype); keep weight loading (weight, weight_ptr, col_offsets, mask) unchanged but ensure dtype alignment when multiplying out. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🧹 Nitpick comments (19)</summary><blockquote> <details> <summary>.claude/skills/perf-host-analysis/references/metrics.md (3)</summary><blockquote> `26-28`: **Add language specifier to formula block.** The fenced code block is missing a language identifier. Consider adding a specifier like `text` or `python` after the opening backticks for better rendering. <details> <summary>📝 Proposed fix</summary> ```diff -``` +```text gpu_idle_ratio = gpu_idle_time_us / total_time_us ``` ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/perf-host-analysis/references/metrics.md around lines 26 -
28, The fenced code block containing the formula "gpu_idle_ratio =
gpu_idle_time_us / total_time_us" lacks a language specifier; update the opening
backticks for that block to include a language tag (for exampletextor
python) so the renderer applies proper formatting—locate the block with the
symbol gpu_idle_ratio and add the chosen specifier immediately after the opening
99-101: Add language specifier to formula block.The fenced code block is missing a language identifier. Consider adding a specifier like
textorpythonafter the opening backticks for better rendering.📝 Proposed fix
-``` +```text host_prep_perf_impact = host_prep_exposed_us / total_time_us</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/perf-host-analysis/references/metrics.md around lines 99 -
101, The fenced code block containing the formula "host_prep_perf_impact =
host_prep_exposed_us / total_time_us" lacks a language specifier; update the
opening backticks to include a language (e.g., addtext orpython) so the
block renders correctly — locate the block with the variables
host_prep_perf_impact, host_prep_exposed_us, and total_time_us and add the
language identifier after the opening ``` only.</details> --- `85-87`: **Add language specifier to formula block.** The fenced code block is missing a language identifier. Consider adding a specifier like `text` or `python` after the opening backticks for better rendering. <details> <summary>📝 Proposed fix</summary> ```diff -``` +```text host_prep_exposed_ratio = host_prep_exposed_us / host_prep_total_us ``` ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/perf-host-analysis/references/metrics.md around lines 85 -
87, The fenced code block containing the formula "host_prep_exposed_ratio =
host_prep_exposed_us / host_prep_total_us" lacks a language specifier; update
the opening backticks to include a language like "text" or "python" (e.g.,
text orpython) so the block renders properly, leaving the formula and
variable names host_prep_exposed_ratio, host_prep_exposed_us, and
host_prep_total_us unchanged.</details> </blockquote></details> <details> <summary>.claude/skills/kernel-cute-writing/references/api-core.md (1)</summary><blockquote> `232-239`: **Add language specifier to code block.** The fenced code block is missing a language identifier. Adding `python` after the opening backticks will enable syntax highlighting. <details> <summary>📝 Proposed fix</summary> ```diff -``` +```python cutlass.dynamic_expr(cond) # Runtime conditional guard cutlass.const_expr(cond) # Compile-time conditional cutlass.Constexpr # Type annotation for compile-time args cutlass.range(n) # IR loop with optional attributes cutlass.range_constexpr(n) # Compile-time unrolled loop ``` ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/kernel-cute-writing/references/api-core.md around lines 232 -
239, The fenced code block containing cutlass.dynamic_expr, cutlass.const_expr,
cutlass.Constexpr, cutlass.range, and cutlass.range_constexpr is missing a
language identifier; update the opening fence to include the language specifier
(python) so the block becomes ```python, preserving the existing contents and
closing fence to enable syntax highlighting for those symbols.</details> </blockquote></details> <details> <summary>.claude/skills/exec-slurm-compile/scripts/compile.slurm (1)</summary><blockquote> `27-31`: **Consider validating required positional arguments.** The script accepts 4 required positional arguments without validation. If fewer arguments are provided, the script will fail with unclear errors or use empty values. <details> <summary>Proposed fix</summary> ```diff +if [[ $# -lt 4 ]]; then + echo "Usage: compile.slurm <container_image> <mount_dir> <scripts_dir> <repo_dir> [build_wheel_args...]" >&2 + exit 1 +fi + container_image=$1 mount_dir=$2 scripts_dir=$3 repo_dir=$4 shift 4 # remaining args forwarded to compile.sh ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/exec-slurm-compile/scripts/compile.slurm around lines 27 - 31, The script uses positional variables container_image, mount_dir, scripts_dir, repo_dir (and then shift 4) but does not validate that four arguments were provided; add a guard at the top that checks $# >= 4 and, if not, prints a concise usage/help message showing required args (container_image mount_dir scripts_dir repo_dir) and exits non‑zero. Reference the variables container_image, mount_dir, scripts_dir, repo_dir and the existing shift 4 so the check runs before shift and before any use of those variables. ``` </details> </blockquote></details> <details> <summary>.claude/skills/exec-slurm-compile/scripts/submit_compile.sh (1)</summary><blockquote> `46-46`: **Potential Bash compatibility issue: empty array expansion with `set -u`.** In Bash versions prior to 4.4, expanding an empty array (`"${extra_build_args[@]}"`) with `set -u` enabled causes an "unbound variable" error. If older Bash versions are a concern, consider using `${extra_build_args[@]+"${extra_build_args[@]}"}` for safe expansion. <details> <summary>Portable fix for older Bash</summary> ```diff - "${extra_build_args[@]}" + ${extra_build_args[@]+"${extra_build_args[@]}"} ``` </details> Also applies to: 61-61 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/exec-slurm-compile/scripts/submit_compile.sh at line 46, The script uses the array variable extra_build_args and later expands it as "${extra_build_args[@]}", which can trigger "unbound variable" under set -u on older Bash; update all expansions of the array (references to "${extra_build_args[@]}" in the script) to the portable form ${extra_build_args[@]+"${extra_build_args[@]}"} so empty-array expansion is safe with set -u while keeping the existing initialization extra_build_args=() unchanged. ``` </details> </blockquote></details> <details> <summary>.claude/skills/kernel-triton-writing/references/troubleshooting.md (1)</summary><blockquote> `36-41`: **Add language specifier to fenced code block.** The output example code block should have a language specifier for proper rendering and syntax highlighting. <details> <summary>📝 Proposed fix</summary> ```diff -Output appears in stderr during compilation (not on device): -``` +Output appears in stderr during compilation (not on device): +```text BLOCK_SIZE 1024 x dtype float32 x shape (1024,) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/kernel-triton-writing/references/troubleshooting.md around
lines 36 - 41, Update the fenced code block showing the stderr output (the block
containing "BLOCK_SIZE 1024", "x dtype float32", "x shape (1024,)") to include a
language specifier (e.g., changetotext) so it renders with correct
syntax highlighting; edit the code block in
.claude/skills/kernel-triton-writing/references/troubleshooting.md accordingly.</details> </blockquote></details> <details> <summary>.claude/skills/perf-analysis/SKILL.md (1)</summary><blockquote> `86-95`: **Add language specifiers to fenced code blocks.** The example blocks should have language specifiers for proper rendering. Consider using `text` or `markdown` for the delegation examples and report format. <details> <summary>📝 Proposed fix</summary> ```diff ### Good Example -``` +```text Profile the batched GEMM kernel in bmm_workload.py with NCU. ... ``` ### Bad Example -``` +```text Run NCU with --set=full --profile-from-start off --target-processes all. ... ``` ### Example Report -``` +```markdown ## Summary Training at 42% MFU, memory-bound due to large attention tensors. ... ``` ``` </details> Also applies to: 97-104, 132-154 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/perf-analysis/SKILL.md around lines 86 - 95, Update the
fenced code blocks in .claude/skills/perf-analysis/SKILL.md so they include
language specifiers: addtext for the general/descriptive blocks (e.g. the block starting "Profile the batched GEMM kernel in bmm_workload.py..." and the "Bad Example" blocks) andmarkdown for the formatted report/example report
blocks (e.g. the "Example Report" block that begins "## Summary"). Locate the
unlabeled triple-backtick blocks by their surrounding text (the "Profile the
batched GEMM kernel..." paragraph, the "Bad Example" section, and the "Example
Report" section) and change the opening fence to include the appropriate
language tag (text or markdown) so the examples render correctly.</details> </blockquote></details> <details> <summary>.claude/skills/perf-nsight-systems/references/app-preparation.md (1)</summary><blockquote> `53-53`: **Consider wrapping CUDA profiler calls with `check_error()` for proper error handling.** The `torch.cuda.cudart().cudaProfilerStart()` syntax is correct. However, according to PyTorch documentation, these calls return `cudaError_t` status codes that should be checked. The recommended pattern is: ```python from torch.cuda import cudart, check_error check_error(cudart().cudaProfilerStart()) ``` This ensures error conditions (e.g., profiler already started) are properly handled. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/perf-nsight-systems/references/app-preparation.md at line 53, Wrap the raw CUDA profiler calls so their return status is checked: replace direct uses of torch.cuda.cudart().cudaProfilerStart() (and cudaProfilerStop() if present) with the check_error(...) pattern from torch.cuda (i.e., call check_error(cudart().cudaProfilerStart()) and check_error(cudart().cudaProfilerStop())) to surface CUDA errors; reference the torch.cuda.cudart().cudaProfilerStart, torch.cuda.cudart().cudaProfilerStop, and torch.cuda.check_error symbols when making the change. ``` </details> </blockquote></details> <details> <summary>.claude/agents/perf-profiling-specialist.md (1)</summary><blockquote> `323-361`: **Specify a language on the fenced template block.** Use `text` (or `markdown`) on the opening fence to satisfy markdown lint and improve rendering consistency. <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/agents/perf-profiling-specialist.md around lines 323 - 361, The
fenced code block beginning withthat contains the "## nsys Profiling Summary" template should declare a language (e.g., usetext or ```markdown)
to satisfy markdown linting and improve rendering; update the opening fence in
.claude/agents/perf-profiling-specialist.md (the template block that starts with
"## nsys Profiling Summary") to include the language token while leaving the
content inside unchanged.</details> </blockquote></details> <details> <summary>.claude/skills/perf-nsight-compute-analysis/SKILL.md (1)</summary><blockquote> `326-329`: **Add a language identifier to the fenced output block.** The example output fence is untyped; use `text` to satisfy markdown lint and keep formatting consistent. <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/perf-nsight-compute-analysis/SKILL.md around lines 326 - 329,
The fenced code block showing the CSV sample (the triple-backtick block
containing "Kernel Name","Duration","Compute (SM) Throughput","Memory
Throughput" and the following CSV line) is missing a language identifier; update
the opening fence fromtotext so the block is typed as text to satisfy
markdown linting and keep formatting consistent.</details> </blockquote></details> <details> <summary>.claude/skills/perf-host-analysis/SKILL.md (1)</summary><blockquote> `65-67`: **Add language identifiers to untyped fenced code blocks.** At Line 65, Line 85, Line 94, Line 103, Line 227, Line 252, Line 335, Line 341, and Line 352, use `text` for plain output/pseudocode blocks to clear markdown lint warnings. Also applies to: 85-91, 94-100, 103-109, 227-247, 252-257, 335-337, 341-348, 352-354 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/perf-host-analysis/SKILL.md around lines 65 - 67, Several
fenced code blocks in SKILL.md (including the inline diagram showing
"[inter-step gap] -> [_forward_step] -> ..." and blocks around the ranges 65,
85–91, 94–100, 103–109, 227–247, 252–257, 335–337, 341–348, 352–354) are untyped
and trigger markdown-lint warnings; update each triple-backtick fence to specify
the language identifier "text" (e.g., ```text) so plain output/pseudocode blocks
are explicitly marked, leaving the block contents unchanged and only adding the
language token to the opening fence.</details> </blockquote></details> <details> <summary>.claude/skills/perf-host-analysis/scripts/analyze_host_overhead.py (3)</summary><blockquote> `486-492`: **Unused loop variable `avg` (second occurrence).** Same issue as above - rename to `_avg`. <details> <summary>♻️ Proposed fix</summary> ```diff - for name, cnt, total, avg in kernels: + for name, cnt, total, _avg in kernels: ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/perf-host-analysis/scripts/analyze_host_overhead.py around lines 486 - 492, In the loop "for name, cnt, total, avg in kernels:" inside analyze_host_overhead.py rename the unused fourth tuple element from "avg" to "_avg" (or similar underscore-prefixed name) to signal it's intentionally unused; update that loop header (the second occurrence) where "name, cnt, total, avg" is unpacked and leave all other logic unchanged so only the variable name is changed. ``` </details> --- `468-476`: **Unused loop variable `avg`.** The variable `avg` is unpacked but never used. Rename to `_avg` to indicate it's intentionally unused. <details> <summary>♻️ Proposed fix</summary> ```diff - for text, cnt, total, avg in nvtx: + for text, cnt, total, _avg in nvtx: ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/perf-host-analysis/scripts/analyze_host_overhead.py around lines 468 - 476, The loop over nvtx unpacks (text, cnt, total, avg) but never uses avg; change the unpacking to (text, cnt, total, _avg) to mark it intentionally unused (update the for-loop header where nvtx is iterated) so linters stop flagging an unused variable; no other logic changes needed — keep references to nvtx_per_step, tp_size, unique_steady, text, cnt, and total as-is. ``` </details> --- `742-760`: **SQLite connection may leak on exception between baseline and target analysis.** If `analyze_single_trace` for the target raises an exception, `conn_b` is already closed, but if an exception occurs during `sqlite3.connect(args.target)`, the code flow is fine. However, if an exception occurs during `analyze_single_trace(conn_t, ...)`, `conn_t` won't be closed. Consider using context managers for robustness. <details> <summary>♻️ Suggested improvement using context managers</summary> ```diff try: # Analyze baseline - conn_b = sqlite3.connect(args.baseline) - baseline_results = analyze_single_trace( - conn_b, args.baseline_label, out, tp_size=args.tp_size - ) - conn_b.close() + with sqlite3.connect(args.baseline) as conn_b: + baseline_results = analyze_single_trace( + conn_b, args.baseline_label, out, tp_size=args.tp_size + ) # Analyze target (if provided) target_results = None if args.target: - conn_t = sqlite3.connect(args.target) - target_results = analyze_single_trace( - conn_t, args.target_label, out, tp_size=args.tp_size - ) - conn_t.close() + with sqlite3.connect(args.target) as conn_t: + target_results = analyze_single_trace( + conn_t, args.target_label, out, tp_size=args.tp_size + ) # Compare compare_results(baseline_results, target_results, out) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/perf-host-analysis/scripts/analyze_host_overhead.py around lines 742 - 760, The current sequence opens sqlite3 connections into conn_b and conn_t and calls analyze_single_trace(conn_b, ...) and analyze_single_trace(conn_t, ...), but if analyze_single_trace for the target raises an exception conn_t is not closed; fix by replacing manual connect/close with context managers: use "with sqlite3.connect(args.baseline) as conn_b:" and "with sqlite3.connect(args.target) as conn_t:" around the respective analyze_single_trace calls, keep the compare_results(baseline_results, target_results, out) after both context blocks so connections are always closed even on exceptions; update references to conn_b/conn_t and ensure baseline_results and target_results are set in the correct scopes. ``` </details> </blockquote></details> <details> <summary>.claude/agents/exec-compile-specialist.md (1)</summary><blockquote> `116-132`: **Add language specifier to fenced code block.** The code block starting at line 116 lacks a language specifier. Since it's a template output format, use `text` or `markdown`. <details> <summary>📝 Proposed fix</summary> ````diff -``` +```text ## Compile Result🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/agents/exec-compile-specialist.md around lines 116 - 132, The fenced code block that renders the compile-result template (the block that begins with the three backticks immediately above the "## Compile Result" heading) lacks a language specifier; update the opening fence from "```" to "```text" (or "```markdown") so the template is explicitly marked as plain text/markdown and renders correctly..claude/skills/kernel-triton-writing/scripts/verify_kernel.py (3)
223-236: Temporary file not cleaned up on subprocess failure beforetryblock.If
subprocess.runraises an exception other thanTimeoutExpired(e.g.,OSError), thefinallyblock handles cleanup. However, if an exception occurs between writing the file (line 223-225) and entering thetryblock (line 229), the temp file would leak. Consider using a context manager pattern or moving file creation inside thetryblock.♻️ Suggested improvement using context manager approach
- with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as script_file: - script_file.write(script) - script_path = script_file.name - - working_dir = os.path.dirname(os.path.abspath(kernel_path)) - try: + with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as script_file: + script_file.write(script) + script_path = script_file.name + + working_dir = os.path.dirname(os.path.abspath(kernel_path)) + result = subprocess.run(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-triton-writing/scripts/verify_kernel.py around lines 223 - 236, The temp file created via NamedTemporaryFile (script_file / script_path) can leak if an exception occurs after writing but before entering the try block; move the file creation inside the same try/finally that runs subprocess.run (or wrap both creation and subprocess.run in a single try/except/finally) so cleanup of script_path always happens in the finally, or alternatively use a context manager that keeps the temp file open for the duration of subprocess.run; adjust references to script_file/script_path, subprocess.run, working_dir, timeout, and kernel_path accordingly to ensure the finally block always removes the temp file even if errors occur before the subprocess is invoked.
121-127: Relative difference calculation may be misleading for very small reference values.The
safe_refguard replaces zeros with ones, but for very small non-zero values (e.g., 1e-10), the relative difference could still be misleadingly large. Consider using a more robust relative error formula likeabs_diff / max(abs(ref), atol).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-triton-writing/scripts/verify_kernel.py around lines 121 - 127, The relative-difference logic using safe_ref is fragile for tiny non-zero refs; change the calculation to use a robust denominator like max(ref_abs, atol) instead of replacing zeros with ones: compute ref_abs = ref.float().abs(), pick a small absolute tolerance (e.g., atol = 1e-8 or an existing tolerance variable), set denom = torch.maximum(ref_abs, torch.full_like(ref_abs, atol)), then compute max_rel = (abs_diff / denom).max().item() and update _global_max_rel accordingly (leave abs_diff, max_abs, and the _global updates unchanged).
241-247: Usenext()instead of list slicing for single element extraction.Creating a full list just to take the first element is inefficient. Use
next()with a generator expression.♻️ Proposed fix
if "RESULT:" in output: try: - result_line = [line for line in output.split("\n") if "RESULT:" in line][0] + result_line = next(line for line in output.split("\n") if "RESULT:" in line) parts_str = result_line.split("RESULT:")[1] parts = parts_str.split(",") passed = "True" in parts[0] max_abs = float(parts[1].split("=")[1]) max_rel = float(parts[2].split("=")[1]) - except (IndexError, ValueError) as exc: + except (StopIteration, IndexError, ValueError) as exc:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/kernel-triton-writing/scripts/verify_kernel.py around lines 241 - 247, Replace the list-and-slice used to find the RESULT line with next() over a generator to avoid building an intermediate list: use next((line for line in output.split("\n") if "RESULT:" in line), None) to assign result_line, then check for None and raise or handle an error before continuing to parse parts_str, parts, passed, max_abs, and max_rel; keep the same parsing logic but ensure the new next() usage and the None check are used in the try block (symbols: output, result_line, parts_str, parts, passed, max_abs, max_rel).
|
PR_Github #42301 [ skip ] completed with state |
Summary
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation