[None][feat] Add Claude Code agents and skills for kernel dev, perf analysis, and compilation by kaiyux · Pull Request #12831 · NVIDIA/TensorRT-LLM

kaiyux · 2026-04-08T08:04:07Z

Summary

Add specialized Claude Code agents for kernel writing (CuTe DSL, Triton, TileIR, CUDA C++), performance profiling, and CUDA graph optimization
Add skills covering Nsight Systems/Compute analysis, host overhead detection, sync-free patterns, workload profiling, and TRT-LLM compilation (local + SLURM)
Add code contribution and codebase exploration guide skills
Update existing AutoDeploy and serve-config skills

Test plan

Verify agents load correctly in Claude Code CLI
Verify skills are listed and invocable via slash commands
Spot-check reference docs for accuracy

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added AI agents for automated GPU kernel compilation (local and SLURM-based).
- Added AI agents for CUDA and Triton kernel development and optimization workflows.
- Added AI agents for GPU performance profiling and analysis (Nsight Systems/Compute).
- Added comprehensive reference documentation for CuTe DSL kernel development.
- Added helper scripts for kernel verification, benchmarking, and host-overhead analysis.
Documentation
- Extensive guides covering kernel optimization patterns, memory analysis, and bottleneck classification.

…, performance analysis, and compilation Add specialized agents and skills covering: - Kernel writing: CuTe DSL, Triton, TileIR optimization, CUDA C++ - Performance: Nsight Systems/Compute analysis, host overhead, CUDA graphs, sync-free, workload profiling - Compilation: local and SLURM-based TRT-LLM builds - Code contribution and codebase exploration guides - Updates to existing AD and serve-config skills Signed-off-by: Kaiyu Xie <[email protected]>

kaiyux · 2026-04-08T08:13:34Z

/bot skip --comment "currently no CI/CD coverage for skills and agents"

QiJune

LGTM

tensorrt-cicd · 2026-04-08T08:20:13Z

PR_Github #42301 [ skip ] triggered by Bot. Commit: 37df0c0 Link to invocation

coderabbitai · 2026-04-08T08:22:47Z

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive agent and skill framework for TensorRT-LLM development and optimization. It adds new agent definitions for model compilation, kernel development across multiple frameworks (CUDA, Triton, CuTe, TileIR), and performance profiling/optimization. Accompanying skill files define detailed workflows for each agent, supported by Python utility scripts for kernel verification and benchmarking, plus extensive reference documentation covering kernel APIs, patterns, and performance analysis tooling.

Changes

Cohort / File(s)	Summary
Agent Definition Updates `.claude/agents/ad-debug-agent.md`, `ad-onboard-reviewer.md`, `ad-run-agent.md`	Added Apache-2.0 license and NVIDIA Corporation author metadata to existing agent front-matter. Renamed onboard-reviewer to ad-onboard-reviewer.
New Compilation & Execution Agents `.claude/agents/exec-compile-specialist.md`, `exec-slurm-compile` (referenced)	Added exec-compile-specialist agent defining automated environment detection for local Docker container vs SLURM-based compilation with monitoring, build-mode rules, and result reporting.
New Kernel Development Agents `.claude/agents/kernel-cuda-specialist.md`, `kernel-cute-specialist.md`, `kernel-tileir-specialist.md`, `kernel-triton-specialist.md`	Added four specialized kernel development agents with scope, constraints, and workflows for CUDA/pybind11, CUTLASS CuTe DSL, Triton, and TileIR backend optimization respectively.
New Performance Analysis Agents `.claude/agents/perf-profiling-specialist.md`, `perf-torch-cuda-graph-specialist.md`	Added agents for system-level profiling (nsys/ncu) and CUDA Graph optimization with routing logic and verification workflows.
Skill Metadata Updates `.claude/skills/ad-model-onboard/SKILL.md`, `ad-pipeline-failure-pr/SKILL.md`, `ci-failure-retrieval/SKILL.md`	Added Apache-2.0 license and NVIDIA Corporation author metadata to skill front-matter.
Compilation Skills `.claude/skills/exec-local-compile/SKILL.md`, `exec-slurm-compile/SKILL.md` + scripts	Added skills and support scripts for local Docker compilation and SLURM/enroot-based cluster compilation with environment setup, container image handling, and job submission logic.
Triton Kernel Development `.claude/skills/kernel-triton-writing/SKILL.md` + scripts + references	Added comprehensive Triton kernel writing workflow with 342 lines of core skill, ~1900 lines of API/pattern/troubleshooting references, and verify/benchmark Python scripts.
CuTe DSL Kernel Development `.claude/skills/kernel-cute-writing/SKILL.md` + scripts + references	Added complete CuTe DSL skill with 368 lines of core workflow, ~2000 lines of architecture/API/pattern documentation, and verify/benchmark utilities.
TileIR Optimization Skill `.claude/skills/kernel-tileir-optimization/SKILL.md` + scripts + references	Added TileIR-focused optimization skill for existing Triton kernels with classification, autotuning, and validation workflows; includes Python scripts for compatibility checking and kernel classification.
Performance Profiling Skills `.claude/skills/perf-profiling-specialist.md`, `perf-nsight-systems/SKILL.md`, `perf-nsight-compute-analysis/SKILL.md`	Added specialized skills for system-level profiling (nsys), kernel-level analysis (ncu), and host-overhead detection/optimization with extensive CLI/report/analysis references.
Performance Analysis & Optimization `.claude/skills/perf-analysis/SKILL.md`, `perf-host-analysis/SKILL.md`, `perf-host-optimization/SKILL.md`, `perf-optimization/SKILL.md`	Added higher-level coordination skills for bottleneck analysis, host-overhead detection, iterative optimization workflows; includes Python analysis scripts (iterate detection, NVTX breakdown, diff reporting).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

coderabbitai

Actionable comments posted: 20

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (24)

.claude/skills/perf-host-analysis/references/iteration-isolation-techniques.md-85-87 (1)

85-87: ⚠️ Potential issue | 🟡 Minor

Label this fenced block to keep markdownlint happy.

The new fence is unlabeled and trips MD040. text is enough here.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
@.claude/skills/perf-host-analysis/references/iteration-isolation-techniques.md
around lines 85 - 87, The fenced code block containing "[Executor] _forward_step
N: X ctx reqs, Y gen reqs" is unlabeled and triggers markdownlint MD040; update
the fence delimiter from ``` to ```text so the block is labeled (e.g., replace
the opening and closing backticks around that line in
iteration-isolation-techniques.md with ```text and ``` respectively) to satisfy
the linter while preserving the content.

.claude/skills/exec-slurm-compile/scripts/enroot-import-27-35 (1)

27-35: ⚠️ Potential issue | 🟡 Minor

Fix the stale help text.

The usage block says the default time is 15 minutes, but the code defaults to 1 hour, and the dependency example is missing the required afterok: separator. Both are copy-paste footguns for a brand-new CLI.

Also applies to: 72-72

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/exec-slurm-compile/scripts/enroot-import around lines 27 -
35, Update the help/usage text in the enroot-import script to reflect the actual
defaults and correct dependency syntax: change the default time note from "15
minutes" to "1 hour" (or "60 minutes") wherever the usage block or the
"[--time=...|-t]" description appears, and fix the dependency example to include
the required "afterok:" separator (e.g., "--dependency=afterok:<jobid>") in the
example sentence that currently shows "'--dependency=afterok<jobid>'"; ensure
these edits are made in the script's usage/help string (the top-of-script usage
block and the example line) so they match the actual sbatch behavior.

.claude/skills/perf-host-analysis/references/trtllm-nvtx-ranges.md-53-66 (1)

53-66: ⚠️ Potential issue | 🟡 Minor

Add language tags to the new fenced blocks.

Both unlabeled fences trip MD040. Mark them as text (or python for the gap formula) so this reference stays lint-clean.

Also applies to: 176-178

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-host-analysis/references/trtllm-nvtx-ranges.md around
lines 53 - 66, The unlabeled fenced code blocks showing the Executor
_forward_step list and the separate gap formula block should be given language
tags to satisfy MD040: add ```text before the executor step list block and
```python before the gap-formula block (or use ```text if the formula is
non-executable), and update the other similar unlabeled fenced block later in
the file the reviewer called out the same way so all unlabeled blocks are
labeled.

.claude/skills/kernel-cute-writing/references/concepts-layouts.md-117-123 (1)

117-123: ⚠️ Potential issue | 🟡 Minor

Correct the complement example.

6:4 denotes six positions at stride 4, but the listed complement set has only five elements and does not match that layout. Please fix either the layout expression or the enumerated indices.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-cute-writing/references/concepts-layouts.md around
lines 117 - 123, The complement example for complement(4:1, 24) is inconsistent:
the layout expression "6:4" implies six positions at stride 4 but the enumerated
set shows only five elements; to fix, make the two representations match by
either changing "6:4" to "5:4" to match the listed indices {4,8,12,16,20}, or
else change the enumerated indices and total size so "6:4" is valid (e.g.,
include {4,8,12,16,20,24} and adjust the second argument from 24 to 25); update
the example so complement(4:1, 24) and the layout token ("5:4" or "6:4") and the
explicit index list are all consistent.

.claude/skills/perf-host-analysis/references/examples.md-22-32 (1)

22-32: ⚠️ Potential issue | 🟡 Minor

Add a language tag to these fenced blocks.

markdownlint is already flagging every example fence here. Using text for these threshold tables will clear the warnings and improve editor rendering.

🧹 Proposed fix

-```
+```text
 GPU idle ratio: 42.1%           → >30% threshold  → CROSSED
 Launch overhead: 12.0%          → >10% threshold  → CROSSED
@@
 Crossed: 6/6 → Verdict: YES (host overhead IS the bottleneck)
 Host prep confirmed: YES (3b=15% AND 3c=85% both crossed)


Apply the same `text` fence to the other five example blocks in this file.
</details>


Also applies to: 36-45, 51-66, 72-95, 101-116, 122-137

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-host-analysis/references/examples.md around lines 22 -
32, Change the six fenced code blocks in
.claude/skills/perf-host-analysis/references/examples.md from plain to language-taggedtext; specifically, for each example block that begins with
lines like "GPU idle ratio: 42.1% ..." (and the other five similar threshold
tables), replace the opening triple backticks with ```text so markdownlint stops
flagging them and editor rendering improves.


</details>

</blockquote></details>
<details>
<summary>.claude/skills/exec-slurm-compile/scripts/compile.sh-21-22 (1)</summary><blockquote>

`21-22`: _⚠️ Potential issue_ | _🟡 Minor_

**Documentation inconsistency: `--no-venv` mentioned in comment but not in default command.**

The comment on line 22 lists `--no-venv` as a default flag, but the actual default command (lines 38-42) does not include it. Either add `--no-venv` to the default command or remove it from the comment.



<details>
<summary>Proposed fix (if `--no-venv` should be a default)</summary>

```diff
     python3 ./scripts/build_wheel.py \
         --trt_root /usr/local/tensorrt \
         --benchmarks \
         -a "100-real" \
-        --nvtx
+        --nvtx \
+        --no-venv
 fi
```

</details>


Also applies to: 36-42

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/exec-slurm-compile/scripts/compile.sh around lines 21 - 22,
The comment lists --no-venv as a default build_wheel.py flag but the script's
actual default invocation of build_wheel.py (the block that constructs the
default flags/command around the build_wheel.py call) does not include
--no-venv; either add --no-venv to that default flags list in the default
command invocation or remove it from the comment so they match; update the
build_wheel.py invocation (the default FLAGS/args used when calling
build_wheel.py) to include --no-venv if you want it to be a true default, or
edit the comment text to drop --no-venv if it should not be default.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/perf-host-optimization/references/examples.md-84-136 (1)</summary><blockquote>

`84-136`: _⚠️ Potential issue_ | _🟡 Minor_

**Label the optimization transcript code fence.**

Line 84 starts an unlabeled fenced block. Add `text` to satisfy MD040 and improve renderer behavior.

<details>
<summary>Suggested fix</summary>

```diff
-```
+```text
 Round 0 (Baseline):
   -> Profile with default functions
   -> Identify _forward_step spends 98.7% in model_engine.forward()
@@
   _prepare_tp_inputs: 57.7s -> 40.1s (-30.5% cumulative)
   Mean TPOT:          37.4ms -> 28.9ms (-22.6%)
   Output throughput:  ~3200 -> 3799 tok/s (+18.7%)
 ```
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-host-optimization/references/examples.md around lines 84

136, The fenced code block that begins with (the unlabeled triple-backtick block containing the optimization transcript) must be changed to a labeled fence by replacing the opening with text to satisfy MD040 and ensure correct rendering; locate the block that starts with "Round 0 (Baseline):" and update only the opening fence to text (leave the contents and closing ```
unchanged).


</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-triton-writing/references/concepts-semantics.md-65-69 (1)</summary><blockquote>

`65-69`: _⚠️ Potential issue_ | _🟡 Minor_

**Add a language tag to the fenced block.**

Line 65 uses an unlabeled fenced code block, which triggers MD040. Use `text` (or `plaintext`) for this hierarchy snippet.

<details>
<summary>Suggested fix</summary>

```diff
-```
+```text
 {bool} < {int8, int16, int32, int64, uint8, uint16, uint32, uint64} < {fp8, fp16, bf16, fp32, fp64}
   ^               ^  (integral types)                                       ^  (floating types)
   kind 0                  kind 1                                           kind 2
 ```
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-triton-writing/references/concepts-semantics.md around
lines 65 - 69, The fenced code block containing the type hierarchy starting with
"{bool} < {int8, int16, int32, int64, uint8, uint16, uint32, uint64} < {fp8,
fp16, bf16, fp32, fp64}" should include a language tag to satisfy MD040; change
the opening triple-backticks to use "text" (or "plaintext") so the block becomes
a labeled plaintext code fence and keep the inner lines unchanged.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/perf-host-analysis/references/output-format.md-82-82 (1)</summary><blockquote>

`82-82`: _⚠️ Potential issue_ | _🟡 Minor_

**Use consistent skill name for handoff command.**

Line 82 says `host-perf-optimization`, but the rest of this file uses `perf-host-optimization`. Please standardize to one canonical skill name.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-host-analysis/references/output-format.md at line 82,
The handoff skill name is inconsistent: replace the string
"host-perf-optimization" with the canonical "perf-host-optimization" so the
command matches the rest of the document; search for and update any occurrences
of "host-perf-optimization" in this file to "perf-host-optimization" to ensure
uniformity.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/perf-host-analysis/references/output-format.md-22-22 (1)</summary><blockquote>

`22-22`: _⚠️ Potential issue_ | _🟡 Minor_

**Add a language identifier to the fenced block.**

The fenced code block at Line 22 should specify a language (e.g., `text`) to satisfy markdown lint rules.

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-host-analysis/references/output-format.md at line 22,
The fenced code block currently uses plain ; update that fence to include a language identifier (for example change to ```text) so the block declares
its language and satisfies the markdown lint rule for fenced code blocks.


</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-triton-writing/references/patterns-gemm.md-242-242 (1)</summary><blockquote>

`242-242`: _⚠️ Potential issue_ | _🟡 Minor_

**Correct typo in key pattern bullet.**

“subtiling” should be “sub-tiling” (or “tiling”) for clarity.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-triton-writing/references/patterns-gemm.md at line
242, Fix the typo in the key pattern bullet that reads "Epilogue subtiling:" by
changing it to "Epilogue sub-tiling:" (or simply "Epilogue tiling:") so the
heading is clear; update the bullet text in the patterns-gemm.md entry that
contains the phrase "Slice accumulator along N for the store phase to cut
register pressure." to use the corrected label.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-cute-writing/references/api-runtime-utils.md-128-141 (1)</summary><blockquote>

`128-141`: _⚠️ Potential issue_ | _🟡 Minor_

**Add missing import for `ClcDynamicPersistentTileSchedulerParams`.**

The dynamic scheduler snippet uses `ClcDynamicPersistentTileSchedulerParams` but doesn't import it. Add to the import block:

```python
from cutlass.utils import (
    StaticPersistentTileScheduler,
    ClcDynamicPersistentTileScheduler,
    PersistentTileSchedulerParams,
    ClcDynamicPersistentTileSchedulerParams,
)
```

Without this, the snippet will fail with a `NameError` on line 138.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-cute-writing/references/api-runtime-utils.md around
lines 128 - 141, The import block is missing
ClcDynamicPersistentTileSchedulerParams which causes a NameError when using
ClcDynamicPersistentTileScheduler.create; update the import tuple that currently
lists StaticPersistentTileScheduler, ClcDynamicPersistentTileScheduler, and
PersistentTileSchedulerParams to also include
ClcDynamicPersistentTileSchedulerParams so the dynamic scheduler snippet can
reference ClcDynamicPersistentTileSchedulerParams without error.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/perf-optimization/SKILL.md-122-129 (1)</summary><blockquote>

`122-129`: _⚠️ Potential issue_ | _🟡 Minor_

**Add language identifiers to fenced code blocks**

These fenced blocks are missing language tags, which triggers markdownlint (`MD040`) and reduces render/tooling quality.



<details>
<summary>Suggested fix</summary>

```diff
-```
+```text
 Primary bottleneck: memory-bound
 Evidence: Memory bandwidth at 89% of peak, compute at 35%
 Recommendations:
 1. [High] Enable FlashAttention for self-attention layers
 2. [Medium] Apply memory pooling for attention buffers
 3. [Low] Consider gradient checkpointing for memory reduction
 ```

-```
+```text
 # Before delegating to specialist
 backup_file("train.py")
 ...
 revert_file("train.py")
 ```

-```
+```markdown
 ## Optimization Applied: <optimization_name>
 ...
 SUCCESS: Achieved X% improvement
 ```

-```
+```markdown
 ## Optimization Summary
 ...
 - <reason for not applying>
 ```
```
</details>


Also applies to: 158-170, 293-312, 316-347

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-optimization/SKILL.md around lines 122 - 129, The fenced
code blocks in .claude/skills/perf-optimization/SKILL.md are missing language
identifiers (triggering MD040); update each triple-backtick block that contains
the snippets starting with "Primary bottleneck: memory-bound", the block that
begins with "Before delegating to specialist", the "## Optimization Applied:
<optimization_name>" block, and the "## Optimization Summary" block (and any
similar blocks in the other ranges noted) by adding appropriate language tags
such as text or markdown (e.g., text or markdown) to each opening fence so
markdownlint passes and rendering/tooling improve.


</details>

</blockquote></details>
<details>
<summary>.claude/skills/perf-nsight-compute-analysis/references/sections-guide.md-223-229 (1)</summary><blockquote>

`223-229`: _⚠️ Potential issue_ | _🟡 Minor_

**Annotate metric counts as version-specific or use qualitative sizing**

The `~Metrics` values in this table are version-dependent and will become stale as Nsight Compute releases add or modify section files and metrics. Mark these counts with the Nsight Compute version they were collected with (e.g., "As of Nsight Compute 2024.3: 213"), or replace exact counts with relative labels (low/medium/high) that remain accurate across versions.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-nsight-compute-analysis/references/sections-guide.md
around lines 223 - 229, The "~Metrics" column contains version-dependent counts
that will go stale; update the table rows for `basic`, `detailed`, `full`,
`roofline`, and `nvlink` so each cell either prefixes the count with the Nsight
Compute version/date (e.g., "As of Nsight Compute 2024.3: 213") or replaces the
numeric count with a qualitative size label ("low/medium/high"); also add or
update a short footnote above/below the table explaining that counts are
versioned and how to interpret qualitative labels so future readers know which
approach was used.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-tileir-optimization/references/tma-conversion.md-56-57 (1)</summary><blockquote>

`56-57`: _⚠️ Potential issue_ | _🟡 Minor_

**Add safety guard for NUM_CTAS division.**

The code divides `NUM_SMS` by `NUM_CTAS` without checking if `NUM_CTAS` exists in `nargs` or is non-zero. This could cause `KeyError` or `ZeroDivisionError` at runtime.



<details>
<summary>🛡️ Proposed fix with guard condition</summary>

```diff
     # Prevent oversubscription with 2CTA
-    if "NUM_SMS" in nargs and "NUM_CTAS" in nargs:
+    if "NUM_SMS" in nargs and "NUM_CTAS" in nargs and nargs["NUM_CTAS"] > 0:
         nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"]
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-tileir-optimization/references/tma-conversion.md
around lines 56 - 57, The code performs integer division of nargs["NUM_SMS"] by
nargs["NUM_CTAS"] without ensuring nargs contains "NUM_CTAS" and that it's
non-zero, which can raise KeyError or ZeroDivisionError; update the block around
the check for "NUM_SMS" and "NUM_CTAS" so you validate that "NUM_CTAS" is
present in nargs and that int(nargs["NUM_CTAS"]) != 0 (or truthy) before doing
nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"], and if the guard
fails, either skip the division or set a safe default/raise a clear error.
Ensure you reference the same keys ("NUM_SMS", "NUM_CTAS") and the nargs dict
when applying the guard.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-cute-writing/references/patterns-getting-started.md-96-113 (1)</summary><blockquote>

`96-113`: _⚠️ Potential issue_ | _🟡 Minor_

**Import `cutlass` in this kernel example.**

This block uses `cutlass.dynamic_expr(...)` on line 109 but never imports `cutlass`, so a standalone copy-paste run fails with `NameError`.

<details>
<summary>💡 Suggested fix</summary>

```diff
+import cutlass
+
 `@cute.kernel`
 def my_kernel(gA: cute.Tensor, gC: cute.Tensor):
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-cute-writing/references/patterns-getting-started.md
around lines 96 - 113, The kernel example uses cutlass.dynamic_expr inside the
my_kernel function but never imports the cutlass module; add a top-level import
for cutlass (e.g., import cutlass) so cutlass.dynamic_expr is defined when
my_kernel runs, ensuring the reference in the my_kernel body (where
cutlass.dynamic_expr(thread_idx < total_tiles) is called) no longer raises
NameError.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-cute-writing/references/patterns-gemm.md-241-247 (1)</summary><blockquote>

`241-247`: _⚠️ Potential issue_ | _🟡 Minor_

**Fix the broken multiline `torch.empty(...).permute(...)` examples in the code snippet.**

Lines 242-246 start chained calls on new lines without wrapping the full expression, which is invalid Python syntax and raises `SyntaxError` when copied as written.

<details>
<summary>Suggested fix</summary>

```diff
-fake_A = torch.empty(8, 8, 1, dtype=torch.float16, device="cuda")
-    .permute(2, 1, 0)                   # M-major physical layout
-fake_B = torch.empty(8, 8, 1, dtype=torch.float16, device="cuda")
-    .permute(2, 1, 0)                   # N-major physical layout
+fake_A = (
+    torch.empty(8, 8, 1, dtype=torch.float16, device="cuda")
+    .permute(2, 1, 0)  # M-major physical layout
+)
+fake_B = (
+    torch.empty(8, 8, 1, dtype=torch.float16, device="cuda")
+    .permute(2, 1, 0)  # N-major physical layout
+)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-cute-writing/references/patterns-gemm.md around lines
241 - 247, The multiline examples for creating fake tensors (fake_A, fake_B,
fake_C) break Python syntax because the chained .permute(...) calls are placed
on a new line; fix by putting the full expression on one logical line or by
enclosing the torch.empty(...) call in parentheses so the subsequent
.permute(...) remains attached (e.g., ensure fake_A =
(torch.empty(...).permute(...)) or keep fake_A = torch.empty(...).permute(...)
on one line), and do the same for fake_B and fake_C to restore valid chaining
for permute.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-tileir-optimization/scripts/tileir_check.py-145-154 (1)</summary><blockquote>

`145-154`: _⚠️ Potential issue_ | _🟡 Minor_

**Mock scenario is internally inconsistent with recommendation logic.**

At Line 148 and Line 153, mock values indicate nvtriton is not installed, but recommendation assumes nvtriton is present and only TileIR activation is missing.

<details>
<summary>Proposed fix</summary>

```diff
 def _mock_data() -> dict:
     """Return realistic mock data for testing without GPU/triton."""
     return {
-        "nvtriton_installed": False,
+        "nvtriton_installed": True,
         "tileir_active": False,
         "blackwell_gpu": True,
         "triton_installed": True,
         "gpu_capability": [10, 0],
         "recommendation": ("Set TRITON_PTXAS_PATH and run with ENABLE_TILE=1 to activate TileIR"),
     }
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-tileir-optimization/scripts/tileir_check.py around
lines 145 - 154, The mock in _mock_data returns nvtriton_installed: False but
the recommendation string assumes nvtriton is present; update the mock to be
internally consistent by either setting "nvtriton_installed": True if you want
the scenario where only TileIR activation is missing, or change the
"recommendation" text to instruct installing nvtriton when "nvtriton_installed"
is False; adjust the dictionary returned by _mock_data (keys:
"nvtriton_installed", "tileir_active", "recommendation") accordingly so the
boolean flags and the recommendation align.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-triton-writing/SKILL.md-43-46 (1)</summary><blockquote>

`43-46`: _⚠️ Potential issue_ | _🟡 Minor_

**Inconsistency: `tl.sigmoid()` warning may conflict with code examples.**

Line 43 warns that `tl.sigmoid()` is unavailable in some Triton versions and recommends using `1.0 / (1.0 + tl.exp(-x_fp32))`. However, the SiLU + Multiply pattern in `references/patterns-fusion.md` (line 103) uses `tl.sigmoid(gate)` directly. Consider adding a note to the pattern file about this compatibility concern, or update the pattern to use the workaround.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-triton-writing/SKILL.md around lines 43 - 46, Update
the SiLU + Multiply pattern in references/patterns-fusion.md to avoid calling
tl.sigmoid(gate) directly because tl.sigmoid() is missing in some Triton
versions; instead, replace that usage with the recommended portable expression
1.0 / (1.0 + tl.exp(-gate_fp32)) (ensure gate is cast to fp32 for the exp and
then cast back to the original dtype before tl.store), or add an explicit
compatibility note next to the SiLU pattern that instructs readers to use the
exp-based fallback when targeting older Triton versions and to perform proper
dtype casts.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-triton-writing/SKILL.md-245-253 (1)</summary><blockquote>

`245-253`: _⚠️ Potential issue_ | _🟡 Minor_

**Reference function randomness won't match kernel output.**

The `reference_fn` uses `torch.manual_seed` for dropout, but the kernel uses `tl.rand` with a different PRNG algorithm. Verification will fail due to different random masks even with the same seed. For dropout kernels, either:
1. Skip dropout comparison and verify GELU only, or
2. Document that dropout verification requires comparing statistical properties rather than exact values.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-triton-writing/SKILL.md around lines 245 - 253, The
reference_fn uses PyTorch dropout with torch.manual_seed which won't match the
kernel's tl.rand PRNG, so update the test to avoid byte-for-byte dropout
comparison: either change reference_fn to return only the GELU output
(remove/dropout in reference_fn) and compare kernel output to GELU-only results,
or modify the verification logic that calls get_inputs to skip exact-mask checks
and instead compare statistical properties (mean/variance or dropout rate)
between the kernel output and a sampled PyTorch dropout across many seeds;
reference the functions reference_fn and get_inputs and the kernel's use of
tl.rand when making the change.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-triton-writing/references/patterns-fusion.md-99-106 (1)</summary><blockquote>

`99-106`: _⚠️ Potential issue_ | _🟡 Minor_

**`tl.sigmoid()` usage conflicts with SKILL.md compatibility warning.**

The SKILL.md (line 43) warns that `tl.sigmoid()` is unavailable in some Triton versions and recommends `1.0 / (1.0 + tl.exp(-x_fp32))`. This pattern uses `tl.sigmoid(gate)` directly. Consider using the workaround for broader compatibility.



<details>
<summary>♻️ Proposed fix for compatibility</summary>

```diff
     # SiLU(gate) * x = gate * sigmoid(gate) * x
-    silu_gate = gate * tl.sigmoid(gate)
+    gate_fp32 = gate.to(tl.float32)
+    sigmoid_gate = 1.0 / (1.0 + tl.exp(-gate_fp32))
+    silu_gate = gate * sigmoid_gate.to(gate.dtype)
     out = silu_gate * x
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-triton-writing/references/patterns-fusion.md around
lines 99 - 106, Replace the direct call to tl.sigmoid in the SiLU computation
with the compatible expression using tl.exp: compute sig = 1.0 / (1.0 +
tl.exp(-gate_fp32)) (ensuring gate is cast to fp32 if necessary), then calculate
silu_gate = gate * sig and out = silu_gate * x before storing; update the
expressions around tl.sigmoid(gate), silu_gate and out to use this exp-based
sigmoid to maintain Triton version compatibility.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-triton-writing/references/patterns-fusion.md-308-316 (1)</summary><blockquote>

`308-316`: _⚠️ Potential issue_ | _🟡 Minor_

**Missing fp32 cast for LayerNorm computation.**

Same precision concern as RMSNorm - the variance computation and `tl.sqrt` should use fp32 intermediates for fp16/bf16 inputs.



<details>
<summary>♻️ Proposed fix</summary>

```diff
     # LayerNorm
-    mean = tl.sum(x, axis=0) / n_cols
-    x_centered = x - mean
-    var = tl.sum(x_centered * x_centered, axis=0) / n_cols
-    x_norm = x_centered / tl.sqrt(var + eps)
+    x_fp32 = x.to(tl.float32)
+    mean = tl.sum(x_fp32, axis=0) / n_cols
+    x_centered = x_fp32 - mean
+    var = tl.sum(x_centered * x_centered, axis=0) / n_cols
+    x_norm = x_centered / tl.sqrt(var + eps)
 
     weight = tl.load(weight_ptr + col_offsets, mask=mask, other=1.0)
     bias = tl.load(bias_ptr + col_offsets, mask=mask, other=0.0)
-    out = x_norm * weight + bias
+    out = (x_norm * weight + bias).to(x.dtype)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-triton-writing/references/patterns-fusion.md around
lines 308 - 316, LayerNorm uses lower-precision intermediates for variance/sqrt
which can underflow for fp16/bf16 inputs; in the LayerNorm block (mean, var,
tl.sqrt, x_norm) force fp32 intermediates: cast x (or x_centered) to float32
before summing and variance computation, perform tl.sqrt on the fp32 var+eps,
then cast the normalized result back to the input dtype before applying
weight/bias; ensure tl.load for weight_ptr/bias_ptr with mask remains correct
and only the arithmetic around mean, var and tl.sqrt uses fp32 temporaries.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-tileir-optimization/SKILL.md-81-85 (1)</summary><blockquote>

`81-85`: _⚠️ Potential issue_ | _🟡 Minor_

**Inconsistent `verify_kernel.py` invocation syntax.**

The command shown here uses `--kernel`, `--reference`, `--shapes`, and `--dtypes` flags, but the actual `verify_kernel.py` script in this PR uses a positional `kernel_path` argument and `--rtol`/`--atol` flags. The script expects the kernel module to export `reference_fn` and `get_inputs()` internally, not via CLI flags.

Update to match the actual script interface:



<details>
<summary>📝 Suggested fix</summary>

```diff
-python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'
+python scripts/verify_kernel.py path/to/kernel.py --rtol 1e-3 --atol 1e-3
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-tileir-optimization/SKILL.md around lines 81 - 85, The
README example uses flags that don't match the current verify_kernel.py
interface; update the doc to call verify_kernel.py with the positional
kernel_path argument and optional numeric tolerance flags (--rtol/--atol)
instead of --reference/--shapes/--dtypes, and note that the kernel module must
export reference_fn and get_inputs() to supply shapes/dtypes and reference data;
also keep the instruction to run with ENABLE_TILE=0 when verifying the kernel.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-triton-writing/references/patterns-fusion.md-190-197 (1)</summary><blockquote>

`190-197`: _⚠️ Potential issue_ | _🟡 Minor_

**Missing fp32 cast for RMS computation with fp16/bf16 inputs.**

Per the SKILL.md precision rules, `tl.sqrt` requires fp32 input for correct results. The computation here operates on the input dtype directly, which could cause precision issues or incorrect results with fp16/bf16.



<details>
<summary>♻️ Proposed fix to add fp32 casts</summary>

```diff
     # Compute RMS
-    x_sq = x * x
-    rms = tl.sqrt(tl.sum(x_sq, axis=0) / n_cols + eps)
+    x_fp32 = x.to(tl.float32)
+    x_sq = x_fp32 * x_fp32
+    rms = tl.sqrt(tl.sum(x_sq, axis=0) / n_cols + eps)
 
     # Normalize and scale
-    x_norm = x / rms
+    x_norm = x_fp32 / rms
     weight = tl.load(weight_ptr + col_offsets, mask=mask, other=1.0)
-    out = x_norm * weight
+    out = (x_norm * weight).to(x.dtype)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-triton-writing/references/patterns-fusion.md around
lines 190 - 197, The RMS computation uses tl.sqrt(tl.sum(...)) directly on x
which can be fp16/bf16; cast the squared terms and the eps divisor to tl.float32
before sum/sqrt and perform sqrt in fp32, then convert the result back as needed
so subsequent normalization uses correct precision. Concretely, in the RMS block
(symbols: x, x_sq, rms, tl.sum, tl.sqrt, eps) compute x_sq in fp32 (e.g., cast x
to tl.float32 before squaring or cast the product), ensure eps and the division
by n_cols are fp32, call tl.sqrt on that fp32 value to get rms_fp32, then use
rms_fp32 appropriately when computing x_norm (cast rms_fp32 to the input dtype
or cast x to fp32 for the division and cast the final out back to the original
dtype); keep weight loading (weight, weight_ptr, col_offsets, mask) unchanged
but ensure dtype alignment when multiplying out.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🧹 Nitpick comments (19)</summary><blockquote>

<details>
<summary>.claude/skills/perf-host-analysis/references/metrics.md (3)</summary><blockquote>

`26-28`: **Add language specifier to formula block.**

The fenced code block is missing a language identifier. Consider adding a specifier like `text` or `python` after the opening backticks for better rendering.


<details>
<summary>📝 Proposed fix</summary>

```diff
-```
+```text
 gpu_idle_ratio = gpu_idle_time_us / total_time_us
 ```
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-host-analysis/references/metrics.md around lines 26 -
28, The fenced code block containing the formula "gpu_idle_ratio =
gpu_idle_time_us / total_time_us" lacks a language specifier; update the opening
backticks for that block to include a language tag (for example text or
python) so the renderer applies proper formatting—locate the block with the
symbol gpu_idle_ratio and add the chosen specifier immediately after the opening

99-101: Add language specifier to formula block.

The fenced code block is missing a language identifier. Consider adding a specifier like text or python after the opening backticks for better rendering.

📝 Proposed fix

-```
+```text
 host_prep_perf_impact = host_prep_exposed_us / total_time_us

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-host-analysis/references/metrics.md around lines 99 -
101, The fenced code block containing the formula "host_prep_perf_impact =
host_prep_exposed_us / total_time_us" lacks a language specifier; update the
opening backticks to include a language (e.g., add text or python) so the
block renders correctly — locate the block with the variables
host_prep_perf_impact, host_prep_exposed_us, and total_time_us and add the
language identifier after the opening ``` only.


</details>

---

`85-87`: **Add language specifier to formula block.**

The fenced code block is missing a language identifier. Consider adding a specifier like `text` or `python` after the opening backticks for better rendering.


<details>
<summary>📝 Proposed fix</summary>

```diff
-```
+```text
 host_prep_exposed_ratio = host_prep_exposed_us / host_prep_total_us
 ```
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-host-analysis/references/metrics.md around lines 85 -
87, The fenced code block containing the formula "host_prep_exposed_ratio =
host_prep_exposed_us / host_prep_total_us" lacks a language specifier; update
the opening backticks to include a language like "text" or "python" (e.g.,
text or python) so the block renders properly, leaving the formula and
variable names host_prep_exposed_ratio, host_prep_exposed_us, and
host_prep_total_us unchanged.


</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-cute-writing/references/api-core.md (1)</summary><blockquote>

`232-239`: **Add language specifier to code block.**

The fenced code block is missing a language identifier. Adding `python` after the opening backticks will enable syntax highlighting.


<details>
<summary>📝 Proposed fix</summary>

```diff
-```
+```python
 cutlass.dynamic_expr(cond)      # Runtime conditional guard
 cutlass.const_expr(cond)        # Compile-time conditional
 cutlass.Constexpr               # Type annotation for compile-time args
 cutlass.range(n)                # IR loop with optional attributes
 cutlass.range_constexpr(n)      # Compile-time unrolled loop
 ```
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-cute-writing/references/api-core.md around lines 232 -
239, The fenced code block containing cutlass.dynamic_expr, cutlass.const_expr,
cutlass.Constexpr, cutlass.range, and cutlass.range_constexpr is missing a
language identifier; update the opening fence to include the language specifier
(python) so the block becomes ```python, preserving the existing contents and
closing fence to enable syntax highlighting for those symbols.


</details>

</blockquote></details>
<details>
<summary>.claude/skills/exec-slurm-compile/scripts/compile.slurm (1)</summary><blockquote>

`27-31`: **Consider validating required positional arguments.**

The script accepts 4 required positional arguments without validation. If fewer arguments are provided, the script will fail with unclear errors or use empty values.



<details>
<summary>Proposed fix</summary>

```diff
+if [[ $# -lt 4 ]]; then
+    echo "Usage: compile.slurm <container_image> <mount_dir> <scripts_dir> <repo_dir> [build_wheel_args...]" >&2
+    exit 1
+fi
+
 container_image=$1
 mount_dir=$2
 scripts_dir=$3
 repo_dir=$4
 shift 4  # remaining args forwarded to compile.sh
```

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/exec-slurm-compile/scripts/compile.slurm around lines 27 -
31, The script uses positional variables container_image, mount_dir,
scripts_dir, repo_dir (and then shift 4) but does not validate that four
arguments were provided; add a guard at the top that checks $# >= 4 and, if not,
prints a concise usage/help message showing required args (container_image
mount_dir scripts_dir repo_dir) and exits non‑zero. Reference the variables
container_image, mount_dir, scripts_dir, repo_dir and the existing shift 4 so
the check runs before shift and before any use of those variables.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/exec-slurm-compile/scripts/submit_compile.sh (1)</summary><blockquote>

`46-46`: **Potential Bash compatibility issue: empty array expansion with `set -u`.**

In Bash versions prior to 4.4, expanding an empty array (`"${extra_build_args[@]}"`) with `set -u` enabled causes an "unbound variable" error. If older Bash versions are a concern, consider using `${extra_build_args[@]+"${extra_build_args[@]}"}` for safe expansion.



<details>
<summary>Portable fix for older Bash</summary>

```diff
-    "${extra_build_args[@]}"
+    ${extra_build_args[@]+"${extra_build_args[@]}"}
```

</details>


Also applies to: 61-61

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/exec-slurm-compile/scripts/submit_compile.sh at line 46, The
script uses the array variable extra_build_args and later expands it as
"${extra_build_args[@]}", which can trigger "unbound variable" under set -u on
older Bash; update all expansions of the array (references to
"${extra_build_args[@]}" in the script) to the portable form
${extra_build_args[@]+"${extra_build_args[@]}"} so empty-array expansion is safe
with set -u while keeping the existing initialization extra_build_args=()
unchanged.
```

</details>

</blockquote></details>
<details>
<summary>.claude/skills/kernel-triton-writing/references/troubleshooting.md (1)</summary><blockquote>

`36-41`: **Add language specifier to fenced code block.**

The output example code block should have a language specifier for proper rendering and syntax highlighting.



<details>
<summary>📝 Proposed fix</summary>

```diff
-Output appears in stderr during compilation (not on device):
-```
+Output appears in stderr during compilation (not on device):
+```text
 BLOCK_SIZE 1024
 x dtype float32
 x shape (1024,)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-triton-writing/references/troubleshooting.md around
lines 36 - 41, Update the fenced code block showing the stderr output (the block
containing "BLOCK_SIZE 1024", "x dtype float32", "x shape (1024,)") to include a
language specifier (e.g., change totext) so it renders with correct
syntax highlighting; edit the code block in
.claude/skills/kernel-triton-writing/references/troubleshooting.md accordingly.


</details>

</blockquote></details>
<details>
<summary>.claude/skills/perf-analysis/SKILL.md (1)</summary><blockquote>

`86-95`: **Add language specifiers to fenced code blocks.**

The example blocks should have language specifiers for proper rendering. Consider using `text` or `markdown` for the delegation examples and report format.



<details>
<summary>📝 Proposed fix</summary>

```diff
 ### Good Example
 
-```
+```text
 Profile the batched GEMM kernel in bmm_workload.py with NCU.
 ...
 ```
 
 ### Bad Example
 
-```
+```text
 Run NCU with --set=full --profile-from-start off --target-processes all.
 ...
 ```

 ### Example Report
 
-```
+```markdown
 ## Summary
 Training at 42% MFU, memory-bound due to large attention tensors.
 ...
 ```
```
</details>


Also applies to: 97-104, 132-154

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-analysis/SKILL.md around lines 86 - 95, Update the
fenced code blocks in .claude/skills/perf-analysis/SKILL.md so they include
language specifiers: add text for the general/descriptive blocks (e.g. the block starting "Profile the batched GEMM kernel in bmm_workload.py..." and the "Bad Example" blocks) and markdown for the formatted report/example report
blocks (e.g. the "Example Report" block that begins "## Summary"). Locate the
unlabeled triple-backtick blocks by their surrounding text (the "Profile the
batched GEMM kernel..." paragraph, the "Bad Example" section, and the "Example
Report" section) and change the opening fence to include the appropriate
language tag (text or markdown) so the examples render correctly.


</details>

</blockquote></details>
<details>
<summary>.claude/skills/perf-nsight-systems/references/app-preparation.md (1)</summary><blockquote>

`53-53`: **Consider wrapping CUDA profiler calls with `check_error()` for proper error handling.**

The `torch.cuda.cudart().cudaProfilerStart()` syntax is correct. However, according to PyTorch documentation, these calls return `cudaError_t` status codes that should be checked. The recommended pattern is:

```python
from torch.cuda import cudart, check_error
check_error(cudart().cudaProfilerStart())
```

This ensures error conditions (e.g., profiler already started) are properly handled.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-nsight-systems/references/app-preparation.md at line 53,
Wrap the raw CUDA profiler calls so their return status is checked: replace
direct uses of torch.cuda.cudart().cudaProfilerStart() (and cudaProfilerStop()
if present) with the check_error(...) pattern from torch.cuda (i.e., call
check_error(cudart().cudaProfilerStart()) and
check_error(cudart().cudaProfilerStop())) to surface CUDA errors; reference the
torch.cuda.cudart().cudaProfilerStart, torch.cuda.cudart().cudaProfilerStop, and
torch.cuda.check_error symbols when making the change.
```

</details>

</blockquote></details>
<details>
<summary>.claude/agents/perf-profiling-specialist.md (1)</summary><blockquote>

`323-361`: **Specify a language on the fenced template block.**

Use `text` (or `markdown`) on the opening fence to satisfy markdown lint and improve rendering consistency.

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/agents/perf-profiling-specialist.md around lines 323 - 361, The
fenced code block beginning with that contains the "## nsys Profiling Summary" template should declare a language (e.g., usetext or ```markdown)
to satisfy markdown linting and improve rendering; update the opening fence in
.claude/agents/perf-profiling-specialist.md (the template block that starts with
"## nsys Profiling Summary") to include the language token while leaving the
content inside unchanged.


</details>

</blockquote></details>
<details>
<summary>.claude/skills/perf-nsight-compute-analysis/SKILL.md (1)</summary><blockquote>

`326-329`: **Add a language identifier to the fenced output block.**

The example output fence is untyped; use `text` to satisfy markdown lint and keep formatting consistent.

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-nsight-compute-analysis/SKILL.md around lines 326 - 329,
The fenced code block showing the CSV sample (the triple-backtick block
containing "Kernel Name","Duration","Compute (SM) Throughput","Memory
Throughput" and the following CSV line) is missing a language identifier; update
the opening fence from totext so the block is typed as text to satisfy
markdown linting and keep formatting consistent.


</details>

</blockquote></details>
<details>
<summary>.claude/skills/perf-host-analysis/SKILL.md (1)</summary><blockquote>

`65-67`: **Add language identifiers to untyped fenced code blocks.**

At Line 65, Line 85, Line 94, Line 103, Line 227, Line 252, Line 335, Line 341, and Line 352, use `text` for plain output/pseudocode blocks to clear markdown lint warnings.



Also applies to: 85-91, 94-100, 103-109, 227-247, 252-257, 335-337, 341-348, 352-354

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-host-analysis/SKILL.md around lines 65 - 67, Several
fenced code blocks in SKILL.md (including the inline diagram showing
"[inter-step gap] -> [_forward_step] -> ..." and blocks around the ranges 65,
85–91, 94–100, 103–109, 227–247, 252–257, 335–337, 341–348, 352–354) are untyped
and trigger markdown-lint warnings; update each triple-backtick fence to specify
the language identifier "text" (e.g., ```text) so plain output/pseudocode blocks
are explicitly marked, leaving the block contents unchanged and only adding the
language token to the opening fence.


</details>

</blockquote></details>
<details>
<summary>.claude/skills/perf-host-analysis/scripts/analyze_host_overhead.py (3)</summary><blockquote>

`486-492`: **Unused loop variable `avg` (second occurrence).**

Same issue as above - rename to `_avg`.



<details>
<summary>♻️ Proposed fix</summary>

```diff
-            for name, cnt, total, avg in kernels:
+            for name, cnt, total, _avg in kernels:
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-host-analysis/scripts/analyze_host_overhead.py around
lines 486 - 492, In the loop "for name, cnt, total, avg in kernels:" inside
analyze_host_overhead.py rename the unused fourth tuple element from "avg" to
"_avg" (or similar underscore-prefixed name) to signal it's intentionally
unused; update that loop header (the second occurrence) where "name, cnt, total,
avg" is unpacked and leave all other logic unchanged so only the variable name
is changed.
```

</details>

---

`468-476`: **Unused loop variable `avg`.**

The variable `avg` is unpacked but never used. Rename to `_avg` to indicate it's intentionally unused.



<details>
<summary>♻️ Proposed fix</summary>

```diff
-            for text, cnt, total, avg in nvtx:
+            for text, cnt, total, _avg in nvtx:
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-host-analysis/scripts/analyze_host_overhead.py around
lines 468 - 476, The loop over nvtx unpacks (text, cnt, total, avg) but never
uses avg; change the unpacking to (text, cnt, total, _avg) to mark it
intentionally unused (update the for-loop header where nvtx is iterated) so
linters stop flagging an unused variable; no other logic changes needed — keep
references to nvtx_per_step, tp_size, unique_steady, text, cnt, and total as-is.
```

</details>

---

`742-760`: **SQLite connection may leak on exception between baseline and target analysis.**

If `analyze_single_trace` for the target raises an exception, `conn_b` is already closed, but if an exception occurs during `sqlite3.connect(args.target)`, the code flow is fine. However, if an exception occurs during `analyze_single_trace(conn_t, ...)`, `conn_t` won't be closed. Consider using context managers for robustness.



<details>
<summary>♻️ Suggested improvement using context managers</summary>

```diff
     try:
         # Analyze baseline
-        conn_b = sqlite3.connect(args.baseline)
-        baseline_results = analyze_single_trace(
-            conn_b, args.baseline_label, out, tp_size=args.tp_size
-        )
-        conn_b.close()
+        with sqlite3.connect(args.baseline) as conn_b:
+            baseline_results = analyze_single_trace(
+                conn_b, args.baseline_label, out, tp_size=args.tp_size
+            )

         # Analyze target (if provided)
         target_results = None
         if args.target:
-            conn_t = sqlite3.connect(args.target)
-            target_results = analyze_single_trace(
-                conn_t, args.target_label, out, tp_size=args.tp_size
-            )
-            conn_t.close()
+            with sqlite3.connect(args.target) as conn_t:
+                target_results = analyze_single_trace(
+                    conn_t, args.target_label, out, tp_size=args.tp_size
+                )

             # Compare
             compare_results(baseline_results, target_results, out)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/perf-host-analysis/scripts/analyze_host_overhead.py around
lines 742 - 760, The current sequence opens sqlite3 connections into conn_b and
conn_t and calls analyze_single_trace(conn_b, ...) and
analyze_single_trace(conn_t, ...), but if analyze_single_trace for the target
raises an exception conn_t is not closed; fix by replacing manual connect/close
with context managers: use "with sqlite3.connect(args.baseline) as conn_b:" and
"with sqlite3.connect(args.target) as conn_t:" around the respective
analyze_single_trace calls, keep the compare_results(baseline_results,
target_results, out) after both context blocks so connections are always closed
even on exceptions; update references to conn_b/conn_t and ensure
baseline_results and target_results are set in the correct scopes.
```

</details>

</blockquote></details>
<details>
<summary>.claude/agents/exec-compile-specialist.md (1)</summary><blockquote>

`116-132`: **Add language specifier to fenced code block.**

The code block starting at line 116 lacks a language specifier. Since it's a template output format, use `text` or `markdown`.



<details>
<summary>📝 Proposed fix</summary>

````diff
-```
+```text
 ## Compile Result

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/agents/exec-compile-specialist.md around lines 116 - 132, The fenced
code block that renders the compile-result template (the block that begins with
the three backticks immediately above the "## Compile Result" heading) lacks a
language specifier; update the opening fence from "```" to "```text" (or
"```markdown") so the template is explicitly marked as plain text/markdown and
renders correctly.

.claude/skills/kernel-triton-writing/scripts/verify_kernel.py (3)

223-236: Temporary file not cleaned up on subprocess failure before try block.

If subprocess.run raises an exception other than TimeoutExpired (e.g., OSError), the finally block handles cleanup. However, if an exception occurs between writing the file (line 223-225) and entering the try block (line 229), the temp file would leak. Consider using a context manager pattern or moving file creation inside the try block.

♻️ Suggested improvement using context manager approach

-    with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as script_file:
-        script_file.write(script)
-        script_path = script_file.name
-
-    working_dir = os.path.dirname(os.path.abspath(kernel_path))
-
     try:
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as script_file:
+            script_file.write(script)
+            script_path = script_file.name
+
+        working_dir = os.path.dirname(os.path.abspath(kernel_path))
+
         result = subprocess.run(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-triton-writing/scripts/verify_kernel.py around lines
223 - 236, The temp file created via NamedTemporaryFile (script_file /
script_path) can leak if an exception occurs after writing but before entering
the try block; move the file creation inside the same try/finally that runs
subprocess.run (or wrap both creation and subprocess.run in a single
try/except/finally) so cleanup of script_path always happens in the finally, or
alternatively use a context manager that keeps the temp file open for the
duration of subprocess.run; adjust references to script_file/script_path,
subprocess.run, working_dir, timeout, and kernel_path accordingly to ensure the
finally block always removes the temp file even if errors occur before the
subprocess is invoked.

121-127: Relative difference calculation may be misleading for very small reference values.

The safe_ref guard replaces zeros with ones, but for very small non-zero values (e.g., 1e-10), the relative difference could still be misleadingly large. Consider using a more robust relative error formula like abs_diff / max(abs(ref), atol).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-triton-writing/scripts/verify_kernel.py around lines
121 - 127, The relative-difference logic using safe_ref is fragile for tiny
non-zero refs; change the calculation to use a robust denominator like
max(ref_abs, atol) instead of replacing zeros with ones: compute ref_abs =
ref.float().abs(), pick a small absolute tolerance (e.g., atol = 1e-8 or an
existing tolerance variable), set denom = torch.maximum(ref_abs,
torch.full_like(ref_abs, atol)), then compute max_rel = (abs_diff /
denom).max().item() and update _global_max_rel accordingly (leave abs_diff,
max_abs, and the _global updates unchanged).

241-247: Use next() instead of list slicing for single element extraction.

Creating a full list just to take the first element is inefficient. Use next() with a generator expression.

♻️ Proposed fix

         if "RESULT:" in output:
             try:
-                result_line = [line for line in output.split("\n") if "RESULT:" in line][0]
+                result_line = next(line for line in output.split("\n") if "RESULT:" in line)
                 parts_str = result_line.split("RESULT:")[1]
                 parts = parts_str.split(",")
                 passed = "True" in parts[0]
                 max_abs = float(parts[1].split("=")[1])
                 max_rel = float(parts[2].split("=")[1])
-            except (IndexError, ValueError) as exc:
+            except (StopIteration, IndexError, ValueError) as exc:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/kernel-triton-writing/scripts/verify_kernel.py around lines
241 - 247, Replace the list-and-slice used to find the RESULT line with next()
over a generator to avoid building an intermediate list: use next((line for line
in output.split("\n") if "RESULT:" in line), None) to assign result_line, then
check for None and raise or handle an error before continuing to parse
parts_str, parts, passed, max_abs, and max_rel; keep the same parsing logic but
ensure the new next() usage and the None check are used in the try block
(symbols: output, result_line, parts_str, parts, passed, max_abs, max_rel).

tensorrt-cicd · 2026-04-08T08:30:04Z

PR_Github #42301 [ skip ] completed with state SUCCESS. Commit: 37df0c0
Skipping testing for commit 37df0c0

Link to invocation

github-actions Bot assigned kaiyux Apr 8, 2026

QiJune approved these changes Apr 8, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 8, 2026

View reviewed changes

kaiyux enabled auto-merge (squash) April 8, 2026 08:26

kaiyux merged commit 04cf885 into NVIDIA:main Apr 8, 2026
11 of 14 checks passed

kaiyux deleted the user/kaiyu/skills branch April 8, 2026 08:37

Conversation

kaiyux commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

kaiyux commented Apr 8, 2026

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

coderabbitai Bot commented Apr 8, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaiyux commented Apr 8, 2026 •

edited

Loading