Skip to content

[None][feat] serve-config-guide skill for basic aggregate single-node serving configs#12054

Merged
venkywonka merged 19 commits into
NVIDIA:mainfrom
venkywonka:venky/add-basic-agg-config-skills
Apr 1, 2026
Merged

[None][feat] serve-config-guide skill for basic aggregate single-node serving configs#12054
venkywonka merged 19 commits into
NVIDIA:mainfrom
venkywonka:venky/add-basic-agg-config-skills

Conversation

@venkywonka
Copy link
Copy Markdown
Collaborator

@venkywonka venkywonka commented Mar 9, 2026

Summary

Claude Code skill (2 files, ~127 lines) that generates source-backed starting trtllm-serve --config YAMLs for basic aggregate (IFB, single-node, PyTorch backend) serving. Iteratively refined through 4 eval iterations against golden holdout configs from IBDB.

What it does:

  • Database-first lookup: searches examples/configs/database/lookup.yaml then examples/configs/curated/lookup.yaml before interpolating
  • Preserves user's stated performance objective (Min Latency / Balanced / Max Throughput) through config selection
  • Excludes speculative, disaggregated, and multi-node configs (DeepSeek-R1 MTP is the one exception — copied verbatim from checked-in source)
  • For interpolated configs: cites starting-point source + single "What to benchmark" knob list

Key design choices validated by eval:

  • max_batch_size: scheduler ceiling, NOT proportional to concurrency — copy from nearest source config
  • cuda_graph_config.max_batch_size: default to max_batch_size (not concurrency-proportional — eval showed the old heuristic was wrong in 3/3 golden configs)
  • KV cache estimation formulas for GQA and MLA (covers MHA as special case where num_kv_heads == num_attention_heads)
  • Chunked prefill MLA hardware constraint (SM90+ only)
  • Excluded sources guardrail: docs/source/legacy/ values don't transfer to PyTorch backend

Files

File Lines Role
SKILL.md 74 Scope, constraints, 5-step decision flow (lock objective → exact match → nearest config → read docs → adjust fields), validation checklist
references/knob-heuristics.md 53 Tuning table for 7 commonly adjusted fields, KV cache estimation formulas, chunked prefill MLA guidance

references/architecture.md was deleted — its model-to-source mapping table was derivable by the agent via ls/grep; the 2 non-obvious pieces (legacy exclusion, DeepSeek-V3 caveat) were inlined into Step 3.

Test plan

  • Exact match: DeepSeek-R1-0528 on 8x B200, 1k1k, conc=1, min latency → copies from database
  • Interpolation: Llama-3.3-70B on 4x H100, 4k/2k, conc=128, max throughput → cites curated source + What to benchmark
  • Hallucination guard: Llama-4-Ultra-1T (nonexistent) → refuses to fabricate paths
  • Golden holdout: 3 IBDB configs not in checked-in database → 90% field accuracy
  • No hardcoded numeric thresholds in heuristics (all guidance says "copy from source" or "benchmark")

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

This pull request adds a comprehensive skill guide for configuring basic aggregate single-node serving with PyTorch backend. It includes step-by-step instructions, architecture classification logic, tuning heuristics for critical configuration knobs, and template configurations organized across a main guide file and three reference documents.

Changes

Cohort / File(s) Summary
Skill Guide and References
.claude/skills/basic-agg-singlenode-config-guide/SKILL.md, .claude/skills/basic-agg-singlenode-config-guide/references/architecture.md, .claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md, .claude/skills/basic-agg-singlenode-config-guide/references/templates.md
Adds comprehensive documentation for manual tuning workflow of basic aggregate single-node configurations, including model classification rules, decision tables for four critical knobs (moe_config.backend, enable_attention_dp, kv_cache_config.free_gpu_memory_fraction, max_num_tokens), YAML config templates for MoE/Dense architectures, and known model mappings with failure mode guidance.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The PR title clearly identifies the main change: adding a configuration guide skill for basic aggregate single-node serving, accurately reflecting the added documentation files.
Description check ✅ Passed The PR description is comprehensive, explaining the skill's purpose, design choices, test validation, and key files with clear structure and rationale.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
.claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md (1)

44-56: Clarify relationship between formula-based and fixed-value guidance.

Lines 50-52 recommend "ISL to 2x ISL" as a sweet spot, but line 54 mentions "some models use fixed values (e.g., 1152 for DeepSeek-R1)." This creates ambiguity: when should users apply the formula versus using a fixed value? Consider adding explicit precedence: "Check database configs first for model-specific fixed values; if not found, use the ISL-based formula."

📋 Proposed clarification
 **Rules**:
+- **Check database configs first**: some models use validated fixed values (e.g., 1152 for DeepSeek-R1).
+- If no database config exists, use the following formula:
 - Must be >= ISL + 64 (chat template overhead) when chunked prefill is disabled, or requests get HTTP 400.
 - Sweet spot: ISL to 2x ISL.
 - Low concurrency (<=32): `max(ISL, 2048)`.
 - High concurrency (>=64): `max(ISL * 2, 4096)`.
 - Never > ISL * 4 — wastes activation buffer, counterintuitively shrinks KV cache.
-- Check database configs first: some models use fixed values (e.g., 1152 for DeepSeek-R1).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
@.claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md
around lines 44 - 56, The guidance for max_num_tokens is ambiguous about when to
use model-specific fixed values versus the ISL-based formula; update the text
for max_num_tokens to explicitly state precedence: first check database configs
or model-specific fixed values (e.g., DeepSeek-R1 = 1152) and use those if
present, otherwise compute the scheduler token budget using the ISL-based
heuristics (sweet spot ISL–2x ISL, low/high concurrency rules, never > ISL*4),
and note that fixed values override the formula.
.claude/skills/basic-agg-singlenode-config-guide/SKILL.md (1)

23-29: Consider using approximate counts to reduce maintenance.

Lines 27-28 cite specific config counts ("169 benchmark-optimal configs", "10 production configs") that may become stale as the repository evolves. Consider rephrasing to "100+ benchmark configs" and "~10 production configs" for resilience.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/basic-agg-singlenode-config-guide/SKILL.md around lines 23 -
29, The document names exact counts ("169 benchmark-optimal configs", "10
production configs") which will become stale; update the wording in SKILL.md
near the Database and Curated bullets to use approximate counts like "100+
benchmark configs" and "~10 production configs" (or similar) while preserving
the pointers to examples/configs/database/lookup.yaml, RecipeList.from_yaml(),
and examples/configs/curated/ and the note about stripping speculative_config;
keep references to references/architecture.md and references/templates.md
unchanged.
.claude/skills/basic-agg-singlenode-config-guide/references/templates.md (1)

7-24: Clarify the relationship between example values.

The template shows max_num_tokens: 2048 and max_seq_len: 2068, but the relationship isn't immediately clear. The comment states >= ISL + OSL + 68, which suggests ISL+OSL=2000 in this example. Consider adding a comment like # example assumes ISL+OSL=2000 to help users understand how to adapt these values.

📝 Proposed clarification
 ```yaml
 backend: pytorch
-max_num_tokens: 2048          # adjust: see knob-heuristics.md §max_num_tokens
-max_seq_len: 2068             # >= ISL + OSL + 68
+max_num_tokens: 2048          # adjust: see knob-heuristics.md §max_num_tokens (example: ISL=1k, OSL=1k)
+max_seq_len: 2068             # >= ISL + OSL + 68 (2000 + 68 in this example)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/basic-agg-singlenode-config-guide/references/templates.md
around lines 7 - 24, Add a clarifying comment that links the example values for
max_num_tokens and max_seq_len to an assumed ISL+OSL total so readers know how
the numbers were derived: update the comment next to max_num_tokens and
max_seq_len (the YAML keys) to note the example uses ISL=1k and OSL=1k
(ISL+OSL=2000) and show that max_seq_len = ISL+OSL+68 (2000+68=2068) so users
can adjust those fields accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/basic-agg-singlenode-config-guide/references/templates.md:
- Around line 26-45: The MoE+GQA template currently sets max_batch_size: 256 but
conflicts with architecture.md which caps MoE models at 512; update the template
so max_batch_size: 512 for the MoE + GQA block (and remove/adjust the incorrect
Dense comment that says "(cap 512 for MoE)" in the Dense section), or
alternatively add an explicit inline note in the MoE+GQA template explaining
that 256 is an intentional conservative value for MoE_GQA — be explicit about
which variant the cap applies to (MoE_MLA vs MoE_GQA) and reference the
max_batch_size key and the Dense comment so readers are not confused.

---

Nitpick comments:
In
@.claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md:
- Around line 44-56: The guidance for max_num_tokens is ambiguous about when to
use model-specific fixed values versus the ISL-based formula; update the text
for max_num_tokens to explicitly state precedence: first check database configs
or model-specific fixed values (e.g., DeepSeek-R1 = 1152) and use those if
present, otherwise compute the scheduler token budget using the ISL-based
heuristics (sweet spot ISL–2x ISL, low/high concurrency rules, never > ISL*4),
and note that fixed values override the formula.

In @.claude/skills/basic-agg-singlenode-config-guide/references/templates.md:
- Around line 7-24: Add a clarifying comment that links the example values for
max_num_tokens and max_seq_len to an assumed ISL+OSL total so readers know how
the numbers were derived: update the comment next to max_num_tokens and
max_seq_len (the YAML keys) to note the example uses ISL=1k and OSL=1k
(ISL+OSL=2000) and show that max_seq_len = ISL+OSL+68 (2000+68=2068) so users
can adjust those fields accordingly.

In @.claude/skills/basic-agg-singlenode-config-guide/SKILL.md:
- Around line 23-29: The document names exact counts ("169 benchmark-optimal
configs", "10 production configs") which will become stale; update the wording
in SKILL.md near the Database and Curated bullets to use approximate counts like
"100+ benchmark configs" and "~10 production configs" (or similar) while
preserving the pointers to examples/configs/database/lookup.yaml,
RecipeList.from_yaml(), and examples/configs/curated/ and the note about
stripping speculative_config; keep references to references/architecture.md and
references/templates.md unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bbcddd23-5df0-403c-ad4c-2fb9fb744118

📥 Commits

Reviewing files that changed from the base of the PR and between 35ccedd and 65d2ada.

📒 Files selected for processing (4)
  • .claude/skills/basic-agg-singlenode-config-guide/SKILL.md
  • .claude/skills/basic-agg-singlenode-config-guide/references/architecture.md
  • .claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md
  • .claude/skills/basic-agg-singlenode-config-guide/references/templates.md

Comment thread .claude/skills/basic-agg-singlenode-config-guide/references/templates.md Outdated
Comment thread .claude/skills/basic-agg-singlenode-config-guide/SKILL.md Outdated
@venkywonka venkywonka marked this pull request as draft March 11, 2026 16:11
Comment thread .claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md Outdated
@venkywonka venkywonka marked this pull request as ready for review March 20, 2026 05:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Claude skill that guides users to generate source-backed trtllm-serve --config YAMLs for basic aggregate, single-node (PyTorch backend) serving, prioritizing checked-in config databases and deployment docs.

Changes:

  • Introduces the basic-agg-singlenode-config-guide skill with a step-by-step decision flow, constraints (incl. MTP carveout), and validation checklist.
  • Adds a reference mapping of model families to the repo’s primary config/doc sources.
  • Adds a reference doc of tuning knobs with “source-backed” guidance plus clearly-labeled fallback heuristics derived from tensorrt_llm/bench/.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
.claude/skills/basic-agg-singlenode-config-guide/SKILL.md Main skill prompt defining scope, constraints, workflow, and validation checklist for config selection/tuning.
.claude/skills/basic-agg-singlenode-config-guide/references/architecture.md Model-family → primary checked-in sources mapping used by the skill’s doc-reading step.
.claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md Guidance on commonly tuned config fields and “fallback only” bench-derived heuristics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .claude/skills/basic-agg-singlenode-config-guide/references/architecture.md Outdated
Comment thread .claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md Outdated
Comment thread .claude/skills/basic-agg-singlenode-config-guide/SKILL.md Outdated
@venkywonka
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39772 [ run ] triggered by Bot. Commit: 59347a8 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39772 [ run ] completed with state FAILURE. Commit: 59347a8
/LLM/main/L0_MergeRequest_PR pipeline #30966 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Comment thread .claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md Outdated
@venkywonka venkywonka requested a review from a team as a code owner March 25, 2026 20:46
@venkywonka venkywonka requested a review from chang-l March 25, 2026 20:46
@venkywonka
Copy link
Copy Markdown
Collaborator Author

/bot kill

@venkywonka
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40370 [ run ] triggered by Bot. Commit: 52ae5e5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40371 [ kill ] triggered by Bot. Commit: 52ae5e5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40370 [ run ] completed with state ABORTED. Commit: 52ae5e5

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40371 [ kill ] completed with state SUCCESS. Commit: 52ae5e5
Successfully killed previous jobs for commit 52ae5e5

Link to invocation

@venkywonka
Copy link
Copy Markdown
Collaborator Author

Skill evaluation summary

Ran a structured eval loop (3 test dimensions + hold-out golden configs) comparing with-skill vs without-skill (baseline) outputs.

Where the skill clearly helps:

  • Exact match retrieval — finds the right checked-in config from the database. This is the primary use case for supported models with existing configs.
  • Guardrails — catches max_num_tokens < ISL (would silently reject requests), excludes legacy TRT-engine docs, avoids hallucinating config paths or claiming interpolated values are source-backed.
  • Starting point for interpolation — when no exact match exists, produces a safe config that won't OOM or reject on first launch, with a clear "what to benchmark" list.

Where both skill and baseline perform equally:

  • Structural decisions (TP, EP, max_seq_len) — both get these right
  • Both refuse to fabricate configs for entirely unsupported models

What the skill does NOT do:

  • Produce production-optimal configs for unseen models. Comparison against a hold-out set of known-good unpublished configs showed the skill's interpolated values are safe starting points but not optimal — real sweep data beats heuristics every time.
  • Replace benchmarking. The response format explicitly frames interpolated configs as "starting point + what to benchmark."

Key improvements driven by eval:

  • Legacy docs exclusion (baseline cited docs/source/legacy/ TRT-engine tuning data as PyTorch backend guidance)
  • max_batch_size is a scheduler ceiling, not a memory reservation — don't reduce it from source
  • moe_config.backend depends on the specific model, not just GPU+precision — mark as unverified when interpolating
  • cuda_graph_config.max_batch_size should reflect expected peak concurrency, not server max_batch_size
  • Simplified response format: exact match gives config directly; interpolation gives starting point + what to benchmark

@venkywonka
Copy link
Copy Markdown
Collaborator Author

/bot kill

@venkywonka
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40976 [ kill ] triggered by Bot. Commit: a75d419 Link to invocation

venkywonka and others added 19 commits March 31, 2026 21:37
Light skill (~250 lines, 4 files) for tuning basic aggregate (IFB,
single node, PyTorch, non-speculative) TRT-LLM serving configs.
Covers architecture-class taxonomy (MoE_MLA, MoE_GQA, Dense_GQA),
4 critical knobs with empirical heuristics from 72K+ IBDB records,
and starter templates with validation checklists.

Signed-off-by: Venkatesh Gururajan <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Light skill (~250 lines, 4 files) for tuning basic aggregate (IFB,
single node, PyTorch, non-speculative) TRT-LLM serving configs.
Covers architecture-class taxonomy (MoE_MLA, MoE_GQA, Dense_GQA),
4 critical knobs with empirical heuristics from 72K+ benchmark
records, and starter templates with validation checklists.

Signed-off-by: Venkatesh Gururajan <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Add negative triggers (multi-node, speculative, disaggregated, TensorRT
engine) and differentiation from trtllm-config-generator to the
frontmatter description. This ensures Claude picks the right skill when
both are available.

Signed-off-by: Venkatesh Gururajan <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venkatesh Gururajan <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Preserve workload intent when selecting checked-in configs and keep the guidance source-backed for model-specific exceptions like Kimi-K2.

Signed-off-by: Venky <[email protected]>
Keep the basic aggregate config skill aligned with current checked-in DeepSeek-R1 serve recipes while preserving its conservative non-spec default.

Signed-off-by: Venky <[email protected]>
Keep exact source-backed config selection for in-scope requests while making adjacent out-of-scope answers more useful. This preserves trust by clearly separating verified guidance from assembled best guesses instead of failing closed.

Signed-off-by: Venky <[email protected]>
…ation

Define constraints once in a canonical section and reference by label
throughout steps and checklist, reducing MTP exception from ~22
restatements to 5 touchpoints. Delete dead templates.md, make SKILL.md
delegate to architecture.md for model mapping, trim out-of-scope
handling and knob-heuristics.md duplicated sections. Add Llama-3.3-70B
coverage and Qwen3 disagg filter note. Total: 323 -> 182 lines (-44%).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: venkywonka <[email protected]>
Add a "Generally Observed Patterns" section with architecture-level
directional hints for the five most impactful knobs (enable_attention_dp,
moe_config.backend, kv_cache_config, max_num_tokens, max_batch_size).
Enrich the Commonly Tuned Fields table with brief directional context.
All hints are directional only (no hard numbers) and defer to checked-in
configs and deployment guides.

Signed-off-by: venkywonka <[email protected]>
…ss definitions

- Expand IFB, ISL, OSL acronyms on first mention in SKILL.md (Copilot review)
- Expand ADP abbreviation on first use in knob-heuristics.md (Copilot review)
- Strengthen enable_attention_dp as high-throughput-only knob (venkywonka review)
- Add Architecture Classes section with MoE/Dense and MLA/GQA definitions
  to architecture.md (Copilot review)

Signed-off-by: venkywonka <[email protected]>
…canonicalize MTP, cut prompt

Signed-off-by: venkywonka <[email protected]>
…tch_size as scheduler ceiling, MoE backend uncertainty, max_seq_len and cuda_graph guidance

Signed-off-by: venkywonka <[email protected]>
…irectly, interpolation gives starting point + what to benchmark

Signed-off-by: venkywonka <[email protected]>
… fix harmful heuristics

Delete architecture.md (model-to-source mapping table — agent can ls/grep the
repo). Fuse its 2 non-obvious pieces (legacy exclusion, V3 caveat) into
SKILL.md Step 3. Remove redundant checklist items, Repo Resources table,
low-value knob-heuristics rows, OOM Triage section, and chunked prefill
duplication.

Fix 2 heuristics that golden holdout eval showed were actively harmful:
- cuda_graph_config.max_batch_size: "next power of 2 above concurrency" →
  "default to max_batch_size" (wrong in 3/3 golden configs)
- max_batch_size: clarify NOT proportional to concurrency, copy from source
  (misinterpreted in 2/3 golden configs)

Net: -55 lines across 3 files (182 → 127), 0 knowledge lost.
Signed-off-by: venkywonka <[email protected]>
@venkywonka venkywonka force-pushed the venky/add-basic-agg-config-skills branch from a75d419 to 4f78207 Compare April 1, 2026 04:37
@venkywonka
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41103 [ run ] triggered by Bot. Commit: 4f78207 Link to invocation

@venkywonka venkywonka enabled auto-merge (squash) April 1, 2026 05:14
@venkywonka
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "no L0 CI coverage for skill changes"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41124 [ skip ] triggered by Bot. Commit: 4f78207 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41103 [ run ] completed with state ABORTED. Commit: 4f78207
/LLM/main/L0_MergeRequest_PR pipeline #32077 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41124 [ skip ] completed with state SUCCESS. Commit: 4f78207
Skipping testing for commit 4f78207

Link to invocation

@venkywonka venkywonka merged commit bb06399 into NVIDIA:main Apr 1, 2026
5 checks passed
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
… serving configs (NVIDIA#12054)

Signed-off-by: venkywonka <[email protected]>
Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants