[None][feat] serve-config-guide skill for basic aggregate single-node serving configs#12054
Conversation
📝 WalkthroughWalkthroughThis pull request adds a comprehensive skill guide for configuring basic aggregate single-node serving with PyTorch backend. It includes step-by-step instructions, architecture classification logic, tuning heuristics for critical configuration knobs, and template configurations organized across a main guide file and three reference documents. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
.claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md (1)
44-56: Clarify relationship between formula-based and fixed-value guidance.Lines 50-52 recommend "ISL to 2x ISL" as a sweet spot, but line 54 mentions "some models use fixed values (e.g., 1152 for DeepSeek-R1)." This creates ambiguity: when should users apply the formula versus using a fixed value? Consider adding explicit precedence: "Check database configs first for model-specific fixed values; if not found, use the ISL-based formula."
📋 Proposed clarification
**Rules**: +- **Check database configs first**: some models use validated fixed values (e.g., 1152 for DeepSeek-R1). +- If no database config exists, use the following formula: - Must be >= ISL + 64 (chat template overhead) when chunked prefill is disabled, or requests get HTTP 400. - Sweet spot: ISL to 2x ISL. - Low concurrency (<=32): `max(ISL, 2048)`. - High concurrency (>=64): `max(ISL * 2, 4096)`. - Never > ISL * 4 — wastes activation buffer, counterintuitively shrinks KV cache. -- Check database configs first: some models use fixed values (e.g., 1152 for DeepSeek-R1).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md around lines 44 - 56, The guidance for max_num_tokens is ambiguous about when to use model-specific fixed values versus the ISL-based formula; update the text for max_num_tokens to explicitly state precedence: first check database configs or model-specific fixed values (e.g., DeepSeek-R1 = 1152) and use those if present, otherwise compute the scheduler token budget using the ISL-based heuristics (sweet spot ISL–2x ISL, low/high concurrency rules, never > ISL*4), and note that fixed values override the formula..claude/skills/basic-agg-singlenode-config-guide/SKILL.md (1)
23-29: Consider using approximate counts to reduce maintenance.Lines 27-28 cite specific config counts ("169 benchmark-optimal configs", "10 production configs") that may become stale as the repository evolves. Consider rephrasing to "100+ benchmark configs" and "~10 production configs" for resilience.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/basic-agg-singlenode-config-guide/SKILL.md around lines 23 - 29, The document names exact counts ("169 benchmark-optimal configs", "10 production configs") which will become stale; update the wording in SKILL.md near the Database and Curated bullets to use approximate counts like "100+ benchmark configs" and "~10 production configs" (or similar) while preserving the pointers to examples/configs/database/lookup.yaml, RecipeList.from_yaml(), and examples/configs/curated/ and the note about stripping speculative_config; keep references to references/architecture.md and references/templates.md unchanged..claude/skills/basic-agg-singlenode-config-guide/references/templates.md (1)
7-24: Clarify the relationship between example values.The template shows
max_num_tokens: 2048andmax_seq_len: 2068, but the relationship isn't immediately clear. The comment states>= ISL + OSL + 68, which suggests ISL+OSL=2000 in this example. Consider adding a comment like# example assumes ISL+OSL=2000to help users understand how to adapt these values.📝 Proposed clarification
```yaml backend: pytorch -max_num_tokens: 2048 # adjust: see knob-heuristics.md §max_num_tokens -max_seq_len: 2068 # >= ISL + OSL + 68 +max_num_tokens: 2048 # adjust: see knob-heuristics.md §max_num_tokens (example: ISL=1k, OSL=1k) +max_seq_len: 2068 # >= ISL + OSL + 68 (2000 + 68 in this example)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/basic-agg-singlenode-config-guide/references/templates.md around lines 7 - 24, Add a clarifying comment that links the example values for max_num_tokens and max_seq_len to an assumed ISL+OSL total so readers know how the numbers were derived: update the comment next to max_num_tokens and max_seq_len (the YAML keys) to note the example uses ISL=1k and OSL=1k (ISL+OSL=2000) and show that max_seq_len = ISL+OSL+68 (2000+68=2068) so users can adjust those fields accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/basic-agg-singlenode-config-guide/references/templates.md:
- Around line 26-45: The MoE+GQA template currently sets max_batch_size: 256 but
conflicts with architecture.md which caps MoE models at 512; update the template
so max_batch_size: 512 for the MoE + GQA block (and remove/adjust the incorrect
Dense comment that says "(cap 512 for MoE)" in the Dense section), or
alternatively add an explicit inline note in the MoE+GQA template explaining
that 256 is an intentional conservative value for MoE_GQA — be explicit about
which variant the cap applies to (MoE_MLA vs MoE_GQA) and reference the
max_batch_size key and the Dense comment so readers are not confused.
---
Nitpick comments:
In
@.claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md:
- Around line 44-56: The guidance for max_num_tokens is ambiguous about when to
use model-specific fixed values versus the ISL-based formula; update the text
for max_num_tokens to explicitly state precedence: first check database configs
or model-specific fixed values (e.g., DeepSeek-R1 = 1152) and use those if
present, otherwise compute the scheduler token budget using the ISL-based
heuristics (sweet spot ISL–2x ISL, low/high concurrency rules, never > ISL*4),
and note that fixed values override the formula.
In @.claude/skills/basic-agg-singlenode-config-guide/references/templates.md:
- Around line 7-24: Add a clarifying comment that links the example values for
max_num_tokens and max_seq_len to an assumed ISL+OSL total so readers know how
the numbers were derived: update the comment next to max_num_tokens and
max_seq_len (the YAML keys) to note the example uses ISL=1k and OSL=1k
(ISL+OSL=2000) and show that max_seq_len = ISL+OSL+68 (2000+68=2068) so users
can adjust those fields accordingly.
In @.claude/skills/basic-agg-singlenode-config-guide/SKILL.md:
- Around line 23-29: The document names exact counts ("169 benchmark-optimal
configs", "10 production configs") which will become stale; update the wording
in SKILL.md near the Database and Curated bullets to use approximate counts like
"100+ benchmark configs" and "~10 production configs" (or similar) while
preserving the pointers to examples/configs/database/lookup.yaml,
RecipeList.from_yaml(), and examples/configs/curated/ and the note about
stripping speculative_config; keep references to references/architecture.md and
references/templates.md unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bbcddd23-5df0-403c-ad4c-2fb9fb744118
📒 Files selected for processing (4)
.claude/skills/basic-agg-singlenode-config-guide/SKILL.md.claude/skills/basic-agg-singlenode-config-guide/references/architecture.md.claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md.claude/skills/basic-agg-singlenode-config-guide/references/templates.md
There was a problem hiding this comment.
Pull request overview
Adds a new Claude skill that guides users to generate source-backed trtllm-serve --config YAMLs for basic aggregate, single-node (PyTorch backend) serving, prioritizing checked-in config databases and deployment docs.
Changes:
- Introduces the
basic-agg-singlenode-config-guideskill with a step-by-step decision flow, constraints (incl. MTP carveout), and validation checklist. - Adds a reference mapping of model families to the repo’s primary config/doc sources.
- Adds a reference doc of tuning knobs with “source-backed” guidance plus clearly-labeled fallback heuristics derived from
tensorrt_llm/bench/.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
.claude/skills/basic-agg-singlenode-config-guide/SKILL.md |
Main skill prompt defining scope, constraints, workflow, and validation checklist for config selection/tuning. |
.claude/skills/basic-agg-singlenode-config-guide/references/architecture.md |
Model-family → primary checked-in sources mapping used by the skill’s doc-reading step. |
.claude/skills/basic-agg-singlenode-config-guide/references/knob-heuristics.md |
Guidance on commonly tuned config fields and “fallback only” bench-derived heuristics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/bot run |
|
PR_Github #39772 [ run ] triggered by Bot. Commit: |
|
PR_Github #39772 [ run ] completed with state
|
|
/bot kill |
|
/bot run |
|
PR_Github #40370 [ run ] triggered by Bot. Commit: |
|
PR_Github #40371 [ kill ] triggered by Bot. Commit: |
|
PR_Github #40370 [ run ] completed with state |
|
PR_Github #40371 [ kill ] completed with state |
Skill evaluation summaryRan a structured eval loop (3 test dimensions + hold-out golden configs) comparing with-skill vs without-skill (baseline) outputs. Where the skill clearly helps:
Where both skill and baseline perform equally:
What the skill does NOT do:
Key improvements driven by eval:
|
|
/bot kill |
|
/bot run |
|
PR_Github #40976 [ kill ] triggered by Bot. Commit: |
Light skill (~250 lines, 4 files) for tuning basic aggregate (IFB, single node, PyTorch, non-speculative) TRT-LLM serving configs. Covers architecture-class taxonomy (MoE_MLA, MoE_GQA, Dense_GQA), 4 critical knobs with empirical heuristics from 72K+ IBDB records, and starter templates with validation checklists. Signed-off-by: Venkatesh Gururajan <[email protected]> Signed-off-by: Venky Ganesh <[email protected]>
Light skill (~250 lines, 4 files) for tuning basic aggregate (IFB, single node, PyTorch, non-speculative) TRT-LLM serving configs. Covers architecture-class taxonomy (MoE_MLA, MoE_GQA, Dense_GQA), 4 critical knobs with empirical heuristics from 72K+ benchmark records, and starter templates with validation checklists. Signed-off-by: Venkatesh Gururajan <[email protected]> Signed-off-by: Venky Ganesh <[email protected]>
Add negative triggers (multi-node, speculative, disaggregated, TensorRT engine) and differentiation from trtllm-config-generator to the frontmatter description. This ensures Claude picks the right skill when both are available. Signed-off-by: Venkatesh Gururajan <[email protected]> Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venkatesh Gururajan <[email protected]> Signed-off-by: Venky Ganesh <[email protected]>
Preserve workload intent when selecting checked-in configs and keep the guidance source-backed for model-specific exceptions like Kimi-K2. Signed-off-by: Venky <[email protected]>
Keep the basic aggregate config skill aligned with current checked-in DeepSeek-R1 serve recipes while preserving its conservative non-spec default. Signed-off-by: Venky <[email protected]>
Keep exact source-backed config selection for in-scope requests while making adjacent out-of-scope answers more useful. This preserves trust by clearly separating verified guidance from assembled best guesses instead of failing closed. Signed-off-by: Venky <[email protected]>
…ation Define constraints once in a canonical section and reference by label throughout steps and checklist, reducing MTP exception from ~22 restatements to 5 touchpoints. Delete dead templates.md, make SKILL.md delegate to architecture.md for model mapping, trim out-of-scope handling and knob-heuristics.md duplicated sections. Add Llama-3.3-70B coverage and Qwen3 disagg filter note. Total: 323 -> 182 lines (-44%). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: venkywonka <[email protected]>
Add a "Generally Observed Patterns" section with architecture-level directional hints for the five most impactful knobs (enable_attention_dp, moe_config.backend, kv_cache_config, max_num_tokens, max_batch_size). Enrich the Commonly Tuned Fields table with brief directional context. All hints are directional only (no hard numbers) and defer to checked-in configs and deployment guides. Signed-off-by: venkywonka <[email protected]>
Signed-off-by: venkywonka <[email protected]>
…ss definitions - Expand IFB, ISL, OSL acronyms on first mention in SKILL.md (Copilot review) - Expand ADP abbreviation on first use in knob-heuristics.md (Copilot review) - Strengthen enable_attention_dp as high-throughput-only knob (venkywonka review) - Add Architecture Classes section with MoE/Dense and MLA/GQA definitions to architecture.md (Copilot review) Signed-off-by: venkywonka <[email protected]>
…rence Signed-off-by: venkywonka <[email protected]>
…rchitecture reference Signed-off-by: venkywonka <[email protected]>
…les to knob-heuristics Signed-off-by: venkywonka <[email protected]>
…to serve-config-guide Signed-off-by: venkywonka <[email protected]>
…canonicalize MTP, cut prompt Signed-off-by: venkywonka <[email protected]>
…tch_size as scheduler ceiling, MoE backend uncertainty, max_seq_len and cuda_graph guidance Signed-off-by: venkywonka <[email protected]>
…irectly, interpolation gives starting point + what to benchmark Signed-off-by: venkywonka <[email protected]>
… fix harmful heuristics Delete architecture.md (model-to-source mapping table — agent can ls/grep the repo). Fuse its 2 non-obvious pieces (legacy exclusion, V3 caveat) into SKILL.md Step 3. Remove redundant checklist items, Repo Resources table, low-value knob-heuristics rows, OOM Triage section, and chunked prefill duplication. Fix 2 heuristics that golden holdout eval showed were actively harmful: - cuda_graph_config.max_batch_size: "next power of 2 above concurrency" → "default to max_batch_size" (wrong in 3/3 golden configs) - max_batch_size: clarify NOT proportional to concurrency, copy from source (misinterpreted in 2/3 golden configs) Net: -55 lines across 3 files (182 → 127), 0 knowledge lost. Signed-off-by: venkywonka <[email protected]>
a75d419 to
4f78207
Compare
|
/bot run |
|
PR_Github #41103 [ run ] triggered by Bot. Commit: |
|
/bot skip --comment "no L0 CI coverage for skill changes" |
|
PR_Github #41124 [ skip ] triggered by Bot. Commit: |
|
PR_Github #41103 [ run ] completed with state
|
|
PR_Github #41124 [ skip ] completed with state |
… serving configs (NVIDIA#12054) Signed-off-by: venkywonka <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Summary
Claude Code skill (2 files, ~127 lines) that generates source-backed starting
trtllm-serve --configYAMLs for basic aggregate (IFB, single-node, PyTorch backend) serving. Iteratively refined through 4 eval iterations against golden holdout configs from IBDB.What it does:
examples/configs/database/lookup.yamlthenexamples/configs/curated/lookup.yamlbefore interpolatingKey design choices validated by eval:
max_batch_size: scheduler ceiling, NOT proportional to concurrency — copy from nearest source configcuda_graph_config.max_batch_size: default tomax_batch_size(not concurrency-proportional — eval showed the old heuristic was wrong in 3/3 golden configs)num_kv_heads == num_attention_heads)docs/source/legacy/values don't transfer to PyTorch backendFiles
SKILL.mdreferences/knob-heuristics.mdreferences/architecture.mdwas deleted — its model-to-source mapping table was derivable by the agent vials/grep; the 2 non-obvious pieces (legacy exclusion, DeepSeek-V3 caveat) were inlined into Step 3.Test plan