Skip to content

fix(html): align walkHtmlTokens with WHATWG spec and swc_html_parser#21000

Merged
alexander-akait merged 14 commits into
mainfrom
claude/align-html-lexer-jo2kB
May 21, 2026
Merged

fix(html): align walkHtmlTokens with WHATWG spec and swc_html_parser#21000
alexander-akait merged 14 commits into
mainfrom
claude/align-html-lexer-jo2kB

Conversation

@alexander-akait
Copy link
Copy Markdown
Member

  • Drop the length cap on the script-data-double-escape-end tempBuffer
    so end tags like </scripts> no longer falsely match "script" and
    prematurely exit double-escaped mode.
  • Update tagStart when entering the less-than-sign state from any
    script-data escaped state so a matching </script> close tag reached
    via SCRIPT_DATA_ESCAPED preserves the preceding script body in the
    emitted text span.
  • Flush pending text when a content-mode close tag is detected via the
    attribute path (</title foo>, </script bar=baz>), not only via
    the direct > path.
  • Track commentStart from the markup-declaration-open transition so
    EOF inside an incomplete <!… emits the correct byte range.
  • Emit the missing-attribute-value form (<a foo=>) so the open-tag
    byte range still includes the >.
  • Implement spec-correct named character reference matching by adding
    the full WHATWG named character references table (generated via
    tooling/generate-html-entities.js) and restoring the
    STATE_AMBIGUOUS_AMPERSAND fallthrough. decodeHtmlEntities now
    handles the full table, including legacy bare forms (&AMP,
    &copy) and multi-code-point entities (&NotEqualTilde;), and
    applies WHATWG longest-prefix backtrack (&notpre; -> ¬pre;).

Adds 199 unit tests covering every state-machine branch in the
tokenizer and reaches 100% statements/branches/functions/lines coverage
on lib/html/walkHtmlTokens.js.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

- Drop the length cap on the script-data-double-escape-end tempBuffer
  so end tags like `</scripts>` no longer falsely match `"script"` and
  prematurely exit double-escaped mode.
- Update `tagStart` when entering the less-than-sign state from any
  script-data escaped state so a matching `</script>` close tag reached
  via SCRIPT_DATA_ESCAPED preserves the preceding script body in the
  emitted text span.
- Flush pending text when a content-mode close tag is detected via the
  attribute path (`</title foo>`, `</script bar=baz>`), not only via
  the direct `>` path.
- Track `commentStart` from the markup-declaration-open transition so
  EOF inside an incomplete `<!…` emits the correct byte range.
- Emit the missing-attribute-value form (`<a foo=>`) so the open-tag
  byte range still includes the `>`.
- Implement spec-correct named character reference matching by adding
  the full WHATWG named character references table (generated via
  `tooling/generate-html-entities.js`) and restoring the
  STATE_AMBIGUOUS_AMPERSAND fallthrough. `decodeHtmlEntities` now
  handles the full table, including legacy bare forms (`&AMP`,
  `&copy`) and multi-code-point entities (`&NotEqualTilde;`), and
  applies WHATWG longest-prefix backtrack (`&notpre;` -> `¬pre;`).

Adds 199 unit tests covering every state-machine branch in the
tokenizer and reaches 100% statements/branches/functions/lines coverage
on lib/html/walkHtmlTokens.js.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Switch lib/html/htmlEntities.js from a 40 KB frozen object literal to
a 28 KB delta-encoded string decoded lazily on first use.

- Each record is `<sharedByte><suffix>\t<valueLenByte><value>` with
  1-byte length prefixes for shared-prefix count and value length.
- Names sorted alphabetically so adjacent entries share long prefixes.
- Length-prefixed values mean records can contain raw `\t` or `\n`
  (as in `&Tab;` and `&NewLine;`) without escaping.
- Module exports a getter function; callers cache the table locally so
  the decode runs at most once per `decodeHtmlEntities` invocation.

Also exclude the generated file from the eslint config so prettier
doesn't try to wrap the long source string.

199/199 tests pass, 100% coverage on `walkHtmlTokens.js`.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
…ties generator

Surface tokenizer parse errors via a new optional `parseError` callback on
`HtmlTokenCallbacks`. Severity is `"warning"` when the tokenizer recovers
and the emitted token is still well-formed (missing-attribute-value,
unexpected-equals-sign-before-attribute-name, abrupt-doctype-*, etc.) and
`"error"` when the emitted token byte range is incomplete (eof-in-tag,
eof-in-comment, eof-in-doctype, eof-in-cdata,
eof-in-script-html-comment-like-text).

Fix the EOF-in-tag deviation: previously a half-finished tag at EOF
(`<div class="x`) was silently emitted as text. The tokenizer now emits
the partial open/close tag at EOF and reports `eof-in-tag` so consumers
still see a tag callback.

Fix two small spec deviations: `STATE_COMMENT_START` and
`STATE_COMMENT_START_DASH` now correctly reconsume on the anything-else
branch (they previously consumed the character, which prevented the
comment-less-than-sign chain from firing). Also correct a misleading
code comment in `STATE_COMMENT_END`.

Make `emitAttribute` advance `pos` past the closing quote by default when
no `attribute` callback is provided, so callers that only need the error
stream don't have to wire an attribute handler just to keep parsing.

Restructure the entities generator to follow the same pattern as
`tooling/generate-runtime-code.js`: vendor the WHATWG entities.json
locally at `tooling/html-entities.json`, support `--write` for in-place
generation, and wire into `lint:special` / `fix:special` so the generated
`lib/html/htmlEntities.js` is verified by CI. A separate `--fetch` flag
refreshes the vendored JSON from the spec URL.

226/226 tests pass, 100% coverage on `walkHtmlTokens.js`.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Drop `lib/html/htmlEntities.js` and inline the delta-encoded entity table
plus the lazy `getHtmlEntities` getter into a `// #region html entities`
block at the top of `lib/html/walkHtmlTokens.js`.

The generator (`tooling/generate-html-entities.js`) now splices the
region into `walkHtmlTokens.js` directly — mirroring how
`tooling/generate-runtime-code.js` operates on `lib/util/semver.js`.
`yarn lint:special` verifies the inlined block is in sync;
`yarn fix:special` writes it in place. `--fetch` still refreshes the
vendored `tooling/html-entities.json` from the spec URL.

226/226 tests pass, 100% coverage on `walkHtmlTokens.js`.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
… literal

The lazy `getHtmlEntities()` getter was building a 2231-entry Map on
first use by parsing a delta-encoded string with one `indexOf` and one
`slice` per entry. Replace it with a single `Object.freeze({...})`
literal so the JS engine builds the table once at module-load time and
all subsequent lookups are direct property access.

The literal lives inside the existing `// #region html entities` block
and is preceded by `// prettier-ignore` / `// cspell:disable-next-line`
so the long line is not wrapped on regeneration. File size grows from
~140 KB to ~143 KB (Object literal vs delta-encoded string), in
exchange for removing the per-first-call decode work and the
`htmlEntityCache` / `getHtmlEntities` scaffolding.

Update the two callers in `walkHtmlTokens.js` (NAMED_CHARACTER_REFERENCE
state and `decodeHtmlEntities`) to read from `HTML_ENTITIES` directly.

226/226 tests pass, 100% coverage on `walkHtmlTokens.js`.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
…gus-comment

Three EOF-path fixes surfaced by an independent spec audit:

- `STATE_ATTRIBUTE_NAME` at EOF previously emitted an attribute with
  stale or `-1` `attrNameEnd` because the state machine only assigns
  `attrNameEnd` on the transitions out of attribute-name. Set it to
  `len` in the EOF block so `<div data-x` at EOF emits attribute
  `data-x` with the correct byte range.

- `STATE_CHARACTER_REFERENCE`-family (states 71–79) at EOF previously
  fell straight to the trailing-text branch, skipping the `returnState`.
  When the return state was inside an attribute value (e.g.
  `<a href="x&amp` at EOF), neither the partial tag nor `eof-in-tag`
  was emitted. Now the EOF block unwinds `state` to `returnState`
  before the existing in-tag check, so the partial open tag is emitted
  and `eof-in-tag` is reported.

- `STATE_BOGUS_COMMENT` at EOF previously reported `eof-in-comment` as
  an error, but the spec defines no parse error for this case (the
  comment is emitted cleanly with the EOF token). Drop the error report
  for bogus-comment-at-EOF while keeping it for the real comment states.

229/229 tests pass, 100% coverage on `walkHtmlTokens.js`.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Three remaining spec-alignment items from the recent audit:

- decodeHtmlEntities now applies the WHATWG numeric-character-reference
  remap rules: 0x00, surrogates (0xD800-0xDFFF), and code points above
  0x10FFFF return U+FFFD; the 0x80-0x9F C1 range remaps through the
  Windows-1252 table per spec (e.g. `&#x80;` -> euro, `&#x99;` -> tm).
- decodeHtmlEntities accepts a new optional `isAttribute` flag that
  applies the WHATWG consumed-as-part-of-an-attribute rule. With
  `isAttribute: true`, a named entity without a trailing `;` whose
  next character is `=` (or whose longest-prefix backtrack leaves
  trailing alphanumerics) stays literal:
  `decodeHtmlEntities("&amp=foo", true)` returns `&amp=foo`; without
  the flag it decodes to `&=foo`.
- Tokenizer now fires the `end-tag-with-attributes` parse error
  (severity `"warning"`) when an end tag like `</div foo>` emits.
  Tracked via a `tagHasAttributes` flag set in `emitAttribute` and
  cleared in `emitOpenTag` / `emitCloseTag`.

235/235 tests pass, 100% coverage on `walkHtmlTokens.js`.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Copilot AI review requested due to automatic review settings May 20, 2026 21:26
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 20, 2026

🦋 Changeset detected

Latest commit: 3effe59

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
webpack Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

This PR is packaged and the instant preview is available (1097a7f).

Install it locally:

  • npm
npm i -D webpack@https://pkg.pr.new/webpack@1097a7f
  • yarn
yarn add -D webpack@https://pkg.pr.new/webpack@1097a7f
  • pnpm
pnpm add -D webpack@https://pkg.pr.new/webpack@1097a7f

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.60%. Comparing base (a39f2d3) to head (3effe59).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #21000      +/-   ##
==========================================
+ Coverage   90.94%   91.60%   +0.66%     
==========================================
  Files         573      573              
  Lines       58940    59120     +180     
  Branches    15888    15948      +60     
==========================================
+ Hits        53601    54159     +558     
+ Misses       5339     4961     -378     
Flag Coverage Δ
integration 89.62% <43.93%> (-0.10%) ⬇️
test262 45.37% <ø> (+<0.01%) ⬆️
unit 37.89% <100.00%> (+1.30%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates webpack’s experimental HTML tokenizer (walkHtmlTokens) to more closely follow the WHATWG HTML tokenization spec and match behavior of swc_html_parser, including fixing multiple byte-range edge cases and expanding entity decoding to the full named character reference table.

Changes:

  • Align tokenizer state-machine behavior for script-data escaping, content-mode close tags via attribute paths, markup-declaration EOF handling, and missing attribute values; add a new parseError callback with "warning"/"error" severities.
  • Implement spec-correct named character reference matching (full WHATWG entities table + longest-prefix backtrack) and upgrade decodeHtmlEntities accordingly.
  • Add a generator (tooling/generate-html-entities.js) plus extensive unit tests (targeting full branch coverage).

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
lib/html/walkHtmlTokens.js Main tokenizer updates: state-machine fixes, new parseError callback, full WHATWG entity table + improved entity decoding, and expanded EOF handling.
tooling/generate-html-entities.js Generates the inlined HTML_ENTITIES region in walkHtmlTokens.js from tooling/html-entities.json.
package.json Hooks the generator into lint:special / fix:special workflows.
tooling/html-entities.json Vendored WHATWG named character references table used by the generator.
test/walkHtmlTokens.unittest.js Adds regression tests and broad state-machine branch coverage + parseError/decodeHtmlEntities tests.
cspell.json Excludes tooling/html-entities.json from spellchecking.
.changeset/align-html-lexer-script-data.md Patch changeset documenting tokenizer alignment, new callback, and full entities support.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/html/walkHtmlTokens.js Outdated
Comment on lines 2694 to 2700
tempBuffer += String.fromCharCode(cc + 0x20);
}
tempBuffer += String.fromCharCode(cc + 0x20);
pos++;
Comment thread lib/html/walkHtmlTokens.js Outdated
if (namedEntityConsumed > 33) break;
runLen++;
// Safety cap — the longest entity name is 32 chars (without `&`).
if (runLen > 32) break;
Comment thread lib/html/walkHtmlTokens.js Outdated
Comment on lines +3177 to +3184
for (let i = name.length; i > 0; i--) {
const prefix = name.slice(0, i);
if (HTML_ENTITIES[prefix] !== undefined) {
// Attribute-context longest-prefix guard: if the matched
// prefix doesn't end with `;` and the leftover starts with
// an alphanumeric character, leave literal per WHATWG.
// (The regex greedy-consumes alphanumerics, so any leftover
// within `name` is itself alphanumeric — we only need to
- Fix tsc lint failure: cast the inlined `HTML_ENTITIES` to
  `Readonly<Record<string, string>>` so the two `HTML_ENTITIES[name]`
  call sites type-check (the frozen-literal type was too narrow).

- Copilot #1: replace the `tempBuffer` string in
  `STATE_SCRIPT_DATA_DOUBLE_ESCAPE_START` / `_END` with a small
  `scriptMatch` counter against the literal `"script"`. Worst-case
  inputs with very long ASCII-alpha runs after `</` no longer grow a
  buffer or do quadratic string concatenation. Vestigial
  `tempBuffer = ""` resets in the four content-mode less-than-sign
  states (which never read the buffer) are removed; the
  `tempBuffer` variable is gone.

- Copilot #2: introduce a `MAX_ENTITY_NAME_LEN = 32` constant
  (longest WHATWG entity name including the trailing `;`) and cap
  the named-character-reference alphanumeric run at
  `MAX_ENTITY_NAME_LEN - 1`. Replaces the off-by-one
  `if (runLen > 32) break` with a clearer in-loop bound.

- Copilot #3: cap the `decodeHtmlEntities` longest-prefix backtrack
  at `MAX_ENTITY_NAME_LEN`. Inputs like `&` followed by thousands
  of alphanumerics stay linear-time; anything past the cap is
  appended verbatim.

Adds a regression test that decodes `&` + 1000 chars to confirm the
decoder doesn't go quadratic.

236/236 tests pass, 100% coverage on `walkHtmlTokens.js`,
`yarn lint` (including tsc) clean.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 20, 2026

Merging this PR will improve performance by 21.08%

⚡ 6 improved benchmarks
❌ 4 regressed benchmarks
✅ 134 untouched benchmarks
⏩ 72 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Memory benchmark "side-effects-reexport", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}' 859.4 KB 406.4 KB ×2.1
Memory benchmark "context-esm", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}' 152.6 KB 663.5 KB -77%
Memory benchmark "many-modules-esm", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}' 288.8 KB 142.8 KB ×2
Memory benchmark "side-effects-reexport", scenario '{"name":"mode-development","mode":"development"}' 3.9 MB 4.9 MB -21.65%
Memory benchmark "cache-filesystem", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}' 843.9 KB 164.7 KB ×5.1
Memory benchmark "many-chunks-esm", scenario '{"name":"mode-production","mode":"production"}' 10.8 MB 7.8 MB +37.82%
Memory benchmark "asset-modules-bytes", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}' 323.2 KB 134.5 KB ×2.4
Memory benchmark "many-chunks-esm", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}' 177 KB 248.5 KB -28.79%
Memory benchmark "concatenate-modules", scenario '{"name":"mode-development","mode":"development"}' 1,113.1 KB 781.7 KB +42.4%
Memory benchmark "future-defaults", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}' 145.2 KB 284.3 KB -48.92%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/align-html-lexer-jo2kB (3effe59) with main (294197c)

Open in CodSpeed

Footnotes

  1. 72 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

Comment thread lib/html/walkHtmlTokens.js Outdated
Comment on lines +3146 to +3161
// Match either `&name;` / `&name` (named refs may be legacy-bare per the
// WHATWG entities table) or a numeric reference `&#\u2026;?`.
return str.replace(
/&(#[xX]?[0-9a-fA-F]+|#?[0-9a-zA-Z]+);?/g,
(match, _group, offset, source) => {
// Numeric reference: &#65; or &#x41;
if (match.charCodeAt(1) === 0x23 /* # */) {
const lastChar = match.charAt(match.length - 1);
const body = lastChar === ";" ? match.slice(2, -1) : match.slice(2);
const isHex =
body.charCodeAt(0) === 0x78 || body.charCodeAt(0) === 0x58;
const code = isHex
? Number.parseInt(body.slice(1), 16)
: Number.parseInt(body, 10);
if (!Number.isNaN(code)) return decodeNumericReference(code);
return match; // Invalid numeric (e.g. &#;)
Comment thread test/walkHtmlTokens.unittest.js Outdated
Comment on lines +1938 to +1941
it("nAMED_CHARACTER_REFERENCE: caps consumption at 33 chars", () => {
// Entity names in the WHATWG table are at most ~33 chars; the
// scanner has a safety cap that breaks out of the consume loop
// past 33 alphanumeric chars even without a closing semicolon.
…ment

Copilot review follow-ups on commit eb2fc4d.

- `decodeHtmlEntities` previously matched numeric refs with
  `#[xX]?[0-9a-fA-F]+`, which let a decimal reference like `&#65b`
  greedily consume the trailing `b` (it's a hex digit) and incorrectly
  decode to `A` while dropping the `b`. Split the regex into three
  clear alternatives:
    `&#x<hex>` — hex (requires `x`/`X`),
    `&#<dec>`  — decimal (digits only),
    `&<name>`  — named (letter then alphanumerics).
  With the stricter regex the parse-int call always has a non-empty
  digit body, so the `Number.isNaN(code)` defensive return is removed.
  Adds a regression test: `&#65b` -> `Ab`, `&#1f` -> `�f`.

- Reword the "named-character-reference cap" test comment so it
  matches the actual `MAX_ENTITY_NAME_LEN = 32` bound (the previous
  copy said "33 chars" from the pre-constant era).

237/237 tests pass, 100% coverage on `walkHtmlTokens.js`,
`yarn lint` (including tsc) clean.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

Comment thread tooling/generate-html-entities.js Outdated
"walkHtmlTokens.js"
);

const REGION_REGEXP = /\/\/ #region html entities\n[\s\S]+?\/\/ #endregion\n/;
Comment thread tooling/generate-html-entities.js Outdated
Comment on lines +58 to +61
return fetchUrl(/** @type {string} */ (res.headers.location)).then(
resolve,
reject
);
Comment thread lib/html/walkHtmlTokens.js Outdated
Comment on lines +240 to +246
* Reports a tokenizer parse error to the consumer. The byte range and
* severity follow the WHATWG spec naming. Severity is `"error"` for
* cases where the emitted token is incomplete (EOF inside a tag or
* comment); everything else is a `"warning"`.
* @param {string} code WHATWG parse-error code (kebab-case)
* @param {number} start byte offset where the error starts
* @param {number} end byte offset where the error ends
Copilot review follow-ups on commit 3e4fe10.

- The entities generator's REGION_REGEXP hard-coded `\n`, so it
  false-reported "need to be updated" on Windows checkouts where git
  normalized line endings to CRLF. Match `\r?\n` and preserve the
  file's existing EOL style (CRLF on Windows, LF elsewhere) when
  writing the regenerated region so we don't introduce mixed line
  endings.

- HTTP redirect handling in `fetchUrl` passed `res.headers.location`
  through unchanged, which broke when the server returned a relative
  `Location`. Resolve via `new URL(location, url).toString()` and
  guard against a missing `Location` header.

- `parseError` doc strings called offsets "byte offsets" but the
  values are actually JS string indices (UTF-16 code units), which
  matters for inputs containing non-BMP code points. Reword the
  `ParseErrorSeverity` typedef, the `reportError` JSDoc, and the
  surrounding internal comments to use "offset" / "string offset"
  / "offset range" instead of "byte".

237/237 tests pass, lint/prettier/cspell/tsc clean.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.

Comment on lines +2787 to +2793
if (HTML_ENTITIES[withSemi] !== undefined) {
namedEntityConsumed = n + 1;
break;
}
}
const bare = input.slice(pos, pos + n);
if (HTML_ENTITIES[bare] !== undefined) {
Comment on lines +2792 to +2795
const bare = input.slice(pos, pos + n);
if (HTML_ENTITIES[bare] !== undefined) {
namedEntityConsumed = n;
break;
Comment on lines +3203 to +3206
for (let i = searchLen; i > 0; i--) {
const prefix = name.slice(0, i);
if (HTML_ENTITIES[prefix] !== undefined) {
// Attribute-context longest-prefix guard: if the matched
Comment thread tooling/generate-html-entities.js Outdated
// strings (1–2 UTF-16 code units).
// prettier-ignore
// cspell:disable-next-line
const HTML_ENTITIES = /** @type {Readonly<Record<string, string>>} */ (Object.freeze(${JSON.stringify(map)}));
Comment thread tooling/generate-html-entities.js Outdated
);
} else {
console.error(
`${path.relative(process.cwd(), TARGET_PATH)} need to be updated`
"webpack": patch
---

Align the experimental HTML tokenizer with the WHATWG spec: fix byte-range bugs in the script-data, content-mode end-tag, attribute-value, and EOF states; surface tokenizer parse errors to consumers via a new `parseError` callback (`"warning"` when the tokenizer recovers and the emitted token is still well-formed, `"error"` when the byte range is incomplete — e.g. `eof-in-tag`); and add the full WHATWG named character references table so `decodeHtmlEntities` handles all named entities (including legacy bare forms like `&AMP` and multi-code-point entities like `&NotEqualTilde;`) with proper longest-prefix backtracking.
Copilot review follow-ups on commit 1c233be.

- Real bug: `HTML_ENTITIES[name] !== undefined` lookups against a
  regular Object hit `Object.prototype` keys (`toString`,
  `constructor`, `hasOwnProperty`, `__proto__`, …), so inputs like
  `&toString;` would falsely decode to the source of
  `Object.prototype.toString`. The generator now emits the table on
  a null prototype via `Object.assign(Object.create(null), {…})`, so
  bracket lookups are safe everywhere they're used (both
  `decodeHtmlEntities` and the tokenizer's NAMED_CHARACTER_REFERENCE
  state). Adds a regression test for `&toString;`, `&constructor;`,
  `&hasOwnProperty;`.

- Grammar: change "<path> need to be updated" to
  "<path> needs to be updated" in the generator's drift message.

- Reword the changeset to use "offset range" instead of "byte range"
  (the tokenizer uses JS string indices, not bytes).

238/238 tests pass, 100% coverage on `walkHtmlTokens.js`,
`yarn lint` (including tsc) clean.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

Comment on lines +55 to +80
if (
res.statusCode === 301 ||
res.statusCode === 302 ||
res.statusCode === 307 ||
res.statusCode === 308
) {
const location = res.headers.location;
if (!location) {
return reject(
new Error(
`Redirect from ${url} with no Location header (HTTP ${res.statusCode})`
)
);
}
// `Location` may be relative; resolve against the current
// request URL so `https.get` receives an absolute URL.
return fetchUrl(new URL(location, url).toString()).then(
resolve,
reject
);
}
if (res.statusCode !== 200) {
return reject(
new Error(`Failed to fetch ${url}: HTTP ${res.statusCode}`)
);
}
Comment thread lib/html/walkHtmlTokens.js Outdated
Comment on lines +3121 to +3123
// - 0x00, > 0x10FFFF, or surrogate (0xD800\u20130xDFFF) \u2192 U+FFFD.
// - 0x80\u20130x9F \u2192 Windows-1252 remap (above).
// - Anything else (including noncharacters and C0 controls) \u2192 the
…e in source

Copilot review follow-ups on commit a17a150.

- `fetchUrl` in `tooling/generate-html-entities.js` previously rejected
  or recursed without consuming the response body. On Node's http(s)
  client this leaves the socket unread and can hang/leak. Call
  `res.resume()` on both the redirect and the non-200 paths so the
  body is drained before we release the response.

- The C1 remap table and surrounding comments in
  `lib/html/walkHtmlTokens.js` had Unicode characters encoded as
  literal `\uXXXX` escape sequences (likely re-escaped by an earlier
  tooling pass), which made the source hard to read. Replaced the
  six-byte escapes with the actual Unicode glyphs for the table
  values, and ASCII-fied the comment punctuation (en-dash to `-`,
  arrow to `->`) so the prose stays clear without relying on
  non-ASCII in source comments.

238/238 tests pass, 100% coverage on `walkHtmlTokens.js`,
`yarn lint` clean.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 1 comment.

Comment thread lib/html/walkHtmlTokens.js Outdated
Comment on lines +3021 to +3022
// completed prior tag).
if (state === STATE_TAG_NAME) tagNameEnd = len;
Copilot review follow-up on commit 083ac67.

Real bug: EOF inside a content-mode end-tag-name state
(RCDATA/RAWTEXT/SCRIPT_DATA/SCRIPT_DATA_ESCAPED end-tag-name) left
`tagNameEnd` carrying the value from the matching open tag, so
`emitCloseTag(len)` sliced a wrong (or empty) range. Example:
`<title>x</tit` at EOF emitted a close tag with name `""` because
`tagNameEnd` was still `6` from the prior `<title>` open.

Fix: reset `tagNameEnd = len` whenever it's missing or stale (less
than `tagNameStart`) instead of only doing so for `STATE_TAG_NAME`.

Also extend the eof-in-script-html-comment-like-text branch to cover
`SCRIPT_DATA_ESCAPED_LESS_THAN_SIGN`, `SCRIPT_DATA_DOUBLE_ESCAPED_LESS_THAN_SIGN`,
and `SCRIPT_DATA_DOUBLE_ESCAPE_END` — per spec each of these
reconsumes back into the (double-)escaped state at EOF, which then
fires the same parse error.

Regression test covers the three content-mode partial-close-tag
cases: `<title>x</tit` -> close "tit", `<style>x</sty` -> close "sty",
`<script>x</scr` -> close "scr".

239/239 tests pass, 100% coverage.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 1 comment.

Comment on lines +3026 to +3031
flushText(tagStart);
pos =
input.charCodeAt(tagStart + 1) === CC_SOLIDUS
? emitCloseTag(len)
: emitOpenTag(len, false);
} else if (
Copilot review follow-up on commit 132a4cd.

Real bug verified by `walkHtmlTokens("a<div", 0, { text: ... })` —
the `"a"` text span was emitted twice:
  1) `STATE_TAG_OPEN`'s alpha branch flushed pending text on tag entry,
  2) the EOF mid-tag handler called `flushText(tagStart)` again,
     re-emitting the same span because `flushText` never advanced
     `textStart`.

Fix: `flushText` now advances `textStart = endPos` after emitting, so
repeated `flushText` calls for the same span are no-ops. `emitOpenTag`
/ `emitCloseTag` overwrite `textStart` with their own `nextPos` after
the tag emits, so the new advance doesn't shift any subsequent ranges.

238/238 tests pass, 100% coverage on `walkHtmlTokens.js`,
`yarn lint` clean.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
@github-actions
Copy link
Copy Markdown
Contributor

Types Coverage

Coverage after merging claude/align-html-lexer-jo2kB into main will be
98.97%
Coverage Report
FileStmtsBranchesFuncsLinesUncovered Lines
bin
   webpack.js98.77%100%100%98.77%91
examples
   build-common.js100%100%100%100%
   buildAll.js100%100%100%100%
   examples.js100%100%100%100%
   template-common.js98.21%100%100%98.21%72
examples/custom-javascript-parser
   test.filter.js100%100%100%100%
examples/custom-javascript-parser/internals
   acorn-parse.js100%100%100%100%
   meriyah-parse.js100%100%100%100%
   oxc-parse.js91.30%100%100%91.30%140, 142–143, 145, 147, 153–154, 161, 168, 90
examples/markdown
   webpack.config.mjs100%100%100%100%
examples/typescript
   test.filter.js100%100%100%100%
examples/typescript-non-erasable
   test.filter.js50%100%100%50%5
examples/virtual-modules
   test.filter.js100%100%100%100%
examples/wasm-bindgen-esm
   test.filter.js100%100%100%100%
examples/wasm-complex
   test.filter.js100%100%100%100%
examples/wasm-simple
   test.filter.js100%100%100%100%
examples/wasm-simple-source-phase
   test.filter.js100%100%100%100%
lib
   APIPlugin.js100%100%100%100%
   AsyncDependenciesBlock.js100%100%100%100%
   AutomaticPrefetchPlugin.js100%100%100%100%
   BannerPlugin.js100%100%100%100%
   Cache.js98.21%100%100%98.21%101
   CacheFacade.js100%100%100%100%
   Chunk.js99.72%100%100%99.72%39
   ChunkGraph.js100%100%100%100%
   ChunkGroup.js100%100%100%100%
   ChunkTemplate.js100%100%100%100%
   CleanPlugin.js99.15%100%100%99.15%206, 226
   CodeGenerationResults.js100%100%100%100%
   CompatibilityPlugin.js100%100%100%100%
   Compilation.js98.45%100%100%98.45%1572, 1868, 1875, 1883, 1905, 2801, 3226, 3888, 3917, 3970–3971, 3975, 3980, 3996–3997, 4011–4012, 4017–4018, 4495, 4521, 511, 516, 5229, 5261, 5278, 5294, 5310, 5325, 5350–5351, 5353, 5681, 5686, 5692, 5695, 5707, 5709, 5713, 5729, 5744, 5776, 5830, 5854, 5968, 730–731
   Compiler.js99.55%100%100%99.55%1116–1117, 1125
   ConcatenationScope.js98.59%100%100%98.59%189
   ConditionalInitFragment.js100%100%100%100%
   ConstPlugin.js100%100%100%100%
   ContextExclusionPlugin.js100%100%100%100%
   ContextModule.js100%100%100%100%
   ContextModuleFactory.js97.75%100%100%97.75%258, 393, 418, 443, 447, 458
   ContextReplacementPlugin.js100%100%100%100%
   DefinePlugin.js98.92%100%100%98.92%158–159, 175, 194, 268
   DependenciesBlock.js100%100%100%100%
   Dependency.js98.20%100%100%98.20%379, 425
   DependencyTemplate.js100%100%100%100%
   DependencyTemplates.js100%100%100%100%
   DotenvPlugin.js98.41%100%100%98.41%378, 391–392
   DynamicEntryPlugin.js100%100%100%100%
   EntryOptionPlugin.js100%100%100%100%
   EntryPlugin.js100%100%100%100%
   Entrypoint.js100%100%100%100%
   EnvironmentPlugin.js97.14%100%100%97.14%49
   ErrorHelpers.js100%100%100%100%
   EvalDevToolModulePlugin.js100%100%100%100%
   EvalSourceMapDevToolPlugin.js100%100%100%100%
   ExportsInfo.js100%100%100%100%
   ExportsInfoApiPlugin.js100%100%100%100%
   ExternalModule.js98.97%100%100%98.97%425–429, 577
   ExternalModuleFactoryPlugin.js100%100%100%100%
   ExternalsPlugin.js100%100%100%100%
   FileSystemInfo.js99.50%100%100%99.50%182, 2252–2253, 2256, 2267, 2278, 2289, 278, 3694, 3709, 3733
   FlagAllModulesAsUsedPlugin.js100%100%100%100%
   FlagDependencyExportsPlugin.js98.74%100%100%98.74%399, 401, 405
   FlagDependencyUsagePlugin.js100%100%100%100%
   FlagEntryExportAsUsedPlugin.js100%100%100%100%
   Generator.js100%100%100%100%
   HotModuleReplacementPlugin.js100%100%100%100%
   HotUpdateChunk.js100%100%100%100%
   IgnorePlugin.js100%100%100%100%
   IgnoreWarningsPlugin.js100%100%100%100%
   InitFragment.js100%100%100%100%
   JavascriptMetaInfoPlugin.js100%100%100%100%
   LibraryTemplatePlugin.js100%100%100%100%
   LoaderOptionsPlugin.js100%100%100%100%
   LoaderTargetPlugin.js100%100%100%100%
   MainTemplate.js100%100%100%100%
   ManifestPlugin.js100%100%100%100%
   Module.js98.50%100%100%98.50%1305, 1310, 1371, 1385, 1447, 1456
   ModuleFactory.js100%100%100%100%
   ModuleFilenameHelpers.js98.85%100%100%98.85%106, 108
   ModuleGraph.js99.73%100%100%99.73%1004
   ModuleGraphConnection.js100%100%100%100%
   ModuleInfoHeaderPlugin.js100%100%100%100%
   ModuleNotFoundError.js100%100%100%100%
   ModuleProfile.js100%100%100%100%
   ModuleSourceTypeConstants.js100%100%100%100%
   ModuleTemplate.js100%100%100%100%
   ModuleTypeConstants.js100%100%100%100%
   MultiCompiler.js99.69%100%100%99.69%645
   MultiStats.js100%100%100%100%
   MultiWatching.js100%100%100%100%
   NoEmitOnErrorsPlugin.js100%100%100%100%
   NodeStuffPlugin.js100%100%100%100%
   NormalModule.js97.83%100%100%97.83%1072, 1106, 1122, 1209, 1834, 1839–1849, 794, 797, 814, 831
   NormalModuleFactory.js99.47%100%100%99.47%1083, 1392, 486, 498
   NormalModuleReplacementPlugin.js100%100%100%100%
   NullFactory.js100%100%100%100%
   OptimizationStages.js100%100%100%100%
   OptionsApply.js100%100%100%100%
   Parser.js100%100%100%100%
   PlatformPlugin.js100%100%100%100%
   PrefetchPlugin.js100%100%100%100%
   ProgressPlugin.js98.85%100%100%98.85%519–520, 525, 527, 591
   ProvidePlugin.js100%100%100%100%
   RawModule.js100%100%100%100%
   RecordIdsPlugin.js100%100%100%100%
   RequestShortener.js100%100%100%100%
   ResolverFactory.js100%100%100%100%
   RuntimeGlobals.js100%100%100%100%
   RuntimeModule.js100%100%100%100%
   RuntimePlugin.js100%100%100%100%
   RuntimeTemplate.js100%100%100%100%
   SelfModuleFactory.js100%100%100%100%
   SingleEntryPlugin.js100%100%100%100%
   SourceMapDevToolModuleOptionsPlugin.js100%100%100%100%
   SourceMapDevToolPlugin.js99.16%100%100%99.16%267–268, 610
   Stats.js100%100%100%100%
   Template.js100%100%100%100%
   TemplatedPathPlugin.js98.86%100%100%98.86%136–137
   UseStrictPlugin.js100%100%100%100%
   WarnCaseSensitiveModulesPlugin.js100%100%100%100%
   WarnDeprecatedOptionPlugin.js100%100%100%100%
   WarnNoModeSetPlugin.js100%100%100%100%
   WatchIgnorePlugin.js100%100%100%100%
   Watching.js100%100%100%100%
   WebpackError.js100%100%100%100%
   WebpackIsIncludedPlugin.js100%100%100%100%
   WebpackOptionsApply.js100%100%100%100%
   WebpackOptionsDefaulter.js100%100%100%100%
   buildChunkGraph.js99.87%100%100%99.87%325
   cli.js98.46%100%100%98.46%10, 119, 471, 503, 545, 815
   index.js99.72%100%100%99.72%165
   validateSchema.js94.67%100%100%94.67%100, 87, 89, 98
   webpack.js96.33%100%100%96.33%10, 198, 220, 222
lib/asset
   AssetBytesGenerator.js100%100%100%100%
   AssetBytesParser.js100%100%100%100%
   AssetGenerator.js100%100%100%100%
   AssetModulesPlugin.js97.32%100%100%97.32%283, 307, 310, 36, 362, 41
   AssetParser.js100%100%100%100%
   AssetSourceGenerator.js100%100%100%100%
   AssetSourceParser.js100%100%100%100%
   RawDataUrlModule.js100%100%100%100%
lib/async-modules
   AsyncModuleHelpers.js100%100%100%100%
   AwaitDependenciesInitFragment.js100%100%100%100%
   InferAsyncModulesPlugin.js100%100%100%100%
lib/cache
   AddBuildDependenciesPlugin.js100%100%100%100%
   AddManagedPathsPlugin.js100%100%100%100%
   IdleFileCachePlugin.js97.92%100%100%97.92%71, 83, 91
   MemoryCachePlugin.js95.83%100%100%95.83%33
   MemoryWithGcCachePlugin.js93.15%100%100%93.15%106, 113–114, 122, 89
   PackFileCacheStrategy.js96.40%100%100%96.40%1250, 1350, 1354, 1416, 628, 647, 657–659, 661, 677–678, 683, 686, 688, 693, 698, 722, 728, 762, 768, 774, 779, 790, 799, 804–805, 807, 824, 830–831, 833
   ResolverCachePlugin.js100%100%100%100%
   getLazyHashedEtag.js100%100%100%100%
   mergeEtags.js100%100%100%100%
lib/config
   browserslistTargetHandler.js100%100%100%100%
   defaults.js99.29%100%100%99.29%1411–1413, 1421, 271, 274, 279, 283
   normalization.js99%100%100%99%191–192, 258, 273
   target.js100%100%100%100%
lib/container
   ContainerEntryDependency.js100%100%100%100%
   ContainerEntryModule.js100%100%100%100%
   ContainerEntryModuleFactory.js100%100%100%100%
   ContainerExposedDependency.js100%100%100%100%
   ContainerPlugin.js100%100%100%100%
   ContainerReferencePlugin.js100%100%100%100%
   FallbackDependency.js100%100%100%100%
   FallbackItemDependency.js100%100%100%100%
   FallbackModule.js100%100%100%100%
   FallbackModuleFactory.js100%100%100%100%
   

@alexander-akait alexander-akait merged commit 1097a7f into main May 21, 2026
60 of 61 checks passed
@alexander-akait alexander-akait deleted the claude/align-html-lexer-jo2kB branch May 21, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants