fix(html): align walkHtmlTokens with WHATWG spec and swc_html_parser by alexander-akait · Pull Request #21000 · webpack/webpack

alexander-akait · 2026-05-20T21:26:52Z

Drop the length cap on the script-data-double-escape-end tempBuffer
so end tags like </scripts> no longer falsely match "script" and
prematurely exit double-escaped mode.
Update tagStart when entering the less-than-sign state from any
script-data escaped state so a matching </script> close tag reached
via SCRIPT_DATA_ESCAPED preserves the preceding script body in the
emitted text span.
Flush pending text when a content-mode close tag is detected via the
attribute path (</title foo>, </script bar=baz>), not only via
the direct > path.
Track commentStart from the markup-declaration-open transition so
EOF inside an incomplete <!… emits the correct byte range.
Emit the missing-attribute-value form (<a foo=>) so the open-tag
byte range still includes the >.
Implement spec-correct named character reference matching by adding
the full WHATWG named character references table (generated via
tooling/generate-html-entities.js) and restoring the
STATE_AMBIGUOUS_AMPERSAND fallthrough. decodeHtmlEntities now
handles the full table, including legacy bare forms (&AMP,
&copy) and multi-code-point entities (&NotEqualTilde;), and
applies WHATWG longest-prefix backtrack (&notpre; -> ¬pre;).

Adds 199 unit tests covering every state-machine branch in the
tokenizer and reaches 100% statements/branches/functions/lines coverage
on lib/html/walkHtmlTokens.js.

https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

- Drop the length cap on the script-data-double-escape-end tempBuffer so end tags like `</scripts>` no longer falsely match `"script"` and prematurely exit double-escaped mode. - Update `tagStart` when entering the less-than-sign state from any script-data escaped state so a matching `</script>` close tag reached via SCRIPT_DATA_ESCAPED preserves the preceding script body in the emitted text span. - Flush pending text when a content-mode close tag is detected via the attribute path (`</title foo>`, `</script bar=baz>`), not only via the direct `>` path. - Track `commentStart` from the markup-declaration-open transition so EOF inside an incomplete `<!…` emits the correct byte range. - Emit the missing-attribute-value form (`<a foo=>`) so the open-tag byte range still includes the `>`. - Implement spec-correct named character reference matching by adding the full WHATWG named character references table (generated via `tooling/generate-html-entities.js`) and restoring the STATE_AMBIGUOUS_AMPERSAND fallthrough. `decodeHtmlEntities` now handles the full table, including legacy bare forms (`&AMP`, `&copy`) and multi-code-point entities (`&NotEqualTilde;`), and applies WHATWG longest-prefix backtrack (`&notpre;` -> `¬pre;`). Adds 199 unit tests covering every state-machine branch in the tokenizer and reaches 100% statements/branches/functions/lines coverage on lib/html/walkHtmlTokens.js. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

Switch lib/html/htmlEntities.js from a 40 KB frozen object literal to a 28 KB delta-encoded string decoded lazily on first use. - Each record is `<sharedByte><suffix>\t<valueLenByte><value>` with 1-byte length prefixes for shared-prefix count and value length. - Names sorted alphabetically so adjacent entries share long prefixes. - Length-prefixed values mean records can contain raw `\t` or `\n` (as in `&Tab;` and `&NewLine;`) without escaping. - Module exports a getter function; callers cache the table locally so the decode runs at most once per `decodeHtmlEntities` invocation. Also exclude the generated file from the eslint config so prettier doesn't try to wrap the long source string. 199/199 tests pass, 100% coverage on `walkHtmlTokens.js`. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

…ties generator Surface tokenizer parse errors via a new optional `parseError` callback on `HtmlTokenCallbacks`. Severity is `"warning"` when the tokenizer recovers and the emitted token is still well-formed (missing-attribute-value, unexpected-equals-sign-before-attribute-name, abrupt-doctype-*, etc.) and `"error"` when the emitted token byte range is incomplete (eof-in-tag, eof-in-comment, eof-in-doctype, eof-in-cdata, eof-in-script-html-comment-like-text). Fix the EOF-in-tag deviation: previously a half-finished tag at EOF (`<div class="x`) was silently emitted as text. The tokenizer now emits the partial open/close tag at EOF and reports `eof-in-tag` so consumers still see a tag callback. Fix two small spec deviations: `STATE_COMMENT_START` and `STATE_COMMENT_START_DASH` now correctly reconsume on the anything-else branch (they previously consumed the character, which prevented the comment-less-than-sign chain from firing). Also correct a misleading code comment in `STATE_COMMENT_END`. Make `emitAttribute` advance `pos` past the closing quote by default when no `attribute` callback is provided, so callers that only need the error stream don't have to wire an attribute handler just to keep parsing. Restructure the entities generator to follow the same pattern as `tooling/generate-runtime-code.js`: vendor the WHATWG entities.json locally at `tooling/html-entities.json`, support `--write` for in-place generation, and wire into `lint:special` / `fix:special` so the generated `lib/html/htmlEntities.js` is verified by CI. A separate `--fetch` flag refreshes the vendored JSON from the spec URL. 226/226 tests pass, 100% coverage on `walkHtmlTokens.js`. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

Drop `lib/html/htmlEntities.js` and inline the delta-encoded entity table plus the lazy `getHtmlEntities` getter into a `// #region html entities` block at the top of `lib/html/walkHtmlTokens.js`. The generator (`tooling/generate-html-entities.js`) now splices the region into `walkHtmlTokens.js` directly — mirroring how `tooling/generate-runtime-code.js` operates on `lib/util/semver.js`. `yarn lint:special` verifies the inlined block is in sync; `yarn fix:special` writes it in place. `--fetch` still refreshes the vendored `tooling/html-entities.json` from the spec URL. 226/226 tests pass, 100% coverage on `walkHtmlTokens.js`. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

… literal The lazy `getHtmlEntities()` getter was building a 2231-entry Map on first use by parsing a delta-encoded string with one `indexOf` and one `slice` per entry. Replace it with a single `Object.freeze({...})` literal so the JS engine builds the table once at module-load time and all subsequent lookups are direct property access. The literal lives inside the existing `// #region html entities` block and is preceded by `// prettier-ignore` / `// cspell:disable-next-line` so the long line is not wrapped on regeneration. File size grows from ~140 KB to ~143 KB (Object literal vs delta-encoded string), in exchange for removing the per-first-call decode work and the `htmlEntityCache` / `getHtmlEntities` scaffolding. Update the two callers in `walkHtmlTokens.js` (NAMED_CHARACTER_REFERENCE state and `decodeHtmlEntities`) to read from `HTML_ENTITIES` directly. 226/226 tests pass, 100% coverage on `walkHtmlTokens.js`. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

…gus-comment Three EOF-path fixes surfaced by an independent spec audit: - `STATE_ATTRIBUTE_NAME` at EOF previously emitted an attribute with stale or `-1` `attrNameEnd` because the state machine only assigns `attrNameEnd` on the transitions out of attribute-name. Set it to `len` in the EOF block so `<div data-x` at EOF emits attribute `data-x` with the correct byte range. - `STATE_CHARACTER_REFERENCE`-family (states 71–79) at EOF previously fell straight to the trailing-text branch, skipping the `returnState`. When the return state was inside an attribute value (e.g. `<a href="x&amp` at EOF), neither the partial tag nor `eof-in-tag` was emitted. Now the EOF block unwinds `state` to `returnState` before the existing in-tag check, so the partial open tag is emitted and `eof-in-tag` is reported. - `STATE_BOGUS_COMMENT` at EOF previously reported `eof-in-comment` as an error, but the spec defines no parse error for this case (the comment is emitted cleanly with the EOF token). Drop the error report for bogus-comment-at-EOF while keeping it for the real comment states. 229/229 tests pass, 100% coverage on `walkHtmlTokens.js`. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

Three remaining spec-alignment items from the recent audit: - decodeHtmlEntities now applies the WHATWG numeric-character-reference remap rules: 0x00, surrogates (0xD800-0xDFFF), and code points above 0x10FFFF return U+FFFD; the 0x80-0x9F C1 range remaps through the Windows-1252 table per spec (e.g. `` -> euro, `` -> tm). - decodeHtmlEntities accepts a new optional `isAttribute` flag that applies the WHATWG consumed-as-part-of-an-attribute rule. With `isAttribute: true`, a named entity without a trailing `;` whose next character is `=` (or whose longest-prefix backtrack leaves trailing alphanumerics) stays literal: `decodeHtmlEntities("&amp=foo", true)` returns `&amp=foo`; without the flag it decodes to `&=foo`. - Tokenizer now fires the `end-tag-with-attributes` parse error (severity `"warning"`) when an end tag like `</div foo>` emits. Tracked via a `tagHasAttributes` flag set in `emitAttribute` and cleared in `emitOpenTag` / `emitCloseTag`. 235/235 tests pass, 100% coverage on `walkHtmlTokens.js`. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

changeset-bot · 2026-05-20T21:26:57Z

🦋 Changeset detected

Latest commit: 3effe59

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
webpack	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

github-actions · 2026-05-20T21:27:30Z

This PR is packaged and the instant preview is available (1097a7f).

Install it locally:

npm

npm i -D webpack@https://pkg.pr.new/webpack@1097a7f

yarn

yarn add -D webpack@https://pkg.pr.new/webpack@1097a7f

pnpm

pnpm add -D webpack@https://pkg.pr.new/webpack@1097a7f

codecov · 2026-05-20T21:29:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.60%. Comparing base (a39f2d3) to head (3effe59).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #21000      +/-   ##
==========================================
+ Coverage   90.94%   91.60%   +0.66%     
==========================================
  Files         573      573              
  Lines       58940    59120     +180     
  Branches    15888    15948      +60     
==========================================
+ Hits        53601    54159     +558     
+ Misses       5339     4961     -378

Flag	Coverage Δ
integration	`89.62% <43.93%> (-0.10%)`	⬇️
test262	`45.37% <ø> (+<0.01%)`	⬆️
unit	`37.89% <100.00%> (+1.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

This PR updates webpack’s experimental HTML tokenizer (walkHtmlTokens) to more closely follow the WHATWG HTML tokenization spec and match behavior of swc_html_parser, including fixing multiple byte-range edge cases and expanding entity decoding to the full named character reference table.

Changes:

Align tokenizer state-machine behavior for script-data escaping, content-mode close tags via attribute paths, markup-declaration EOF handling, and missing attribute values; add a new parseError callback with "warning"/"error" severities.
Implement spec-correct named character reference matching (full WHATWG entities table + longest-prefix backtrack) and upgrade decodeHtmlEntities accordingly.
Add a generator (tooling/generate-html-entities.js) plus extensive unit tests (targeting full branch coverage).

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`lib/html/walkHtmlTokens.js`	Main tokenizer updates: state-machine fixes, new `parseError` callback, full WHATWG entity table + improved entity decoding, and expanded EOF handling.
`tooling/generate-html-entities.js`	Generates the inlined `HTML_ENTITIES` region in `walkHtmlTokens.js` from `tooling/html-entities.json`.
`package.json`	Hooks the generator into `lint:special` / `fix:special` workflows.
`tooling/html-entities.json`	Vendored WHATWG named character references table used by the generator.
`test/walkHtmlTokens.unittest.js`	Adds regression tests and broad state-machine branch coverage + parseError/decodeHtmlEntities tests.
`cspell.json`	Excludes `tooling/html-entities.json` from spellchecking.
`.changeset/align-html-lexer-script-data.md`	Patch changeset documenting tokenizer alignment, new callback, and full entities support.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

-						tempBuffer += String.fromCharCode(cc + 0x20);
-					}
+					tempBuffer += String.fromCharCode(cc + 0x20);
 					pos++;


-					if (namedEntityConsumed > 33) break;
+					runLen++;
+					// Safety cap — the longest entity name is 32 chars (without `&`).
+					if (runLen > 32) break;


+			for (let i = name.length; i > 0; i--) {
+				const prefix = name.slice(0, i);
+				if (HTML_ENTITIES[prefix] !== undefined) {
+					// Attribute-context longest-prefix guard: if the matched
+					// prefix doesn't end with `;` and the leftover starts with
+					// an alphanumeric character, leave literal per WHATWG.
+					// (The regex greedy-consumes alphanumerics, so any leftover
+					// within `name` is itself alphanumeric — we only need to


- Fix tsc lint failure: cast the inlined `HTML_ENTITIES` to `Readonly<Record<string, string>>` so the two `HTML_ENTITIES[name]` call sites type-check (the frozen-literal type was too narrow). - Copilot #1: replace the `tempBuffer` string in `STATE_SCRIPT_DATA_DOUBLE_ESCAPE_START` / `_END` with a small `scriptMatch` counter against the literal `"script"`. Worst-case inputs with very long ASCII-alpha runs after `</` no longer grow a buffer or do quadratic string concatenation. Vestigial `tempBuffer = ""` resets in the four content-mode less-than-sign states (which never read the buffer) are removed; the `tempBuffer` variable is gone. - Copilot #2: introduce a `MAX_ENTITY_NAME_LEN = 32` constant (longest WHATWG entity name including the trailing `;`) and cap the named-character-reference alphanumeric run at `MAX_ENTITY_NAME_LEN - 1`. Replaces the off-by-one `if (runLen > 32) break` with a clearer in-loop bound. - Copilot #3: cap the `decodeHtmlEntities` longest-prefix backtrack at `MAX_ENTITY_NAME_LEN`. Inputs like `&` followed by thousands of alphanumerics stay linear-time; anything past the cap is appended verbatim. Adds a regression test that decodes `&` + 1000 chars to confirm the decoder doesn't go quadratic. 236/236 tests pass, 100% coverage on `walkHtmlTokens.js`, `yarn lint` (including tsc) clean. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

codspeed-hq · 2026-05-20T21:56:10Z

Merging this PR will improve performance by 21.08%

⚡ 6 improved benchmarks
❌ 4 regressed benchmarks
✅ 134 untouched benchmarks
⏩ 72 skipped benchmarks¹

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Memory	`benchmark "side-effects-reexport", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}'`	859.4 KB	406.4 KB	×2.1
❌	Memory	`benchmark "context-esm", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}'`	152.6 KB	663.5 KB	-77%
⚡	Memory	`benchmark "many-modules-esm", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}'`	288.8 KB	142.8 KB	×2
❌	Memory	`benchmark "side-effects-reexport", scenario '{"name":"mode-development","mode":"development"}'`	3.9 MB	4.9 MB	-21.65%
⚡	Memory	`benchmark "cache-filesystem", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}'`	843.9 KB	164.7 KB	×5.1
⚡	Memory	`benchmark "many-chunks-esm", scenario '{"name":"mode-production","mode":"production"}'`	10.8 MB	7.8 MB	+37.82%
⚡	Memory	`benchmark "asset-modules-bytes", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}'`	323.2 KB	134.5 KB	×2.4
❌	Memory	`benchmark "many-chunks-esm", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}'`	177 KB	248.5 KB	-28.79%
⚡	Memory	`benchmark "concatenate-modules", scenario '{"name":"mode-development","mode":"development"}'`	1,113.1 KB	781.7 KB	+42.4%
❌	Memory	`benchmark "future-defaults", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}'`	145.2 KB	284.3 KB	-48.92%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/align-html-lexer-jo2kB (3effe59) with main (294197c)}

72 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

+	// Match either `&name;` / `&name` (named refs may be legacy-bare per the
+	// WHATWG entities table) or a numeric reference `&#\u2026;?`.
+	return str.replace(
+		/&(#[xX]?[0-9a-fA-F]+|#?[0-9a-zA-Z]+);?/g,
+		(match, _group, offset, source) => {
+			// Numeric reference: &#65; or &#x41;
+			if (match.charCodeAt(1) === 0x23 /* # */) {
+				const lastChar = match.charAt(match.length - 1);
+				const body = lastChar === ";" ? match.slice(2, -1) : match.slice(2);
+				const isHex =
+					body.charCodeAt(0) === 0x78 || body.charCodeAt(0) === 0x58;
+				const code = isHex
+					? Number.parseInt(body.slice(1), 16)
+					: Number.parseInt(body, 10);
+				if (!Number.isNaN(code)) return decodeNumericReference(code);
+				return match; // Invalid numeric (e.g. &#;)


+		it("nAMED_CHARACTER_REFERENCE: caps consumption at 33 chars", () => {
+			// Entity names in the WHATWG table are at most ~33 chars; the
+			// scanner has a safety cap that breaks out of the consume loop
+			// past 33 alphanumeric chars even without a closing semicolon.


…ment Copilot review follow-ups on commit eb2fc4d. - `decodeHtmlEntities` previously matched numeric refs with `#[xX]?[0-9a-fA-F]+`, which let a decimal reference like `&#65b` greedily consume the trailing `b` (it's a hex digit) and incorrectly decode to `A` while dropping the `b`. Split the regex into three clear alternatives: `&#x<hex>` — hex (requires `x`/`X`), `&#<dec>` — decimal (digits only), `&<name>` — named (letter then alphanumerics). With the stricter regex the parse-int call always has a non-empty digit body, so the `Number.isNaN(code)` defensive return is removed. Adds a regression test: `&#65b` -> `Ab`, `&#1f` -> `�f`. - Reword the "named-character-reference cap" test comment so it matches the actual `MAX_ENTITY_NAME_LEN = 32` bound (the previous copy said "33 chars" from the pre-constant era). 237/237 tests pass, 100% coverage on `walkHtmlTokens.js`, `yarn lint` (including tsc) clean. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

+	"walkHtmlTokens.js"
+);
+
+const REGION_REGEXP = /\/\/ #region html entities\n[\s\S]+?\/\/ #endregion\n/;


+						return fetchUrl(/** @type {string} */ (res.headers.location)).then(
+							resolve,
+							reject
+						);


+	 * Reports a tokenizer parse error to the consumer. The byte range and
+	 * severity follow the WHATWG spec naming. Severity is `"error"` for
+	 * cases where the emitted token is incomplete (EOF inside a tag or
+	 * comment); everything else is a `"warning"`.
+	 * @param {string} code WHATWG parse-error code (kebab-case)
+	 * @param {number} start byte offset where the error starts
+	 * @param {number} end byte offset where the error ends


Copilot review follow-ups on commit 3e4fe10. - The entities generator's REGION_REGEXP hard-coded `\n`, so it false-reported "need to be updated" on Windows checkouts where git normalized line endings to CRLF. Match `\r?\n` and preserve the file's existing EOL style (CRLF on Windows, LF elsewhere) when writing the regenerated region so we don't introduce mixed line endings. - HTTP redirect handling in `fetchUrl` passed `res.headers.location` through unchanged, which broke when the server returned a relative `Location`. Resolve via `new URL(location, url).toString()` and guard against a missing `Location` header. - `parseError` doc strings called offsets "byte offsets" but the values are actually JS string indices (UTF-16 code units), which matters for inputs containing non-BMP code points. Reword the `ParseErrorSeverity` typedef, the `reportError` JSDoc, and the surrounding internal comments to use "offset" / "string offset" / "offset range" instead of "byte". 237/237 tests pass, lint/prettier/cspell/tsc clean. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.

+						if (HTML_ENTITIES[withSemi] !== undefined) {
+							namedEntityConsumed = n + 1;
+							break;
+						}
+					}
+					const bare = input.slice(pos, pos + n);
+					if (HTML_ENTITIES[bare] !== undefined) {


+					const bare = input.slice(pos, pos + n);
+					if (HTML_ENTITIES[bare] !== undefined) {
+						namedEntityConsumed = n;
+						break;


+			for (let i = searchLen; i > 0; i--) {
+				const prefix = name.slice(0, i);
+				if (HTML_ENTITIES[prefix] !== undefined) {
+					// Attribute-context longest-prefix guard: if the matched


+// strings (1–2 UTF-16 code units).
+// prettier-ignore
+// cspell:disable-next-line
+const HTML_ENTITIES = /** @type {Readonly<Record<string, string>>} */ (Object.freeze(${JSON.stringify(map)}));


+			);
+		} else {
+			console.error(
+				`${path.relative(process.cwd(), TARGET_PATH)} need to be updated`


+"webpack": patch
+---
+
+Align the experimental HTML tokenizer with the WHATWG spec: fix byte-range bugs in the script-data, content-mode end-tag, attribute-value, and EOF states; surface tokenizer parse errors to consumers via a new `parseError` callback (`"warning"` when the tokenizer recovers and the emitted token is still well-formed, `"error"` when the byte range is incomplete — e.g. `eof-in-tag`); and add the full WHATWG named character references table so `decodeHtmlEntities` handles all named entities (including legacy bare forms like `&AMP` and multi-code-point entities like `&NotEqualTilde;`) with proper longest-prefix backtracking.


Copilot review follow-ups on commit 1c233be. - Real bug: `HTML_ENTITIES[name] !== undefined` lookups against a regular Object hit `Object.prototype` keys (`toString`, `constructor`, `hasOwnProperty`, `__proto__`, …), so inputs like `&toString;` would falsely decode to the source of `Object.prototype.toString`. The generator now emits the table on a null prototype via `Object.assign(Object.create(null), {…})`, so bracket lookups are safe everywhere they're used (both `decodeHtmlEntities` and the tokenizer's NAMED_CHARACTER_REFERENCE state). Adds a regression test for `&toString;`, `&constructor;`, `&hasOwnProperty;`. - Grammar: change "<path> need to be updated" to "<path> needs to be updated" in the generator's drift message. - Reword the changeset to use "offset range" instead of "byte range" (the tokenizer uses JS string indices, not bytes). 238/238 tests pass, 100% coverage on `walkHtmlTokens.js`, `yarn lint` (including tsc) clean. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

+					if (
+						res.statusCode === 301 ||
+						res.statusCode === 302 ||
+						res.statusCode === 307 ||
+						res.statusCode === 308
+					) {
+						const location = res.headers.location;
+						if (!location) {
+							return reject(
+								new Error(
+									`Redirect from ${url} with no Location header (HTTP ${res.statusCode})`
+								)
+							);
+						}
+						// `Location` may be relative; resolve against the current
+						// request URL so `https.get` receives an absolute URL.
+						return fetchUrl(new URL(location, url).toString()).then(
+							resolve,
+							reject
+						);
+					}
+					if (res.statusCode !== 200) {
+						return reject(
+							new Error(`Failed to fetch ${url}: HTTP ${res.statusCode}`)
+						);
+					}


+	//   - 0x00, > 0x10FFFF, or surrogate (0xD800\u20130xDFFF) \u2192 U+FFFD.
+	//   - 0x80\u20130x9F \u2192 Windows-1252 remap (above).
+	//   - Anything else (including noncharacters and C0 controls) \u2192 the


…e in source Copilot review follow-ups on commit a17a150. - `fetchUrl` in `tooling/generate-html-entities.js` previously rejected or recursed without consuming the response body. On Node's http(s) client this leaves the socket unread and can hang/leak. Call `res.resume()` on both the redirect and the non-200 paths so the body is drained before we release the response. - The C1 remap table and surrounding comments in `lib/html/walkHtmlTokens.js` had Unicode characters encoded as literal `\uXXXX` escape sequences (likely re-escaped by an earlier tooling pass), which made the source hard to read. Replaced the six-byte escapes with the actual Unicode glyphs for the table values, and ASCII-fied the comment punctuation (en-dash to `-`, arrow to `->`) so the prose stays clear without relying on non-ASCII in source comments. 238/238 tests pass, 100% coverage on `walkHtmlTokens.js`, `yarn lint` clean. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 1 comment.

+		// completed prior tag).
+		if (state === STATE_TAG_NAME) tagNameEnd = len;


Copilot review follow-up on commit 083ac67. Real bug: EOF inside a content-mode end-tag-name state (RCDATA/RAWTEXT/SCRIPT_DATA/SCRIPT_DATA_ESCAPED end-tag-name) left `tagNameEnd` carrying the value from the matching open tag, so `emitCloseTag(len)` sliced a wrong (or empty) range. Example: `<title>x</tit` at EOF emitted a close tag with name `""` because `tagNameEnd` was still `6` from the prior `<title>` open. Fix: reset `tagNameEnd = len` whenever it's missing or stale (less than `tagNameStart`) instead of only doing so for `STATE_TAG_NAME`. Also extend the eof-in-script-html-comment-like-text branch to cover `SCRIPT_DATA_ESCAPED_LESS_THAN_SIGN`, `SCRIPT_DATA_DOUBLE_ESCAPED_LESS_THAN_SIGN`, and `SCRIPT_DATA_DOUBLE_ESCAPE_END` — per spec each of these reconsumes back into the (double-)escaped state at EOF, which then fires the same parse error. Regression test covers the three content-mode partial-close-tag cases: `<title>x</tit` -> close "tit", `<style>x</sty` -> close "sty", `<script>x</scr` -> close "scr". 239/239 tests pass, 100% coverage. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 1 comment.

+		flushText(tagStart);
+		pos =
+			input.charCodeAt(tagStart + 1) === CC_SOLIDUS
+				? emitCloseTag(len)
+				: emitOpenTag(len, false);
+	} else if (


Copilot review follow-up on commit 132a4cd. Real bug verified by `walkHtmlTokens("a<div", 0, { text: ... })` — the `"a"` text span was emitted twice: 1) `STATE_TAG_OPEN`'s alpha branch flushed pending text on tag entry, 2) the EOF mid-tag handler called `flushText(tagStart)` again, re-emitting the same span because `flushText` never advanced `textStart`. Fix: `flushText` now advances `textStart = endPos` after emitting, so repeated `flushText` calls for the same span are no-ops. `emitOpenTag` / `emitCloseTag` overwrite `textStart` with their own `nextPos` after the tag emits, so the new advance doesn't shift any subsequent ranges. 238/238 tests pass, 100% coverage on `walkHtmlTokens.js`, `yarn lint` clean. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh

github-actions · 2026-05-21T01:19:12Z

Types Coverage

Coverage after merging claude/align-html-lexer-jo2kB into main will be

98.97%

Coverage Report

File	Stmts	Branches	Funcs	Lines	Uncovered Lines
bin
webpack.js	98.77%	100%	100%	98.77%	91
examples
build-common.js	100%	100%	100%	100%
buildAll.js	100%	100%	100%	100%
examples.js	100%	100%	100%	100%
template-common.js	98.21%	100%	100%	98.21%	72
examples/custom-javascript-parser
test.filter.js	100%	100%	100%	100%
examples/custom-javascript-parser/internals
acorn-parse.js	100%	100%	100%	100%
meriyah-parse.js	100%	100%	100%	100%
oxc-parse.js	91.30%	100%	100%	91.30%	140, 142–143, 145, 147, 153–154, 161, 168, 90
examples/markdown
webpack.config.mjs	100%	100%	100%	100%
examples/typescript
test.filter.js	100%	100%	100%	100%
examples/typescript-non-erasable
test.filter.js	50%	100%	100%	50%	5
examples/virtual-modules
test.filter.js	100%	100%	100%	100%
examples/wasm-bindgen-esm
test.filter.js	100%	100%	100%	100%
examples/wasm-complex
test.filter.js	100%	100%	100%	100%
examples/wasm-simple
test.filter.js	100%	100%	100%	100%
examples/wasm-simple-source-phase
test.filter.js	100%	100%	100%	100%
lib
APIPlugin.js	100%	100%	100%	100%
AsyncDependenciesBlock.js	100%	100%	100%	100%
AutomaticPrefetchPlugin.js	100%	100%	100%	100%
BannerPlugin.js	100%	100%	100%	100%
Cache.js	98.21%	100%	100%	98.21%	101
CacheFacade.js	100%	100%	100%	100%
Chunk.js	99.72%	100%	100%	99.72%	39
ChunkGraph.js	100%	100%	100%	100%
ChunkGroup.js	100%	100%	100%	100%
ChunkTemplate.js	100%	100%	100%	100%
CleanPlugin.js	99.15%	100%	100%	99.15%	206, 226
CodeGenerationResults.js	100%	100%	100%	100%
CompatibilityPlugin.js	100%	100%	100%	100%
Compilation.js	98.45%	100%	100%	98.45%	1572, 1868, 1875, 1883, 1905, 2801, 3226, 3888, 3917, 3970–3971, 3975, 3980, 3996–3997, 4011–4012, 4017–4018, 4495, 4521, 511, 516, 5229, 5261, 5278, 5294, 5310, 5325, 5350–5351, 5353, 5681, 5686, 5692, 5695, 5707, 5709, 5713, 5729, 5744, 5776, 5830, 5854, 5968, 730–731
Compiler.js	99.55%	100%	100%	99.55%	1116–1117, 1125
ConcatenationScope.js	98.59%	100%	100%	98.59%	189
ConditionalInitFragment.js	100%	100%	100%	100%
ConstPlugin.js	100%	100%	100%	100%
ContextExclusionPlugin.js	100%	100%	100%	100%
ContextModule.js	100%	100%	100%	100%
ContextModuleFactory.js	97.75%	100%	100%	97.75%	258, 393, 418, 443, 447, 458
ContextReplacementPlugin.js	100%	100%	100%	100%
DefinePlugin.js	98.92%	100%	100%	98.92%	158–159, 175, 194, 268
DependenciesBlock.js	100%	100%	100%	100%
Dependency.js	98.20%	100%	100%	98.20%	379, 425
DependencyTemplate.js	100%	100%	100%	100%
DependencyTemplates.js	100%	100%	100%	100%
DotenvPlugin.js	98.41%	100%	100%	98.41%	378, 391–392
DynamicEntryPlugin.js	100%	100%	100%	100%
EntryOptionPlugin.js	100%	100%	100%	100%
EntryPlugin.js	100%	100%	100%	100%
Entrypoint.js	100%	100%	100%	100%
EnvironmentPlugin.js	97.14%	100%	100%	97.14%	49
ErrorHelpers.js	100%	100%	100%	100%
EvalDevToolModulePlugin.js	100%	100%	100%	100%
EvalSourceMapDevToolPlugin.js	100%	100%	100%	100%
ExportsInfo.js	100%	100%	100%	100%
ExportsInfoApiPlugin.js	100%	100%	100%	100%
ExternalModule.js	98.97%	100%	100%	98.97%	425–429, 577
ExternalModuleFactoryPlugin.js	100%	100%	100%	100%
ExternalsPlugin.js	100%	100%	100%	100%
FileSystemInfo.js	99.50%	100%	100%	99.50%	182, 2252–2253, 2256, 2267, 2278, 2289, 278, 3694, 3709, 3733
FlagAllModulesAsUsedPlugin.js	100%	100%	100%	100%
FlagDependencyExportsPlugin.js	98.74%	100%	100%	98.74%	399, 401, 405
FlagDependencyUsagePlugin.js	100%	100%	100%	100%
FlagEntryExportAsUsedPlugin.js	100%	100%	100%	100%
Generator.js	100%	100%	100%	100%
HotModuleReplacementPlugin.js	100%	100%	100%	100%
HotUpdateChunk.js	100%	100%	100%	100%
IgnorePlugin.js	100%	100%	100%	100%
IgnoreWarningsPlugin.js	100%	100%	100%	100%
InitFragment.js	100%	100%	100%	100%
JavascriptMetaInfoPlugin.js	100%	100%	100%	100%
LibraryTemplatePlugin.js	100%	100%	100%	100%
LoaderOptionsPlugin.js	100%	100%	100%	100%
LoaderTargetPlugin.js	100%	100%	100%	100%
MainTemplate.js	100%	100%	100%	100%
ManifestPlugin.js	100%	100%	100%	100%
Module.js	98.50%	100%	100%	98.50%	1305, 1310, 1371, 1385, 1447, 1456
ModuleFactory.js	100%	100%	100%	100%
ModuleFilenameHelpers.js	98.85%	100%	100%	98.85%	106, 108
ModuleGraph.js	99.73%	100%	100%	99.73%	1004
ModuleGraphConnection.js	100%	100%	100%	100%
ModuleInfoHeaderPlugin.js	100%	100%	100%	100%
ModuleNotFoundError.js	100%	100%	100%	100%
ModuleProfile.js	100%	100%	100%	100%
ModuleSourceTypeConstants.js	100%	100%	100%	100%
ModuleTemplate.js	100%	100%	100%	100%
ModuleTypeConstants.js	100%	100%	100%	100%
MultiCompiler.js	99.69%	100%	100%	99.69%	645
MultiStats.js	100%	100%	100%	100%
MultiWatching.js	100%	100%	100%	100%
NoEmitOnErrorsPlugin.js	100%	100%	100%	100%
NodeStuffPlugin.js	100%	100%	100%	100%
NormalModule.js	97.83%	100%	100%	97.83%	1072, 1106, 1122, 1209, 1834, 1839–1849, 794, 797, 814, 831
NormalModuleFactory.js	99.47%	100%	100%	99.47%	1083, 1392, 486, 498
NormalModuleReplacementPlugin.js	100%	100%	100%	100%
NullFactory.js	100%	100%	100%	100%
OptimizationStages.js	100%	100%	100%	100%
OptionsApply.js	100%	100%	100%	100%
Parser.js	100%	100%	100%	100%
PlatformPlugin.js	100%	100%	100%	100%
PrefetchPlugin.js	100%	100%	100%	100%
ProgressPlugin.js	98.85%	100%	100%	98.85%	519–520, 525, 527, 591
ProvidePlugin.js	100%	100%	100%	100%
RawModule.js	100%	100%	100%	100%
RecordIdsPlugin.js	100%	100%	100%	100%
RequestShortener.js	100%	100%	100%	100%
ResolverFactory.js	100%	100%	100%	100%
RuntimeGlobals.js	100%	100%	100%	100%
RuntimeModule.js	100%	100%	100%	100%
RuntimePlugin.js	100%	100%	100%	100%
RuntimeTemplate.js	100%	100%	100%	100%
SelfModuleFactory.js	100%	100%	100%	100%
SingleEntryPlugin.js	100%	100%	100%	100%
SourceMapDevToolModuleOptionsPlugin.js	100%	100%	100%	100%
SourceMapDevToolPlugin.js	99.16%	100%	100%	99.16%	267–268, 610
Stats.js	100%	100%	100%	100%
Template.js	100%	100%	100%	100%
TemplatedPathPlugin.js	98.86%	100%	100%	98.86%	136–137
UseStrictPlugin.js	100%	100%	100%	100%
WarnCaseSensitiveModulesPlugin.js	100%	100%	100%	100%
WarnDeprecatedOptionPlugin.js	100%	100%	100%	100%
WarnNoModeSetPlugin.js	100%	100%	100%	100%
WatchIgnorePlugin.js	100%	100%	100%	100%
Watching.js	100%	100%	100%	100%
WebpackError.js	100%	100%	100%	100%
WebpackIsIncludedPlugin.js	100%	100%	100%	100%
WebpackOptionsApply.js	100%	100%	100%	100%
WebpackOptionsDefaulter.js	100%	100%	100%	100%
buildChunkGraph.js	99.87%	100%	100%	99.87%	325
cli.js	98.46%	100%	100%	98.46%	10, 119, 471, 503, 545, 815
index.js	99.72%	100%	100%	99.72%	165
validateSchema.js	94.67%	100%	100%	94.67%	100, 87, 89, 98
webpack.js	96.33%	100%	100%	96.33%	10, 198, 220, 222
lib/asset
AssetBytesGenerator.js	100%	100%	100%	100%
AssetBytesParser.js	100%	100%	100%	100%
AssetGenerator.js	100%	100%	100%	100%
AssetModulesPlugin.js	97.32%	100%	100%	97.32%	283, 307, 310, 36, 362, 41
AssetParser.js	100%	100%	100%	100%
AssetSourceGenerator.js	100%	100%	100%	100%
AssetSourceParser.js	100%	100%	100%	100%
RawDataUrlModule.js	100%	100%	100%	100%
lib/async-modules
AsyncModuleHelpers.js	100%	100%	100%	100%
AwaitDependenciesInitFragment.js	100%	100%	100%	100%
InferAsyncModulesPlugin.js	100%	100%	100%	100%
lib/cache
AddBuildDependenciesPlugin.js	100%	100%	100%	100%
AddManagedPathsPlugin.js	100%	100%	100%	100%
IdleFileCachePlugin.js	97.92%	100%	100%	97.92%	71, 83, 91
MemoryCachePlugin.js	95.83%	100%	100%	95.83%	33
MemoryWithGcCachePlugin.js	93.15%	100%	100%	93.15%	106, 113–114, 122, 89
PackFileCacheStrategy.js	96.40%	100%	100%	96.40%	1250, 1350, 1354, 1416, 628, 647, 657–659, 661, 677–678, 683, 686, 688, 693, 698, 722, 728, 762, 768, 774, 779, 790, 799, 804–805, 807, 824, 830–831, 833
ResolverCachePlugin.js	100%	100%	100%	100%
getLazyHashedEtag.js	100%	100%	100%	100%
mergeEtags.js	100%	100%	100%	100%
lib/config
browserslistTargetHandler.js	100%	100%	100%	100%
defaults.js	99.29%	100%	100%	99.29%	1411–1413, 1421, 271, 274, 279, 283
normalization.js	99%	100%	100%	99%	191–192, 258, 273
target.js	100%	100%	100%	100%
lib/container
ContainerEntryDependency.js	100%	100%	100%	100%
ContainerEntryModule.js	100%	100%	100%	100%
ContainerEntryModuleFactory.js	100%	100%	100%	100%
ContainerExposedDependency.js	100%	100%	100%	100%
ContainerPlugin.js	100%	100%	100%	100%
ContainerReferencePlugin.js	100%	100%	100%	100%
FallbackDependency.js	100%	100%	100%	100%
FallbackItemDependency.js	100%	100%	100%	100%
FallbackModule.js	100%	100%	100%	100%
FallbackModuleFactory.js	100%	100%	100%	100%

alexander-akait added 7 commits May 20, 2026 17:12

Copilot AI review requested due to automatic review settings May 20, 2026 21:26

Copilot started reviewing on behalf of alexander-akait May 20, 2026 21:27 View session