fix(html): align walkHtmlTokens with WHATWG spec and swc_html_parser#21000
Conversation
- Drop the length cap on the script-data-double-escape-end tempBuffer so end tags like `</scripts>` no longer falsely match `"script"` and prematurely exit double-escaped mode. - Update `tagStart` when entering the less-than-sign state from any script-data escaped state so a matching `</script>` close tag reached via SCRIPT_DATA_ESCAPED preserves the preceding script body in the emitted text span. - Flush pending text when a content-mode close tag is detected via the attribute path (`</title foo>`, `</script bar=baz>`), not only via the direct `>` path. - Track `commentStart` from the markup-declaration-open transition so EOF inside an incomplete `<!…` emits the correct byte range. - Emit the missing-attribute-value form (`<a foo=>`) so the open-tag byte range still includes the `>`. - Implement spec-correct named character reference matching by adding the full WHATWG named character references table (generated via `tooling/generate-html-entities.js`) and restoring the STATE_AMBIGUOUS_AMPERSAND fallthrough. `decodeHtmlEntities` now handles the full table, including legacy bare forms (`&`, `©`) and multi-code-point entities (`≂̸`), and applies WHATWG longest-prefix backtrack (`¬pre;` -> `¬pre;`). Adds 199 unit tests covering every state-machine branch in the tokenizer and reaches 100% statements/branches/functions/lines coverage on lib/html/walkHtmlTokens.js. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Switch lib/html/htmlEntities.js from a 40 KB frozen object literal to a 28 KB delta-encoded string decoded lazily on first use. - Each record is `<sharedByte><suffix>\t<valueLenByte><value>` with 1-byte length prefixes for shared-prefix count and value length. - Names sorted alphabetically so adjacent entries share long prefixes. - Length-prefixed values mean records can contain raw `\t` or `\n` (as in `	` and `
`) without escaping. - Module exports a getter function; callers cache the table locally so the decode runs at most once per `decodeHtmlEntities` invocation. Also exclude the generated file from the eslint config so prettier doesn't try to wrap the long source string. 199/199 tests pass, 100% coverage on `walkHtmlTokens.js`. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
…ties generator Surface tokenizer parse errors via a new optional `parseError` callback on `HtmlTokenCallbacks`. Severity is `"warning"` when the tokenizer recovers and the emitted token is still well-formed (missing-attribute-value, unexpected-equals-sign-before-attribute-name, abrupt-doctype-*, etc.) and `"error"` when the emitted token byte range is incomplete (eof-in-tag, eof-in-comment, eof-in-doctype, eof-in-cdata, eof-in-script-html-comment-like-text). Fix the EOF-in-tag deviation: previously a half-finished tag at EOF (`<div class="x`) was silently emitted as text. The tokenizer now emits the partial open/close tag at EOF and reports `eof-in-tag` so consumers still see a tag callback. Fix two small spec deviations: `STATE_COMMENT_START` and `STATE_COMMENT_START_DASH` now correctly reconsume on the anything-else branch (they previously consumed the character, which prevented the comment-less-than-sign chain from firing). Also correct a misleading code comment in `STATE_COMMENT_END`. Make `emitAttribute` advance `pos` past the closing quote by default when no `attribute` callback is provided, so callers that only need the error stream don't have to wire an attribute handler just to keep parsing. Restructure the entities generator to follow the same pattern as `tooling/generate-runtime-code.js`: vendor the WHATWG entities.json locally at `tooling/html-entities.json`, support `--write` for in-place generation, and wire into `lint:special` / `fix:special` so the generated `lib/html/htmlEntities.js` is verified by CI. A separate `--fetch` flag refreshes the vendored JSON from the spec URL. 226/226 tests pass, 100% coverage on `walkHtmlTokens.js`. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Drop `lib/html/htmlEntities.js` and inline the delta-encoded entity table plus the lazy `getHtmlEntities` getter into a `// #region html entities` block at the top of `lib/html/walkHtmlTokens.js`. The generator (`tooling/generate-html-entities.js`) now splices the region into `walkHtmlTokens.js` directly — mirroring how `tooling/generate-runtime-code.js` operates on `lib/util/semver.js`. `yarn lint:special` verifies the inlined block is in sync; `yarn fix:special` writes it in place. `--fetch` still refreshes the vendored `tooling/html-entities.json` from the spec URL. 226/226 tests pass, 100% coverage on `walkHtmlTokens.js`. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
… literal
The lazy `getHtmlEntities()` getter was building a 2231-entry Map on
first use by parsing a delta-encoded string with one `indexOf` and one
`slice` per entry. Replace it with a single `Object.freeze({...})`
literal so the JS engine builds the table once at module-load time and
all subsequent lookups are direct property access.
The literal lives inside the existing `// #region html entities` block
and is preceded by `// prettier-ignore` / `// cspell:disable-next-line`
so the long line is not wrapped on regeneration. File size grows from
~140 KB to ~143 KB (Object literal vs delta-encoded string), in
exchange for removing the per-first-call decode work and the
`htmlEntityCache` / `getHtmlEntities` scaffolding.
Update the two callers in `walkHtmlTokens.js` (NAMED_CHARACTER_REFERENCE
state and `decodeHtmlEntities`) to read from `HTML_ENTITIES` directly.
226/226 tests pass, 100% coverage on `walkHtmlTokens.js`.
https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
…gus-comment Three EOF-path fixes surfaced by an independent spec audit: - `STATE_ATTRIBUTE_NAME` at EOF previously emitted an attribute with stale or `-1` `attrNameEnd` because the state machine only assigns `attrNameEnd` on the transitions out of attribute-name. Set it to `len` in the EOF block so `<div data-x` at EOF emits attribute `data-x` with the correct byte range. - `STATE_CHARACTER_REFERENCE`-family (states 71–79) at EOF previously fell straight to the trailing-text branch, skipping the `returnState`. When the return state was inside an attribute value (e.g. `<a href="x&` at EOF), neither the partial tag nor `eof-in-tag` was emitted. Now the EOF block unwinds `state` to `returnState` before the existing in-tag check, so the partial open tag is emitted and `eof-in-tag` is reported. - `STATE_BOGUS_COMMENT` at EOF previously reported `eof-in-comment` as an error, but the spec defines no parse error for this case (the comment is emitted cleanly with the EOF token). Drop the error report for bogus-comment-at-EOF while keeping it for the real comment states. 229/229 tests pass, 100% coverage on `walkHtmlTokens.js`. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Three remaining spec-alignment items from the recent audit:
- decodeHtmlEntities now applies the WHATWG numeric-character-reference
remap rules: 0x00, surrogates (0xD800-0xDFFF), and code points above
0x10FFFF return U+FFFD; the 0x80-0x9F C1 range remaps through the
Windows-1252 table per spec (e.g. `€` -> euro, `™` -> tm).
- decodeHtmlEntities accepts a new optional `isAttribute` flag that
applies the WHATWG consumed-as-part-of-an-attribute rule. With
`isAttribute: true`, a named entity without a trailing `;` whose
next character is `=` (or whose longest-prefix backtrack leaves
trailing alphanumerics) stays literal:
`decodeHtmlEntities("&=foo", true)` returns `&=foo`; without
the flag it decodes to `&=foo`.
- Tokenizer now fires the `end-tag-with-attributes` parse error
(severity `"warning"`) when an end tag like `</div foo>` emits.
Tracked via a `tagHasAttributes` flag set in `emitAttribute` and
cleared in `emitOpenTag` / `emitCloseTag`.
235/235 tests pass, 100% coverage on `walkHtmlTokens.js`.
https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
🦋 Changeset detectedLatest commit: 3effe59 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
This PR is packaged and the instant preview is available (1097a7f). Install it locally:
npm i -D webpack@https://pkg.pr.new/webpack@1097a7f
yarn add -D webpack@https://pkg.pr.new/webpack@1097a7f
pnpm add -D webpack@https://pkg.pr.new/webpack@1097a7f |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21000 +/- ##
==========================================
+ Coverage 90.94% 91.60% +0.66%
==========================================
Files 573 573
Lines 58940 59120 +180
Branches 15888 15948 +60
==========================================
+ Hits 53601 54159 +558
+ Misses 5339 4961 -378
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR updates webpack’s experimental HTML tokenizer (walkHtmlTokens) to more closely follow the WHATWG HTML tokenization spec and match behavior of swc_html_parser, including fixing multiple byte-range edge cases and expanding entity decoding to the full named character reference table.
Changes:
- Align tokenizer state-machine behavior for script-data escaping, content-mode close tags via attribute paths, markup-declaration EOF handling, and missing attribute values; add a new
parseErrorcallback with"warning"/"error"severities. - Implement spec-correct named character reference matching (full WHATWG entities table + longest-prefix backtrack) and upgrade
decodeHtmlEntitiesaccordingly. - Add a generator (
tooling/generate-html-entities.js) plus extensive unit tests (targeting full branch coverage).
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
lib/html/walkHtmlTokens.js |
Main tokenizer updates: state-machine fixes, new parseError callback, full WHATWG entity table + improved entity decoding, and expanded EOF handling. |
tooling/generate-html-entities.js |
Generates the inlined HTML_ENTITIES region in walkHtmlTokens.js from tooling/html-entities.json. |
package.json |
Hooks the generator into lint:special / fix:special workflows. |
tooling/html-entities.json |
Vendored WHATWG named character references table used by the generator. |
test/walkHtmlTokens.unittest.js |
Adds regression tests and broad state-machine branch coverage + parseError/decodeHtmlEntities tests. |
cspell.json |
Excludes tooling/html-entities.json from spellchecking. |
.changeset/align-html-lexer-script-data.md |
Patch changeset documenting tokenizer alignment, new callback, and full entities support. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tempBuffer += String.fromCharCode(cc + 0x20); | ||
| } | ||
| tempBuffer += String.fromCharCode(cc + 0x20); | ||
| pos++; |
| if (namedEntityConsumed > 33) break; | ||
| runLen++; | ||
| // Safety cap — the longest entity name is 32 chars (without `&`). | ||
| if (runLen > 32) break; |
| for (let i = name.length; i > 0; i--) { | ||
| const prefix = name.slice(0, i); | ||
| if (HTML_ENTITIES[prefix] !== undefined) { | ||
| // Attribute-context longest-prefix guard: if the matched | ||
| // prefix doesn't end with `;` and the leftover starts with | ||
| // an alphanumeric character, leave literal per WHATWG. | ||
| // (The regex greedy-consumes alphanumerics, so any leftover | ||
| // within `name` is itself alphanumeric — we only need to |
- Fix tsc lint failure: cast the inlined `HTML_ENTITIES` to `Readonly<Record<string, string>>` so the two `HTML_ENTITIES[name]` call sites type-check (the frozen-literal type was too narrow). - Copilot #1: replace the `tempBuffer` string in `STATE_SCRIPT_DATA_DOUBLE_ESCAPE_START` / `_END` with a small `scriptMatch` counter against the literal `"script"`. Worst-case inputs with very long ASCII-alpha runs after `</` no longer grow a buffer or do quadratic string concatenation. Vestigial `tempBuffer = ""` resets in the four content-mode less-than-sign states (which never read the buffer) are removed; the `tempBuffer` variable is gone. - Copilot #2: introduce a `MAX_ENTITY_NAME_LEN = 32` constant (longest WHATWG entity name including the trailing `;`) and cap the named-character-reference alphanumeric run at `MAX_ENTITY_NAME_LEN - 1`. Replaces the off-by-one `if (runLen > 32) break` with a clearer in-loop bound. - Copilot #3: cap the `decodeHtmlEntities` longest-prefix backtrack at `MAX_ENTITY_NAME_LEN`. Inputs like `&` followed by thousands of alphanumerics stay linear-time; anything past the cap is appended verbatim. Adds a regression test that decodes `&` + 1000 chars to confirm the decoder doesn't go quadratic. 236/236 tests pass, 100% coverage on `walkHtmlTokens.js`, `yarn lint` (including tsc) clean. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
| // Match either `&name;` / `&name` (named refs may be legacy-bare per the | ||
| // WHATWG entities table) or a numeric reference `&#\u2026;?`. | ||
| return str.replace( | ||
| /&(#[xX]?[0-9a-fA-F]+|#?[0-9a-zA-Z]+);?/g, | ||
| (match, _group, offset, source) => { | ||
| // Numeric reference: A or A | ||
| if (match.charCodeAt(1) === 0x23 /* # */) { | ||
| const lastChar = match.charAt(match.length - 1); | ||
| const body = lastChar === ";" ? match.slice(2, -1) : match.slice(2); | ||
| const isHex = | ||
| body.charCodeAt(0) === 0x78 || body.charCodeAt(0) === 0x58; | ||
| const code = isHex | ||
| ? Number.parseInt(body.slice(1), 16) | ||
| : Number.parseInt(body, 10); | ||
| if (!Number.isNaN(code)) return decodeNumericReference(code); | ||
| return match; // Invalid numeric (e.g. &#;) |
| it("nAMED_CHARACTER_REFERENCE: caps consumption at 33 chars", () => { | ||
| // Entity names in the WHATWG table are at most ~33 chars; the | ||
| // scanner has a safety cap that breaks out of the consume loop | ||
| // past 33 alphanumeric chars even without a closing semicolon. |
…ment Copilot review follow-ups on commit eb2fc4d. - `decodeHtmlEntities` previously matched numeric refs with `#[xX]?[0-9a-fA-F]+`, which let a decimal reference like `Ab` greedily consume the trailing `b` (it's a hex digit) and incorrectly decode to `A` while dropping the `b`. Split the regex into three clear alternatives: `&#x<hex>` — hex (requires `x`/`X`), `&#<dec>` — decimal (digits only), `&<name>` — named (letter then alphanumerics). With the stricter regex the parse-int call always has a non-empty digit body, so the `Number.isNaN(code)` defensive return is removed. Adds a regression test: `Ab` -> `Ab`, `f` -> `�f`. - Reword the "named-character-reference cap" test comment so it matches the actual `MAX_ENTITY_NAME_LEN = 32` bound (the previous copy said "33 chars" from the pre-constant era). 237/237 tests pass, 100% coverage on `walkHtmlTokens.js`, `yarn lint` (including tsc) clean. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
| "walkHtmlTokens.js" | ||
| ); | ||
|
|
||
| const REGION_REGEXP = /\/\/ #region html entities\n[\s\S]+?\/\/ #endregion\n/; |
| return fetchUrl(/** @type {string} */ (res.headers.location)).then( | ||
| resolve, | ||
| reject | ||
| ); |
| * Reports a tokenizer parse error to the consumer. The byte range and | ||
| * severity follow the WHATWG spec naming. Severity is `"error"` for | ||
| * cases where the emitted token is incomplete (EOF inside a tag or | ||
| * comment); everything else is a `"warning"`. | ||
| * @param {string} code WHATWG parse-error code (kebab-case) | ||
| * @param {number} start byte offset where the error starts | ||
| * @param {number} end byte offset where the error ends |
Copilot review follow-ups on commit 3e4fe10. - The entities generator's REGION_REGEXP hard-coded `\n`, so it false-reported "need to be updated" on Windows checkouts where git normalized line endings to CRLF. Match `\r?\n` and preserve the file's existing EOL style (CRLF on Windows, LF elsewhere) when writing the regenerated region so we don't introduce mixed line endings. - HTTP redirect handling in `fetchUrl` passed `res.headers.location` through unchanged, which broke when the server returned a relative `Location`. Resolve via `new URL(location, url).toString()` and guard against a missing `Location` header. - `parseError` doc strings called offsets "byte offsets" but the values are actually JS string indices (UTF-16 code units), which matters for inputs containing non-BMP code points. Reword the `ParseErrorSeverity` typedef, the `reportError` JSDoc, and the surrounding internal comments to use "offset" / "string offset" / "offset range" instead of "byte". 237/237 tests pass, lint/prettier/cspell/tsc clean. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
| if (HTML_ENTITIES[withSemi] !== undefined) { | ||
| namedEntityConsumed = n + 1; | ||
| break; | ||
| } | ||
| } | ||
| const bare = input.slice(pos, pos + n); | ||
| if (HTML_ENTITIES[bare] !== undefined) { |
| const bare = input.slice(pos, pos + n); | ||
| if (HTML_ENTITIES[bare] !== undefined) { | ||
| namedEntityConsumed = n; | ||
| break; |
| for (let i = searchLen; i > 0; i--) { | ||
| const prefix = name.slice(0, i); | ||
| if (HTML_ENTITIES[prefix] !== undefined) { | ||
| // Attribute-context longest-prefix guard: if the matched |
| // strings (1–2 UTF-16 code units). | ||
| // prettier-ignore | ||
| // cspell:disable-next-line | ||
| const HTML_ENTITIES = /** @type {Readonly<Record<string, string>>} */ (Object.freeze(${JSON.stringify(map)})); |
| ); | ||
| } else { | ||
| console.error( | ||
| `${path.relative(process.cwd(), TARGET_PATH)} need to be updated` |
| "webpack": patch | ||
| --- | ||
|
|
||
| Align the experimental HTML tokenizer with the WHATWG spec: fix byte-range bugs in the script-data, content-mode end-tag, attribute-value, and EOF states; surface tokenizer parse errors to consumers via a new `parseError` callback (`"warning"` when the tokenizer recovers and the emitted token is still well-formed, `"error"` when the byte range is incomplete — e.g. `eof-in-tag`); and add the full WHATWG named character references table so `decodeHtmlEntities` handles all named entities (including legacy bare forms like `&` and multi-code-point entities like `≂̸`) with proper longest-prefix backtracking. |
Copilot review follow-ups on commit 1c233be. - Real bug: `HTML_ENTITIES[name] !== undefined` lookups against a regular Object hit `Object.prototype` keys (`toString`, `constructor`, `hasOwnProperty`, `__proto__`, …), so inputs like `&toString;` would falsely decode to the source of `Object.prototype.toString`. The generator now emits the table on a null prototype via `Object.assign(Object.create(null), {…})`, so bracket lookups are safe everywhere they're used (both `decodeHtmlEntities` and the tokenizer's NAMED_CHARACTER_REFERENCE state). Adds a regression test for `&toString;`, `&constructor;`, `&hasOwnProperty;`. - Grammar: change "<path> need to be updated" to "<path> needs to be updated" in the generator's drift message. - Reword the changeset to use "offset range" instead of "byte range" (the tokenizer uses JS string indices, not bytes). 238/238 tests pass, 100% coverage on `walkHtmlTokens.js`, `yarn lint` (including tsc) clean. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
| if ( | ||
| res.statusCode === 301 || | ||
| res.statusCode === 302 || | ||
| res.statusCode === 307 || | ||
| res.statusCode === 308 | ||
| ) { | ||
| const location = res.headers.location; | ||
| if (!location) { | ||
| return reject( | ||
| new Error( | ||
| `Redirect from ${url} with no Location header (HTTP ${res.statusCode})` | ||
| ) | ||
| ); | ||
| } | ||
| // `Location` may be relative; resolve against the current | ||
| // request URL so `https.get` receives an absolute URL. | ||
| return fetchUrl(new URL(location, url).toString()).then( | ||
| resolve, | ||
| reject | ||
| ); | ||
| } | ||
| if (res.statusCode !== 200) { | ||
| return reject( | ||
| new Error(`Failed to fetch ${url}: HTTP ${res.statusCode}`) | ||
| ); | ||
| } |
| // - 0x00, > 0x10FFFF, or surrogate (0xD800\u20130xDFFF) \u2192 U+FFFD. | ||
| // - 0x80\u20130x9F \u2192 Windows-1252 remap (above). | ||
| // - Anything else (including noncharacters and C0 controls) \u2192 the |
…e in source Copilot review follow-ups on commit a17a150. - `fetchUrl` in `tooling/generate-html-entities.js` previously rejected or recursed without consuming the response body. On Node's http(s) client this leaves the socket unread and can hang/leak. Call `res.resume()` on both the redirect and the non-200 paths so the body is drained before we release the response. - The C1 remap table and surrounding comments in `lib/html/walkHtmlTokens.js` had Unicode characters encoded as literal `\uXXXX` escape sequences (likely re-escaped by an earlier tooling pass), which made the source hard to read. Replaced the six-byte escapes with the actual Unicode glyphs for the table values, and ASCII-fied the comment punctuation (en-dash to `-`, arrow to `->`) so the prose stays clear without relying on non-ASCII in source comments. 238/238 tests pass, 100% coverage on `walkHtmlTokens.js`, `yarn lint` clean. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
| // completed prior tag). | ||
| if (state === STATE_TAG_NAME) tagNameEnd = len; |
Copilot review follow-up on commit 083ac67. Real bug: EOF inside a content-mode end-tag-name state (RCDATA/RAWTEXT/SCRIPT_DATA/SCRIPT_DATA_ESCAPED end-tag-name) left `tagNameEnd` carrying the value from the matching open tag, so `emitCloseTag(len)` sliced a wrong (or empty) range. Example: `<title>x</tit` at EOF emitted a close tag with name `""` because `tagNameEnd` was still `6` from the prior `<title>` open. Fix: reset `tagNameEnd = len` whenever it's missing or stale (less than `tagNameStart`) instead of only doing so for `STATE_TAG_NAME`. Also extend the eof-in-script-html-comment-like-text branch to cover `SCRIPT_DATA_ESCAPED_LESS_THAN_SIGN`, `SCRIPT_DATA_DOUBLE_ESCAPED_LESS_THAN_SIGN`, and `SCRIPT_DATA_DOUBLE_ESCAPE_END` — per spec each of these reconsumes back into the (double-)escaped state at EOF, which then fires the same parse error. Regression test covers the three content-mode partial-close-tag cases: `<title>x</tit` -> close "tit", `<style>x</sty` -> close "sty", `<script>x</scr` -> close "scr". 239/239 tests pass, 100% coverage. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
| flushText(tagStart); | ||
| pos = | ||
| input.charCodeAt(tagStart + 1) === CC_SOLIDUS | ||
| ? emitCloseTag(len) | ||
| : emitOpenTag(len, false); | ||
| } else if ( |
Copilot review follow-up on commit 132a4cd. Real bug verified by `walkHtmlTokens("a<div", 0, { text: ... })` — the `"a"` text span was emitted twice: 1) `STATE_TAG_OPEN`'s alpha branch flushed pending text on tag entry, 2) the EOF mid-tag handler called `flushText(tagStart)` again, re-emitting the same span because `flushText` never advanced `textStart`. Fix: `flushText` now advances `textStart = endPos` after emitting, so repeated `flushText` calls for the same span are no-ops. `emitOpenTag` / `emitCloseTag` overwrite `textStart` with their own `nextPos` after the tag emits, so the new advance doesn't shift any subsequent ranges. 238/238 tests pass, 100% coverage on `walkHtmlTokens.js`, `yarn lint` clean. https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh
Types CoverageCoverage after merging claude/align-html-lexer-jo2kB into main will be
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
so end tags like
</scripts>no longer falsely match"script"andprematurely exit double-escaped mode.
tagStartwhen entering the less-than-sign state from anyscript-data escaped state so a matching
</script>close tag reachedvia SCRIPT_DATA_ESCAPED preserves the preceding script body in the
emitted text span.
attribute path (
</title foo>,</script bar=baz>), not only viathe direct
>path.commentStartfrom the markup-declaration-open transition soEOF inside an incomplete
<!…emits the correct byte range.<a foo=>) so the open-tagbyte range still includes the
>.the full WHATWG named character references table (generated via
tooling/generate-html-entities.js) and restoring theSTATE_AMBIGUOUS_AMPERSAND fallthrough.
decodeHtmlEntitiesnowhandles the full table, including legacy bare forms (
&,©) and multi-code-point entities (≂̸), andapplies WHATWG longest-prefix backtrack (
¬pre;->¬pre;).Adds 199 unit tests covering every state-machine branch in the
tokenizer and reaches 100% statements/branches/functions/lines coverage
on lib/html/walkHtmlTokens.js.
https://claude.ai/code/session_01N4rd8xuv5oRaHWFL8dFpwh