Skip to content

HTMLScanner.scanName: ASCII fast-path for the per-char inner loop#171

Merged
rbri merged 1 commit into
HtmlUnit:masterfrom
shapiroronny:perf/scanname-ascii-fastpath
Apr 27, 2026
Merged

HTMLScanner.scanName: ASCII fast-path for the per-char inner loop#171
rbri merged 1 commit into
HtmlUnit:masterfrom
shapiroronny:perf/scanname-ascii-fastpath

Conversation

@shapiroronny
Copy link
Copy Markdown
Contributor

scanName runs once per element and once per attribute name and is one of the hottest frames in a parser-only profile (~5% self-time on a typical HTML page). The inner loop calls Character.isLetterOrDigit, Character.isLowerCase / isUpperCase, and Character.toLowerCase / toUpperCase per character, none of which are JIT-inlineable -- the existing comment in the same method already hand-tunes the whitespace check around this exact concern.

HTML element/attribute names are virtually always lowercase ASCII, so:

  • For the strict-mode "is name char?" check: explicit ASCII branch (a-z / A-Z / 0-9 / - . : _) before falling through to Character.isLetterOrDigit only for c >= 0x80.
  • For NAMES_LOWERCASE: if c is uppercase ASCII, lowercase via c | 0x20 (no JDK call); only Character.toLowerCase for non-ASCII chars that aren't already lowercase. Symmetric for NAMES_UPPERCASE.

Behavior is preserved for all inputs, including non-ASCII names that fall through to the JDK methods.

JMH HtmlUnitBenchmark.JMH on a ~30 KB HTML page (5 forks x 3 warmup x 5 measurement = 25 samples, paired sequentially in one JMH session on top of a corresponding htmlunit-side change that defers the HtmlPage id/name index build):

before: 593.611 +- 13.110 us/op
after: 564.642 +- 11.833 us/op
delta: -29.0 us, -4.9%

Tests: full htmlunit-neko suite (mvn -Dgpg.skip test) -- 8407 tests, 0 failures, 0 errors. CanonicalSAXTest / CanonicalTest / CanonicalDomFragmentTest (1156 cases each, the scanner-correctness suites) all pass.

scanName runs once per element and once per attribute name and is one
of the hottest frames in a parser-only profile (~5% self-time on a
typical HTML page). The inner loop calls Character.isLetterOrDigit,
Character.isLowerCase / isUpperCase, and Character.toLowerCase /
toUpperCase per character, none of which are JIT-inlineable -- the
existing comment in the same method already hand-tunes the whitespace
check around this exact concern.

HTML element/attribute names are virtually always lowercase ASCII,
so:

- For the strict-mode "is name char?" check: explicit ASCII branch
  (a-z / A-Z / 0-9 / - . : _) before falling through to
  Character.isLetterOrDigit only for c >= 0x80.
- For NAMES_LOWERCASE: if c is uppercase ASCII, lowercase via
  c | 0x20 (no JDK call); only Character.toLowerCase for non-ASCII
  chars that aren't already lowercase. Symmetric for NAMES_UPPERCASE.

Behavior is preserved for all inputs, including non-ASCII names that
fall through to the JDK methods.

JMH HtmlUnitBenchmark.JMH on a ~30 KB HTML page (5 forks x 3 warmup
x 5 measurement = 25 samples, paired sequentially in one JMH session
on top of a corresponding htmlunit-side change that defers the
HtmlPage id/name index build):

  before: 593.611 +- 13.110 us/op
  after:  564.642 +- 11.833 us/op
  delta:  -29.0 us, -4.9%

Tests: full htmlunit-neko suite (mvn -Dgpg.skip test) -- 8407 tests,
0 failures, 0 errors. CanonicalSAXTest / CanonicalTest /
CanonicalDomFragmentTest (1156 cases each, the scanner-correctness
suites) all pass.
@sonarqubecloud
Copy link
Copy Markdown

@rbri
Copy link
Copy Markdown
Member

rbri commented Apr 27, 2026

Great, many thanks....

@rbri rbri merged commit aae3eb0 into HtmlUnit:master Apr 27, 2026
8 checks passed
@rbri
Copy link
Copy Markdown
Member

rbri commented Apr 27, 2026

5.0.0-SNAPSHOT updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants