HTMLScanner.scanName: ASCII fast-path for the per-char inner loop by shapiroronny · Pull Request #171 · HtmlUnit/htmlunit-neko

shapiroronny · 2026-04-27T13:44:47Z

scanName runs once per element and once per attribute name and is one of the hottest frames in a parser-only profile (~5% self-time on a typical HTML page). The inner loop calls Character.isLetterOrDigit, Character.isLowerCase / isUpperCase, and Character.toLowerCase / toUpperCase per character, none of which are JIT-inlineable -- the existing comment in the same method already hand-tunes the whitespace check around this exact concern.

HTML element/attribute names are virtually always lowercase ASCII, so:

For the strict-mode "is name char?" check: explicit ASCII branch (a-z / A-Z / 0-9 / - . : _) before falling through to Character.isLetterOrDigit only for c >= 0x80.
For NAMES_LOWERCASE: if c is uppercase ASCII, lowercase via c | 0x20 (no JDK call); only Character.toLowerCase for non-ASCII chars that aren't already lowercase. Symmetric for NAMES_UPPERCASE.

Behavior is preserved for all inputs, including non-ASCII names that fall through to the JDK methods.

JMH HtmlUnitBenchmark.JMH on a ~30 KB HTML page (5 forks x 3 warmup x 5 measurement = 25 samples, paired sequentially in one JMH session on top of a corresponding htmlunit-side change that defers the HtmlPage id/name index build):

before: 593.611 +- 13.110 us/op
after: 564.642 +- 11.833 us/op
delta: -29.0 us, -4.9%

Tests: full htmlunit-neko suite (mvn -Dgpg.skip test) -- 8407 tests, 0 failures, 0 errors. CanonicalSAXTest / CanonicalTest / CanonicalDomFragmentTest (1156 cases each, the scanner-correctness suites) all pass.

scanName runs once per element and once per attribute name and is one of the hottest frames in a parser-only profile (~5% self-time on a typical HTML page). The inner loop calls Character.isLetterOrDigit, Character.isLowerCase / isUpperCase, and Character.toLowerCase / toUpperCase per character, none of which are JIT-inlineable -- the existing comment in the same method already hand-tunes the whitespace check around this exact concern. HTML element/attribute names are virtually always lowercase ASCII, so: - For the strict-mode "is name char?" check: explicit ASCII branch (a-z / A-Z / 0-9 / - . : _) before falling through to Character.isLetterOrDigit only for c >= 0x80. - For NAMES_LOWERCASE: if c is uppercase ASCII, lowercase via c | 0x20 (no JDK call); only Character.toLowerCase for non-ASCII chars that aren't already lowercase. Symmetric for NAMES_UPPERCASE. Behavior is preserved for all inputs, including non-ASCII names that fall through to the JDK methods. JMH HtmlUnitBenchmark.JMH on a ~30 KB HTML page (5 forks x 3 warmup x 5 measurement = 25 samples, paired sequentially in one JMH session on top of a corresponding htmlunit-side change that defers the HtmlPage id/name index build): before: 593.611 +- 13.110 us/op after: 564.642 +- 11.833 us/op delta: -29.0 us, -4.9% Tests: full htmlunit-neko suite (mvn -Dgpg.skip test) -- 8407 tests, 0 failures, 0 errors. CanonicalSAXTest / CanonicalTest / CanonicalDomFragmentTest (1156 cases each, the scanner-correctness suites) all pass.

sonarqubecloud · 2026-04-27T13:45:30Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

rbri · 2026-04-27T14:43:21Z

Great, many thanks....

rbri · 2026-04-27T14:46:47Z

5.0.0-SNAPSHOT updated

rbri merged commit aae3eb0 into HtmlUnit:master Apr 27, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTMLScanner.scanName: ASCII fast-path for the per-char inner loop#171

HTMLScanner.scanName: ASCII fast-path for the per-char inner loop#171
rbri merged 1 commit into
HtmlUnit:masterfrom
shapiroronny:perf/scanname-ascii-fastpath

shapiroronny commented Apr 27, 2026

Uh oh!

sonarqubecloud Bot commented Apr 27, 2026

Uh oh!

rbri commented Apr 27, 2026

Uh oh!

Uh oh!

rbri commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

shapiroronny commented Apr 27, 2026

Uh oh!

sonarqubecloud Bot commented Apr 27, 2026

Quality Gate passed

Uh oh!

rbri commented Apr 27, 2026

Uh oh!

Uh oh!

rbri commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants