Fix GH-21734: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes by iliaal · Pull Request #21735 · php/php-src

iliaal · 2026-04-12T10:44:54Z

lxb_encoding_decode_valid_utf_8_single() had no UTF-8 validation: no continuation byte range checks, no overlong sequence rejection, no surrogate rejection. It was written to assume the caller already verified the input. The URL parser calls it on untrusted user input at 7 sites, and the IDNA code calls it on percent-decoded hostname bytes at 2 more.

An attacker feeding overlong ASCII characters into a hostname could get them through IDNA processing as their target codepoints, producing valid domains from byte sequences that look nothing like the canonical form. %C1%A5%C1%B6%C1%A9%C1%AC.com resolved to evil.com. Chrome, Firefox, and Safari reject overlong sequences at the UTF-8 decode step.

Added the missing validation to decode_valid_utf_8_single:

2-byte: reject lead bytes < 0xC2 (overlong), validate continuation byte range
3-byte: validate continuations, reject 0xE0 + < 0xA0 (overlong), reject 0xED + > 0x9F (surrogates)
4-byte: reject lead > 0xF4, validate continuations, reject 0xF0 + < 0x90 (overlong), reject 0xF4 + > 0x8F (> U+10FFFF)

On error, the decoder advances by 1 byte (not the full sequence length) so the next byte gets its own decode attempt, matching browser behavior.

… continuation bytes lxb_encoding_decode_valid_utf_8_single() skipped all UTF-8 validation (continuation byte range, overlong sequences, surrogates), trusting the caller to pass valid input. The URL parser calls it on untrusted user input at 7 sites, and the IDNA code calls it on percent-decoded hostname bytes at 2 more. Overlong ASCII characters in hostnames passed through IDNA processing as their target codepoints, producing valid domains from byte sequences that look nothing like the canonical form (e.g., %C1%A5%C1%B6... → "evil.com"). Chrome, Firefox, and Safari reject these at the UTF-8 decode step. Add the missing validation to decode_valid_utf_8_single: - 2-byte: reject lead bytes < 0xC2 (overlong), validate continuation - 3-byte: validate continuations, reject 0xE0 + < 0xA0 (overlong), reject 0xED + > 0x9F (surrogates) - 4-byte: reject lead > 0xF4, validate continuations, reject 0xF0 + < 0x90 (overlong), reject 0xF4 + > 0x8F (> U+10FFFF) On error, advance by 1 byte (not the full sequence length) so the next byte gets its own decode attempt, matching browser behavior. Closes phpGH-21734

TimWolla · 2026-04-12T11:40:09Z

This needs to be fixed in the upstream library at https://github.com/lexbor/lexbor.

iliaal requested review from TimWolla, kocsismate and ndossche as code owners April 12, 2026 10:44

iliaal mentioned this pull request Apr 12, 2026

Fix GH-21734: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes iliaal/php-src#35

Closed

github-actions bot added the Extension: uri label Apr 12, 2026

TimWolla closed this Apr 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GH-21734: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes#21735

Fix GH-21734: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes#21735
iliaal wants to merge 1 commit intophp:masterfrom
iliaal:fix/gh-21734-lexbor-utf8-validation

iliaal commented Apr 12, 2026

Uh oh!

TimWolla commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iliaal commented Apr 12, 2026

Uh oh!

TimWolla commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants