Add XML entity format parsing support to EntityResolver. by jordikroon · Pull Request #21 · php/docbook-cs

jordikroon · 2026-06-29T19:06:26Z

jordikroon · 2026-06-29T19:12:30Z

@alfsb I’ve attempted to implement XML entities. Could you take a look before this is merged and let me know if you have any concerns or if I missed anything? I know it's kinda unknown terrain for you.

We are using regexes to extract the entities (DTD/XML) because on a mass scale like this it's just way faster to do so.

alfsb · 2026-06-29T23:35:14Z

@alfsb I’ve attempted to implement XML entities. Could you take a look before this is merged and let me know if you have any concerns or if I missed anything? I know it's kinda unknown terrain for you.

I think this is ok. The XML format for grouped XML entities is very simple. Only two main levels. <entities> at root element, and zero-to-many <entity>s as children.

We are using regexes to extract the entities (DTD/XML) because on a mass scale like this it's just way faster to do so.

The trick I used on here is probably as fast or faster, and has the singular advantage of not using any regex.

It gets you a fully valid XML, of any depthness, with all bells and whistles like XML comments, XML PIs, attributes and so on. The only thing its sidesteps are DTD entities, that are re-encoded as normal text, and yet, can be searched and replaced without any ambiguity.

Note the lines 76--80. Here, I'm using a native DOMDocument. On line 81 it becomes text again, but that is not the main point. The interesting part is that the fragment (that is not a valid XML by itself, less so with undefined DTD entities) got loaded, and can be navigated, copied and operated. Any DTD entity, undefined or not, is re-encoded, from &ent; to &ent;, so it's possible to search the text nodes for the pattern &*; to find any original entity, and then replace it accordly. I think this will suffice for your use case.

In fact, I'm working in a parser that will load any XML, file or fragment, that will accept undefined DTD entities and keep then as XML_ENTITY_REF_NODE nodes, all in native libxml... This is a side project, but I think you probably can appreciate what I'm trying to do with all this, and after all DTD entities are migrated to XML entities.

In a not so distant future, it will be possible to fully expand all entities of a file or fragment, in userland, deterministically, and without loading from the root. :)

alfsb

The mentioned code above, will work the same, be the text is loaded from the body of <!ENTITY, or if it is loaded from <entity>s elements from XML Entity .ent files.

But the regex may very well work, as I do not know any Docbook element named <entity>.

add XML entity format parsing support

290ee34

jordikroon requested a review from alfsb June 29, 2026 19:06

jordikroon changed the title ~~add XML entity format parsing support~~ Add XML entity format parsing support to EntityResolver. Jun 29, 2026

alfsb approved these changes Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add XML entity format parsing support to EntityResolver.#21

Add XML entity format parsing support to EntityResolver.#21
jordikroon wants to merge 1 commit into
php:mainfrom
jordikroon:xml-entities

jordikroon commented Jun 29, 2026

Uh oh!

jordikroon commented Jun 29, 2026

Uh oh!

alfsb commented Jun 29, 2026

Uh oh!

alfsb left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jordikroon commented Jun 29, 2026

Uh oh!

jordikroon commented Jun 29, 2026

Uh oh!

alfsb commented Jun 29, 2026

Uh oh!

alfsb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants