Skip to content

Add XML entity format parsing support to EntityResolver.#21

Open
jordikroon wants to merge 1 commit into
php:mainfrom
jordikroon:xml-entities
Open

Add XML entity format parsing support to EntityResolver.#21
jordikroon wants to merge 1 commit into
php:mainfrom
jordikroon:xml-entities

Conversation

@jordikroon

Copy link
Copy Markdown
Member

In order to support php/doc-en#5634

@jordikroon jordikroon requested a review from alfsb June 29, 2026 19:06
@jordikroon

Copy link
Copy Markdown
Member Author

@alfsb I’ve attempted to implement XML entities. Could you take a look before this is merged and let me know if you have any concerns or if I missed anything? I know it's kinda unknown terrain for you.

We are using regexes to extract the entities (DTD/XML) because on a mass scale like this it's just way faster to do so.

@jordikroon jordikroon changed the title add XML entity format parsing support Add XML entity format parsing support to EntityResolver. Jun 29, 2026
@alfsb

alfsb commented Jun 29, 2026

Copy link
Copy Markdown
Member

@alfsb I’ve attempted to implement XML entities. Could you take a look before this is merged and let me know if you have any concerns or if I missed anything? I know it's kinda unknown terrain for you.

I think this is ok. The XML format for grouped XML entities is very simple. Only two main levels. <entities> at root element, and zero-to-many <entity>s as children.

We are using regexes to extract the entities (DTD/XML) because on a mass scale like this it's just way faster to do so.

The trick I used on here is probably as fast or faster, and has the singular advantage of not using any regex.

It gets you a fully valid XML, of any depthness, with all bells and whistles like XML comments, XML PIs, attributes and so on. The only thing its sidesteps are DTD entities, that are re-encoded as normal text, and yet, can be searched and replaced without any ambiguity.

Note the lines 76--80. Here, I'm using a native DOMDocument. On line 81 it becomes text again, but that is not the main point. The interesting part is that the fragment (that is not a valid XML by itself, less so with undefined DTD entities) got loaded, and can be navigated, copied and operated. Any DTD entity, undefined or not, is re-encoded, from &ent; to &amp;ent;, so it's possible to search the text nodes for the pattern &amp;*; to find any original entity, and then replace it accordly. I think this will suffice for your use case.

In fact, I'm working in a parser that will load any XML, file or fragment, that will accept undefined DTD entities and keep then as XML_ENTITY_REF_NODE nodes, all in native libxml... This is a side project, but I think you probably can appreciate what I'm trying to do with all this, and after all DTD entities are migrated to XML entities.

In a not so distant future, it will be possible to fully expand all entities of a file or fragment, in userland, deterministically, and without loading from the root. :)

@alfsb alfsb left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mentioned code above, will work the same, be the text is loaded from the body of <!ENTITY, or if it is loaded from <entity>s elements from XML Entity .ent files.

But the regex may very well work, as I do not know any Docbook element named <entity>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants