Add LIBXML_NOBLANKS to dom_load() XML parse options#304
Conversation
|
As commented on discord, I know little to nothing about PhD in this respect. Why PhD? Because some Docbook parts specify that whitespace should be preserved in the rendering process, so some whitespace may end up being relevant in PhD, or in final output. To get better performance and reduce memory, removing XML comments may give a similar impact, and these are unspecified in Docbook. |
|
Philip O gives an example where this causes a problem. The rendering of The problem is more general. Whitespace in Docbook is complicated. In some elements it is completely irrelevant, and could be trimmed, but there are some other contexts, where whitespace should be coalesced (like HTML), and other contexts where it should be fully preserved. There is a hint in old libxml docs:
So... it may be possible to change libxml to the "correct" Docbook behaviour if it is called in validating mode. |
|
After some reading, I think that
so The last time Docbook published a DTD schema was on version 5.0.1 AFAIK. And I do not know a way to convert a RNG into a DTD, even imperfectly. And obviously, enabling Maybe it is possible to just fake it. To bruteforce a "one rule per element" DTD schema that contains all pairwise parent vs child-element-or-text pairs. The theory is that libxml only needs to know if a particular element may contain |
alfsb
left a comment
There was a problem hiding this comment.
This cannot be merged as is. As commented above, this changes the rendering of the manual in broken ways.
It may be possible, though. It is only a conjecture, but by creating a simplified DTD from RNGs (preferably) or from a validated manual (for experimental use only), with the format:
<!ELEMENT parent-element (child1|child2|child3)*>
<!ELEMENT parent-element (#PCDATA|child1|child2)*>
and enabling it on doc-base/manual.xml, may cause libxml2 to trim spaces only they not cause rendering changes.
The presence or absence of #PCDATA should be calculated by detecting if any of their possible direct child nodes is expected text nodes (from RNG) or is a text node not composed entirely of whitespace (from build XML).
Drops whitespace-only text nodes. Which shouldn't be harmful. It results in a smaller DOM, thus faster to traverse and requires less memory.
Benchmarked
configure.php --with-lang=en, 5 runs, PHP 8.5:Benchmark:
Raw files:
Before patch (baseline)
After patch