Fixing Errant HTML Content
I did try this function before writing my original post and found that its algorithm for closing tags basically closes them at the end. In my case, I have an unclosed div right in the middle of the HTML. The semantic effect of this unclosed div is to cause the main section of the article to be incorrectly enclosed within an aside block
. This is how DOMDocument perceives this flaw and it is also how the clean_html resolves the unclosed tag. (My logic already makes heavy use of DOMDocument.) Since I prune off all aside blocks, this malformed HTML causes the article content (in this case) to be tossed along with the asides.
I can write a routine to fix the HTML as needed, but that seems like reinventing the wheel. Ergo my question about being able to install the Tidy extension to PHP. Tidy claims to have this problem solved.
On another note, I will absolutely switch to
since I have experienced the ill effects of being viewed as a mere bot. So that promises to help substantially.
Thanks yet again to you and Michael for the great info.