Fixing Errant HTML Content
Jamroom Developers
Yeah - just use jrCore_clean_html() - it will fix up errant HTML for the most part - i.e.
$corrected_html = jrCore_clean_html($malformed_html);
It uses DOMDocument:
https://www.php.net/DOMDocument
to basically "rewrite" the HTML which ensures everything gets closed.
If you are loading offsite URLs and need to get the HTML, I would highly recommend not using cURL yourself directly and instead just use the jrUrlScan_get_url_with_wget() function that is provided by the Media URL Scanner module - it does a lot of extra work to look like a "real" browser to the remote site. If you are just using curl you will find a lot of sites that will not load properly and/or will just "hang" - these sites use 3rd party CDN's and accelerators like Cloudflare that see the curl call like a "bot" and will reject it. The jrUrlScan_get_url_with_wget() module is setup to work like a "real" browser in that it fully accepts cookies, session cookies, uses a real user agent, accepts gzipped content, etc. - all the headers and stuff it sends out masquerades as a real browser and you will get the HTML. So I would do this:
if ($possibly_bad_html = jrUrlScan_get_url_with_wget('https://someurl.com')) {
$cleaned_up_html = jrCore_clean_html($possibly_bad_html);
}
Let me know if that helps.