The Difficulty of Parsing the Web

By Glendon Solsberry on 25 Jan 2013

I read an entry in TheDailyWTF.com that made me pause. It was an image of a Google Search, focusing on "Doomworld", and the fact that Google had mangled the Homepage and Filesize fields in the output. See the image here:

![Image Showing mangled results from Google Search](https://img.thedailywtf.com/images/13/q1/e48/Pic-5.png ""Interesting Notion"")

Knowing that Google does all of this stuff via algorithm, I was curious to see if there was anything strange on the source page. So I visited the Editors page on Doomworld.com, and viewed the source. As a long-time web developer, this page scares me.

Mixed upper and lower case tag names, attributes that have been deprecated in HTML4.01 since December 1999, image shims, and more. Things like missing quotes on attribute values:

<FORM><SELECT language=JavaScript name=SiteSelector onchange=location.href=this.options[this.selectedIndex].value style="font-size : xx-small">>

I'm not trying to pick on Doomworld.com. They just happened to be what showed up on TheDailyWTF.com, and piqued my interest. I ran this page through the W3C Markup Validation Service and came out with 1,340 errors. Many of these were for invalid values on attributes. Lots of them are for missing closing tags, or for closing tags without opening tags.

As a long time programmer, I still can't wrap my head around parsing HTML that isn't "to spec". Even parsing HTML that is "to spec" is hard. It's no wonder to me that Google's algorithms and parsers may have screwed this up. I've built hundreds of web pages, both static and dynamic. I've built them using CMSs, FrontPage, DreamWeaver, etc. But the final step for me is to always clean up the page(s) before I post them.

dp.cx blog