Web Mining Pitfalls

Programming defensively requires knowing the input that your code should be able to handle. Typically, the programmer may be intimately familiar with the type of data that his/her code will encounter and can perform checks and catch exceptions with respect to the format of the data.

Web mining requires a lot more sophistication. The programmer in many cases does not know the full formatting of the data published on a web site. Additionally, this format may change over time. There are certain standards that do apply to certain types of data on the web, but one cannot rely on web developers to follow these standards. For example, the RSS Advisory Board developed a convention for the formatting of web pages so that browsers can automatically discover the links to the site’s RSS feeds. I have found in my research that approximately 95% of my sample actually implemented this convention. Not bad, but not perfect.

Always Have a Plan B, C, D, …

One might say that 95% is good enough. I am a bit obsessive when it comes to data quality, so I wanted to extract a feed for 99% of the sites I had on my list. Also, I am always leery [...]