Web Mining Pitfalls

Programming defensively requires knowing the input that your code should be able to handle. Typically, the programmer may be intimately familiar with the type of data that his/her code will encounter and can perform checks and catch exceptions with respect to the format of the data.

Web mining requires a lot more sophistication. The programmer in many cases does not know the full formatting of the data published on a web site. Additionally, this format may change over time. There are certain standards that do apply to certain types of data on the web, but one cannot rely on web developers to follow these standards. For example, the RSS Advisory Board developed a convention for the formatting of web pages so that browsers can automatically discover the links to the site’s RSS feeds. I have found in my research that approximately 95% of my sample actually implemented this convention. Not bad, but not perfect.

Always Have a Plan B, C, D, …

One might say that 95% is good enough. I am a bit obsessive when it comes to data quality, so I wanted to extract a feed for 99% of the sites I had on my list. Also, I am always leery of bias. Could there be something special about these sites that do not implement RSS autodiscovery? Clearly, there are exceptions to my Plan A. So time to move to Plan B. I found that some of the sites in this 5% used FeedBurner to index their feeds, so Plan B was to use a regular expression to extract FeedBurner URLs. This only added another 1% (actually less than that) to my coverage.

((?:http|feed)://feeds.feedburner.com/.*?)

Next, Plan C took the domain name and simply slapped /feed to it and hope it sticks. I called this process “feed probing” and it added the remaining 3% that I was looking for. If Plans A, B and C all failed to find a suitable RSS feed, all hope is lost and we just skip this site (1% error).

On the other hand, there are times when it is the HTTP client or server cannot be trusted…

Common Python Exceptions in Web Mining

It is all too common to encounter an exception while web mining or crawling. Code must handle these errors gracefully by catching exceptions or failing without aborting. One method that works well is to provide a resume mechanism that restarts execution where the code left off, rather than having to start a multiple hour/day/week job over again! Below is a taxonomy of common problems (and their Python exceptions):

HTTP Errors. These occur frequently. Some are recoverable, and others are worth just throwing out a record over. The most common ones are below, but for more information, refer to RFC 2616.

  • 404 Not Found: the web page you tried to download could not be found. Skip the record.
  • 400 Bad Request: the server has deemed the client’s HTTP request as malformed. Either retry, or double check your code!
  • 401 Unauthorized: authentication is required before proceeding. Either skip the record, or add authentication to your code.
  • 403 Forbidden: you are trying to access something you are not allowed to access. This is a common HTTP error thrown when your program is being rate limited.
  • 500 Internal Server Error: something on the server end is wrong. Either try again immediately, or wait a while and retry the request.
  • 3xx Redirect: the web page you are trying to access has moved somewhere else. In my research, these are rare but common in practice. I choose to skip these sites. You may wish not to.

In Python, these can be caught as urllib2.HTTPError. It is also possible to specify actions based on the specific HTTP error code returned:

try:
    content = urllib2.urlopen(url).read()
except urllib2.HTTPError, e:
    if e.code == 404:
        print "Not Found"
    elif e.code == 500:
        print "Internal Server Error"

Server Errors “URLError”. These occur frequently as well and seem to denote some sort of server or connection trouble, such as “Connection refused” or site does not exist. Usually, these are resolved by retrying the fetch. In Python, it is very important to note that HTTPError is a subclass of URLError, so when handling both exceptions distinctly, HTTPError must be caught first.

try:
    content = urllib2.urlopen(url).read()
except urllib2.HTTPError, e:
    ...
except urllib2.URLError, f:
    print f.reason


Other Bizarreness. The web is very chaotic. Sometimes weird stuff happens. The rare, elusive httplib.BadStatusLine exception technically means that the server returned an error code that the client does not understand, but it can also be thrown when the page being fetched is blank. On a recent project, I ran into a new one: httplib.IncompleteRead which has little documentation. Both of these issues can usually be resolved by retrying the fetch. Both of these pesky errors (and more) can be handled by simply catching their parent exception: httplib.HTTPException.

try:
    content = urllib2.urlopen(url).read()
except httplib.HTTPException:
    #you've encountered a rare beast. You win a prize!

Everything Deserves a Second Chance

One common reaction to any error is to just throw the record out. URLErrors errors are so common, that it is probably unwise to do that if you are using the data for something. Typically, these errors go away if you try again. I use the following loop to catch errors and react appropriately.

attempt = 0
while attempt < HTTP_RETRIES:
    attempt += 1
    try: 
        temp = urllib2.urlopen(url).read()
        break 
    except urllib2.HTTPError:
        break
    except urllib2.URLError:
        continue
    except httplib.HTTPException:
        continue
else:
    continue

This code attempts to fetch URL a maximum of HTTP_RETRIES times. If the fetch is successful, Python breaks out of the loop. If a URLError or HTTPException occurs, we move on to another attempt of the fetch. If we encounter an HTTP error (not found, restricted etc), give up. Depending on the error, we can modify the code to retry on certain errors, and abort on others, but for my purposes, I do not care.

The Comatose Crawler

If you have ever done a large scale crawl on a web site, you are bound to encounter a state where your crawler becomes comatose – it is running, maybe using system resources, but is not outputting anything or reporting progress. It looks like an infinite no-op loop. I have encountered this problem since I started doing web mining in 2006 and did not, until just this past weekend, realize exactly why it was happening and how to prevent it.

Your crawler has sunk in a swamp, and is essentially trapped. For whatever reason, the HTTP server your code is communicating with maintains an open connection, but sends no data. I suppose this could be a deadlock-type situation where the HTTP server is waiting for an additional request (?), and the crawler is waiting for output from the HTTP server. It was my misunderstanding that the HTTP protocol had a built-in timeout, and I was relying on it. This is apparently not the case. There is a simple way to avoid this swamp, by setting a timeout on the socket sending the HTTP request:

import socket
...
HTTP_TIMEOUT = 5
socket.setdefaulttimeout(HTTP_TIMEOUT)
...
handle = urllib2.urlopen("http://www.google.com")
content = handle.read()
...

If a request to the socket goes unanswered after HTTP_TIMEOUT seconds, Python throws a urllib2.URLError exception that can be caught. In my code, I just skip these troublemakers.

Traceback (most recent call last):
  File "", line 1, in ?
  File "/usr/lib64/python2.4/urllib2.py", line 130, in urlopen
    return _opener.open(url, data)
  File "/usr/lib64/python2.4/urllib2.py", line 358, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.4/urllib2.py", line 376, in _open
    '_open', req)
  File "/usr/lib64/python2.4/urllib2.py", line 337, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.4/urllib2.py", line 1021, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.4/urllib2.py", line 996, in do_open
    raise URLError(err)
urllib2.URLError: urlopen error="" timed="" out=""

With enough experience, dedication, blood, sweat, tears, and caffeine, data mining the jungle known as the World Wide Web becomes both simple and fun. Happy web mining!

4 comments to Web Mining Pitfalls

  • […] This post was mentioned on Twitter by Ryan Rosario, Christophe Lalanne. Christophe Lalanne said: RT @DataJunkie: New at Byte Mining: Web Mining Pitfalls http://dlvr.it/Fkrym (working link now!) […]

  • Tim

    3xx are pretty common now that every link shared through social media has been shortened.
    I’m surprised you don’t mention HTTP 408, I ran into those quite frequently.
    Great post.

    • 408 eh? I’ve never ever run across that one! lol I would have expected that HTTP would throw that error on a timeout, but instead my crawlers just get stuck in a swamp. Thanks for the tip!!

      Regarding 3xx. I typically mine from one site (say, a social network), so I’ve been lucky to not see that. I’d venture to guess that my 1% error is from redirects, and hidden RSS links somewhere.

  • As mentioned briefly on twitter, I wrote some code a while ago for dealing with scraping HTML for feeds which have been embedded in a useless and hard to detect manner. My sample size was admittedly small (a few hundred pages I think), but roughly the logic was to start with all possible URLs on the page and apply the following a sequence of narrowing filters. At each stage if a filter removes everything we skip it and retain the previous step. Then at the end you (hopefully) end up with only a few links, do a get on all of them and see which of them look like feeds based on content type and actual contents.

    It’s a bit hacky (see special cases like “Hey, blogger don’t actually do real redirects. Let’s handle that”), but it worked pretty well.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>