All your data are belong to us: January 2010

Friday, January 29, 2010

The SiteScraper library

As a student I was fortunate to have the opportunity to learn about web scraping, guided by Professor Timothy Baldwin. I aimed to build a tool to make scraping web pages easier, resulting from frustration with a previous project.

My idea for this tool was that it should be possible to train a program to scrape a website by just giving the desired outputs for some example webpages. The program would build a model of how to extract this content and then this model could be applied to scrape other webpages that used the same template.

The tool was eventually called SiteScraper and is available for download on Google Code. For more information have a browse of this paper, which covers the implementation and results in detail.

I use SiteScraper for much of my scraping work and often make updates based on experience gained from a project.

Wednesday, January 20, 2010

Web scraping with regular expressions

Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let's say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions.

The results are:

regex_test took 40.032 ms

lxml_test took 1863.463 ms

bs_test took 54206.303 ms

That means for this use case lxml takes 40x longer than regular expressions and BeautifulSoup over 1000x! This is because lxml and BeautifulSoup parse the entire document into their internal format, when only the title is required.

XPaths are very useful for most web scraping tasks, but there still is a use case for regular expressions.

Tuesday, January 5, 2010

How to use XPaths robustly

In an earlier post I referred to XPaths but did not explain how to use them.

Say we have the following HTML document:

<html>
<body
  <div></div>
  <div id="content">
   <ul>
   <li>First item</li>
   <li>Second item</li>
   </ul>
  </div>
</body>
</html>

To access the list elements we follow the HTML structure from the root tag down to the li's:

html > body > 2nd div > ul > many li's.

An XPath to represent this traversal is:

/html[1]/body[1]/div[2]/ul[1]/li

If a tag has no index then every tag of that type will be selected:

/html/body/div/ul/li

XPaths can also use attributes to select nodes:

/html/body/div[@id="content"]/ul/li

And instead of using an absolute XPath from the root the XPath can be relative to a particular node by using double slash:

//div[@id="content"]/ul/li

This is more reliable than an absolute XPath because it can still locate the correct content after the surrounding structure is changed.

There are other features in the XPath standard but the above are all I use regularly.

A handy way to find the XPath of a tag is with Firefox's Firebug extension. To do this open the HTML tab in Firebug, right click the element you are interested in, and select "Copy XPath". (Alternatively use the "Inspect" button to select the tag.)
This will give you an XPath with indices only where there are multiple tags of the same type, such as:

/html/body/div[2]/ul/li

One thing to keep in mind is Firefox will always create a <tbody> tag within tables whether it existed in the original HTML or not. I need to be reminded of this often!

For one-off scrapes the above XPath should be fine. But for long term repeat scrapes it is better to use a relative XPath around an ID element with attributes instead of indices. From my experience such an XPath is more likely to survive minor modifications to the layout. However for a more robust solution see my SiteScraper library, which I will introduce in a later post.

Saturday, January 2, 2010

Parsing HTML with Python

HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.

Unfortunately the HTML of many webpages around the internet is invalid - for example a list element may be missing a closing tag:

<ul>
<li>abc</li>
<li>def
<li>ghi</li>
</ul>

but it still needs to be interpreted as:

This means we can't naively parse HTML by assuming a tag ends when we find the next closing tag. Instead it is best to use one of the many HTML parsing libraries available, such as BeautifulSoup, lxml, html5lib, and libxml2dom.
Seemingly the most well known and used such library is BeautifulSoup. A Google search for Python web scraping module currently returns BeautifulSoup as the first result.
However I instead use lxml because I find it more robust when parsing bad HTML. Additionally Ian Bicking found lxml more efficient than the other parsing libraries, though my priority is accuracy over speed.

You will need to use version 2 onwards of lxml, which includes the html module. This meant needing to compile lxml up to Ubuntu 8.10, which came with an earlier version.

Here is an example how to parse the previous broken HTML with lxml:

>>> from lxml import html
>>> tree = html.fromstring('<ul><li>abc</li><li>def<li>ghi</li></ul>')
>>> tree.xpath('//ul/li')
[<Element li at 959553c>, <Element li at 95952fc>, <Element li at 959544c>]