Saturday, January 2, 2010

Parsing HTML with Python

HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.

Unfortunately the HTML of many webpages around the internet is invalid - for example a list element may be missing a closing tag:
<ul>
<li>abc</li>
<li>def
<li>ghi</li>
</ul>
but it still needs to be interpreted as:
  • abc
  • def
  • ghi
This means we can't naively parse HTML by assuming a tag ends when we find the next closing tag. Instead it is best to use one of the many HTML parsing libraries available, such as BeautifulSoup, lxml, html5lib, and libxml2dom.
Seemingly the most well known and used such library is BeautifulSoup. A Google search for Python web scraping module currently returns BeautifulSoup as the first result.
However I instead use lxml because I find it more robust when parsing bad HTML. Additionally Ian Bicking found lxml more efficient than the other parsing libraries, though my priority is accuracy over speed.

You will need to use version 2 onwards of lxml, which includes the html module. This meant needing to compile lxml up to Ubuntu 8.10, which came with an earlier version.

Here is an example how to parse the previous broken HTML with lxml:
>>> from lxml import html
>>> tree = html.fromstring('<ul><li>abc</li><li>def<li>ghi</li></ul>')
>>> tree.xpath('//ul/li')
[<Element li at 959553c>, <Element li at 95952fc>, <Element li at 959544c>]

No comments:

Post a Comment