HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.
Unfortunately the HTML of many webpages around the internet is invalid - for example a list element may be missing a closing tag:
Seemingly the most well known and used such library is BeautifulSoup. A Google search for Python web scraping module currently returns BeautifulSoup as the first result.
However I instead use lxml because I find it more robust when parsing bad HTML. Additionally Ian Bicking found lxml more efficient than the other parsing libraries, though my priority is accuracy over speed.
You will need to use version 2 onwards of lxml, which includes the html module. This meant needing to compile lxml up to Ubuntu 8.10, which came with an earlier version.
Here is an example how to parse the previous broken HTML with lxml:
Unfortunately the HTML of many webpages around the internet is invalid - for example a list element may be missing a closing tag:
<ul>but it still needs to be interpreted as:
<li>abc</li>
<li>def
<li>ghi</li>
</ul>
- abc
- def
- ghi
Seemingly the most well known and used such library is BeautifulSoup. A Google search for Python web scraping module currently returns BeautifulSoup as the first result.
However I instead use lxml because I find it more robust when parsing bad HTML. Additionally Ian Bicking found lxml more efficient than the other parsing libraries, though my priority is accuracy over speed.
You will need to use version 2 onwards of lxml, which includes the html module. This meant needing to compile lxml up to Ubuntu 8.10, which came with an earlier version.
Here is an example how to parse the previous broken HTML with lxml:
>>> from lxml import html
>>> tree = html.fromstring('<ul><li>abc</li><li>def<li>ghi</li></ul>')
>>> tree.xpath('//ul/li')
[<Element li at 959553c>, <Element li at 95952fc>, <Element li at 959544c>]
No comments:
Post a Comment