Saturday, August 28, 2010

Why reinvent the wheel?

I have been asked a few times why I chose to reinvent the wheel when libraries such as Scrapy and lxml already exist.

I am aware of these libraries and have used them in the past with good results. However my current work involves building relatively simple web scraping scripts that I want to run without hassle on the clients machine. This rules out installing full frameworks such as Scrapy or compiling C based libraries such as lxml - I need a pure Python solution. This also gives me the flexibility to run the script on Google App Engine.

To scrape webpages there are generally two stages: parse the HTML and then select the relevant nodes.
The most well known Python HTML parser seems to be BeautifulSoup, however I find it slow, difficult to use (compared to XPath), often parses HTML inaccurately, and significantly - the author has lost interest in further developing it. So I would not recommend using it - instead go with html5lib.

To select HTML content I use XPath. Is there a decent pure Python XPath solution? I didn't find one 6 months ago when I needed it so developed this simple version that covers my typical use cases. I would deprecate this in future if a decent solution does come along, but for now I am happy with my pure Python infrastructure.

2 comments:

  1. BeautifulSoup is under development. Here is a quote from the page you linked:

    "This page was originally written in March 2009. Since then, the 3.2 series has been released, replacing the 3.1 series, and development of the 4.x series has gotten underway."

    I haven't used html5lib yet. Could you show a simple example with BS and html5lib too?

    ReplyDelete
  2. BeautifulSoup is still under development, but it does not do a good job at parsing and the original author has lost interest.

    The html5lib docs have some examples:
    http://code.google.com/p/html5lib/wiki/UserDocumentation

    ReplyDelete