Saturday, July 10, 2010

Caching crawled webpages

When crawling a website I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly from my local cache and avoid extra load on their website server. This is often necessary when a client realizes they require additional features scraped.

I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.

Here is some example usage of pdict:
>>> from webscraping.pdict import PersistentDict
>>> cache = PersistentDict(CACHE_FILE)
>>> cache[url1] = html1
>>> cache[url2] = html2
>>> url1 in cache
True
>>> cache[url1]
html1
>>> cache.keys()
[url1, url2]
>>> del cache[url1]
>>> url1 in cache
False

8 comments:

  1. I think its a very good idea that your library has no external dependencies, because it have a low adoption cost. To be honest, I hate mucking around with dependencies. You have download them, get the right version, compile them (depends tho), get their dependencies. There is too much room for error.

    ReplyDelete
  2. Yeah exactly. Often I distribute pdict with my solutions and I don't want the client having to go through that.

    ReplyDelete
  3. From a cursory look into your library, it looks like you're using xpath for parsing, any reason why you aren't using a htmlparser like BeautifulSoup?

    It's just a single file and does not require dependencies(unless i'm mistaken).

    ReplyDelete
  4. hi jimbojambo,
    I often use lxml's html parser, which from my experience performs better than BeautifulSoup.
    However I sometimes faced problems with preparsing the html. One is that when dealing with really broken webpages html parsers can obscure content.
    Results have been good - I will write a post about it some time.

    ReplyDelete
  5. Bad html does break BeautifulSoup, but I first clean the html through tidy(using tidylib) and then everything works great. Of course there's a small overhead, but it's negligible.

    The problem with lxml is that its docs are really bad, I never really understood anything from its manual.

    ReplyDelete
  6. yeah lxml is not as user friendly, but it does offer a lot more functionality. It also has good docstrings and an active user group.

    BeautifulSoup is no longer being developed - have you tried html5lib?

    ReplyDelete
  7. hi Richard,

    Sorry for replying so late.

    I haven't tried html5lib and probably won't anytime soon.
    I've found Ruby mechanize(it uses nokogiri(a xpath lib) for html parsing), it has automatic handling of cookies and acts almost like a real browser, it's made scraping sites a lot easier for me.

    I also use Heroku as a test server, unlike gae heroku allows c extensions to be installed, so mechanize works very well there. It's got only 5mb as free storage but that's plenty enough for me since I don't actually store anything in a db.

    Btw, Ruby's far nicer than python ;)

    I highly recommend you take a look at it if/when you have the time.

    ReplyDelete
  8. I often use Python's Mechanize library, which is also based on the original Perl library.

    Yes Ruby is good - I used to use Rails. But I prefer Python!

    ReplyDelete