When crawling a website I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly from my local cache and avoid extra load on their website server. This is often necessary when a client realizes they require additional features scraped.
I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.
Here is some example usage of pdict:
I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.
Here is some example usage of pdict:
>>> from webscraping.pdict import PersistentDict >>> cache = PersistentDict(CACHE_FILE) >>> cache[url1] = html1 >>> cache[url2] = html2 >>> url1 in cache True >>> cache[url1] html1 >>> cache.keys() [url1, url2] >>> del cache[url1] >>> url1 in cache False
I think its a very good idea that your library has no external dependencies, because it have a low adoption cost. To be honest, I hate mucking around with dependencies. You have download them, get the right version, compile them (depends tho), get their dependencies. There is too much room for error.
ReplyDeleteYeah exactly. Often I distribute pdict with my solutions and I don't want the client having to go through that.
ReplyDeleteFrom a cursory look into your library, it looks like you're using xpath for parsing, any reason why you aren't using a htmlparser like BeautifulSoup?
ReplyDeleteIt's just a single file and does not require dependencies(unless i'm mistaken).
hi jimbojambo,
ReplyDeleteI often use lxml's html parser, which from my experience performs better than BeautifulSoup.
However I sometimes faced problems with preparsing the html. One is that when dealing with really broken webpages html parsers can obscure content.
Results have been good - I will write a post about it some time.
Bad html does break BeautifulSoup, but I first clean the html through tidy(using tidylib) and then everything works great. Of course there's a small overhead, but it's negligible.
ReplyDeleteThe problem with lxml is that its docs are really bad, I never really understood anything from its manual.
yeah lxml is not as user friendly, but it does offer a lot more functionality. It also has good docstrings and an active user group.
ReplyDeleteBeautifulSoup is no longer being developed - have you tried html5lib?
hi Richard,
ReplyDeleteSorry for replying so late.
I haven't tried html5lib and probably won't anytime soon.
I've found Ruby mechanize(it uses nokogiri(a xpath lib) for html parsing), it has automatic handling of cookies and acts almost like a real browser, it's made scraping sites a lot easier for me.
I also use Heroku as a test server, unlike gae heroku allows c extensions to be installed, so mechanize works very well there. It's got only 5mb as free storage but that's plenty enough for me since I don't actually store anything in a db.
Btw, Ruby's far nicer than python ;)
I highly recommend you take a look at it if/when you have the time.
I often use Python's Mechanize library, which is also based on the original Perl library.
ReplyDeleteYes Ruby is good - I used to use Rails. But I prefer Python!