Sunday, July 25, 2010

All your data are belong to us?

Regarding the title of this blog "All your data are belong to us" - I realized not everyone get the reference. See this wikipedia article for an explanation.

Saturday, July 10, 2010

Caching crawled webpages

When crawling a website I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly from my local cache and avoid extra load on their website server. This is often necessary when a client realizes they require additional features scraped.

I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.

Here is some example usage of pdict:
>>> from webscraping.pdict import PersistentDict
>>> cache = PersistentDict(CACHE_FILE)
>>> cache[url1] = html1
>>> cache[url2] = html2
>>> url1 in cache
True
>>> cache[url1]
html1
>>> cache.keys()
[url1, url2]
>>> del cache[url1]
>>> url1 in cache
False

Thursday, July 1, 2010

Fixed fee or hourly?

I prefer to quote per project rather than per hour for my web scraping work because it:
  • gives me incentive to increase my efficiency (by improving my infrastructure)
  • gives the client security about the total cost
  • avoids distrust about the number of hours actually worked
  • makes me look more competitive compared to the hourly rates available in Asia and Eastern Europe
  • is difficult to track time fairly when working on two or more projects simultaneously
  • is easy to estimate complexity based on past experience, atleast compared to building websites 
  • involves less administration