All your data are belong to us: July 2010

Sunday, July 25, 2010

All your data are belong to us?

Regarding the title of this blog "All your data are belong to us" - I realized not everyone get the reference. See this wikipedia article for an explanation.

Saturday, July 10, 2010

Caching crawled webpages

When crawling a website I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly from my local cache and avoid extra load on their website server. This is often necessary when a client realizes they require additional features scraped.

I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.

Here is some example usage of pdict:

>>> from webscraping.pdict import PersistentDict
>>> cache = PersistentDict(CACHE_FILE)
>>> cache[url1] = html1
>>> cache[url2] = html2
>>> url1 in cache
True
>>> cache[url1]
html1
>>> cache.keys()
[url1, url2]
>>> del cache[url1]
>>> url1 in cache
False

Thursday, July 1, 2010

Fixed fee or hourly?

I prefer to quote per project rather than per hour for my web scraping work because it:

gives me incentive to increase my efficiency (by improving my infrastructure)
gives the client security about the total cost
avoids distrust about the number of hours actually worked
makes me look more competitive compared to the hourly rates available in Asia and Eastern Europe
is difficult to track time fairly when working on two or more projects simultaneously
is easy to estimate complexity based on past experience, atleast compared to building websites
involves less administration