Regarding the title of this blog "All your data are belong to us" - I realized not everyone get the reference. See this wikipedia article for an explanation.
Sunday, July 25, 2010
Saturday, July 10, 2010
Caching crawled webpages
When crawling a website I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly from my local cache and avoid extra load on their website server. This is often necessary when a client realizes they require additional features scraped.
I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.
Here is some example usage of pdict:
I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.
Here is some example usage of pdict:
>>> from webscraping.pdict import PersistentDict >>> cache = PersistentDict(CACHE_FILE) >>> cache[url1] = html1 >>> cache[url2] = html2 >>> url1 in cache True >>> cache[url1] html1 >>> cache.keys() [url1, url2] >>> del cache[url1] >>> url1 in cache False
Thursday, July 1, 2010
Fixed fee or hourly?
I prefer to quote per project rather than per hour for my web scraping work because it:
- gives me incentive to increase my efficiency (by improving my infrastructure)
- gives the client security about the total cost
- avoids distrust about the number of hours actually worked
- makes me look more competitive compared to the hourly rates available in Asia and Eastern Europe
- is difficult to track time fairly when working on two or more projects simultaneously
- is easy to estimate complexity based on past experience, atleast compared to building websites
- involves less administration
Subscribe to:
Posts (Atom)