All your data are belong to us: April 2010

In the previous post I covered three alternative approaches to regularly scrape a website for a client, with the most common one being in the form of a web application. However hosting the web application on either my own or the clients server has problems.

My solution is to host the application on a neutral third party platform - Google App Engine (GAE). Here is my overview of deploying on GAE:

Pros:

provides a stable and consistent platform that I can use for multiple applications
both the customer and I can login and manage it, so we do not need to expose our servers
has generous free quotas, which I rarely exhaust

Cons:

only supports pure Python (or Java), so libraries that rely on C such as lxml are not supported (yet)
limitations on maximum job time and interacting with the database
have to trust Google with storing our scraped data

Often deploying on GAE works well for both the client and me, but it is not always practical/possible. I'm still looking for a silver bullet!

Usually my clients request for a website to be scraped into a standard format like CSV, which they can then integrate with their existing applications. However sometimes a client need a website scraped periodically because its data is continually updated. An example of the first use case is census statistics, and of the second stock prices.

I have three solutions for periodically scraping a website:

I provide the client with my web scraping code, which they can then execute regularly
Client pays me a small fee in future whenever they want the data rescraped
I build a web application that scrapes regularly and provides the data in a useful form

The first option is not always practical if the client does not have a technical background. Additionally my solutions are developed and tested on Linux and may not work on Windows.

The second option is generally not attractive to the client because it puts them in a weak position where they are dependent on me being contactable and cooperative in future.
Also it involves ongoing costs for them.

So usually I end up building a basic web application that consists of a CRON job to do the scraping, an interface to the scraped data, and some administration settings. If the scraping jobs are not too big I am happy to host the application on my own server, however most clients prefer the security of hosting it on their own server in case the app breaks down.

Unfortunately I find hosting on their server does not work well because they will have different versions of libraries or use a platform I am not familiar with. Additionally I prefer to build my web applications in Python (using web2py), and though Python is great for development it cannot compare to PHP for ease of deployment.
I can usually figure this all out but it takes time and also trust from the client to give me root privilege on their server. And given that these web applications are generally low cost (~ $1000) the ease of deployment is important.

All this is far from ideal. The solution? - see my next post.

All your data are belong to us

Thursday, April 15, 2010

Why Google App Engine

Monday, April 12, 2010

Scraping dynamic data

About Me