All your data are belong to us

Thursday, April 15, 2010

Why Google App Engine

In the previous post I covered three alternative approaches to regularly scrape a website for a client, with the most common one being in the form of a web application. However hosting the web application on either my own or the clients server has problems.

My solution is to host the application on a neutral third party platform - Google App Engine (GAE). Here is my overview of deploying on GAE:

Pros:

provides a stable and consistent platform that I can use for multiple applications
both the customer and I can login and manage it, so we do not need to expose our servers
has generous free quotas, which I rarely exhaust

Cons:

only supports pure Python (or Java), so libraries that rely on C such as lxml are not supported (yet)
limitations on maximum job time and interacting with the database
have to trust Google with storing our scraped data

Often deploying on GAE works well for both the client and me, but it is not always practical/possible. I'm still looking for a silver bullet!

Monday, April 12, 2010

Scraping dynamic data

Usually my clients request for a website to be scraped into a standard format like CSV, which they can then integrate with their existing applications. However sometimes a client need a website scraped periodically because its data is continually updated. An example of the first use case is census statistics, and of the second stock prices.

I have three solutions for periodically scraping a website:

I provide the client with my web scraping code, which they can then execute regularly
Client pays me a small fee in future whenever they want the data rescraped
I build a web application that scrapes regularly and provides the data in a useful form

The first option is not always practical if the client does not have a technical background. Additionally my solutions are developed and tested on Linux and may not work on Windows.

The second option is generally not attractive to the client because it puts them in a weak position where they are dependent on me being contactable and cooperative in future.
Also it involves ongoing costs for them.

So usually I end up building a basic web application that consists of a CRON job to do the scraping, an interface to the scraped data, and some administration settings. If the scraping jobs are not too big I am happy to host the application on my own server, however most clients prefer the security of hosting it on their own server in case the app breaks down.

Unfortunately I find hosting on their server does not work well because they will have different versions of libraries or use a platform I am not familiar with. Additionally I prefer to build my web applications in Python (using web2py), and though Python is great for development it cannot compare to PHP for ease of deployment.
I can usually figure this all out but it takes time and also trust from the client to give me root privilege on their server. And given that these web applications are generally low cost (~ $1000) the ease of deployment is important.

All this is far from ideal. The solution? - see my next post.

Saturday, March 27, 2010

Scraping Flash based websites

Flash is a pain. It is flaky on Linux and can not be scraped like HTML because it uses a binary format. HTML5 and Apple's criticism of Flash are good news for me because they encourage developers to try non-Flash solutions.

The reality is though that many sites currently use Flash to display content that I need to access. Here are some approaches for scraping Flash that I have tried:

Check for AJAX requests that may carry the data you are after between the flash app and server
Extract text with the Macromedia Flash Search Engine SDK
Use OCR to extract the text directly

Most flash apps are self contained and so don't use AJAX, which rules out (1). And I have had poor results with (2) and (3).

Still no silver bullet...

Tuesday, March 16, 2010

I love AJAX!

AJAX is a JavaScript technique that allows a webpage to request URLs from its backend server and then make use of the returned data. For example gmail uses AJAX to load new messages. The old way to do this was reloading the webpage and then embedding the new content in the HTML, which was inefficient because it required downloading the entire webpage again rather that just the updated data.
AJAX is good for developers because it makes more complex web applications possible. It is good for users because it gives them a faster and smoother browsing experience. And it is good for me because AJAX powered websites are easier to scrape.

The trouble with scraping websites is they obscure the data I am after within a layer of presentation. However AJAX calls typically return just the data in an easy to parse format like JSON or XML. So effectively they provide an API to their backend database.

These AJAX calls can be monitored through tools such as Firebug to see what URLs are called and what they return from the server. Then I can call these URLs directly myself from outside the application and change the query parameters to fetch other records.

Friday, March 12, 2010

Scraping JavaScript webpages with webkit

In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:

requires me to program in JavaScript rather than my beloved Python (with all its great libraries)
is slow because have to wait for FireFox to render the entire webpage
is somewhat buggy and has a small user/developer community, mostly at MIT

An alternative solution that addresses all these points is webkit, which is an open source browser engine used most famously in Apple's Safari browser. Webkit has now been ported to the Qt framework and can be used through its Python bindings.

Here is a simple class that renders a webpage (including executing any JavaScript) and then saves the final HTML to a file:

import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *


class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

url = 'http://sitescraper.net'
r = Render(url)
html = r.frame.toHtml()

I can then analyze this resulting HTML with my standard Python tools like the webscraping module.

Tuesday, March 2, 2010

Scraping JavaScript based web pages with Chickenfoot

The data from most webpages can be scraped by simply downloading the HTML and then parsing out the desired content. However some webpages load their content dynamically with JavaScript after the page loads so that the desired data is not found in the original HTML. This is usually done for legitimate reasons such as loading the page faster, but in some cases is designed solely to inhibit scrapers.
This can make scraping a little tougher, but not impossible.

The easiest case is where the content is stored in JavaScript structures which are then inserted into the DOM at page load. This means the content is still embedded in the HTML but needs to instead be scraped from the JavaScript code rather than the HTML tags.

A more tricky case is where websites encode their content in the HTML and then use JavaScript to decode it on page load. It is possible to convert such functions into Python and then run them over the downloaded HTML, but often an easier and quicker alternative is to execute the original JavaScript. One such tool to do this is the Firefox Chickenfoot extension. Chickenfoot consists of a Firefox panel where you can execute arbitrary JavaScript code within a webpage and across multiple webpages. It also comes with a number of high level functions to make interaction and navigation easier.

To get a feel for Chickenfoot here is an example to crawl a website:

// crawl given website url recursively to given depth
function crawl(website, max_depth, links) {
  if(!links) {
    links = {};
    go(website);
    links[website] = 1;
  }

  // TODO: insert code to act on current webpage here

  if(max_depth > 0) {
    // iterate links
    for(var link=find("link"); link.hasMatch; link=link.next) {  
      url = link.element.href;
      if(!links[url]) {
        if(url.indexOf(website) == 0) {
          // same domain
          go(url);
          links[url] = 1;
          crawl(website, max_depth - 1, links);
        }
      }
    }
  }
  back(); wait();
}

This is part of a script I built on my Linux machine for a client on Windows and it worked fine for both of us.
To find out more about Chickenfoot check out their video.

Chickenfoot is a useful weapon in my web scraping arsenal, particularly for quick jobs with a low to medium amount of data. For larger websites there is a more suitable alternative, which I will cover in the next post.

Monday, February 8, 2010

How to crawl websites without being blocked

Websites want users who will purchase their products and click on their advertising. They want to be crawled by search engines so their users can find them, however they don't (generally) want to be crawled by others. One such company is Google, ironically.

Some websites will actively try to stop scrapers so here are some suggestions to help you crawl beneath their radar.

Speed

If you download 1 webpage a day then you will not be blocked but your crawl would take too long to be useful. If you instead used threading to crawl multiple URLs asynchronously then they might mistake you for a DOS attack and blacklist your IP. So what is the happy medium? The wikipedia article on web crawlers currently states "Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3–4 minutes." This is a little slow and I have found 1 download every 5 seconds is usually fine. If you don't need the data quickly then use a longer delay to reduce your risk and be kinder to their server.

Identity

Websites do not want to block genuine users so you should try to look like one. Set your user-agent to a common web browser instead of using the library default (such as wget/version or urllib/version). You could even pretend to be the Google Bot (only for the brave): Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

If you have access to multiple IP addresses (for example via proxies) then distribute your requests among them so that it appears your downloading comes from multiple users.

Consistency

Avoid accessing webpages sequentially: /product/1, /product/2, etc. And don't download a new webpage exactly every N seconds.
Both of these mistakes can attract attention to your downloading because a real user browses more randomly. So make sure to crawl webpages in an unordered manner and add a random offset to the delay between downloads.

Following these recommendations will allow you to crawl most websites without being detected.