All your data are belong to us: March 2010

Saturday, March 27, 2010

Scraping Flash based websites

Flash is a pain. It is flaky on Linux and can not be scraped like HTML because it uses a binary format. HTML5 and Apple's criticism of Flash are good news for me because they encourage developers to try non-Flash solutions.

The reality is though that many sites currently use Flash to display content that I need to access. Here are some approaches for scraping Flash that I have tried:

Check for AJAX requests that may carry the data you are after between the flash app and server
Extract text with the Macromedia Flash Search Engine SDK
Use OCR to extract the text directly

Most flash apps are self contained and so don't use AJAX, which rules out (1). And I have had poor results with (2) and (3).

Still no silver bullet...

Tuesday, March 16, 2010

I love AJAX!

AJAX is a JavaScript technique that allows a webpage to request URLs from its backend server and then make use of the returned data. For example gmail uses AJAX to load new messages. The old way to do this was reloading the webpage and then embedding the new content in the HTML, which was inefficient because it required downloading the entire webpage again rather that just the updated data.
AJAX is good for developers because it makes more complex web applications possible. It is good for users because it gives them a faster and smoother browsing experience. And it is good for me because AJAX powered websites are easier to scrape.

The trouble with scraping websites is they obscure the data I am after within a layer of presentation. However AJAX calls typically return just the data in an easy to parse format like JSON or XML. So effectively they provide an API to their backend database.

These AJAX calls can be monitored through tools such as Firebug to see what URLs are called and what they return from the server. Then I can call these URLs directly myself from outside the application and change the query parameters to fetch other records.

Friday, March 12, 2010

Scraping JavaScript webpages with webkit

In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:

requires me to program in JavaScript rather than my beloved Python (with all its great libraries)
is slow because have to wait for FireFox to render the entire webpage
is somewhat buggy and has a small user/developer community, mostly at MIT

An alternative solution that addresses all these points is webkit, which is an open source browser engine used most famously in Apple's Safari browser. Webkit has now been ported to the Qt framework and can be used through its Python bindings.

Here is a simple class that renders a webpage (including executing any JavaScript) and then saves the final HTML to a file:

import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *


class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

url = 'http://sitescraper.net'
r = Render(url)
html = r.frame.toHtml()

I can then analyze this resulting HTML with my standard Python tools like the webscraping module.

Tuesday, March 2, 2010

Scraping JavaScript based web pages with Chickenfoot

The data from most webpages can be scraped by simply downloading the HTML and then parsing out the desired content. However some webpages load their content dynamically with JavaScript after the page loads so that the desired data is not found in the original HTML. This is usually done for legitimate reasons such as loading the page faster, but in some cases is designed solely to inhibit scrapers.
This can make scraping a little tougher, but not impossible.

The easiest case is where the content is stored in JavaScript structures which are then inserted into the DOM at page load. This means the content is still embedded in the HTML but needs to instead be scraped from the JavaScript code rather than the HTML tags.

A more tricky case is where websites encode their content in the HTML and then use JavaScript to decode it on page load. It is possible to convert such functions into Python and then run them over the downloaded HTML, but often an easier and quicker alternative is to execute the original JavaScript. One such tool to do this is the Firefox Chickenfoot extension. Chickenfoot consists of a Firefox panel where you can execute arbitrary JavaScript code within a webpage and across multiple webpages. It also comes with a number of high level functions to make interaction and navigation easier.

To get a feel for Chickenfoot here is an example to crawl a website:

// crawl given website url recursively to given depth
function crawl(website, max_depth, links) {
  if(!links) {
    links = {};
    go(website);
    links[website] = 1;
  }

  // TODO: insert code to act on current webpage here

  if(max_depth > 0) {
    // iterate links
    for(var link=find("link"); link.hasMatch; link=link.next) {  
      url = link.element.href;
      if(!links[url]) {
        if(url.indexOf(website) == 0) {
          // same domain
          go(url);
          links[url] = 1;
          crawl(website, max_depth - 1, links);
        }
      }
    }
  }
  back(); wait();
}

This is part of a script I built on my Linux machine for a client on Windows and it worked fine for both of us.
To find out more about Chickenfoot check out their video.

Chickenfoot is a useful weapon in my web scraping arsenal, particularly for quick jobs with a low to medium amount of data. For larger websites there is a more suitable alternative, which I will cover in the next post.