All your data are belong to us: webkit

Tuesday, December 6, 2011

Scraping multiple JavaScript webpages with Python

I made an earlier post here about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:

from PyQt4.QtCore import *
from PyQt4.QtWebKit import *


class Render(QWebPage):
  def __init__(self, urls):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.urls = urls
    self.data = {} # store downloaded HTML in a dict
    self.crawl()
    self.app.exec_()
    
  def crawl(self):
    if self.urls:
      url = self.urls.pop(0)
      print 'Downloading', url
      self.mainFrame().load(QUrl(url))
    else:
      self.app.quit()
      
  def _loadFinished(self, result):
    frame = self.mainFrame()
    url = str(frame.url().toString())
    html = frame.toHtml()
    self.data[url] = html
    self.crawl()
    

urls = ['http://sitescraper.net', 'http://blog.sitescraper.net']
r = Render(urls)
print r.data.keys()

This is a simple solution that will keep all HTML in memory, so is not practical for large crawls. For large crawls you should save the resulting HTML to disk. I use the pdict module for this.

Friday, March 12, 2010

Scraping JavaScript webpages with webkit

In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:

requires me to program in JavaScript rather than my beloved Python (with all its great libraries)
is slow because have to wait for FireFox to render the entire webpage
is somewhat buggy and has a small user/developer community, mostly at MIT

An alternative solution that addresses all these points is webkit, which is an open source browser engine used most famously in Apple's Safari browser. Webkit has now been ported to the Qt framework and can be used through its Python bindings.

Here is a simple class that renders a webpage (including executing any JavaScript) and then saves the final HTML to a file:

import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *


class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

url = 'http://sitescraper.net'
r = Render(url)
html = r.frame.toHtml()

I can then analyze this resulting HTML with my standard Python tools like the webscraping module.