Tuesday, December 6, 2011

Scraping multiple JavaScript webpages with Python

I made an earlier post here about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:


from PyQt4.QtCore import *
from PyQt4.QtWebKit import *


class Render(QWebPage):
  def __init__(self, urls):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.urls = urls
    self.data = {} # store downloaded HTML in a dict
    self.crawl()
    self.app.exec_()
    
  def crawl(self):
    if self.urls:
      url = self.urls.pop(0)
      print 'Downloading', url
      self.mainFrame().load(QUrl(url))
    else:
      self.app.quit()
      
  def _loadFinished(self, result):
    frame = self.mainFrame()
    url = str(frame.url().toString())
    html = frame.toHtml()
    self.data[url] = html
    self.crawl()
    

urls = ['http://sitescraper.net', 'http://blog.sitescraper.net']
r = Render(urls)
print r.data.keys()


This is a simple solution that will keep all HTML in memory, so is not practical for large crawls. For large crawls you should save the resulting HTML to disk. I use the pdict module for this.

Saturday, December 3, 2011

How to teach yourself web scraping

I often get asked how to learn about web scraping. Here is my advice.

First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don't need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.

The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:

Make sure you learn all the details of the urllib2 module. Here are some additional good resources:



Learn about the HTTP protocol, which is how you will interact with websites.


Learn about regular expressions:



Learn about XPath:



If necessary learn about JavaScript:



These FireFox extensions can make web scraping easier:



Some libraries that can make web scraping easier:

Some other resources: