Tuesday, December 6, 2011

Scraping multiple JavaScript webpages with Python

I made an earlier post here about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:


from PyQt4.QtCore import *
from PyQt4.QtWebKit import *


class Render(QWebPage):
  def __init__(self, urls):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.urls = urls
    self.data = {} # store downloaded HTML in a dict
    self.crawl()
    self.app.exec_()
    
  def crawl(self):
    if self.urls:
      url = self.urls.pop(0)
      print 'Downloading', url
      self.mainFrame().load(QUrl(url))
    else:
      self.app.quit()
      
  def _loadFinished(self, result):
    frame = self.mainFrame()
    url = str(frame.url().toString())
    html = frame.toHtml()
    self.data[url] = html
    self.crawl()
    

urls = ['http://sitescraper.net', 'http://blog.sitescraper.net']
r = Render(urls)
print r.data.keys()


This is a simple solution that will keep all HTML in memory, so is not practical for large crawls. For large crawls you should save the resulting HTML to disk. I use the pdict module for this.

2 comments:

  1. I was getting "QSocket Notifier: invalid socket.."
    problem while using this snippet.What this error does?

    Also,Do I need to do append some non-blocking statements to this?How this is happening?Does this make problem for doing for n urls?

    ReplyDelete
  2. This script uses the loadFinished() signal to load the next page.

    Not sure about the socket error - may be related to the particular website loaded.

    ReplyDelete