Friday, March 12, 2010

Scraping JavaScript webpages with webkit

In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:
  1. requires me to program in JavaScript rather than my beloved Python (with all its great libraries)
  2. is slow because have to wait for FireFox to render the entire webpage
  3. is somewhat buggy and has a small user/developer community, mostly at MIT
An alternative solution that addresses all these points is webkit, which is an open source browser engine used most famously in Apple's Safari browser. Webkit has now been ported to the Qt framework and can be used through its Python bindings.

Here is a simple class that renders a webpage (including executing any JavaScript) and then saves the final HTML to a file:


import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *


class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

url = 'http://sitescraper.net'
r = Render(url)
html = r.frame.toHtml()


I can then analyze this resulting HTML with my standard Python tools like the webscraping module.

32 comments:

  1. Hi Richard.

    I have been looking for something like this. From all the solutions, yours seems the best one.
    The problem is that I couldn't make it work :)
    Could you give me a hand?

    I have installed Pyside on Ubuntu.
    Now my script looks like:

    #!/usr/bin/python

    # Import PySide classes
    import sys
    from PySide.QtCore import *
    from PySide.QtGui import *


    class Render(QWebPage):
    def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

    def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

    r = Render("www.google.com")
    html = r.frame.toHtml()


    But I have an error:
    Traceback (most recent call last):
    File "test1.py", line 9, in
    class Render(QWebPage):
    NameError: name 'QWebPage' is not defined


    Am I missing something?
    Thank you in advance.

    ReplyDelete
  2. Glad you find it useful.

    I have updated my example to show the required imports - you will also need:
    from PySide.QtWebKit import *

    ReplyDelete
  3. Thanks a lot Richard.
    It works now.

    ReplyDelete
  4. Hi,

    Is there any way I can send mouse events, I mean, to simulate user actions?

    Or maybe a way to use it with selenium?

    Thanks,
    Gabriel

    ReplyDelete
  5. Yes, you can simulate mouse events through JavaScript:

    e.evaluateJavaScript("var evObj = document.createEvent('MouseEvents'); evObj.initEvent('click', true, true); this.dispatchEvent(evObj);")

    ReplyDelete
  6. Would it be possible to add authentication? This is almost exactly what I need, but the page I want to access is on our company intranet and requires a login to view.

    Thanks for the example! It's a huge help.

    --Mike

    ReplyDelete
  7. webkit supports cookies like a normal browser, so you could make it submit the login form before accessing the content.

    ReplyDelete
  8. You can also interface xdotool for mouse events.

    ReplyDelete
  9. I prefer solutions that are cross platform with minimal dependencies, so easier to deploy to clients.
    Isn't xdotool only for X11?

    ReplyDelete
  10. Hi Richard, you example code is really what i have been looking for online for so long. I am new to QtWebKit and would like to ask a few question regarding the example code.

    Q1) I replaced the url link in your example code with:
    r = Render("http://quote.morningstar.com/stock/s.aspx?t=wmt")

    Code seems working, but I am getting error message as following:
    QSslSocket: cannot call unresolved function SSLv3_client_method
    QSslSocket: cannot call unresolved function SSL_CTX_new
    QSslSocket: cannot call unresolved function SSL_library_init
    QSslSocket: cannot call unresolved function ERR_get_error

    Q2) I added a line in the end as "print html". I am getting following errors:

    Traceback (most recent call last):
    File "D:\MyStuffs\Hobbies\AlienProjects\Stocks\Scrape\webK.py", line 25, in
    print html
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 585: ordinal not in range(128)

    Is html an object or a reference to the captured web page text? How do I print page out?

    Thanks,

    -- Wei

    ReplyDelete
  11. I see a followup to your login question was made here:
    http://stackoverflow.com/questions/5356948/scraping-javascript-driven-web-pages-with-pyqt4-how-to-access-pages-that-need-a/

    ReplyDelete
  12. Thanks, I was looking for something like this. However, if the load time of the page requires much time (for instance 10 seconds in a browser), how to get that? This script terminates quickly and doesn't fetch everything. I guess it can be done with QTimer but I don't know how to integrate a timeout limit.

    ReplyDelete
  13. the _loadFinished signal won't be called until the page and it's resources are loaded. If you need additional time you could call something like this before exiting the app:

    def wait(self, secs=10):
    deadline = time.time() + secs
    while time.time() < deadline:
    time.sleep(0.1)
    self.app.processEvents()

    ReplyDelete
  14. Hi Richard,
    I am quite new to all of this so pardon me if my questions are fundamental. I am trying to scrape information from a webpage that is practically entirely encoded in Javascript, and listed in a dynamic table on multiple pages. The site also happens to have a 'Download' button that conveniently puts all the data into a csv file. I don't know whether it would be easier to automate the clicking of this button, or the scraping of the code itself--if the former, is there a way to do this with WebKit or something else that doesn't require too many downloads? If the latter, how can I view the saved HTML from the rendered webpage? Any insight would be appreciated. Thanks!

    ReplyDelete
  15. Yes could do this with webkit, but probably easier to just replicate the download request. There are firefox extensions that can help you with this, such as firebug.

    ReplyDelete
  16. Thanks! I got the URL of the 'download' button using firebug, but it consists of a (dynamically generated based on which checkboxes are selected) javascript call to a regularly updated database, rather than a direct file path:

    https://www.quantcast.com/download/plannerCSV?&d0Id=10&sc=1&mr=10000

    Do you have any ideas on how to replicate the download request using that? Thanks so much; I really appreciate your time :)

    ReplyDelete
  17. Don't worry about the particular javascript used. Instead analyze the download request this triggers and then replicate this yourself.

    ReplyDelete
  18. This is fantastic! It actually took some work to get PySide installed, but now i'm having this problem:

    from PySide.QtGui import *


    Traceback (most recent call last):
    File "", line 1, in
    ImportError: dlopen(/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/PySide/QtGui.so, 2): Library not loaded: /opt/local/lib/libpng12.0.dylib
    Referenced from: /opt/local/lib/libQtGui.4.dylib
    Reason: Incompatible library version: libQtGui.4.dylib requires version 45.0.0 or later, but libpng12.0.dylib provides version 44.0.0


    Any ideas?

    ReplyDelete
  19. Seems that you installed pyside manually and now you have a version dependency problem. Can you install a more recent version of libpng?
    Or if you use package management to install the dependencies will be taken care of.

    This is what I used on Ubuntu to install PyQt:

    sudo apt-get install python-qt4

    ReplyDelete
  20. Thanks a lot for these useful information, it helped me a lot.

    However when I try to connect to this website, https[://]www[dot]securitygarden[dot]com, I fail to establish a connection. I'm one of the contributor of this project and i would like to test the efficiency of content obtained with PySide/PyQt.

    I don't know the origin of the problem, because I can connect to that URL with urllib2. If you have an idea thanks in advance, and thanks again for the articles on this blog that I find useful and interesting.

    ReplyDelete
  21. Is it possible to load multiple URLs without touching the Render class.

    ReplyDelete
  22. Currently the loading code is in the constructor so the class is not efficient for loading multiple URL's. You should refactor and put the loading code in a method.

    ReplyDelete
  23. I would like to call a method multiple times, each time with a different URL. I see that the load and app.exec_ can be moved to a loadURL method, but I don't see how to support multiple calls... please enlighten,. being stuck on it for a while now.

    ReplyDelete
  24. Hi, is there a way to track the progress of page loading, something like getting the amount of data loaded from url?

    Thanks!

    ReplyDelete
  25. yes, you can track downloading progress via the loadProgress(int) signal: http://www.riverbankcomputing.co.uk/static/Docs/PyQt4/html/qwebview.html

    ReplyDelete
  26. Hello. I need help creating a web scraping application to capture information from a dynamic web page. The page employs periodic XMLHttpRequest requests, once per second and the objective is to capture and log all responses. The server sets cookies both through http and javascript methods and requires these cookies in the request headers. It appears that a Webkit hack could accomplish this. Is anyone able to help with this?

    ReplyDelete
  27. yes webkit could manage this - you can catch the finished() signal to check AJAX responses.

    ReplyDelete
  28. i want to load a list of URLs and scrape some value in each page.
    But using the example above and some modification like this:
    urllist = ['https://market.android.com/details?id=com.tencent.mtt','https://market.android.com/details?id=com.tencent.qqpimsecure']
    p = re.compile(r'num\d">(\d+)<')
    for detailurl in urllist:
    r = Render(QUrl(detailurl))
    html = r.frame.toHtml()
    matched = p.findall(html)
    print matched

    then i got the error:A QApplication instance already exists.
    how can i reload the frame using new URL and get the content? thx.

    ReplyDelete
  29. yes you are only allowed to define a single QApplication instance. Here is a modified example for crawling multiple URL's:
    http://blog.sitescraper.net/2011/12/scraping-multiple-javascript-webpages.html

    ReplyDelete
    Replies
    1. Hi Richard,

      This has been a big help, however, I have the same problem as the last commenter: you can only have a single QApplication instance. I saw your solution for multiple URLs, but it assumes that all of the URLs are known in advance. In my case, I'm batch processing URLs and I only get the URLs of the AJAX pages as I'm parsing another page, so it's all in a big loop. I need to be able to call Render(..) inside a loop:

      for page in pages_with_embedded_urls:
      url = get_url(page)
      render = Render(url)
      render.get_html()

      The second iteration of the loop results in: RuntimeError: A QApplication instance already exists. Do you know how I can do this?

      Thanks in advance.

      Delete
  30. hi Jonathan, didn't realize this blog was still active.
    Have reposted here: http://sitescraper.net/blog/Scraping-multiple-JavaScript-webpages-with-webkit/
    And will disable this blog.

    ReplyDelete