All your data are belong to us: Scraping JavaScript webpages with webkit

Friday, March 12, 2010

Scraping JavaScript webpages with webkit

In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:

requires me to program in JavaScript rather than my beloved Python (with all its great libraries)
is slow because have to wait for FireFox to render the entire webpage
is somewhat buggy and has a small user/developer community, mostly at MIT

An alternative solution that addresses all these points is webkit, which is an open source browser engine used most famously in Apple's Safari browser. Webkit has now been ported to the Qt framework and can be used through its Python bindings.

Here is a simple class that renders a webpage (including executing any JavaScript) and then saves the final HTML to a file:

import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *


class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

url = 'http://sitescraper.net'
r = Render(url)
html = r.frame.toHtml()

I can then analyze this resulting HTML with my standard Python tools like the webscraping module.

32 comments:

UnknownMarch 8, 2011 at 10:58 PM
Hi Richard.

I have been looking for something like this. From all the solutions, yours seems the best one.
The problem is that I couldn't make it work :)
Could you give me a hand?

I have installed Pyside on Ubuntu.
Now my script looks like:

#!/usr/bin/python

# Import PySide classes
import sys
from PySide.QtCore import *
from PySide.QtGui import *

class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()

def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()

r = Render("www.google.com")
html = r.frame.toHtml()

But I have an error:
Traceback (most recent call last):
File "test1.py", line 9, in
class Render(QWebPage):
NameError: name 'QWebPage' is not defined

Am I missing something?
Thank you in advance.
ReplyDelete
Replies
RichardMarch 8, 2011 at 11:10 PM
Glad you find it useful.

I have updated my example to show the required imports - you will also need:
from PySide.QtWebKit import *
ReplyDelete
Replies
BrunoMarch 9, 2011 at 12:56 AM
Thanks a lot Richard.
It works now.
ReplyDelete
Replies
UnknownMarch 10, 2011 at 5:39 AM
Hi,

Is there any way I can send mouse events, I mean, to simulate user actions?

Or maybe a way to use it with selenium?

Thanks,
Gabriel
ReplyDelete
Replies
RichardMarch 10, 2011 at 9:16 AM
Yes, you can simulate mouse events through JavaScript:

e.evaluateJavaScript("var evObj = document.createEvent('MouseEvents'); evObj.initEvent('click', true, true); this.dispatchEvent(evObj);")
ReplyDelete
Replies
AnonymousMarch 19, 2011 at 5:50 AM
Would it be possible to add authentication? This is almost exactly what I need, but the page I want to access is on our company intranet and requires a login to view.

Thanks for the example! It's a huge help.

--Mike
ReplyDelete
Replies
RichardMarch 21, 2011 at 1:20 PM
webkit supports cookies like a normal browser, so you could make it submit the login form before accessing the content.
ReplyDelete
Replies
kristopolousMarch 27, 2011 at 12:38 PM
You can also interface xdotool for mouse events.
ReplyDelete
Replies
RichardMarch 28, 2011 at 2:59 PM
I prefer solutions that are cross platform with minimal dependencies, so easier to deploy to clients.
Isn't xdotool only for X11?
ReplyDelete
Replies
AnonymousApril 5, 2011 at 3:12 AM
Hi Richard, you example code is really what i have been looking for online for so long. I am new to QtWebKit and would like to ask a few question regarding the example code.

Q1) I replaced the url link in your example code with:
r = Render("http://quote.morningstar.com/stock/s.aspx?t=wmt")

Code seems working, but I am getting error message as following:
QSslSocket: cannot call unresolved function SSLv3_client_method
QSslSocket: cannot call unresolved function SSL_CTX_new
QSslSocket: cannot call unresolved function SSL_library_init
QSslSocket: cannot call unresolved function ERR_get_error

Q2) I added a line in the end as "print html". I am getting following errors:

Traceback (most recent call last):
File "D:\MyStuffs\Hobbies\AlienProjects\Stocks\Scrape\webK.py", line 25, in
print html
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 585: ordinal not in range(128)

Is html an object or a reference to the captured web page text? How do I print page out?

Thanks,

-- Wei
ReplyDelete
Replies
RichardApril 12, 2011 at 11:27 AM
I see a followup to your login question was made here:
http://stackoverflow.com/questions/5356948/scraping-javascript-driven-web-pages-with-pyqt4-how-to-access-pages-that-need-a/
ReplyDelete
Replies
AnonymousApril 14, 2011 at 7:52 AM
Thanks, I was looking for something like this. However, if the load time of the page requires much time (for instance 10 seconds in a browser), how to get that? This script terminates quickly and doesn't fetch everything. I guess it can be done with QTimer but I don't know how to integrate a timeout limit.
ReplyDelete
Replies
RichardApril 14, 2011 at 12:41 PM
the _loadFinished signal won't be called until the page and it's resources are loaded. If you need additional time you could call something like this before exiting the app:

def wait(self, secs=10):
deadline = time.time() + secs
while time.time() < deadline:
time.sleep(0.1)
self.app.processEvents()
ReplyDelete
Replies
JenJune 8, 2011 at 6:43 AM
Hi Richard,
I am quite new to all of this so pardon me if my questions are fundamental. I am trying to scrape information from a webpage that is practically entirely encoded in Javascript, and listed in a dynamic table on multiple pages. The site also happens to have a 'Download' button that conveniently puts all the data into a csv file. I don't know whether it would be easier to automate the clicking of this button, or the scraping of the code itself--if the former, is there a way to do this with WebKit or something else that doesn't require too many downloads? If the latter, how can I view the saved HTML from the rendered webpage? Any insight would be appreciated. Thanks!
ReplyDelete
Replies
RichardJune 9, 2011 at 1:25 AM
Yes could do this with webkit, but probably easier to just replicate the download request. There are firefox extensions that can help you with this, such as firebug.
ReplyDelete
Replies
JenJune 11, 2011 at 6:45 AM
Great, thanks!!
ReplyDelete
Replies
JenJune 14, 2011 at 4:26 AM
Thanks! I got the URL of the 'download' button using firebug, but it consists of a (dynamically generated based on which checkboxes are selected) javascript call to a regularly updated database, rather than a direct file path:

https://www.quantcast.com/download/plannerCSV?&d0Id=10&sc=1&mr=10000

Do you have any ideas on how to replicate the download request using that? Thanks so much; I really appreciate your time :)
ReplyDelete
Replies
RichardJune 16, 2011 at 12:19 AM
Don't worry about the particular javascript used. Instead analyze the download request this triggers and then replicate this yourself.
ReplyDelete
Replies
ThomasJune 17, 2011 at 4:08 AM
This is fantastic! It actually took some work to get PySide installed, but now i'm having this problem:

from PySide.QtGui import *

Traceback (most recent call last):
File "", line 1, in
ImportError: dlopen(/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/PySide/QtGui.so, 2): Library not loaded: /opt/local/lib/libpng12.0.dylib
Referenced from: /opt/local/lib/libQtGui.4.dylib
Reason: Incompatible library version: libQtGui.4.dylib requires version 45.0.0 or later, but libpng12.0.dylib provides version 44.0.0

Any ideas?
ReplyDelete
Replies
RichardJune 17, 2011 at 2:10 PM
Seems that you installed pyside manually and now you have a version dependency problem. Can you install a more recent version of libpng?
Or if you use package management to install the dependencies will be taken care of.

This is what I used on Ubuntu to install PyQt:

sudo apt-get install python-qt4
ReplyDelete
Replies
AurelienJuly 12, 2011 at 2:15 AM
Thanks a lot for these useful information, it helped me a lot.

However when I try to connect to this website, https[://]www[dot]securitygarden[dot]com, I fail to establish a connection. I'm one of the contributor of this project and i would like to test the efficiency of content obtained with PySide/PyQt.

I don't know the origin of the problem, because I can connect to that URL with urllib2. If you have an idea thanks in advance, and thanks again for the articles on this blog that I find useful and interesting.
ReplyDelete
Replies
AnonymousAugust 5, 2011 at 6:44 AM
Is it possible to load multiple URLs without touching the Render class.
ReplyDelete
Replies
RichardAugust 5, 2011 at 6:52 AM
Currently the loading code is in the constructor so the class is not efficient for loading multiple URL's. You should refactor and put the loading code in a method.
ReplyDelete
Replies
AnonymousAugust 5, 2011 at 7:02 AM
I would like to call a method multiple times, each time with a different URL. I see that the load and app.exec_ can be moved to a loadURL method, but I don't see how to support multiple calls... please enlighten,. being stuck on it for a while now.
ReplyDelete
Replies
UnknownNovember 16, 2011 at 8:15 PM
Hi, is there a way to track the progress of page loading, something like getting the amount of data loaded from url?

Thanks!
ReplyDelete
Replies
RichardNovember 18, 2011 at 2:17 AM
yes, you can track downloading progress via the loadProgress(int) signal: http://www.riverbankcomputing.co.uk/static/Docs/PyQt4/html/qwebview.html
ReplyDelete
Replies
joeNovember 30, 2011 at 12:49 AM
Hello. I need help creating a web scraping application to capture information from a dynamic web page. The page employs periodic XMLHttpRequest requests, once per second and the objective is to capture and log all responses. The server sets cookies both through http and javascript methods and requires these cookies in the request headers. It appears that a Webkit hack could accomplish this. Is anyone able to help with this?
ReplyDelete
Replies
RichardNovember 30, 2011 at 3:09 AM
yes webkit could manage this - you can catch the finished() signal to check AJAX responses.
ReplyDelete
Replies
UnknownDecember 6, 2011 at 2:32 PM
i want to load a list of URLs and scrape some value in each page.
But using the example above and some modification like this:
urllist = ['https://market.android.com/details?id=com.tencent.mtt','https://market.android.com/details?id=com.tencent.qqpimsecure']
p = re.compile(r'num\d">(\d+)<')
for detailurl in urllist:
r = Render(QUrl(detailurl))
html = r.frame.toHtml()
matched = p.findall(html)
print matched

then i got the error:A QApplication instance already exists.
how can i reload the frame using new URL and get the content? thx.
ReplyDelete
Replies
RichardDecember 6, 2011 at 7:36 PM
yes you are only allowed to define a single QApplication instance. Here is a modified example for crawling multiple URL's:
http://blog.sitescraper.net/2011/12/scraping-multiple-javascript-webpages.html
ReplyDelete
Replies
RichardMay 21, 2012 at 9:33 PM
hi Jonathan, didn't realize this blog was still active.
Have reposted here: http://sitescraper.net/blog/Scraping-multiple-JavaScript-webpages-with-webkit/
And will disable this blog.
ReplyDelete
Replies

Add comment