In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:
Here is a simple class that renders a webpage (including executing any JavaScript) and then saves the final HTML to a file:
I can then analyze this resulting HTML with my standard Python tools like the webscraping module.
- requires me to program in JavaScript rather than my beloved Python (with all its great libraries)
- is slow because have to wait for FireFox to render the entire webpage
- is somewhat buggy and has a small user/developer community, mostly at MIT
Here is a simple class that renders a webpage (including executing any JavaScript) and then saves the final HTML to a file:
import sys from PyQt4.QtGui import * from PyQt4.QtCore import * from PyQt4.QtWebKit import * class Render(QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self._loadFinished) self.mainFrame().load(QUrl(url)) self.app.exec_() def _loadFinished(self, result): self.frame = self.mainFrame() self.app.quit() url = 'http://sitescraper.net' r = Render(url) html = r.frame.toHtml()
I can then analyze this resulting HTML with my standard Python tools like the webscraping module.
Hi Richard.
ReplyDeleteI have been looking for something like this. From all the solutions, yours seems the best one.
The problem is that I couldn't make it work :)
Could you give me a hand?
I have installed Pyside on Ubuntu.
Now my script looks like:
#!/usr/bin/python
# Import PySide classes
import sys
from PySide.QtCore import *
from PySide.QtGui import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
r = Render("www.google.com")
html = r.frame.toHtml()
But I have an error:
Traceback (most recent call last):
File "test1.py", line 9, in
class Render(QWebPage):
NameError: name 'QWebPage' is not defined
Am I missing something?
Thank you in advance.
Glad you find it useful.
ReplyDeleteI have updated my example to show the required imports - you will also need:
from PySide.QtWebKit import *
Thanks a lot Richard.
ReplyDeleteIt works now.
Hi,
ReplyDeleteIs there any way I can send mouse events, I mean, to simulate user actions?
Or maybe a way to use it with selenium?
Thanks,
Gabriel
Yes, you can simulate mouse events through JavaScript:
ReplyDeletee.evaluateJavaScript("var evObj = document.createEvent('MouseEvents'); evObj.initEvent('click', true, true); this.dispatchEvent(evObj);")
Would it be possible to add authentication? This is almost exactly what I need, but the page I want to access is on our company intranet and requires a login to view.
ReplyDeleteThanks for the example! It's a huge help.
--Mike
webkit supports cookies like a normal browser, so you could make it submit the login form before accessing the content.
ReplyDeleteYou can also interface xdotool for mouse events.
ReplyDeleteI prefer solutions that are cross platform with minimal dependencies, so easier to deploy to clients.
ReplyDeleteIsn't xdotool only for X11?
Hi Richard, you example code is really what i have been looking for online for so long. I am new to QtWebKit and would like to ask a few question regarding the example code.
ReplyDeleteQ1) I replaced the url link in your example code with:
r = Render("http://quote.morningstar.com/stock/s.aspx?t=wmt")
Code seems working, but I am getting error message as following:
QSslSocket: cannot call unresolved function SSLv3_client_method
QSslSocket: cannot call unresolved function SSL_CTX_new
QSslSocket: cannot call unresolved function SSL_library_init
QSslSocket: cannot call unresolved function ERR_get_error
Q2) I added a line in the end as "print html". I am getting following errors:
Traceback (most recent call last):
File "D:\MyStuffs\Hobbies\AlienProjects\Stocks\Scrape\webK.py", line 25, in
print html
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 585: ordinal not in range(128)
Is html an object or a reference to the captured web page text? How do I print page out?
Thanks,
-- Wei
I see a followup to your login question was made here:
ReplyDeletehttp://stackoverflow.com/questions/5356948/scraping-javascript-driven-web-pages-with-pyqt4-how-to-access-pages-that-need-a/
Thanks, I was looking for something like this. However, if the load time of the page requires much time (for instance 10 seconds in a browser), how to get that? This script terminates quickly and doesn't fetch everything. I guess it can be done with QTimer but I don't know how to integrate a timeout limit.
ReplyDeletethe _loadFinished signal won't be called until the page and it's resources are loaded. If you need additional time you could call something like this before exiting the app:
ReplyDeletedef wait(self, secs=10):
deadline = time.time() + secs
while time.time() < deadline:
time.sleep(0.1)
self.app.processEvents()
Hi Richard,
ReplyDeleteI am quite new to all of this so pardon me if my questions are fundamental. I am trying to scrape information from a webpage that is practically entirely encoded in Javascript, and listed in a dynamic table on multiple pages. The site also happens to have a 'Download' button that conveniently puts all the data into a csv file. I don't know whether it would be easier to automate the clicking of this button, or the scraping of the code itself--if the former, is there a way to do this with WebKit or something else that doesn't require too many downloads? If the latter, how can I view the saved HTML from the rendered webpage? Any insight would be appreciated. Thanks!
Yes could do this with webkit, but probably easier to just replicate the download request. There are firefox extensions that can help you with this, such as firebug.
ReplyDeleteGreat, thanks!!
ReplyDeleteThanks! I got the URL of the 'download' button using firebug, but it consists of a (dynamically generated based on which checkboxes are selected) javascript call to a regularly updated database, rather than a direct file path:
ReplyDeletehttps://www.quantcast.com/download/plannerCSV?&d0Id=10&sc=1&mr=10000
Do you have any ideas on how to replicate the download request using that? Thanks so much; I really appreciate your time :)
Don't worry about the particular javascript used. Instead analyze the download request this triggers and then replicate this yourself.
ReplyDeleteThis is fantastic! It actually took some work to get PySide installed, but now i'm having this problem:
ReplyDeletefrom PySide.QtGui import *
Traceback (most recent call last):
File "", line 1, in
ImportError: dlopen(/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/PySide/QtGui.so, 2): Library not loaded: /opt/local/lib/libpng12.0.dylib
Referenced from: /opt/local/lib/libQtGui.4.dylib
Reason: Incompatible library version: libQtGui.4.dylib requires version 45.0.0 or later, but libpng12.0.dylib provides version 44.0.0
Any ideas?
Seems that you installed pyside manually and now you have a version dependency problem. Can you install a more recent version of libpng?
ReplyDeleteOr if you use package management to install the dependencies will be taken care of.
This is what I used on Ubuntu to install PyQt:
sudo apt-get install python-qt4
Thanks a lot for these useful information, it helped me a lot.
ReplyDeleteHowever when I try to connect to this website, https[://]www[dot]securitygarden[dot]com, I fail to establish a connection. I'm one of the contributor of this project and i would like to test the efficiency of content obtained with PySide/PyQt.
I don't know the origin of the problem, because I can connect to that URL with urllib2. If you have an idea thanks in advance, and thanks again for the articles on this blog that I find useful and interesting.
Is it possible to load multiple URLs without touching the Render class.
ReplyDeleteCurrently the loading code is in the constructor so the class is not efficient for loading multiple URL's. You should refactor and put the loading code in a method.
ReplyDeleteI would like to call a method multiple times, each time with a different URL. I see that the load and app.exec_ can be moved to a loadURL method, but I don't see how to support multiple calls... please enlighten,. being stuck on it for a while now.
ReplyDeleteHi, is there a way to track the progress of page loading, something like getting the amount of data loaded from url?
ReplyDeleteThanks!
yes, you can track downloading progress via the loadProgress(int) signal: http://www.riverbankcomputing.co.uk/static/Docs/PyQt4/html/qwebview.html
ReplyDeleteHello. I need help creating a web scraping application to capture information from a dynamic web page. The page employs periodic XMLHttpRequest requests, once per second and the objective is to capture and log all responses. The server sets cookies both through http and javascript methods and requires these cookies in the request headers. It appears that a Webkit hack could accomplish this. Is anyone able to help with this?
ReplyDeleteyes webkit could manage this - you can catch the finished() signal to check AJAX responses.
ReplyDeletei want to load a list of URLs and scrape some value in each page.
ReplyDeleteBut using the example above and some modification like this:
urllist = ['https://market.android.com/details?id=com.tencent.mtt','https://market.android.com/details?id=com.tencent.qqpimsecure']
p = re.compile(r'num\d">(\d+)<')
for detailurl in urllist:
r = Render(QUrl(detailurl))
html = r.frame.toHtml()
matched = p.findall(html)
print matched
then i got the error:A QApplication instance already exists.
how can i reload the frame using new URL and get the content? thx.
yes you are only allowed to define a single QApplication instance. Here is a modified example for crawling multiple URL's:
ReplyDeletehttp://blog.sitescraper.net/2011/12/scraping-multiple-javascript-webpages.html
Hi Richard,
DeleteThis has been a big help, however, I have the same problem as the last commenter: you can only have a single QApplication instance. I saw your solution for multiple URLs, but it assumes that all of the URLs are known in advance. In my case, I'm batch processing URLs and I only get the URLs of the AJAX pages as I'm parsing another page, so it's all in a big loop. I need to be able to call Render(..) inside a loop:
for page in pages_with_embedded_urls:
url = get_url(page)
render = Render(url)
render.get_html()
The second iteration of the loop results in: RuntimeError: A QApplication instance already exists. Do you know how I can do this?
Thanks in advance.
hi Jonathan, didn't realize this blog was still active.
ReplyDeleteHave reposted here: http://sitescraper.net/blog/Scraping-multiple-JavaScript-webpages-with-webkit/
And will disable this blog.