All your data are belong to us

Tuesday, December 6, 2011

Scraping multiple JavaScript webpages with Python

I made an earlier post here about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:

from PyQt4.QtCore import *
from PyQt4.QtWebKit import *


class Render(QWebPage):
  def __init__(self, urls):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.urls = urls
    self.data = {} # store downloaded HTML in a dict
    self.crawl()
    self.app.exec_()
    
  def crawl(self):
    if self.urls:
      url = self.urls.pop(0)
      print 'Downloading', url
      self.mainFrame().load(QUrl(url))
    else:
      self.app.quit()
      
  def _loadFinished(self, result):
    frame = self.mainFrame()
    url = str(frame.url().toString())
    html = frame.toHtml()
    self.data[url] = html
    self.crawl()
    

urls = ['http://sitescraper.net', 'http://blog.sitescraper.net']
r = Render(urls)
print r.data.keys()

This is a simple solution that will keep all HTML in memory, so is not practical for large crawls. For large crawls you should save the resulting HTML to disk. I use the pdict module for this.

Saturday, December 3, 2011

How to teach yourself web scraping

I often get asked how to learn about web scraping. Here is my advice.

First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don't need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.

The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:

Make sure you learn all the details of the urllib2 module. Here are some additional good resources:

Learn about the HTTP protocol, which is how you will interact with websites.

Learn about regular expressions:

Learn about XPath:

If necessary learn about JavaScript:

These FireFox extensions can make web scraping easier:

Some libraries that can make web scraping easier:

Some other resources:

http://dev.lethain.com/an-introduction-to-compassionate-screenscraping

http://stackoverflow.com/questions/tagged/screen-scraping

http://blog.sitescraper.net

Tuesday, November 29, 2011

How to use proxies

First you need some working proxies. You can try to collect them from the various free lists such as http://hidemyass.com/proxy-list/, but many people use these websites so they won't be reliable.
If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip or proxybonanza.

Each proxy will have the format login:password@IP:port
The login details and post are optional. Here are some examples:

bob:eakej34@66.12.121.140:8000
219.66.12.12
219.66.12.14:8080

With the webscraping library you can then use the proxies like this:

from webscraping import download
D = download.Download(proxies=proxies, user_agent=user_agent)
html = D.get(url)

The above script will download content through a random proxy from the given list. Here is a standalone version:

import urllib2
import gzip
import random
import StringIO


def fetch(url, data=None, proxies=None, user_agent='Mozilla/5.0'):
    """Download the content at this url and return the content
    """
    opener = urllib2.build_opener()
    if proxies:
        # download through a random proxy from the list
        proxy = random.choice(proxies)
        if url.lower().startswith('https://'):
            opener.add_handler(urllib2.ProxyHandler({'https' : proxy}))
        else:
            opener.add_handler(urllib2.ProxyHandler({'http' : proxy}))
    
    # submit these headers with the request
    headers =  {'User-agent': user_agent, 'Accept-encoding': 'gzip', 'Referer': url}
    
    if isinstance(data, dict):
        # need to post this data
        data = urllib.urlencode(data)
    try:
        response = opener.open(urllib2.Request(url, data, headers))
        content = response.read()
        if response.headers.get('content-encoding') == 'gzip':
            # data came back gzip-compressed so decompress it          
            content = gzip.GzipFile(fileobj=StringIO.StringIO(content)).read()
    except Exception, e:
        # so many kinds of errors are possible here so just catch them all
        print 'Error: %s %s' % (url, e)
        content = None
    return content

Sunday, November 6, 2011

How to automatically find contact details

I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.

This wastes my time so I use this snippet to automate extracting the available emails:

import sys
from webscraping import common, download


def get_emails(website, max_depth):
    """Returns a list of emails found at this website
    
    max_depth is how deep to follow links
    """
    D = download.Download()
    return D.get_emails(website, max_depth=max_depth)
    

if __name__ == '__main__':
    try:
        website = sys.argv[1]
        max_depth = int(sys.argv[2])
    except:
        print 'Usage: %s <URL> <max depth>' % sys.argv[0]
    else:
        print get_emails(website, max_depth)

Example use:

>>> get_emails('http://www.sitescraper.net', 1)
['richard@sitescraper.net']

Wednesday, October 12, 2011

Free service to extract article from webpage

In a previous post I showed a tool for automatically extracting article summaries. Recently I came across a free online service from instapaper.com that does an even better job.

Here is my blog article: http://blog.sitescraper.net/2010/06/how-to-stop-scraper.html

And here are the results when submitted to instapaper:

And here is a BBC article:

And again the results from instapaper:

Instapaper has not made this service public, so hopefully they add it to their official API in future.

Tuesday, September 6, 2011

Google interview

Recently I was invited to Sydney to interview for a developer position.

I waited in the reception for a 1pm start. Apart from a tire swing, which everyone ignored anyway, this could have been any office. Nothing like the Googleplex in Mountain View.

I was led to a small room for the interviews. Along the way I passed rows of coders at work and an effigy of Sarah Palin. Poor taste.

I then had back to back technical interviews until past 5pm. Lunch was not provided as I had assumed, so I was feeling weak by the final interview.

By this time most people had already left for the day.

After the last interview I was escorted out. No tour. No snacks. No schwag.

So in conclusion I don't feel at all tempted to leave my freelancing + travel lifestyle.

Wednesday, July 20, 2011

User agents

Your web browser will send what is known as a "User Agent" for every page you access. This is a string to tell the server what kind of device you are accessing the page with. Here are some common User Agent strings:

Browser User Agent

Firefox on Windows XP Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6

Chrome on Linux Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3

Internet Explorer on Windows Vista Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)

Opera on Windows Vista Opera/9.00 (Windows NT 5.1; U; en)

Android Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3

IPhone Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3

Blackberry Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, Like Gecko) Version/6.0.0.141 Mobile Safari/534.1+

Python urllib Python-urllib/2.1

Old Google Bot Googlebot/2.1 ( http://www.googlebot.com/bot.html)

New Google Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

MSN Bot msnbot/1.1 (+http://search.msn.com/msnbot.htm)

Yahoo Bot Yahoo! Slurp/Site Explorer

Browser	User Agent
Firefox on Windows XP	Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
Chrome on Linux	Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
Internet Explorer on Windows Vista	Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
Opera on Windows Vista	Opera/9.00 (Windows NT 5.1; U; en)
Android	Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3
IPhone	Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3
Blackberry	Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, Like Gecko) Version/6.0.0.141 Mobile Safari/534.1+
Python urllib	Python-urllib/2.1
Old Google Bot	Googlebot/2.1 ( http://www.googlebot.com/bot.html)
New Google Bot	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
MSN Bot	msnbot/1.1 (+http://search.msn.com/msnbot.htm)
Yahoo Bot	Yahoo! Slurp/Site Explorer

You can find your own current User Agent here.

Some webpages will use the User Agent to display content that is customized to your particular browser. For example if your User Agent indicates you are using an old browser then the website may return the plain HTML version without any AJAX features, which may be easier to scrape.

Some websites will automatically block certain User Agents, for example if your User Agent indicates you are accessing their server with a script rather than a regular web browser.

Fortunately it is easy to set your User Agent to whatever you like.
For FireFox you can use User Agent Switcher extension.
For Chrome there is currently no extension, but you can set the User Agent from the command line at startup: chromium-browser --user-agent="my custom user agent"
For Internet Explorer you can use the UAPick extension.

And for Python scripts you can set the proxy header with:

proxy = urllib2.ProxyHandler({'http': IP})
opener = urllib2.build_opener(proxy)
opener.urlopen('http://www.google.com')

Using the default User Agent for your scraper is a common reason to be blocked, so don't forget.