All your data are belong to us: 2011

Tuesday, December 6, 2011

Scraping multiple JavaScript webpages with Python

I made an earlier post here about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:

from PyQt4.QtCore import *
from PyQt4.QtWebKit import *


class Render(QWebPage):
  def __init__(self, urls):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.urls = urls
    self.data = {} # store downloaded HTML in a dict
    self.crawl()
    self.app.exec_()
    
  def crawl(self):
    if self.urls:
      url = self.urls.pop(0)
      print 'Downloading', url
      self.mainFrame().load(QUrl(url))
    else:
      self.app.quit()
      
  def _loadFinished(self, result):
    frame = self.mainFrame()
    url = str(frame.url().toString())
    html = frame.toHtml()
    self.data[url] = html
    self.crawl()
    

urls = ['http://sitescraper.net', 'http://blog.sitescraper.net']
r = Render(urls)
print r.data.keys()

This is a simple solution that will keep all HTML in memory, so is not practical for large crawls. For large crawls you should save the resulting HTML to disk. I use the pdict module for this.

Saturday, December 3, 2011

How to teach yourself web scraping

I often get asked how to learn about web scraping. Here is my advice.

First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don't need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.

The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:

Make sure you learn all the details of the urllib2 module. Here are some additional good resources:

Learn about the HTTP protocol, which is how you will interact with websites.

Learn about regular expressions:

Learn about XPath:

If necessary learn about JavaScript:

These FireFox extensions can make web scraping easier:

Some libraries that can make web scraping easier:

Some other resources:

http://dev.lethain.com/an-introduction-to-compassionate-screenscraping

http://stackoverflow.com/questions/tagged/screen-scraping

http://blog.sitescraper.net

Tuesday, November 29, 2011

How to use proxies

First you need some working proxies. You can try to collect them from the various free lists such as http://hidemyass.com/proxy-list/, but many people use these websites so they won't be reliable.
If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip or proxybonanza.

Each proxy will have the format login:password@IP:port
The login details and post are optional. Here are some examples:

bob:eakej34@66.12.121.140:8000
219.66.12.12
219.66.12.14:8080

With the webscraping library you can then use the proxies like this:

from webscraping import download
D = download.Download(proxies=proxies, user_agent=user_agent)
html = D.get(url)

The above script will download content through a random proxy from the given list. Here is a standalone version:

import urllib2
import gzip
import random
import StringIO


def fetch(url, data=None, proxies=None, user_agent='Mozilla/5.0'):
    """Download the content at this url and return the content
    """
    opener = urllib2.build_opener()
    if proxies:
        # download through a random proxy from the list
        proxy = random.choice(proxies)
        if url.lower().startswith('https://'):
            opener.add_handler(urllib2.ProxyHandler({'https' : proxy}))
        else:
            opener.add_handler(urllib2.ProxyHandler({'http' : proxy}))
    
    # submit these headers with the request
    headers =  {'User-agent': user_agent, 'Accept-encoding': 'gzip', 'Referer': url}
    
    if isinstance(data, dict):
        # need to post this data
        data = urllib.urlencode(data)
    try:
        response = opener.open(urllib2.Request(url, data, headers))
        content = response.read()
        if response.headers.get('content-encoding') == 'gzip':
            # data came back gzip-compressed so decompress it          
            content = gzip.GzipFile(fileobj=StringIO.StringIO(content)).read()
    except Exception, e:
        # so many kinds of errors are possible here so just catch them all
        print 'Error: %s %s' % (url, e)
        content = None
    return content

Sunday, November 6, 2011

How to automatically find contact details

I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.

This wastes my time so I use this snippet to automate extracting the available emails:

import sys
from webscraping import common, download


def get_emails(website, max_depth):
    """Returns a list of emails found at this website
    
    max_depth is how deep to follow links
    """
    D = download.Download()
    return D.get_emails(website, max_depth=max_depth)
    

if __name__ == '__main__':
    try:
        website = sys.argv[1]
        max_depth = int(sys.argv[2])
    except:
        print 'Usage: %s <URL> <max depth>' % sys.argv[0]
    else:
        print get_emails(website, max_depth)

Example use:

>>> get_emails('http://www.sitescraper.net', 1)
['richard@sitescraper.net']

Wednesday, October 12, 2011

Free service to extract article from webpage

In a previous post I showed a tool for automatically extracting article summaries. Recently I came across a free online service from instapaper.com that does an even better job.

Here is my blog article: http://blog.sitescraper.net/2010/06/how-to-stop-scraper.html

And here are the results when submitted to instapaper:

And here is a BBC article:

And again the results from instapaper:

Instapaper has not made this service public, so hopefully they add it to their official API in future.

Tuesday, September 6, 2011

Google interview

Recently I was invited to Sydney to interview for a developer position.

I waited in the reception for a 1pm start. Apart from a tire swing, which everyone ignored anyway, this could have been any office. Nothing like the Googleplex in Mountain View.

I was led to a small room for the interviews. Along the way I passed rows of coders at work and an effigy of Sarah Palin. Poor taste.

I then had back to back technical interviews until past 5pm. Lunch was not provided as I had assumed, so I was feeling weak by the final interview.

By this time most people had already left for the day.

After the last interview I was escorted out. No tour. No snacks. No schwag.

So in conclusion I don't feel at all tempted to leave my freelancing + travel lifestyle.

Wednesday, July 20, 2011

User agents

Your web browser will send what is known as a "User Agent" for every page you access. This is a string to tell the server what kind of device you are accessing the page with. Here are some common User Agent strings:

Browser User Agent

Firefox on Windows XP Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6

Chrome on Linux Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3

Internet Explorer on Windows Vista Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)

Opera on Windows Vista Opera/9.00 (Windows NT 5.1; U; en)

Android Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3

IPhone Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3

Blackberry Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, Like Gecko) Version/6.0.0.141 Mobile Safari/534.1+

Python urllib Python-urllib/2.1

Old Google Bot Googlebot/2.1 ( http://www.googlebot.com/bot.html)

New Google Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

MSN Bot msnbot/1.1 (+http://search.msn.com/msnbot.htm)

Yahoo Bot Yahoo! Slurp/Site Explorer

Browser	User Agent
Firefox on Windows XP	Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
Chrome on Linux	Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
Internet Explorer on Windows Vista	Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
Opera on Windows Vista	Opera/9.00 (Windows NT 5.1; U; en)
Android	Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3
IPhone	Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3
Blackberry	Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, Like Gecko) Version/6.0.0.141 Mobile Safari/534.1+
Python urllib	Python-urllib/2.1
Old Google Bot	Googlebot/2.1 ( http://www.googlebot.com/bot.html)
New Google Bot	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
MSN Bot	msnbot/1.1 (+http://search.msn.com/msnbot.htm)
Yahoo Bot	Yahoo! Slurp/Site Explorer

You can find your own current User Agent here.

Some webpages will use the User Agent to display content that is customized to your particular browser. For example if your User Agent indicates you are using an old browser then the website may return the plain HTML version without any AJAX features, which may be easier to scrape.

Some websites will automatically block certain User Agents, for example if your User Agent indicates you are accessing their server with a script rather than a regular web browser.

Fortunately it is easy to set your User Agent to whatever you like.
For FireFox you can use User Agent Switcher extension.
For Chrome there is currently no extension, but you can set the User Agent from the command line at startup: chromium-browser --user-agent="my custom user agent"
For Internet Explorer you can use the UAPick extension.

And for Python scripts you can set the proxy header with:

proxy = urllib2.ProxyHandler({'http': IP})
opener = urllib2.build_opener(proxy)
opener.urlopen('http://www.google.com')

Using the default User Agent for your scraper is a common reason to be blocked, so don't forget.

Tuesday, July 5, 2011

Taking advantage of mobile interfaces

Sometimes a website will have multiple versions: one for regular users with a modern browser, a HTML version for browsers that don't support JavaScript, and a simplified version for mobile users.

For example Gmail has:

All three of these interfaces will display the content of your emails but use different layouts and features. The main entrance at gmail.com is well known for its use of AJAX to load content dynamically without refreshing the page. This leads to a better user experience but makes web automation or scraping harder.

On the other hand the static HTML interface has fewer features and is less efficient for users, but much easier to automate or scrape because all the content is available when the page loads.

So before scraping a website check for its HTML or mobile version, which should be easier to scrape.

To find the HTML version try disabling JavaScript in your browser and see what happens.
To find the mobile version try adding the "m" subdomain (domain.com -> m.domain.com) or using a mobile user-agent.

Thursday, June 30, 2011

Parsing Flash with Swiffy

Google has released a tool called Swiffy for parsing Flash files into HTML5. This is relevant to web scraping because content embedded in Flash is a pain to extract, as I wrote about earlier.

I tried some test files and found the results no more useful for parsing text content than the output produced by swf2html (Linux version). Some neat example conversions are available here. Currently Swiffy supports ActionScript 2.0 and works best with Flash 5, which was released back in 2000 so there is still a lot of work to do.

Sunday, June 19, 2011

Google App Engine limitations

Most of the discussion about Google App Engine seems to focus on how it allows you to scale your app, however I find it most useful for small client apps where we want a reliable platform while avoiding any ongoing hosting fee. For large apps paying for hosting would not be a problem.

These are some of the downsides I have found using Google App Engine:

Slow - if your app has not been accessed recently (last minute) then it can take up to 10 seconds to load for the user
Pure Python/Java code only - this prevents using a lot of good libraries, most importantly for me lxml
CPU quota easily gets exhausted when uploading data
Proxies not supported, which makes apps that rely on external websites risky. For example the Twitter API has a per IP quota which you would be sharing with all other GAE apps.
Blocked in some countries, such as Turkey
Indexes - the free quota is 1 GB but often over half of this is taken up by indexes
~~Maximum 1000 records per query~~ - no longer a limitation!
20 second request limit, so often need the overhead of using Task Queues

Despite these problems I still find Google App Engine a fantastic platform and a pleasure to develop on.

Sunday, May 29, 2011

Using Google Translate to crawl a website

I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is good to have backup options.

One option is using Google Translate, which let's you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content.

I added a function to download a URL via Google Translate and Google Cache to my webscraping library. Here is an example:

from webscraping import download, xpath

D = download.Download()
url = 'http://sitescraper.net/faq'
html1 = D.get(url) # download directly
html2 = D.gcache_get(url) # download via Google Cache
html3 = D.gtrans_get(url) # download via Google Translate
for html in (html1, html2, html3):
    print xpath.get(html, '//title')

This example downloads the same webpage directly, via Google Cache, and via Google Translate. Then it parses the title to show the same webpage has been downloaded. The output when run is:

Frequently asked questions | SiteScraper

Sunday, May 15, 2011

Using Google Cache to crawl a website

Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.

Fortunately there is an alternative - Google.

If a website doesn't exist in Google's search results then for most people it doesn't exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This means Google has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.

So instead of downloading a URL we want directly we can download it indirectly via Google Cache: http://www.google.com/search?&q=cache%3Ahttp%3A//sitescraper.net
This way the source website can not block you and does not even know you are crawling their content.

Thursday, March 31, 2011

Google Storage

Often the datasets I scrape are too big to send via email and would take up too much space on my web server, so I upload them to Google Storage.
Here is an example snippet to create a folder on GS, upload a file, and then download it:

>>> gsutil mb gs://bucket_name
>>> gsutil ls
gs://bucket_name
>>> gsutil cp path/to/file.ext gs://bucket_name
>>> gsutil ls gs://bucket_name
file.ext
>>> gsutil cp gs://bucket_name/file.ext file_copy.ext

Tuesday, March 1, 2011

The SiteScraper module

A few years ago I developed the sitescraper library for automatically scraping website data based on example cases:

>>> from sitescraper import sitescraper>>> ss = sitescraper()
>>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=python&x=0&y=0'
>>> data = ["Amazon.com: python", ["Learning Python, 3rd Edition", 
     "Programming in Python 3: A Complete Introduction to the Python Language (Developer's Library)", 
     "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]
>>> ss.add(url, data)
>>> # we can add multiple example cases, but this is a simple example so 1 will do (I generally use 3)
>>> # ss.add(url2, data2) 
>>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=linux&x=0&y=0')
["Amazon.com: linux", ["A Practical Guide to Linux(R) Commands, Editors, and Shell Programming", "Linux Pocket Guide", "Linux in a Nutshell (In a Nutshell (O'Reilly))", 'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 'Linux Bible, 2008 Edition: Boot up to Ubuntu, Fedora, KNOPPIX, Debian, openSUSE, and 11 Other Distributions']]

See this paper for more info.

It was designed for scraping websites overtime where their layout may change. Unfortunately I don't use it much these days because most of my projects are one-off scrapes.

Monday, February 21, 2011

Automating CAPTCHA's

By now you would be used to entering the text for an image like this:

The idea is this will prevent bots because only a real user can interpret the image.

However this is not an obstacle for a determined scraper because of services like deathbycaptcha that will solve the CAPTCHA for you. These services use cheap labor to manually interpret the images and send the result back through an API.

CAPTCHA's are still useful because they deter most bots. However they can not prevent a determined scraper and are annoying to genuine users.