All your data are belong to us: proxies

Tuesday, November 29, 2011

How to use proxies

First you need some working proxies. You can try to collect them from the various free lists such as http://hidemyass.com/proxy-list/, but many people use these websites so they won't be reliable.
If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip or proxybonanza.

Each proxy will have the format login:password@IP:port
The login details and post are optional. Here are some examples:

bob:eakej34@66.12.121.140:8000
219.66.12.12
219.66.12.14:8080

With the webscraping library you can then use the proxies like this:

from webscraping import download
D = download.Download(proxies=proxies, user_agent=user_agent)
html = D.get(url)

The above script will download content through a random proxy from the given list. Here is a standalone version:

import urllib2
import gzip
import random
import StringIO


def fetch(url, data=None, proxies=None, user_agent='Mozilla/5.0'):
    """Download the content at this url and return the content
    """
    opener = urllib2.build_opener()
    if proxies:
        # download through a random proxy from the list
        proxy = random.choice(proxies)
        if url.lower().startswith('https://'):
            opener.add_handler(urllib2.ProxyHandler({'https' : proxy}))
        else:
            opener.add_handler(urllib2.ProxyHandler({'http' : proxy}))
    
    # submit these headers with the request
    headers =  {'User-agent': user_agent, 'Accept-encoding': 'gzip', 'Referer': url}
    
    if isinstance(data, dict):
        # need to post this data
        data = urllib.urlencode(data)
    try:
        response = opener.open(urllib2.Request(url, data, headers))
        content = response.read()
        if response.headers.get('content-encoding') == 'gzip':
            # data came back gzip-compressed so decompress it          
            content = gzip.GzipFile(fileobj=StringIO.StringIO(content)).read()
    except Exception, e:
        # so many kinds of errors are possible here so just catch them all
        print 'Error: %s %s' % (url, e)
        content = None
    return content

Monday, February 8, 2010

How to crawl websites without being blocked

Websites want users who will purchase their products and click on their advertising. They want to be crawled by search engines so their users can find them, however they don't (generally) want to be crawled by others. One such company is Google, ironically.

Some websites will actively try to stop scrapers so here are some suggestions to help you crawl beneath their radar.

Speed

If you download 1 webpage a day then you will not be blocked but your crawl would take too long to be useful. If you instead used threading to crawl multiple URLs asynchronously then they might mistake you for a DOS attack and blacklist your IP. So what is the happy medium? The wikipedia article on web crawlers currently states "Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3–4 minutes." This is a little slow and I have found 1 download every 5 seconds is usually fine. If you don't need the data quickly then use a longer delay to reduce your risk and be kinder to their server.

Identity

Websites do not want to block genuine users so you should try to look like one. Set your user-agent to a common web browser instead of using the library default (such as wget/version or urllib/version). You could even pretend to be the Google Bot (only for the brave): Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

If you have access to multiple IP addresses (for example via proxies) then distribute your requests among them so that it appears your downloading comes from multiple users.

Consistency

Avoid accessing webpages sequentially: /product/1, /product/2, etc. And don't download a new webpage exactly every N seconds.
Both of these mistakes can attract attention to your downloading because a real user browses more randomly. So make sure to crawl webpages in an unordered manner and add a random offset to the delay between downloads.

Following these recommendations will allow you to crawl most websites without being detected.