All your data are belong to us: How to use proxies

First you need some working proxies. You can try to collect them from the various free lists such as http://hidemyass.com/proxy-list/, but many people use these websites so they won't be reliable.
If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip or proxybonanza.

Each proxy will have the format login:password@IP:port
The login details and post are optional. Here are some examples:

bob:eakej34@66.12.121.140:8000
219.66.12.12
219.66.12.14:8080

With the webscraping library you can then use the proxies like this:

from webscraping import download
D = download.Download(proxies=proxies, user_agent=user_agent)
html = D.get(url)

The above script will download content through a random proxy from the given list. Here is a standalone version:

import urllib2
import gzip
import random
import StringIO


def fetch(url, data=None, proxies=None, user_agent='Mozilla/5.0'):
    """Download the content at this url and return the content
    """
    opener = urllib2.build_opener()
    if proxies:
        # download through a random proxy from the list
        proxy = random.choice(proxies)
        if url.lower().startswith('https://'):
            opener.add_handler(urllib2.ProxyHandler({'https' : proxy}))
        else:
            opener.add_handler(urllib2.ProxyHandler({'http' : proxy}))
    
    # submit these headers with the request
    headers =  {'User-agent': user_agent, 'Accept-encoding': 'gzip', 'Referer': url}
    
    if isinstance(data, dict):
        # need to post this data
        data = urllib.urlencode(data)
    try:
        response = opener.open(urllib2.Request(url, data, headers))
        content = response.read()
        if response.headers.get('content-encoding') == 'gzip':
            # data came back gzip-compressed so decompress it          
            content = gzip.GzipFile(fileobj=StringIO.StringIO(content)).read()
    except Exception, e:
        # so many kinds of errors are possible here so just catch them all
        print 'Error: %s %s' % (url, e)
        content = None
    return content

All your data are belong to us

Tuesday, November 29, 2011

How to use proxies

No comments:

Post a Comment

About Me