Tuesday, November 29, 2011

How to use proxies

First you need some working proxies. You can try to collect them from the various free lists such as http://hidemyass.com/proxy-list/, but many people use these websites so they won't be reliable.
If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip or proxybonanza.

Each proxy will have the format login:password@IP:port
The login details and post are optional. Here are some examples:
  • bob:eakej34@66.12.121.140:8000
  • 219.66.12.12
  • 219.66.12.14:8080 

With the webscraping library you can then use the proxies like this:
from webscraping import download
D = download.Download(proxies=proxies, user_agent=user_agent)
html = D.get(url)

The above script will download content through a random proxy from the given list. Here is a standalone version:

import urllib2
import gzip
import random
import StringIO


def fetch(url, data=None, proxies=None, user_agent='Mozilla/5.0'):
    """Download the content at this url and return the content
    """
    opener = urllib2.build_opener()
    if proxies:
        # download through a random proxy from the list
        proxy = random.choice(proxies)
        if url.lower().startswith('https://'):
            opener.add_handler(urllib2.ProxyHandler({'https' : proxy}))
        else:
            opener.add_handler(urllib2.ProxyHandler({'http' : proxy}))
    
    # submit these headers with the request
    headers =  {'User-agent': user_agent, 'Accept-encoding': 'gzip', 'Referer': url}
    
    if isinstance(data, dict):
        # need to post this data
        data = urllib.urlencode(data)
    try:
        response = opener.open(urllib2.Request(url, data, headers))
        content = response.read()
        if response.headers.get('content-encoding') == 'gzip':
            # data came back gzip-compressed so decompress it          
            content = gzip.GzipFile(fileobj=StringIO.StringIO(content)).read()
    except Exception, e:
        # so many kinds of errors are possible here so just catch them all
        print 'Error: %s %s' % (url, e)
        content = None
    return content

Sunday, November 6, 2011

How to automatically find contact details

I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.

This wastes my time so I use this snippet to automate extracting the available emails:


import sys
from webscraping import common, download


def get_emails(website, max_depth):
    """Returns a list of emails found at this website
    
    max_depth is how deep to follow links
    """
    D = download.Download()
    return D.get_emails(website, max_depth=max_depth)
    

if __name__ == '__main__':
    try:
        website = sys.argv[1]
        max_depth = int(sys.argv[2])
    except:
        print 'Usage: %s <URL> <max depth>' % sys.argv[0]
    else:
        print get_emails(website, max_depth)

Example use:
>>> get_emails('http://www.sitescraper.net', 1)
['richard@sitescraper.net']