First you need some working proxies. You can try to collect them from the various free lists such as http://hidemyass.com/proxy-list/, but many people use these websites so they won't be reliable.
If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip or proxybonanza.
Each proxy will have the format login:password@IP:port
The login details and post are optional. Here are some examples:
With the webscraping library you can then use the proxies like this:
The above script will download content through a random proxy from the given list. Here is a standalone version:
If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like packetflip or proxybonanza.
Each proxy will have the format login:password@IP:port
The login details and post are optional. Here are some examples:
- bob:eakej34@66.12.121.140:8000
- 219.66.12.12
- 219.66.12.14:8080
With the webscraping library you can then use the proxies like this:
from webscraping import download D = download.Download(proxies=proxies, user_agent=user_agent) html = D.get(url)
The above script will download content through a random proxy from the given list. Here is a standalone version:
import urllib2
import gzip
import random
import StringIO
def fetch(url, data=None, proxies=None, user_agent='Mozilla/5.0'):
"""Download the content at this url and return the content
"""
opener = urllib2.build_opener()
if proxies:
# download through a random proxy from the list
proxy = random.choice(proxies)
if url.lower().startswith('https://'):
opener.add_handler(urllib2.ProxyHandler({'https' : proxy}))
else:
opener.add_handler(urllib2.ProxyHandler({'http' : proxy}))
# submit these headers with the request
headers = {'User-agent': user_agent, 'Accept-encoding': 'gzip', 'Referer': url}
if isinstance(data, dict):
# need to post this data
data = urllib.urlencode(data)
try:
response = opener.open(urllib2.Request(url, data, headers))
content = response.read()
if response.headers.get('content-encoding') == 'gzip':
# data came back gzip-compressed so decompress it
content = gzip.GzipFile(fileobj=StringIO.StringIO(content)).read()
except Exception, e:
# so many kinds of errors are possible here so just catch them all
print 'Error: %s %s' % (url, e)
content = None
return content
No comments:
Post a Comment