Wednesday, July 20, 2011

User agents

Your web browser will send what is known as a "User Agent" for every page you access. This is a string to tell the server what kind of device you are accessing the page with. Here are some common User Agent strings:
Browser User Agent
Firefox on Windows XP Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
Chrome on Linux Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
Internet Explorer on Windows Vista Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
Opera on Windows Vista Opera/9.00 (Windows NT 5.1; U; en)
Android Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3
IPhone Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3
Blackberry Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, Like Gecko) Version/6.0.0.141 Mobile Safari/534.1+
Python urllib Python-urllib/2.1
Old Google Bot Googlebot/2.1 ( http://www.googlebot.com/bot.html)
New Google Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
MSN Bot msnbot/1.1 (+http://search.msn.com/msnbot.htm)
Yahoo Bot Yahoo! Slurp/Site Explorer
You can find your own current User Agent here.

Some webpages will use the User Agent to display content that is customized to your particular browser. For example if your User Agent indicates you are using an old browser then the website may return the plain HTML version without any AJAX features, which may be easier to scrape.

Some websites will automatically block certain User Agents, for example if your User Agent indicates you are accessing their server with a script rather than a regular web browser.

Fortunately it is easy to set your User Agent to whatever you like.
For FireFox you can use User Agent Switcher extension.
For Chrome there is currently no extension, but you can set the User Agent from the command line at startup: chromium-browser --user-agent="my custom user agent"
For Internet Explorer you can use the UAPick extension.

And for Python scripts you can set the proxy header with:

proxy = urllib2.ProxyHandler({'http': IP})
opener = urllib2.build_opener(proxy)
opener.urlopen('http://www.google.com')
Using the default User Agent for your scraper is a common reason to be blocked, so don't forget.



No comments:

Post a Comment