Wednesday, July 20, 2011

User agents

Your web browser will send what is known as a "User Agent" for every page you access. This is a string to tell the server what kind of device you are accessing the page with. Here are some common User Agent strings:
Browser User Agent
Firefox on Windows XP Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
Chrome on Linux Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
Internet Explorer on Windows Vista Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
Opera on Windows Vista Opera/9.00 (Windows NT 5.1; U; en)
Android Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3
IPhone Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3
Blackberry Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, Like Gecko) Version/6.0.0.141 Mobile Safari/534.1+
Python urllib Python-urllib/2.1
Old Google Bot Googlebot/2.1 ( http://www.googlebot.com/bot.html)
New Google Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
MSN Bot msnbot/1.1 (+http://search.msn.com/msnbot.htm)
Yahoo Bot Yahoo! Slurp/Site Explorer
You can find your own current User Agent here.

Some webpages will use the User Agent to display content that is customized to your particular browser. For example if your User Agent indicates you are using an old browser then the website may return the plain HTML version without any AJAX features, which may be easier to scrape.

Some websites will automatically block certain User Agents, for example if your User Agent indicates you are accessing their server with a script rather than a regular web browser.

Fortunately it is easy to set your User Agent to whatever you like.
For FireFox you can use User Agent Switcher extension.
For Chrome there is currently no extension, but you can set the User Agent from the command line at startup: chromium-browser --user-agent="my custom user agent"
For Internet Explorer you can use the UAPick extension.

And for Python scripts you can set the proxy header with:

proxy = urllib2.ProxyHandler({'http': IP})
opener = urllib2.build_opener(proxy)
opener.urlopen('http://www.google.com')
Using the default User Agent for your scraper is a common reason to be blocked, so don't forget.



Tuesday, July 5, 2011

Taking advantage of mobile interfaces

Sometimes a website will have multiple versions: one for regular users with a modern browser, a HTML version for browsers that don't support JavaScript, and a simplified version for mobile users.

For example Gmail has:

All three of these interfaces will display the content of your emails but use different layouts and features. The main entrance at gmail.com is well known for its use of AJAX to load content dynamically without refreshing the page. This leads to a better user experience but makes web automation or scraping harder.

On the other hand the static HTML interface has fewer features and is less efficient for users, but much easier to automate or scrape because all the content is available when the page loads.

So before scraping a website check for its HTML or mobile version, which should be easier to scrape.

To find the HTML version try disabling JavaScript in your browser and see what happens.
To find the mobile version try adding the "m" subdomain (domain.com -> m.domain.com) or using a mobile user-agent.