Sunday, May 29, 2011

Using Google Translate to crawl a website

I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is good to have backup options.

One option is using Google Translate, which let's you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content.


I added a function to download a URL via Google Translate and Google Cache to my webscraping library. Here is an example:

from webscraping import download, xpath

D = download.Download()
url = 'http://sitescraper.net/faq'
html1 = D.get(url) # download directly
html2 = D.gcache_get(url) # download via Google Cache
html3 = D.gtrans_get(url) # download via Google Translate
for html in (html1, html2, html3):
    print xpath.get(html, '//title')
This example downloads the same webpage directly, via Google Cache, and via Google Translate. Then it parses the title to show the same webpage has been downloaded. The output when run is:
Frequently asked questions | SiteScraper 
Frequently asked questions | SiteScraper 
Frequently asked questions | SiteScraper

Sunday, May 15, 2011

Using Google Cache to crawl a website

Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.

Fortunately there is an alternative - Google.

If a website doesn't exist in Google's search results then for most people it doesn't exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This means Google has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.

So instead of downloading a URL we want directly we can download it indirectly via Google Cache: http://www.google.com/search?&q=cache%3Ahttp%3A//sitescraper.net
This way the source website can not block you and does not even know you are crawling their content.