All your data are belong to us: Using Google Translate to crawl a website

I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is good to have backup options.

One option is using Google Translate, which let's you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content.

I added a function to download a URL via Google Translate and Google Cache to my webscraping library. Here is an example:

from webscraping import download, xpath

D = download.Download()
url = 'http://sitescraper.net/faq'
html1 = D.get(url) # download directly
html2 = D.gcache_get(url) # download via Google Cache
html3 = D.gtrans_get(url) # download via Google Translate
for html in (html1, html2, html3):
    print xpath.get(html, '//title')

This example downloads the same webpage directly, via Google Cache, and via Google Translate. Then it parses the title to show the same webpage has been downloaded. The output when run is:

Frequently asked questions | SiteScraper

All your data are belong to us

Sunday, May 29, 2011

Using Google Translate to crawl a website

No comments:

Post a Comment

About Me