I wrote previously about using Google Cache to crawl a website. Sometimes, for whatever reason, Google Cache does not include a webpage so it is good to have backup options.
One option is using Google Translate, which let's you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content.
I added a function to download a URL via Google Translate and Google Cache to my webscraping library. Here is an example:
One option is using Google Translate, which let's you translate a webpage into another language. If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the original content.
I added a function to download a URL via Google Translate and Google Cache to my webscraping library. Here is an example:
from webscraping import download, xpath D = download.Download() url = 'http://sitescraper.net/faq' html1 = D.get(url) # download directly html2 = D.gcache_get(url) # download via Google Cache html3 = D.gtrans_get(url) # download via Google Translate for html in (html1, html2, html3): print xpath.get(html, '//title')
This example downloads the same webpage directly, via Google Cache, and via Google Translate. Then it parses the title to show the same webpage has been downloaded. The output when run is:
Frequently asked questions | SiteScraper
Frequently asked questions | SiteScraper
Frequently asked questions | SiteScraper