All your data are belong to us: Using Google Cache to crawl a website

Sunday, May 15, 2011

Using Google Cache to crawl a website

Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.

Fortunately there is an alternative - Google.

If a website doesn't exist in Google's search results then for most people it doesn't exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This means Google has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.

So instead of downloading a URL we want directly we can download it indirectly via Google Cache: http://www.google.com/search?&q=cache%3Ahttp%3A//sitescraper.net
This way the source website can not block you and does not even know you are crawling their content.

8 comments:

The WandererJuly 3, 2011 at 5:50 AM
This comment has been removed by the author.
ReplyDelete
Replies
The WandererJuly 3, 2011 at 5:52 AM
Hi Richard,

How frequently can you make requests to Google Cache before getting the boot?

Suppose I wanted to scrape all of the news articles from a particular newspaper using Google Cache.

Bill
ReplyDelete
Replies
RichardJuly 4, 2011 at 2:43 PM
hi Bill,
I have found no problem using a ~10 second delay between requests. Make sure to add some randomness so you don't use exactly the same interval each time.
Richard
ReplyDelete
Replies
bahtiJuly 15, 2011 at 12:51 AM
hey richard,
i am trying to crawl google cached results. and i get banned when using random 2-5 seconds intervals. 10 sec seems a bit slow. is there any other thing that you can suggest. i thought of using proxies but i dont know how to implement it in python also, i dont where to find working proxies.
Thank you, it is a very useful post
ReplyDelete
Replies
RichardJuly 15, 2011 at 7:24 PM
hello,

thanks for your comment.

2-5 seconds is a bit fast so I am not surprised you were blocked. Think how a regular user uses Google.

There are many free proxy lists around but they are not reliable. Search google for "buy proxies" and you will find many providers.

And you can use this class for supporting proxies: http://docs.python.org/library/urllib2.html#urllib2.ProxyHandler
ReplyDelete
Replies
RichardJuly 15, 2011 at 7:25 PM
this module supports proxies so may help you:
http://code.google.com/p/webscraping/source/browse/download.py
ReplyDelete
Replies
bahtiJuly 19, 2011 at 7:53 PM
thank you so much for your replies.
I tried to use (10-20) sec random interval but get blocked again. Did they change something?
ReplyDelete
Replies
RichardJuly 20, 2011 at 4:51 PM
I have not had a problem with a ~10 second interval so the problem must be something else, for example a suspicious user agent.
ReplyDelete
Replies

Add comment