Sunday, May 15, 2011

Using Google Cache to crawl a website

Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.

Fortunately there is an alternative - Google.

If a website doesn't exist in Google's search results then for most people it doesn't exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This means Google has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.

So instead of downloading a URL we want directly we can download it indirectly via Google Cache: http://www.google.com/search?&q=cache%3Ahttp%3A//sitescraper.net
This way the source website can not block you and does not even know you are crawling their content.

8 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Hi Richard,

    How frequently can you make requests to Google Cache before getting the boot?

    Suppose I wanted to scrape all of the news articles from a particular newspaper using Google Cache.

    Bill

    ReplyDelete
  3. hi Bill,
    I have found no problem using a ~10 second delay between requests. Make sure to add some randomness so you don't use exactly the same interval each time.
    Richard

    ReplyDelete
  4. hey richard,
    i am trying to crawl google cached results. and i get banned when using random 2-5 seconds intervals. 10 sec seems a bit slow. is there any other thing that you can suggest. i thought of using proxies but i dont know how to implement it in python also, i dont where to find working proxies.
    Thank you, it is a very useful post

    ReplyDelete
  5. hello,

    thanks for your comment.

    2-5 seconds is a bit fast so I am not surprised you were blocked. Think how a regular user uses Google.

    There are many free proxy lists around but they are not reliable. Search google for "buy proxies" and you will find many providers.

    And you can use this class for supporting proxies: http://docs.python.org/library/urllib2.html#urllib2.ProxyHandler

    ReplyDelete
  6. this module supports proxies so may help you:
    http://code.google.com/p/webscraping/source/browse/download.py

    ReplyDelete
  7. thank you so much for your replies.
    I tried to use (10-20) sec random interval but get blocked again. Did they change something?

    ReplyDelete
  8. I have not had a problem with a ~10 second interval so the problem must be something else, for example a suspicious user agent.

    ReplyDelete