You spent time and money collecting the data in your website so you want to prevent someone else downloading and reusing it. However you still want Google to index your website so that people can find you.
This is a common problem. Below I will outline some strategies to protect your data.
If requiring accounts isn't practical and you want search engines to crawl it then realistically you can't prevent it being scraped, but you can discourage scrapers by setting a high enough barrier.
The simplest way to obfuscate your data is have it encoded on the server and then dynamically decoded with JavaScript in the client's browser. The scraper would then need to decode this JavaScript to extract the original data. This is not difficult for an experienced scraper, but would atleast provide a small barrier.
A better way is to encapsulate the key data within images or flash. Optical Character Recognition (OCR) techniques would then need to be used to extract the original data, which require a lot of effort to do accurately. (Make sure the URL of the image does not reveal the original data, as one website did!) The free OCR tools that I have tested are at best 80% accurate, which makes the resulting data useless.
The tradeoff with encoding data in images images is there will be more data for the client to download and they prevent genuine users from conveniently copying the text. For example people often display their email address within an image to combat spammers, which then forces everyone else to type it out manually.
CAPTCHA is not a good solution for protecting your content - they annoy genuine users, can be bypassed by a determined scraper, and additionally make it difficult for the Google Bot to index your website properly. They are only a good solution when being indexed by Google is not a priority and you want to stop most scrapers.
domain/product/product_title/product_id
For example these two URLs point to the same content on Amazon:
http://www.amazon.com/Lets-Go-Australia-10th-Inc/dp/0312385757
http://www.amazon.com/FAKE_TITLE/dp/0312385757
The title is just to make the URL look pretty. This makes the site easy to crawl because the scraper can just iterate through all the ID's (in this case ISBN's). If the title here had to be consistent with the product ID then it would take more work to scrape.
In the next post I will take the opposite point of view of someone trying to scrape a website.
This is a common problem. Below I will outline some strategies to protect your data.
Restrict
Firstly if your data really is valuable then perhaps it shouldn't be all publicly available. Often websites will display the basic data to standard users / search engines and the more valuable data (such as emails) only to logged in users. Then the website can easily track and control how much valuable data each account is accessing.If requiring accounts isn't practical and you want search engines to crawl it then realistically you can't prevent it being scraped, but you can discourage scrapers by setting a high enough barrier.
Obfuscate
Scrapers typically work by downloading the HTML for a URL and then extracting out the desired content. To make this process harder you can obfuscate your valuable data. The simplest way to obfuscate your data is have it encoded on the server and then dynamically decoded with JavaScript in the client's browser. The scraper would then need to decode this JavaScript to extract the original data. This is not difficult for an experienced scraper, but would atleast provide a small barrier.
A better way is to encapsulate the key data within images or flash. Optical Character Recognition (OCR) techniques would then need to be used to extract the original data, which require a lot of effort to do accurately. (Make sure the URL of the image does not reveal the original data, as one website did!) The free OCR tools that I have tested are at best 80% accurate, which makes the resulting data useless.
The tradeoff with encoding data in images images is there will be more data for the client to download and they prevent genuine users from conveniently copying the text. For example people often display their email address within an image to combat spammers, which then forces everyone else to type it out manually.
Challenge
A popular way to prevent automated scrapers is by forcing users to pass a CAPTCHA. For example Google does this when it gets too many search requests from the same IP within a timeframe. To avoid the CAPTCHA the scraper could proceed slowly, but they probably can't afford to wait. To speed up this rate they may purchase multiple anonymous proxies to provide multiple IP addresses, but that is expensive - 10 anonymous proxies will cost ~$30 / month to rent. The CAPTCHA can also be solved automatically by a service like deathbycaptcha.com. This takes some effort to setup so would only be implemented by experienced scrapers for valuable data. CAPTCHA is not a good solution for protecting your content - they annoy genuine users, can be bypassed by a determined scraper, and additionally make it difficult for the Google Bot to index your website properly. They are only a good solution when being indexed by Google is not a priority and you want to stop most scrapers.
Corrupt
If you are suspicious of an IP that is accessing your website you could block the IP, but then they would know they are detected and try a different approach. Instead you could allow the IP to continue downloading but return incorrect text or figures. This should be done subtly so that is not clear which data is correct and their entire data set will be corrupted. Perhaps they won't notice and you will be able to track them down later by searching for "purple monkey dishwasher" or whatever other content was inserted!Structure
Another factor that makes sites easy to scrape is when they use a URL structure like:domain/product/product_title/product_id
For example these two URLs point to the same content on Amazon:
http://www.amazon.com/Lets-Go-Australia-10th-Inc/dp/0312385757
http://www.amazon.com/FAKE_TITLE/dp/0312385757
The title is just to make the URL look pretty. This makes the site easy to crawl because the scraper can just iterate through all the ID's (in this case ISBN's). If the title here had to be consistent with the product ID then it would take more work to scrape.
Google
All of the above strategies could be ignored for the Google Bot to ensure your website is properly indexed. Be aware that anyone could pretend to be the Google Bot by setting their user-agent to "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)", so to be confident you should also verify the IP address via a reversed DNS lookup. Be warned that Google has been known to punish websites that display different content for their bot to regular users.In the next post I will take the opposite point of view of someone trying to scrape a website.
Web scraping is a computer software technique of extracting information from websites. Thank you very much....
ReplyDeleteWeb Scrapers
"Firstly if your data really is valuable then perhaps it shouldn't be all publicly available"or even placed on the net. Your important documents should be placed on a secure server or even offline to ensure the privacy and security of the said papers.
ReplyDeletedocument storage
Good point. I currently use Google Storage and Dropbox for storing my data.
ReplyDelete