Saturday, December 3, 2011

How to teach yourself web scraping

I often get asked how to learn about web scraping. Here is my advice.

First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don't need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.

The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:

Make sure you learn all the details of the urllib2 module. Here are some additional good resources:



Learn about the HTTP protocol, which is how you will interact with websites.


Learn about regular expressions:



Learn about XPath:



If necessary learn about JavaScript:



These FireFox extensions can make web scraping easier:



Some libraries that can make web scraping easier:

Some other resources:

No comments:

Post a Comment