I often get asked how to learn about web scraping. Here is my advice.
First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don't need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.
The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:
Make sure you learn all the details of the urllib2 module. Here are some additional good resources:
Learn about the HTTP protocol, which is how you will interact with websites.
Learn about regular expressions:
Learn about XPath:
If necessary learn about JavaScript:
These FireFox extensions can make web scraping easier:
Some libraries that can make web scraping easier:
First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don't need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.
The following advice will assume you want to use Python for web scraping.
If you have some programming experience then I recommend working through the Dive Into Python book:
Make sure you learn all the details of the urllib2 module. Here are some additional good resources:
- http://www.doughellmann.com/PyMOTW/urllib2/
- http://www.voidspace.org.uk/python/articles/urllib2.shtml
Learn about the HTTP protocol, which is how you will interact with websites.
Learn about regular expressions:
Learn about XPath:
- http://www.w3schools.com/xpath/
- http://www.learn-xslt-tutorial.com/XPath.cfm
- http://lxml.de/dev/xpathxslt.html
If necessary learn about JavaScript:
- https://developer.mozilla.org/en/A_re-introduction_to_JavaScript
- http://eloquentjavascript.net/contents.html
- http://www.yuiblog.com/crockford/
These FireFox extensions can make web scraping easier:
Some libraries that can make web scraping easier:
Some other resources:
No comments:
Post a Comment