Saturday, June 12, 2010

Open sourced web scraping code

For most scraping jobs I use the same general approach of crawling, selecting the appropriate nodes, and then saving the results. Consequently I reuse a lot of code across projects, which I have now combined into a library. Most of this infrastructure is available open sourced on Google Code.

The code in that repository is licensed under the LGPL, which means you are free to use it in your own applications (including commercial) but are obliged to release any changes you make. This is different than the more popular GPL license, which would make the library unusable in most commercial projects. And it is also different than the BSD and WTFPL style licenses, which would let people do whatever they want with the library including making changes and not releasing them.

I think the LGPL is a good balance for libraries because it lets anyone use the code while everyone can benefit from improvements made by individual users.

4 comments:

  1. I like the way you explain some of the differences between those licenses.

    LGPL sounds like the license I have always been looking for. A balance between between openness and commercial usage.

    ReplyDelete
  2. I just saw this URL on the aicookbook, and have written a brief description and link to your code on the MetaOptimize forum.

    Would you be interested in posting an answer, describing the relative merits of your code compared to, say, scrapy? That would be very informative.

    ReplyDelete
  3. Can you explain what your code does that is different or better than existing software like Beautifulsoup?

    ReplyDelete
  4. sorry for delay - have been traveling!

    Hopefully this post addresses your questions:
    http://blog.sitescraper.net/2010/08/why-not-just-use-scrapy.html

    ReplyDelete