All your data are belong to us: The SiteScraper module

A few years ago I developed the sitescraper library for automatically scraping website data based on example cases:

>>> from sitescraper import sitescraper>>> ss = sitescraper()
>>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=python&x=0&y=0'
>>> data = ["Amazon.com: python", ["Learning Python, 3rd Edition", 
     "Programming in Python 3: A Complete Introduction to the Python Language (Developer's Library)", 
     "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]
>>> ss.add(url, data)
>>> # we can add multiple example cases, but this is a simple example so 1 will do (I generally use 3)
>>> # ss.add(url2, data2) 
>>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=linux&x=0&y=0')
["Amazon.com: linux", ["A Practical Guide to Linux(R) Commands, Editors, and Shell Programming", "Linux Pocket Guide", "Linux in a Nutshell (In a Nutshell (O'Reilly))", 'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 'Linux Bible, 2008 Edition: Boot up to Ubuntu, Fedora, KNOPPIX, Debian, openSUSE, and 11 Other Distributions']]

See this paper for more info.

It was designed for scraping websites overtime where their layout may change. Unfortunately I don't use it much these days because most of my projects are one-off scrapes.

2 comments:

ronnixNovember 9, 2011 at 8:51 AM
Would you consider moving the source to github, so that other people could contribute?
RichardNovember 10, 2011 at 2:03 AM
the source code is available on Google Code. If someone wishes to contribute there they are welcome. If someone wishes to fork it elsewhere they are welcome.

All your data are belong to us

Tuesday, March 1, 2011

The SiteScraper module

2 comments:

About Me