A few years ago I developed the sitescraper library for automatically scraping website data based on example cases:
See this paper for more info.
>>> from sitescraper import sitescraper>>> ss = sitescraper() >>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=python&x=0&y=0' >>> data = ["Amazon.com: python", ["Learning Python, 3rd Edition", "Programming in Python 3: A Complete Introduction to the Python Language (Developer's Library)", "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]] >>> ss.add(url, data) >>> # we can add multiple example cases, but this is a simple example so 1 will do (I generally use 3) >>> # ss.add(url2, data2) >>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=linux&x=0&y=0') ["Amazon.com: linux", ["A Practical Guide to Linux(R) Commands, Editors, and Shell Programming", "Linux Pocket Guide", "Linux in a Nutshell (In a Nutshell (O'Reilly))", 'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 'Linux Bible, 2008 Edition: Boot up to Ubuntu, Fedora, KNOPPIX, Debian, openSUSE, and 11 Other Distributions']]
See this paper for more info.
It was designed for scraping websites overtime where their layout may change. Unfortunately I don't use it much these days because most of my projects are one-off scrapes.
Would you consider moving the source to github, so that other people could contribute?
ReplyDeletethe source code is available on Google Code. If someone wishes to contribute there they are welcome. If someone wishes to fork it elsewhere they are welcome.
ReplyDelete