Thursday, March 31, 2011

Google Storage

Often the datasets I scrape are too big to send via email and would take up too much space on my web server, so I upload them to Google Storage.
Here is an example snippet to create a folder on GS, upload a file, and then download it:

>>> gsutil mb gs://bucket_name
>>> gsutil ls
gs://bucket_name
>>> gsutil cp path/to/file.ext gs://bucket_name
>>> gsutil ls gs://bucket_name
file.ext
>>> gsutil cp gs://bucket_name/file.ext file_copy.ext

Tuesday, March 1, 2011

The SiteScraper module

A few years ago I developed the sitescraper library for automatically scraping website data based on example cases:


>>> from sitescraper import sitescraper>>> ss = sitescraper()
>>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=python&x=0&y=0'
>>> data = ["Amazon.com: python", ["Learning Python, 3rd Edition", 
     "Programming in Python 3: A Complete Introduction to the Python Language (Developer's Library)", 
     "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]
>>> ss.add(url, data)
>>> # we can add multiple example cases, but this is a simple example so 1 will do (I generally use 3)
>>> # ss.add(url2, data2) 
>>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=linux&x=0&y=0')
["Amazon.com: linux", ["A Practical Guide to Linux(R) Commands, Editors, and Shell Programming", "Linux Pocket Guide", "Linux in a Nutshell (In a Nutshell (O'Reilly))", 'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 'Linux Bible, 2008 Edition: Boot up to Ubuntu, Fedora, KNOPPIX, Debian, openSUSE, and 11 Other Distributions']]

See this paper for more info.

It was designed for scraping websites overtime where their layout may change. Unfortunately I don't use it much these days because most of my projects are one-off scrapes.