As a student I was fortunate to have the opportunity to learn about web scraping, guided by Professor Timothy Baldwin. I aimed to build a tool to make scraping web pages easier, resulting from frustration with a previous project.
My idea for this tool was that it should be possible to train a program to scrape a website by just giving the desired outputs for some example webpages. The program would build a model of how to extract this content and then this model could be applied to scrape other webpages that used the same template.
The tool was eventually called SiteScraper and is available for download on Google Code. For more information have a browse of this paper, which covers the implementation and results in detail.
I use SiteScraper for much of my scraping work and often make updates based on experience gained from a project.
My idea for this tool was that it should be possible to train a program to scrape a website by just giving the desired outputs for some example webpages. The program would build a model of how to extract this content and then this model could be applied to scrape other webpages that used the same template.
The tool was eventually called SiteScraper and is available for download on Google Code. For more information have a browse of this paper, which covers the implementation and results in detail.
I use SiteScraper for much of my scraping work and often make updates based on experience gained from a project.