All your data are belong to us: Web scraping with regular expressions

Wednesday, January 20, 2010

Web scraping with regular expressions

Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let's say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions.

The results are:

regex_test took 40.032 ms

lxml_test took 1863.463 ms

bs_test took 54206.303 ms

That means for this use case lxml takes 40x longer than regular expressions and BeautifulSoup over 1000x! This is because lxml and BeautifulSoup parse the entire document into their internal format, when only the title is required.

XPaths are very useful for most web scraping tasks, but there still is a use case for regular expressions.

5 comments:

tobyodaviesDecember 14, 2010 at 10:50 AM
Sure, as long as you _never_ use markup inside your divs... The speed comparison is ridiculous - these two examples aren't even doing close to the same thing - one is getting a div, the other is getting you only the first text node inside the div. Unless you know and control the format the re based parser will fail. Trust SO - as a collective we know that in the long run it is a Bad Idea(TM) to parse with regexes... just give up, please, for all our sakes!
ReplyDelete
Replies
RichardDecember 14, 2010 at 2:54 PM
as mentioned my use case is a one off scrape. I don't need something reliable over time, just a quick scrape of a known template. I have been involved in web scraping full time for 2 years and have never been bitten by this.

It's interesting the passion this topic provokes!
ReplyDelete
Replies
AnonymousNovember 19, 2011 at 6:33 AM
I'm sure somewhere over the vast user base of SO, you can find more than 2 years experience with web scraping.
ReplyDelete
Replies
RichardNovember 19, 2011 at 10:55 AM
Of course. So?
ReplyDelete
Replies
Epsilon GivenDecember 1, 2011 at 2:28 PM
I have to completely agree with Richard. As someone who currently does a lot of one-off web scraping, there are times where a regex or two gives you exactly what you need!

In my experience, sometimes you could use a regex to pull out chunks of the file, and then resort to something like Beautiful Soup to parse that little chunk. Sometimes using Beautiful Soup is such a drain on runtime and programmer's time (it takes a lot of effort to drill down to that one set of nodes you need, and if you're trying to scrape information from hundreds of pages, parse-time does matter), that a couple of regexs could work wonders in getting at the information you need!
ReplyDelete
Replies

Add comment