Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let's say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions.
The results are:
That means for this use case lxml takes 40x longer than regular expressions and BeautifulSoup over 1000x! This is because lxml and BeautifulSoup parse the entire document into their internal format, when only the title is required.
XPaths are very useful for most web scraping tasks, but there still is a use case for regular expressions.
The results are:
regex_test took 40.032 ms
lxml_test took 1863.463 ms
bs_test took 54206.303 ms
That means for this use case lxml takes 40x longer than regular expressions and BeautifulSoup over 1000x! This is because lxml and BeautifulSoup parse the entire document into their internal format, when only the title is required.
XPaths are very useful for most web scraping tasks, but there still is a use case for regular expressions.
Sure, as long as you _never_ use markup inside your divs... The speed comparison is ridiculous - these two examples aren't even doing close to the same thing - one is getting a div, the other is getting you only the first text node inside the div. Unless you know and control the format the re based parser will fail. Trust SO - as a collective we know that in the long run it is a Bad Idea(TM) to parse with regexes... just give up, please, for all our sakes!
ReplyDeleteas mentioned my use case is a one off scrape. I don't need something reliable over time, just a quick scrape of a known template. I have been involved in web scraping full time for 2 years and have never been bitten by this.
ReplyDeleteIt's interesting the passion this topic provokes!
I'm sure somewhere over the vast user base of SO, you can find more than 2 years experience with web scraping.
ReplyDeleteOf course. So?
ReplyDeleteI have to completely agree with Richard. As someone who currently does a lot of one-off web scraping, there are times where a regex or two gives you exactly what you need!
ReplyDeleteIn my experience, sometimes you could use a regex to pull out chunks of the file, and then resort to something like Beautiful Soup to parse that little chunk. Sometimes using Beautiful Soup is such a drain on runtime and programmer's time (it takes a lot of effort to drill down to that one set of nodes you need, and if you're trying to scrape information from hundreds of pages, parse-time does matter), that a couple of regexs could work wonders in getting at the information you need!