Saturday, March 27, 2010

Scraping Flash based websites

Flash is a pain. It is flaky on Linux and can not be scraped like HTML because it uses a binary format. HTML5 and Apple's criticism of Flash are good news for me because they encourage developers to try non-Flash solutions.

The reality is though that many sites currently use Flash to display content that I need to access. Here are some approaches for scraping Flash that I have tried:

  1. Check for AJAX requests that may carry the data you are after between the flash app and server
  2. Extract text with the Macromedia Flash Search Engine SDK
  3. Use OCR to extract the text directly

Most flash apps are self contained and so don't use AJAX, which rules out (1). And I have had poor results with (2) and (3).

Still no silver bullet...

No comments:

Post a Comment