Thursday, June 30, 2011

Parsing Flash with Swiffy

Google has released a tool called Swiffy for parsing Flash files into HTML5. This is relevant to web scraping because content embedded in Flash is a pain to extract, as I wrote about earlier.

I tried some test files and found the results no more useful for parsing text content than the output produced by swf2html (Linux version). Some neat example conversions are available here. Currently Swiffy supports ActionScript 2.0 and works best with Flash 5, which was released back in 2000 so there is still a lot of work to do.

Sunday, June 19, 2011

Google App Engine limitations

Most of the discussion about Google App Engine seems to focus on how it allows you to scale your app, however I find it most useful for small client apps where we want a reliable platform while avoiding any ongoing hosting fee. For large apps paying for hosting would not be a problem.

These are some of the downsides I have found using Google App Engine:
  • Slow - if your app has not been accessed recently (last minute) then it can take up to 10 seconds to load for the user
  • Pure Python/Java code only - this prevents using a lot of good libraries, most importantly for me lxml
  • CPU quota easily gets exhausted when uploading data
  • Proxies not supported, which makes apps that rely on external websites risky. For example the Twitter API has a per IP quota which you would be sharing with all other GAE apps.
  • Blocked in some countries, such as Turkey
  • Indexes - the free quota is 1 GB but often over half of this is taken up by indexes
  • Maximum 1000 records per query - no longer a limitation!
  • 20 second request limit, so often need the overhead of using Task Queues

Despite these problems I still find Google App Engine a fantastic platform and a pleasure to develop on.