All your data are belong to us: 2010

Sunday, November 7, 2010

New scraping quote tool

An ongoing problem for my web scraping work is how much to quote for a job. I prefer fixed fee to hourly rates so I need to consider the complexity upfront. My initial strategy was simply to quote low to ensure I got business and hopefully build up some regular clients.

Through experience I found the following factors most effected the time required for a job:

Website size
Login protected
IP restrictions
HTML quality
JavaScript/AJAX

I developed a formula based on these factors and have now built an interface that lets potential clients clarify the costs involved with different kinds of web scraping jobs. Additionally I hope this will reduce the communication overhead be helping clients to provide the necessary information upfront.

Wednesday, October 27, 2010

Increase your Google App Engine quotas for free

Google App Engine provides generous free quotas for your app and additional paid quotas.
I always enable billing for my GAE apps even though I rarely exhaust the free quotas. Enabling billing and setting paid quotas does not mean you have to pay anything and in fact increases what you get for free.

Here is a screenshot of the billing panel:

GAE lets you allocate a daily budget to the various resources, with the minimum permitted budget being USD $1. When you exhaust a free quota you will only be charged for the budget allocated to it. In the above screenshot I have allocated all my budget to emailing, but since my app does use the Mail API I can be confident this free quota will never be exhausted and I will never pay a cent. For another app that does use Mail I have allocated all the budget to Bandwidth Out instead.

Now with billing enabled my app:

can access the Blobstore API to store larger amounts of data
enjoys much higher free limits for the Mail, Task Queue, and UrlFetch API's - for example by default an app can make 7000 Mail API calls but with billing enabled this limit jumps to 1,700,000 calls
has a higher per minute CPU limit, which I find particularly useful when uploading a mass of records to the Datastore

So in summary you can enable billing to extend your free quotas without risk of paying.

Thursday, October 7, 2010

Extracting article summaries

I made my own version of this technique to extract article summaries.
Source code can be found here.

The idea is simple - extract the biggest text block - but performs well.
Here are some test results:

http://www.nytimes.com/2010/03/23/technology/23google.html?_r=1

The decision to shut down google.cn will have a limited financial impact on Google, which is based in Mountain View, Calif. China accounted for a small fraction of Google’s $23.6 billion in global revenue last year. Ads that once appeared on google.

http://www.theregister.co.uk/2010/09/29/novell_suse_appliance_1_1/

Being able to spin up appliance images for EC2 and spit them out onto the Amazon cloud meshes with Novell's EC2-based SUSE Linux licensing, which was announced back in August. Novell is only selling priority-level (24x7) support contract for SUSE Linux li

http://blog.sitescraper.net/2010/08/best-website-for-freelancers.html

However with Elance there is a high barrier to entry: you have to pass a test, receive a phone call to confirm your identity, and pay money for each job you bid on. Often I see jobs on Elance with no bids because it requires obscure experience - people we

Tuesday, September 14, 2010

Image efficiencies

I needed to store a large quantities of images so took the following measurements:

Format	Time	Size	Sample
bmp	0.715670	1769526
gif	4.184417	501931
jpg	2.507811	22252
png	10.909442	67295
ppm	0.648540	1769488	(Blogger prevented upload)
tiff	1.011216	1769600	(Blogger prevented upload)

Gif is the clear loser - it takes a long time to process but still looks terrible.
For space use jpeg, speed ppm.

Google's new WebP format looks promising.

Tuesday, September 7, 2010

Feedback

I was concerned about what blind spots I might have with the way I run my business. For example I am Australian and Australian's are usually very informal, even in a professional setting - was my communication with international clients too informal?

To try and address these concerns I developed a feedback survey with Google Docs, which I have been (politely) requesting my clients to complete at the end of a job. The results have been helpful, and it also seems to have impressed clients that I wanted their feedback. Wish I had thought of this earlier!

Saturday, August 28, 2010

Why reinvent the wheel?

I have been asked a few times why I chose to reinvent the wheel when libraries such as Scrapy and lxml already exist.

I am aware of these libraries and have used them in the past with good results. However my current work involves building relatively simple web scraping scripts that I want to run without hassle on the clients machine. This rules out installing full frameworks such as Scrapy or compiling C based libraries such as lxml - I need a pure Python solution. This also gives me the flexibility to run the script on Google App Engine.

To scrape webpages there are generally two stages: parse the HTML and then select the relevant nodes.
The most well known Python HTML parser seems to be BeautifulSoup, however I find it slow, difficult to use (compared to XPath), often parses HTML inaccurately, and significantly - the author has lost interest in further developing it. So I would not recommend using it - instead go with html5lib.

To select HTML content I use XPath. Is there a decent pure Python XPath solution? I didn't find one 6 months ago when I needed it so developed this simple version that covers my typical use cases. I would deprecate this in future if a decent solution does come along, but for now I am happy with my pure Python infrastructure.

Friday, August 20, 2010

Best website for freelancers

When I started freelancing I tried competing for as much work as possible by creating accounts on every freelance site I could find: oDesk, guru, scriptlance, and many others. However to my surprise I got almost all my work from just one source: Elance. How is Elance different?

With most freelancing sites you create an account and then can start bidding for jobs straight away. There is generally no cost to bidding so freelancers tend to bid on projects even if they don't have the skills or time to complete it. This is obviously frustrating for clients who are going to waste a lot of time sifting through bids.

However with Elance there is a high barrier to entry: you have to pass a test, receive a phone call to confirm your identity, and pay money for each job you bid on. Often I see jobs on Elance with no bids because it requires obscure experience - people weren't willing to waste their money bidding for a job they can't do. This barrier serves to weed out some of the less serious workers so that the average bid is of higher quality.

From my experience the clients are different on Elance too. On most freelancing sites the client is trying to get the job done for the smallest amount of money possible and so are often willing to spend their time sifting through dozens of proposals, hoping to get lucky. Elance seems to attract clients who consider their time as valuable and are willing to pay a premium for good service.
Often clients contact me directly through Elance because I am native English and want to avoid potential communication problems. One client even requested me to double my bid because "we are not cheap!"

After a year of freelancing I now get the majority of work directly through my website, but still get a decent percentage of clients through Elance.
My advice for new freelancers - focus on building your Elance profile and don't waste your time with the others. (Though do let me know if you have had good experience elsewhere.)

Sunday, July 25, 2010

All your data are belong to us?

Regarding the title of this blog "All your data are belong to us" - I realized not everyone get the reference. See this wikipedia article for an explanation.

Saturday, July 10, 2010

Caching crawled webpages

When crawling a website I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly from my local cache and avoid extra load on their website server. This is often necessary when a client realizes they require additional features scraped.

I built the pdict library to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.

Here is some example usage of pdict:

>>> from webscraping.pdict import PersistentDict
>>> cache = PersistentDict(CACHE_FILE)
>>> cache[url1] = html1
>>> cache[url2] = html2
>>> url1 in cache
True
>>> cache[url1]
html1
>>> cache.keys()
[url1, url2]
>>> del cache[url1]
>>> url1 in cache
False

Thursday, July 1, 2010

Fixed fee or hourly?

I prefer to quote per project rather than per hour for my web scraping work because it:

gives me incentive to increase my efficiency (by improving my infrastructure)
gives the client security about the total cost
avoids distrust about the number of hours actually worked
makes me look more competitive compared to the hourly rates available in Asia and Eastern Europe
is difficult to track time fairly when working on two or more projects simultaneously
is easy to estimate complexity based on past experience, atleast compared to building websites
involves less administration

Saturday, June 12, 2010

Open sourced web scraping code

For most scraping jobs I use the same general approach of crawling, selecting the appropriate nodes, and then saving the results. Consequently I reuse a lot of code across projects, which I have now combined into a library. Most of this infrastructure is available open sourced on Google Code.

The code in that repository is licensed under the LGPL, which means you are free to use it in your own applications (including commercial) but are obliged to release any changes you make. This is different than the more popular GPL license, which would make the library unusable in most commercial projects. And it is also different than the BSD and WTFPL style licenses, which would let people do whatever they want with the library including making changes and not releasing them.

I think the LGPL is a good balance for libraries because it lets anyone use the code while everyone can benefit from improvements made by individual users.

Monday, May 3, 2010

Why web2py

In a previous post I mentioned that web2py is my weapon of choice for building web applications.
Before web2py I had learnt a variety of approaches to building dynamic websites (raw PHP, Python CGI, Turbogears, Symfony, Rails, Django), but find myself most productive with web2py.

This is because web2py:

uses a pure Python templating system without restrictions - "we're all consenting adults here"
supports database migrations
has automatic form generation and validation with SQLFORM
runs on Google App Engine without modification
has a highly active and friendly user forum
rapid development - feature requests are often written and committed to trunk within the day
supports multiple apps for a single install
can develop apps through the browser admin
commits to backward compatibility
has no configuration files or dependencies - works out of the box
has sensible defaults for view templates, imported modules, etc

The downsides:

highly dependent on Massimo (the project leader)
the name web2py is unattractive compared to rails, pylons, web.py, etc
few designers, so the example applications look crude
~~inconsistent scattered documentation~~ [online book now available here!]

Thursday, April 15, 2010

Why Google App Engine

In the previous post I covered three alternative approaches to regularly scrape a website for a client, with the most common one being in the form of a web application. However hosting the web application on either my own or the clients server has problems.

My solution is to host the application on a neutral third party platform - Google App Engine (GAE). Here is my overview of deploying on GAE:

Pros:

provides a stable and consistent platform that I can use for multiple applications
both the customer and I can login and manage it, so we do not need to expose our servers
has generous free quotas, which I rarely exhaust

Cons:

only supports pure Python (or Java), so libraries that rely on C such as lxml are not supported (yet)
limitations on maximum job time and interacting with the database
have to trust Google with storing our scraped data

Often deploying on GAE works well for both the client and me, but it is not always practical/possible. I'm still looking for a silver bullet!

Monday, April 12, 2010

Scraping dynamic data

Usually my clients request for a website to be scraped into a standard format like CSV, which they can then integrate with their existing applications. However sometimes a client need a website scraped periodically because its data is continually updated. An example of the first use case is census statistics, and of the second stock prices.

I have three solutions for periodically scraping a website:

I provide the client with my web scraping code, which they can then execute regularly
Client pays me a small fee in future whenever they want the data rescraped
I build a web application that scrapes regularly and provides the data in a useful form

The first option is not always practical if the client does not have a technical background. Additionally my solutions are developed and tested on Linux and may not work on Windows.

The second option is generally not attractive to the client because it puts them in a weak position where they are dependent on me being contactable and cooperative in future.
Also it involves ongoing costs for them.

So usually I end up building a basic web application that consists of a CRON job to do the scraping, an interface to the scraped data, and some administration settings. If the scraping jobs are not too big I am happy to host the application on my own server, however most clients prefer the security of hosting it on their own server in case the app breaks down.

Unfortunately I find hosting on their server does not work well because they will have different versions of libraries or use a platform I am not familiar with. Additionally I prefer to build my web applications in Python (using web2py), and though Python is great for development it cannot compare to PHP for ease of deployment.
I can usually figure this all out but it takes time and also trust from the client to give me root privilege on their server. And given that these web applications are generally low cost (~ $1000) the ease of deployment is important.

All this is far from ideal. The solution? - see my next post.

Saturday, March 27, 2010

Scraping Flash based websites

Flash is a pain. It is flaky on Linux and can not be scraped like HTML because it uses a binary format. HTML5 and Apple's criticism of Flash are good news for me because they encourage developers to try non-Flash solutions.

The reality is though that many sites currently use Flash to display content that I need to access. Here are some approaches for scraping Flash that I have tried:

Check for AJAX requests that may carry the data you are after between the flash app and server
Extract text with the Macromedia Flash Search Engine SDK
Use OCR to extract the text directly

Most flash apps are self contained and so don't use AJAX, which rules out (1). And I have had poor results with (2) and (3).

Still no silver bullet...

Tuesday, March 16, 2010

I love AJAX!

AJAX is a JavaScript technique that allows a webpage to request URLs from its backend server and then make use of the returned data. For example gmail uses AJAX to load new messages. The old way to do this was reloading the webpage and then embedding the new content in the HTML, which was inefficient because it required downloading the entire webpage again rather that just the updated data.
AJAX is good for developers because it makes more complex web applications possible. It is good for users because it gives them a faster and smoother browsing experience. And it is good for me because AJAX powered websites are easier to scrape.

The trouble with scraping websites is they obscure the data I am after within a layer of presentation. However AJAX calls typically return just the data in an easy to parse format like JSON or XML. So effectively they provide an API to their backend database.

These AJAX calls can be monitored through tools such as Firebug to see what URLs are called and what they return from the server. Then I can call these URLs directly myself from outside the application and change the query parameters to fetch other records.

Friday, March 12, 2010

Scraping JavaScript webpages with webkit

In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:

requires me to program in JavaScript rather than my beloved Python (with all its great libraries)
is slow because have to wait for FireFox to render the entire webpage
is somewhat buggy and has a small user/developer community, mostly at MIT

An alternative solution that addresses all these points is webkit, which is an open source browser engine used most famously in Apple's Safari browser. Webkit has now been ported to the Qt framework and can be used through its Python bindings.

Here is a simple class that renders a webpage (including executing any JavaScript) and then saves the final HTML to a file:

import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *


class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

url = 'http://sitescraper.net'
r = Render(url)
html = r.frame.toHtml()

I can then analyze this resulting HTML with my standard Python tools like the webscraping module.

Tuesday, March 2, 2010

Scraping JavaScript based web pages with Chickenfoot

The data from most webpages can be scraped by simply downloading the HTML and then parsing out the desired content. However some webpages load their content dynamically with JavaScript after the page loads so that the desired data is not found in the original HTML. This is usually done for legitimate reasons such as loading the page faster, but in some cases is designed solely to inhibit scrapers.
This can make scraping a little tougher, but not impossible.

The easiest case is where the content is stored in JavaScript structures which are then inserted into the DOM at page load. This means the content is still embedded in the HTML but needs to instead be scraped from the JavaScript code rather than the HTML tags.

A more tricky case is where websites encode their content in the HTML and then use JavaScript to decode it on page load. It is possible to convert such functions into Python and then run them over the downloaded HTML, but often an easier and quicker alternative is to execute the original JavaScript. One such tool to do this is the Firefox Chickenfoot extension. Chickenfoot consists of a Firefox panel where you can execute arbitrary JavaScript code within a webpage and across multiple webpages. It also comes with a number of high level functions to make interaction and navigation easier.

To get a feel for Chickenfoot here is an example to crawl a website:

// crawl given website url recursively to given depth
function crawl(website, max_depth, links) {
  if(!links) {
    links = {};
    go(website);
    links[website] = 1;
  }

  // TODO: insert code to act on current webpage here

  if(max_depth > 0) {
    // iterate links
    for(var link=find("link"); link.hasMatch; link=link.next) {  
      url = link.element.href;
      if(!links[url]) {
        if(url.indexOf(website) == 0) {
          // same domain
          go(url);
          links[url] = 1;
          crawl(website, max_depth - 1, links);
        }
      }
    }
  }
  back(); wait();
}

This is part of a script I built on my Linux machine for a client on Windows and it worked fine for both of us.
To find out more about Chickenfoot check out their video.

Chickenfoot is a useful weapon in my web scraping arsenal, particularly for quick jobs with a low to medium amount of data. For larger websites there is a more suitable alternative, which I will cover in the next post.

Monday, February 8, 2010

How to crawl websites without being blocked

Websites want users who will purchase their products and click on their advertising. They want to be crawled by search engines so their users can find them, however they don't (generally) want to be crawled by others. One such company is Google, ironically.

Some websites will actively try to stop scrapers so here are some suggestions to help you crawl beneath their radar.

Speed

If you download 1 webpage a day then you will not be blocked but your crawl would take too long to be useful. If you instead used threading to crawl multiple URLs asynchronously then they might mistake you for a DOS attack and blacklist your IP. So what is the happy medium? The wikipedia article on web crawlers currently states "Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3–4 minutes." This is a little slow and I have found 1 download every 5 seconds is usually fine. If you don't need the data quickly then use a longer delay to reduce your risk and be kinder to their server.

Identity

Websites do not want to block genuine users so you should try to look like one. Set your user-agent to a common web browser instead of using the library default (such as wget/version or urllib/version). You could even pretend to be the Google Bot (only for the brave): Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

If you have access to multiple IP addresses (for example via proxies) then distribute your requests among them so that it appears your downloading comes from multiple users.

Consistency

Avoid accessing webpages sequentially: /product/1, /product/2, etc. And don't download a new webpage exactly every N seconds.
Both of these mistakes can attract attention to your downloading because a real user browses more randomly. So make sure to crawl webpages in an unordered manner and add a random offset to the delay between downloads.

Following these recommendations will allow you to crawl most websites without being detected.

Friday, February 5, 2010

How to protect your data

You spent time and money collecting the data in your website so you want to prevent someone else downloading and reusing it. However you still want Google to index your website so that people can find you.

This is a common problem. Below I will outline some strategies to protect your data.

Restrict

Firstly if your data really is valuable then perhaps it shouldn't be all publicly available. Often websites will display the basic data to standard users / search engines and the more valuable data (such as emails) only to logged in users. Then the website can easily track and control how much valuable data each account is accessing.

If requiring accounts isn't practical and you want search engines to crawl it then realistically you can't prevent it being scraped, but you can discourage scrapers by setting a high enough barrier.

Obfuscate

Scrapers typically work by downloading the HTML for a URL and then extracting out the desired content. To make this process harder you can obfuscate your valuable data.

The simplest way to obfuscate your data is have it encoded on the server and then dynamically decoded with JavaScript in the client's browser. The scraper would then need to decode this JavaScript to extract the original data. This is not difficult for an experienced scraper, but would atleast provide a small barrier.

A better way is to encapsulate the key data within images or flash. Optical Character Recognition (OCR) techniques would then need to be used to extract the original data, which require a lot of effort to do accurately. (Make sure the URL of the image does not reveal the original data, as one website did!) The free OCR tools that I have tested are at best 80% accurate, which makes the resulting data useless.
The tradeoff with encoding data in images images is there will be more data for the client to download and they prevent genuine users from conveniently copying the text. For example people often display their email address within an image to combat spammers, which then forces everyone else to type it out manually.

Challenge

A popular way to prevent automated scrapers is by forcing users to pass a CAPTCHA. For example Google does this when it gets too many search requests from the same IP within a timeframe. To avoid the CAPTCHA the scraper could proceed slowly, but they probably can't afford to wait. To speed up this rate they may purchase multiple anonymous proxies to provide multiple IP addresses, but that is expensive - 10 anonymous proxies will cost ~$30 / month to rent. The CAPTCHA can also be solved automatically by a service like deathbycaptcha.com. This takes some effort to setup so would only be implemented by experienced scrapers for valuable data.

CAPTCHA is not a good solution for protecting your content - they annoy genuine users, can be bypassed by a determined scraper, and additionally make it difficult for the Google Bot to index your website properly. They are only a good solution when being indexed by Google is not a priority and you want to stop most scrapers.

Corrupt

If you are suspicious of an IP that is accessing your website you could block the IP, but then they would know they are detected and try a different approach. Instead you could allow the IP to continue downloading but return incorrect text or figures. This should be done subtly so that is not clear which data is correct and their entire data set will be corrupted. Perhaps they won't notice and you will be able to track them down later by searching for "purple monkey dishwasher" or whatever other content was inserted!

Structure

Another factor that makes sites easy to scrape is when they use a URL structure like:
domain/product/product_title/product_id
For example these two URLs point to the same content on Amazon:
http://www.amazon.com/Lets-Go-Australia-10th-Inc/dp/0312385757
http://www.amazon.com/FAKE_TITLE/dp/0312385757
The title is just to make the URL look pretty. This makes the site easy to crawl because the scraper can just iterate through all the ID's (in this case ISBN's). If the title here had to be consistent with the product ID then it would take more work to scrape.

Google

All of the above strategies could be ignored for the Google Bot to ensure your website is properly indexed. Be aware that anyone could pretend to be the Google Bot by setting their user-agent to "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)", so to be confident you should also verify the IP address via a reversed DNS lookup. Be warned that Google has been known to punish websites that display different content for their bot to regular users.

In the next post I will take the opposite point of view of someone trying to scrape a website.

Tuesday, February 2, 2010

Why Python

Sometimes people ask why I use Python instead of something faster like C/C++. For me the speed of a language is a low priority because in my work the overwhelming amount of execution time is spent waiting for data to be downloaded rather than programming instructions to finish. So it makes sense to use whatever language I can write good code fastest in, which is currently Python because of its high level syntax and wonderful libraries. ESR wrote an article on why he likes Python that I expect resonates with many.

Additionally Python is an interpreted language so it is easier for me to distribute my solutions to clients than would be for a compiled language like C. Most of my scraping jobs are relatively small so distribution overhead is important.

A few people have suggested I use ruby instead. I have used ruby and like it, but found it lacks the depth of libraries available to Python.

However Python is by no means perfect - for example there are limitations with threading, using unicode is awkward, and distributing on Windows can be difficult. And there are also many redundant or poorly designed builtin libraries.
Some of these issues are being addressed in Python 3, some not.

Friday, January 29, 2010

The SiteScraper library

As a student I was fortunate to have the opportunity to learn about web scraping, guided by Professor Timothy Baldwin. I aimed to build a tool to make scraping web pages easier, resulting from frustration with a previous project.

My idea for this tool was that it should be possible to train a program to scrape a website by just giving the desired outputs for some example webpages. The program would build a model of how to extract this content and then this model could be applied to scrape other webpages that used the same template.

The tool was eventually called SiteScraper and is available for download on Google Code. For more information have a browse of this paper, which covers the implementation and results in detail.

I use SiteScraper for much of my scraping work and often make updates based on experience gained from a project.

Wednesday, January 20, 2010

Web scraping with regular expressions

Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let's say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions.

The results are:

regex_test took 40.032 ms

lxml_test took 1863.463 ms

bs_test took 54206.303 ms

That means for this use case lxml takes 40x longer than regular expressions and BeautifulSoup over 1000x! This is because lxml and BeautifulSoup parse the entire document into their internal format, when only the title is required.

XPaths are very useful for most web scraping tasks, but there still is a use case for regular expressions.

Tuesday, January 5, 2010

How to use XPaths robustly

In an earlier post I referred to XPaths but did not explain how to use them.

Say we have the following HTML document:

<html>
<body
  <div></div>
  <div id="content">
   <ul>
   <li>First item</li>
   <li>Second item</li>
   </ul>
  </div>
</body>
</html>

To access the list elements we follow the HTML structure from the root tag down to the li's:

html > body > 2nd div > ul > many li's.

An XPath to represent this traversal is:

/html[1]/body[1]/div[2]/ul[1]/li

If a tag has no index then every tag of that type will be selected:

/html/body/div/ul/li

XPaths can also use attributes to select nodes:

/html/body/div[@id="content"]/ul/li

And instead of using an absolute XPath from the root the XPath can be relative to a particular node by using double slash:

//div[@id="content"]/ul/li

This is more reliable than an absolute XPath because it can still locate the correct content after the surrounding structure is changed.

There are other features in the XPath standard but the above are all I use regularly.

A handy way to find the XPath of a tag is with Firefox's Firebug extension. To do this open the HTML tab in Firebug, right click the element you are interested in, and select "Copy XPath". (Alternatively use the "Inspect" button to select the tag.)
This will give you an XPath with indices only where there are multiple tags of the same type, such as:

/html/body/div[2]/ul/li

One thing to keep in mind is Firefox will always create a <tbody> tag within tables whether it existed in the original HTML or not. I need to be reminded of this often!

For one-off scrapes the above XPath should be fine. But for long term repeat scrapes it is better to use a relative XPath around an ID element with attributes instead of indices. From my experience such an XPath is more likely to survive minor modifications to the layout. However for a more robust solution see my SiteScraper library, which I will introduce in a later post.

Saturday, January 2, 2010

Parsing HTML with Python

HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.

Unfortunately the HTML of many webpages around the internet is invalid - for example a list element may be missing a closing tag:

<ul>
<li>abc</li>
<li>def
<li>ghi</li>
</ul>

but it still needs to be interpreted as:

This means we can't naively parse HTML by assuming a tag ends when we find the next closing tag. Instead it is best to use one of the many HTML parsing libraries available, such as BeautifulSoup, lxml, html5lib, and libxml2dom.
Seemingly the most well known and used such library is BeautifulSoup. A Google search for Python web scraping module currently returns BeautifulSoup as the first result.
However I instead use lxml because I find it more robust when parsing bad HTML. Additionally Ian Bicking found lxml more efficient than the other parsing libraries, though my priority is accuracy over speed.

You will need to use version 2 onwards of lxml, which includes the html module. This meant needing to compile lxml up to Ubuntu 8.10, which came with an earlier version.

Here is an example how to parse the previous broken HTML with lxml:

>>> from lxml import html
>>> tree = html.fromstring('<ul><li>abc</li><li>def<li>ghi</li></ul>')
>>> tree.xpath('//ul/li')
[<Element li at 959553c>, <Element li at 95952fc>, <Element li at 959544c>]