Tuesday, January 5, 2010

How to use XPaths robustly

In an earlier post I referred to XPaths but did not explain how to use them.

Say we have the following HTML document:
<html>
 <body
  <div></div>
  <div id="content">
   <ul>
    <li>First item</li>
    <li>Second item</li>
   </ul>
  </div>
 </body>
</html>

To access the list elements we follow the HTML structure from the root tag down to the li's:
html > body > 2nd div > ul > many li's.

An XPath to represent this traversal is:
/html[1]/body[1]/div[2]/ul[1]/li

If a tag has no index then every tag of that type will be selected:
/html/body/div/ul/li

XPaths can also use attributes to select nodes:
/html/body/div[@id="content"]/ul/li

And instead of using an absolute XPath from the root the XPath can be relative to a particular node by using double slash:
//div[@id="content"]/ul/li
This is more reliable than an absolute XPath because it can still locate the correct content after the surrounding structure is changed.

There are other features in the XPath standard but the above are all I use regularly.



A handy way to find the XPath of a tag is with Firefox's Firebug extension. To do this open the HTML tab in Firebug, right click the element you are interested in, and select "Copy XPath". (Alternatively use the "Inspect" button to select the tag.)
This will give you an XPath with indices only where there are multiple tags of the same type, such as:
/html/body/div[2]/ul/li
One thing to keep in mind is Firefox will always create a <tbody> tag within tables whether it existed in the original HTML or not. I need to be reminded of this often!

For one-off scrapes the above XPath should be fine. But for long term repeat scrapes it is better to use a relative XPath around an ID element with attributes instead of indices. From my experience such an XPath is more likely to survive minor modifications to the layout. However for a more robust solution see my SiteScraper library, which I will introduce in a later post.

No comments:

Post a Comment