Tuesday, April 28, 2009

stringIDs in DB2 pureXML: What and Why

Earlier this month I had asked you about stringIDs in DB2 pureXML. Answer B) from the provided options is correct. DB2 replaces structural information such as element names, attribute names, namespace prefixes and URIs with stringIDs. But what are stringIDs and why are they used?

When we look at a simple XML document like
<department>
<employee>
<firstname>Henrik</firstname>
<lastname>Loeser</lastname>
</employee>
</department>

and an XPath expression like "/department/employee/lastname", then - first of all - we see a lot of strings. The strings are of different length and the tags (the markup) make up most of the document. And we haven't even introduced namespaces here.

How do you efficiently store such documents? XML can be very verbose compared to relational data. How do you quickly as possible navigate within such documents, i.e., compare the different steps of your XPath expression to the different levels of the XML document?

The key to compactness and speed is the use of stringIDs. For DB2 pureXML every element name, attribute name, namespace URI, and namespace prefix is substituted by a 32 bit integer value when an XML document gets parsed. Each string is mapped to a unique number, a so-called stringID.

In the above example, all "department" could be replaced by 1, all "employee" by 2, etc. When the DB2 engine compiles a query and generates an executable package it also uses the stringIDs. This way, when at runtime the XPath expression is evaluated on the data, only integer values need to be compared. First we need to match the root element, i.e., look for the element name with value 1 ("department"). If we found one, the child needs to be a 2 ("employee"). Comparing integer values is of course much faster than comparing strings of variable length.

How fast the XQuery execution is in DB2 pureXML can be seen when you look at the TPoX benchmark results or by reading some of the customer success stories collected at the pureXML wiki.