When we look at a simple XML document like
<department>
<employee>
<firstname>Henrik</firstname>
<lastname>Loeser</lastname>
</employee>
</department>
and an XPath expression like "/department/employee/lastname", then - first of all - we see a lot of strings. The strings are of different length and the tags (the markup) make up most of the document. And we haven't even introduced namespaces here.
How do you efficiently store such documents? XML can be very verbose compared to relational data. How do you quickly as possible navigate within such documents, i.e., compare the different steps of your XPath expression to the different levels of the XML document?
The key to compactness and speed is the use of stringIDs. For DB2 pureXML every element name, attribute name, namespace URI, and namespace prefix is substituted by a 32 bit integer value when an XML document gets parsed. Each string is mapped to a unique number, a so-called stringID.
In the above example, all "department" could be replaced by 1, all "employee" by 2, etc. When the DB2 engine compiles a query and generates an executable package it also uses the stringIDs. This way, when at runtime the XPath expression is evaluated on the data, only integer values need to be compared. First we need to match the root element, i.e., look for the element name with value 1 ("department"). If we found one, the child needs to be a 2 ("employee"). Comparing integer values is of course much faster than comparing strings of variable length.
How fast the XQuery execution is in DB2 pureXML can be seen when you look at the TPoX benchmark results or by reading some of the customer success stories collected at the pureXML wiki.