Data Henrik: pureXML Indexing - Behind the scenes

Tuesday, August 10, 2010

pureXML Indexing - Behind the scenes

I was asked whether I could explain a little bit how the different XML indexes are used for performing an index scan. Well, this probably requires several blog entries, but today I at least try to cover the basics.

When you create a table with an XML column, DB2 automatically creates two different XML indexes. One is the XML Path index, the other the XML Region Index. The INDEXTYPE in SYSCAT.INDEXES lists them as XPTH and XRGN. If you create a table with more than one XML column, you will notice that there is a XPTH for each XML column, but only one XRGN for the entire table. The path index maps an encoded path (e.g., /root/department/employee) to a unique pathID, an integer value. The pathID is unique within the entire database. The region index is used to manage parts (regions) of an XML document, so that documents can be larger than a single data page.

Similar to indexes on non-XML columns, users can create indexes on XML columns utilizing some features of the CREATE INDEX statement. Thereafter, the system catalog might look like shown in this article in the Information Center. But how are those indexes used?

One of the XML-related operators is XISCAN. When you perform an index scan using a non-XML index, an IXSCAN operator shows up in the access plan. For XML data, because some more effort is needed, the operator is named XISCAN, XML Index Scan. Once the optimizer decides that an index scan should be used, some auxiliary information to set up the scan is needed, including of course the start or stop keys for the values in the predicate (similar to regular b-tree access). What is needed in addition is the pathID of the path to search for. In the XML world we are not comparing "id=4711", but the value at "/root/department/employee/id" needs to be 4711.

If the entire absolute path is known when the query is compiled, the path can be turned into a pathID. However, this is not the case when wildcards (e.g., "//employee/id") are used. That's when the XML path index (XPTH) comes in handy. Because it stores the encoded paths in reverse order (our example would be "id/employee/department/root" in clear text) it is possible to find all paths ending (now starting) with "id" or "employee/id". To retrieve the pathIDs, at runtime an index scan on the path index is performed and the range is X-locked. With the locking DB2 can prevent any new path to be created during the actual index scan for the query predicate.

Once the pathIDs are known, either from the compile phase or the index look-up, the actual scan on the XML index (XVIP type in the catalog) can be performed and docIDs, nodeIDs (to pinpoint the node having that desired value) and RIDs (record identifiers) are obtained from the b-tree. The docIDs identify the qualifying XML documents, the RIDs point to the records (rows) in the table which the XML documents are part of. Now, based on the docID and nodeID, the part of the document can be located using the XML region index. Then, regular XML processing kicks in, e.g., the XSCAN operator or others could be applied.

Note that the above description covers the standard case and that optimizations for some cases are possible. There is also the XANDOR operator which speeds up index evaluation when two or more indexes for the same column can be used. But that's another blog post. Stay tuned...

Pages

Tuesday, August 10, 2010

pureXML Indexing - Behind the scenes