Monday, July 20, 2009

Relational vs. pureXML: Some indexing differences

You probably know how relational indexing works. When you create an index on a column, then for every row in the table there is an index entry. There is a 1:1 relationship between the rows and the index entries.

In the XML world things can be quite different. Let's imagine that we have XML documents where as part of other personal information the names of our siblings are listed ("CREATE TABLE persons(id INT, doc XML)"). The XML fragment could like the following for a single sibling (2 kids for the parents):
<siblings>
<name>Sarah</name>
</siblings>

If we would like to index the sibling names, we could do it in DB2 the following way:
CREATE INDEX sibNamesIdx ON persons(doc)
GENERATE KEY USING XMLPATTERN
'/siblings/name' as SQL VARCHAR(40)

Because we have one sibling in the above example, there would be a single entry in the index - the same as in relational. But what happens if you are an only child (<siblings/>)? Or what if you have multiple brothers and sisters?

<siblings>
<name>Sam</name>
<name>Sarah</name>
<name>Sean</name>
<name>Siegfried</name>
<name>Simon</name>
<name>Susie</name>
<name>Sydney</name>
</siblings>

For an only child there wouldn't be a sibling name and hence no index entry. For multiple siblings there would be an index entry each, resulting in multiple index entries for a single row. This is quite different from the relational world.

What may look strange or at least unusual to you has some benefits to you. In the relational world you would need to model such data with joined-in side tables or with sparse columns. In the latter case you would manage NULL values in your index. In the XML world you only have data that is really present represented in your indexes, thus keeping them small.