Thursday, April 16, 2009

Lessons from the field: Remember the trees and your relational data!

Some days before my vacation I was called to assist with a possible performance problem at a customer. To get a first impression I asked for the table definition, the indexes, and the query in question.

The table had several relational and one XML column. No problem, typical scenario. The indexes presented to me were a couple XML indexes. All looked OK to support the query in question which was similar to:

select col1, col2, ..., xmlCol
from tableInQuestion
where col1='someString'
and xmlexists('$XMLCOL/department/employee/name[first="Henrik"')
and xmlexists('$XMLCOL/department/employee/name[last="Loeser"')


When people with relational background learn to write queries with XML expressions, they repeatedly hear to watch out for XMLEXISTS and to write the comparison correctly (do not produce a boolean value!). This was done correctly in the query above. But I noticed a semantic issue. The problem is that you are not searching for XML documents that has one employee named "Henrik" and one that has a lastname of "Loeser", but for an employee named "Henrik Loeser". Because the predicates should be on the same employee, the query needs to be rewritten to something like the one below. Both comparisons are within a single XMLEXISTS, both on the same employee.

select col1, col2, ..., xmlCol
from tableInQuestion
where col1='someString'
and xmlexists('$XMLCOL/department/employee/name[first="Henrik" and last="Loeser"')



When writing queries against databases with XML data, keep in mind that the data is not normalized, i.e., you have an XML document with possibly multiple instances of a data object (such as employees).

The above did not cause the performance problem. Some questions later from my side we found that the team had focused too much on getting the XML part right and forgot to create indexes on the relational columns...