Tuesday, March 10, 2009

Hundreds of indexes on a single XML column - why not?

Last week I presented at the German Database Conference "BTW" (see here for a history) some work on how to address XML Index Challenges (paper written with two colleagues). More and more enterprises face the challenge of logging all their activities. Given that today SOA and Web Services become standard for new applications, there is a trend that the logs are XML-based as well.

The scenario in the paper is based on what was observed at several banking customers. The internet banking system is required to log every event in any of their internet banking applications - we are talking 10 to 20 million inserts a day. Events include clicks that take a user to a new web page or dialog, entry of user data, as well as every click that initiates a banking transaction. This “logging” happens across a set of diverse applications such as checking accounts, loans, investment management, and others. The log information can vary widely, thus very flexible storage is required. So far the data is stored in a relational table with few columns and the log record itself as VARCHAR.

The key problem with a relational-only approach is that the variable part of the log entries is hard to query with adequate performance. Current relational database technology does not allow easy indexing of individual pieces of strings in a VARCHAR column. The applications that read this data typically use SQL "LIKE" predicates on these VARCHAR values. This results in limited queryability, limited index usage, and sub-optimal performance. The solution is to use XML as log format and to make use of DB2 pureXML's hybrid capabilities, i.e., to store the logs in an XML column and some other information (like a timestamp) as relational fields.

What is the challenge?
The challenge is not the heavy insert workload as was demonstrated with the 1 Terabyte XML benchmark, it is the indexing. Given that there are different log record formats (different XML document types) for the different applications, and there is a heavy volume of data coming in, you want to have indexes on all important elements and attributes to find the needle in the haystack. If we think of only few dozen application types and typically 10 to 20 indexes per type, we are already in the hundreds of indexes.

Does it work?
The good news is that DB2 pureXML is able to handle it without problems. First of all, DB2 allows users to only index what is needed (see the Information Center and Tip 8 in the DB2 pureXML 15 best practices). Other systems require primary index, secondary index, side tables or whatever it is called and this would be overhead slowing down the system. DB2 directly extracts - without any prerequisite - the data to index.
Secondly, even if there are hundreds of indexes, only dozen or so match a given document. Still, taking care of the hundreds of non-qualifying indexes could be an issue. Fortunately not in DB2, how this is done was part of my presentation and is described in the conference paper. With the number of non-matching indexes increasing, there is only a small performance penalty and the DB2 server on a machine with 4 dual-core 1.9 GHz POWER5 processors was able to maintain close to 100,000 inserts a minute.

How many indexes are ok?
Earlier I had asked about the right number of indexes. The answer to the question can be (over-)simplified by saying that you should create as many as are needed to help query performance. As we have learned we don't need to worry too much about non-matching indexes. With that we are back to relational index maintenance characteristics most of us know already.