Tuesday, February 10, 2009

XML Database Benchmarks, TPoX, and DB2 pureXML

I recently got asked why DB2 pureXML is using the TPoX benchmark and whether TPoX is an official benchmark similar to those from TPC or SPEC. I will try to answer that question today.

TPoX stands for "Transaction Processing over XML" and is an open source database benchmark which is available at SourceForge.net. It originated from IBM, but other sources, most significantly Intel, have been contributing to the benchmark. TPoX is an application-level XML database benchmark based on a (real) financial application scenario. The goal of TPoX is to evaluate the performance of XML database systems, focusing on XQuery, SQL/XML, XML storage, XML indexing, XML Schema support, XML updates, logging, concurrency and other database aspects.

Why is it important to mention that long list of features? This is because several other XML database benchmarks (e.g., XMach-1 , XMark, XPathMark, XOO7, XBench, MBench, Michigan Benchmark, and MemBeR) already existed before TPoX was born. All but one or two of these focus mostly on XQuery performance or on specific database aspects, not on the entire system. For a company that plans to buy an XML database system it is not good enough to know that the XPath evaluation of a system is outstanding when insert processing or bufferpool management are not worth a penny. In other words, being good in one aspect of what makes up a database system is not good enough to produce a well-rounded, reliable, and performant (XML) database system, a system database user are really looking for.

Because both TPC and SPEC were not interested in developing an XML database benchmark, because of the lack of an adequate database benchmark, and because of not much interest from other database vendors IBM eventually proposed TPoX to the database and XML community (see SIGMOD 2007 paper and 2006 Dagstuhl seminar on XQuery Implementation Paradigms) and made it open source. Why open source? It allows open discussions, contributions, and usage of the benchmark and its code.

Since TPoX has been made available, many companies, universities, business partners, other database vendors, and of course IBM have used TPoX to evaluate XML database performance. Some results have been posted at http://tpox.sourceforge.net/tpoxresults.htm, including results on a 1 TB database (the latter also has some nice overview slides). Note that many database vendor do not allow disclosure of benchmark results without their agreement.

Coming back to the original question whether TPoX is an official TPC or SPEC benchmark the answer is no, because there are no such XML database benchmarks. But TPoX is a well-adopted benchmark that allows to compare XML database systems by taking a well-balanced approach to cover most aspects of what makes up a (commercial) database system.