Data Henrik: February 2009

Friday, February 27, 2009

I am not allowed to tell you

Do you know this feeling when you already know what is coming and you are not supposed to talk about? You already found THE perfect birthday gift for your loved one and are eager to tell her or him without telling details. Or you know about what's inside the wrapping and want to actually see it and have to wait for another two hours.

IBM is offering the DB2 Technology Sandbox, a preview of what is most likely coming in the next release of DB2. On the website there are only vague hints of what is coming, including the following ones:

“Fast Enable” - 90% less effort to move to DB2:
Leverage your existing staff and skills.
Save even more with deep compression:
Multiple algorithms for automatic index compression.
Automatic compression for temporary tables.
Intelligent compression of large objects and XML.
Gain business intelligence from your XML data.

In my role as DB2 developer I have been working on some of the features and as advocate I am already working with some customers and I even gave briefings about it. Anyway I am not allowed to tell you more about it. It's the same feeling as described above. The good thing is you don't have to wait for the gift to be unwrapped. You can apply to take a sneak peek at it. What's even better is that you can gain a competitive advantage because you are prepared and know how to leverage the new technology once it's generally available.

Monday, February 23, 2009

DB2 pureXML And Bufferpools - XDA Object

One of the questions I answer a lot to both customers and IBMers alike is whether XML data is buffered, i.e., whether XML data utilizes the bufferpools. Before I explain it in more detail, let me answer the question briefly. Yes, XML data goes through the DB2 bufferpool and the entire navigation within documents happens on XML data in the bufferpool. This is for performance, performance, and performance.

Let's take a look at a database that we create with system managed space (SMS). Note that this is NOT the recommended way of creating a database and you should just use the default. The reason we use SMS is that it makes the different database objects visible.


create db testx user tablespace managed by system using ('testx')

Once the database is created, we connect and then create five tables like this:

create table t1(id int, doc xml, text clob(10k));
create table t2(id int, doc xml);
create table t3(id int, text clob(10k));
create table t4(id int);
create table t5(id int unique not null);

After the tables are created, we try to locate the directory or folder where the objects are placed. Because I am using DB2 Express-C on a Windows laptop, I am seeing something like shown in the picture below.

Now we take a look at the DB2 Information Center to have the file types explained.

Each subdirectory or container has a file created in it called SQLTAG.NAM. This file marks the subdirectory as being in use so that subsequent table space creation does not attempt to use these subdirectories.

In addition, a file called SQL*.DAT stores information about each table that the subdirectory or container contains. The asterisk (*) is replaced by a unique set of digits that identifies each table. For each SQL*.DAT file there might be one or more of the following files, depending on the table type, the reorganization status of the table, or whether indexes, LOB, or LONG fields exist for the table:
SQL*.BKM (contains block allocation information if it is an MDC table)
SQL*.LF (contains LONG VARCHAR or LONG VARGRAPHIC data)
SQL*.LB (contains BLOB, CLOB, or DBCLOB data)
SQL*.XDA (contains XML data)
SQL*.LBA (contains allocation and free space information about SQL*.LB files)
SQL*.INX (contains index table data)
SQL*.IN1 (contains index table data)
SQL*.DTR (contains temporary data for a reorganization of an SQL*.DAT file)
SQL*.LFR (contains temporary data for a reorganization of an SQL*.LF file)
SQL*.RLB (contains temporary data for a reorganization of an SQL*.LB file)
SQL*.RBA (contains temporary data for a reorganization of an SQL*.LBA file)

When we look at the files, we see that the objects for table "t1" have the prefix "SQL00002" and for "t5" the prefix "SQL00006". Now why do we have the different object types?

t1/SQL00002 has a DAT object for the table data, LB and LBA objects for the CLOB column, and XDA for the XML data. Why do we have an INX index object? The DB2 Information Center has the answer again (look for the section on the XML data type).
When a column of type XML is created, an XML path index is created on that column. A table-level XML region index is also created when the first column of type XML is created.
The INX object is needed for the two system indexes that are created for the XML column.
t2/SQL00003 and t3/SQL00004 demonstrate what we just said. If we only have a CLOB column, no index/INX is needed, with XML only, we don't have LOB objects.
t4/SQL00005 only has an INTEGER column and the only object is of type DAT. No index is present and hence there isn't any INX file.
Finally, to make the reverse probe, we created t5/SQL00006 with a unique constraint which is implemented as a unique index. With an index present, we need an INX object.

Long story short, XML and LOB data are different. XML data is in its own storage object, even if you usually can't see it. DB2 caches XML data in the bufferpools for performance. I didn't explain today how XML data is organized and why XML data can be buffered even if documents are larger than a database page. The latter has to do with the puzzle I gave you some weeks ago. Stay tuned for the details...

Thursday, February 19, 2009

How many trees does a forest need? (DB2 pureXML Indexing)

The above question was the key topic in a public debate here in town. The background is the small forest next to us. Right now there are too many trees per acre and they are competing for sun light and space. The proposed solution by the administration was to cut some trees, so that those trees which are important for the forest, for animals and insects can grow and prosper. The envisioned result is a healthy forest.

Do you see any parallels to our database world? One of the more frequently asked questions about DB2 pureXML is how many XML indexes do make sense for a table. In the relational world, the number of possible indexes is limited by the number of combinations and permutations of columns. For XML it would be possible to store an entire database inside a single document. But how do you index the data? How do you determine which indexes are needed? Too few indexes could mean that queries do not perform. Too many indexes could mean that a hefty penality is paid during IUD operations for index maintenance and the system health is impacted. The index maintenance costs are known for the relational world, but little is known for XML indexes.

In the coming days and weeks I plan to address XML indexing questions. In the meantime, please let me know if you have indexing topics that should be part of this series.

Passive House & "Arctic" Wheather / Ein Passivhaus bei den arktischen Temperaturen

Right now it is freezing cold in Germany (up to -18 Celsius last night), in our region we had "moderate" -12 degrees Celsius last night. How are the temperatures inside a passive house at this time, given that we only have a small heating system?
The thick insulation and walls fight off most of the cold and let the warm air stay inside. To have fresh air inside and keep it warm inside, our ventilation system follows the following steps:

Fresh air is received through an earth tube (a 40 m long pipe buried 2 m deep) which causes the air to warm up from the -12 C to around 7 C.
A heat exchanger takes the energy off the out-flowing air and warms up the in-flowing air. This brings the in-flowing air to already 19 C.
Over night the 400W heat pump bumped up the in-flowing air temperature further to around 30 C.
Throughout the day we had clear skies and the sun helped to keep the inside temperature at 23 C (kids were wearing T-shirts only), without the heating system on.

To summarize, it was nice and warm inside, really cold outside, and we had minimal energy consumption.

Passivhaus-Temperaturen bei "arktischem" Wetter
Derzeit ist es richtig eisig draussen, bei uns war es "nur" -12 C letzte Nacht. Wie sind die Temperaturen in unserem Passivhaus? Recht angenehm kann man sagen. Dies haengt damit zusammen, dass die dicken Waende die Kaelte draussen und die Waerme drinlassen. Unsere Fenster haben wir Dank der Lueftungsanlage das letzte Mal im Oktober geoeffnet. Jetzt beim kalten Wetter sind folgende Temperaturen beim Prozess des Lueftens festzustellen:

Draussen sind es -12 C. Die Luft wird durch einen Erdwaermetauscher, ein ca. 40 m langes und 2m tief vergrabenes Rohr, angesaugt und erwaermt sich bis auf ca. 6-7 C.
Die Luft wird dann durch einen Waermetauscher geleitet, der mit der Fortluft die einstroemendes Frischluft erwaermt. Dies bringt die Frischluft schon fast auf Raumtemperatur.
Unsere Mini-Waermepumpe (Leistung cal 400W) erwaermt die Luft weiter auf 29-32 C.
Am gestrigen Tag hatten wir einen klaren, wolkenlosen Himmel mit viel Sonne. Dadurch war die Heizung nicht notwendig und die Sonne half die Temperaturen im Haus auf 23 C zu bringen.

Insgesamt ist es derzeit sehr angenehm im Haus, auch was den Stromverbrauch betrifft.

Wednesday, February 18, 2009

Virtual Here, Virtual There

Virtual One
In a recent discussion with a database professor the following problem was pointed out. Students were using their own machines for homework which was based on DB2. There are different operating systems and DB2 supports many of them. Few of the students ran into issues in an unsupported W*****s environment. My proposed solution is to look into so-called virtual appliances. Basically this is a virtual machine that runs a specific pre-installed application, in our case DB2. The good news is that IBM offers a DB2 virtual appliance here, and a smaller image is also available at VMWare here.
Please note that you still need real hardware to run it.

Virtual Two
There has been some blogging (e.g., Susan Visser or Conor O'Mahony) about a virtual IBM event called "Data in Action". You may ask, what is a virtual conference? Will the food be better than at a regular conference? Will people still watch me after a fun intro when the regular (IBM) talk sets in and I start snoring? The answers mostly depend on where and how you attend this virtual event. The good news is that the choice is yours. One thing does not change though:
You still need real coffee to run such an event.

Friday, February 13, 2009

Valentine's Day, DB2 pureXML, and XQuery - How To Impress Your Spouse

Tomorrow is Valentine's Day and there are different ways to collect important Brownie Points at your significant other. Today, I want to show you how you can do it as a DBA knowing your DB2 system catalog and some XQuery (mental note: DB2 was the right infrastructure decision, version 9.5 required).

The following XQuery will produce a string based on a procedure name, then within the query we use a simple regular expression to get rid of characters, and finally make use of string processing functions.


xquery for $i in
db2-fn:sqlquery("select xmltext(procname) from syscat.procedures where procname like 'WLM_%ESS'")
let $k:=replace($i,'(_)|(S|W|T|M|P|R|N|C|A|G)',''),
$j:=string-join((substring($k,4,1),concat(substring($k,1,1),
substring($k,string-length($k)-1,1),
substring($k,5,1),substring($k,3,1)),concat(substring($k,7,1),
substring($k,string-length($k)-1,1),substring($k,2,1))),' ')
return $j

The result should look like shown here:

There are probably simpler ways to send the same message, but sometimes the effort counts. When I tested it with my wife this morning, she was especially thankful for the "1 record selected" - something that is important to most relationships...

===
Valentinstag
Wer am morgigen Valentinstag seinem/seiner Liebsten tief beeindrucken und die Liebe bekunden moechte, hat viele Moeglichkeiten. DB2 stellt (wie immer) die passende Infrastruktur zur Verfuegung. Mit obiger Anfrage kann man seine Gefuehle ausdruecken und gleichzeitig ein wenig XQuery lernen. Wichtig ist nichts anbrennen zu lassen und "1 record selected" als Teil des Ergebnisses zu haben. Nichts ist schlimmer als wenn Fragen aufkommen koennen...

Wednesday, February 11, 2009

Organic Food and pureXML (now included in core DB2 license)

When I go shopping for groceries I have many choices. More and more often I choose organic products because of their better ingredients and their quality. It's not always an easy decision because organic food usually costs (significantly) more than regular food. But later, at the table, I am rewarded and it was worth the additional cost. What would I pick if the product with the better quality and better ingredients is offered at the same price? Easy answer.

That's my food talk today, now to something entirely different (really?). I am happy to tell you that the so far separately priced pureXML feature will now be included in the core DB2 for Linux, UNIX, and Windows. I would guess that for some companies it was hard to decide between an "all-included pricing" from another database vendor and DB2 which offered some premium technology as "additional cost features" (even though DB2 is priced differently) or to make the step from all relational to adding XML columns (and thereby requiring an additional license). The announcement is great news for customers, business partners, and IBMers alike as it becomes simpler to implement XML-based solutions. The announcement itself has the rationale behind this move:

[...] The packaging for DB2 V9.5 for Linux®, UNIX®, and Windows® is being updated to offer greater value to our users and customers. With the expanded usage of Web 2.0, service oriented architecture and the increasingly ubiquitous usage of XML throughout software tools and applications customers are finding XML being used in all parts of their IT infrastructure. To meet the needs of our customers' evolving needs the pureXML feature will now be included in all editions of DB2 for Linux, UNIX and Windows. Customers will now be able to leverage the new pureXML features in DB2 V9.5 such as; parameter passing from SQL to XQuery, trigger support on XML columns, schema evolution, and sub-document update shorten development timelines, helping faster delivery of business applications. [...]

My colleague Conor adds the following to it:

[...] Strong levels of XML adoption combined with strong levels of pureXML adoption are the driving force behind this development. Essentially, pureXML has emerged to become a core data type in much the same way as the traditional relational data types.
This reflects the large amount of information in XML format in organizations around the world. XML has emerged to become the ubiquitous data format for many applications and environments, driven by the adoption of industry formats like HL7 in the healthcare industry, ACORD in the insurance industry, FIXML and FpML in the financial services industries, SEPA in European Banking, NIEM in Government, XBRL for financial reporting, and so on. [...]

For those that haven't tried pureXML yet, now is the time - no more excuses. pureXML is now included in the core DB2 product for all editions. Download the (entirely) free DB2 Express-C and start exploring the pureXML functionality.

Tuesday, February 10, 2009

XML Database Benchmarks, TPoX, and DB2 pureXML

I recently got asked why DB2 pureXML is using the TPoX benchmark and whether TPoX is an official benchmark similar to those from TPC or SPEC. I will try to answer that question today.

TPoX stands for "Transaction Processing over XML" and is an open source database benchmark which is available at SourceForge.net. It originated from IBM, but other sources, most significantly Intel, have been contributing to the benchmark. TPoX is an application-level XML database benchmark based on a (real) financial application scenario. The goal of TPoX is to evaluate the performance of XML database systems, focusing on XQuery, SQL/XML, XML storage, XML indexing, XML Schema support, XML updates, logging, concurrency and other database aspects.

Why is it important to mention that long list of features? This is because several other XML database benchmarks (e.g., XMach-1 , XMark, XPathMark, XOO7, XBench, MBench, Michigan Benchmark, and MemBeR) already existed before TPoX was born. All but one or two of these focus mostly on XQuery performance or on specific database aspects, not on the entire system. For a company that plans to buy an XML database system it is not good enough to know that the XPath evaluation of a system is outstanding when insert processing or bufferpool management are not worth a penny. In other words, being good in one aspect of what makes up a database system is not good enough to produce a well-rounded, reliable, and performant (XML) database system, a system database user are really looking for.

Because both TPC and SPEC were not interested in developing an XML database benchmark, because of the lack of an adequate database benchmark, and because of not much interest from other database vendors IBM eventually proposed TPoX to the database and XML community (see SIGMOD 2007 paper and 2006 Dagstuhl seminar on XQuery Implementation Paradigms) and made it open source. Why open source? It allows open discussions, contributions, and usage of the benchmark and its code.

Since TPoX has been made available, many companies, universities, business partners, other database vendors, and of course IBM have used TPoX to evaluate XML database performance. Some results have been posted at http://tpox.sourceforge.net/tpoxresults.htm, including results on a 1 TB database (the latter also has some nice overview slides). Note that many database vendor do not allow disclosure of benchmark results without their agreement.

Coming back to the original question whether TPoX is an official TPC or SPEC benchmark the answer is no, because there are no such XML database benchmarks. But TPoX is a well-adopted benchmark that allows to compare XML database systems by taking a well-balanced approach to cover most aspects of what makes up a (commercial) database system.

Monday, February 9, 2009

Some San Francisco fun / Mit Skiern durch San Francisco

Over the weekend I saw a funny commercial for a German car. Being a German who has lived in the San Francisco Bay Area it is nice to see how they make use of the unique landscape.

The original video and a "Making Of" are available at the Audi TV website.
==

Mit Skiern durch San Francisco
Wenn man mal im Grossraum San Francisco gelebt hat und sich, wie ich, in die ungewoehnliche und ganz besondere Landschaft - insbesondere die Huegel - von San Francisco verliebt hat, ist es immer etwas Schoenes, diese auf Fotos oder Videos zu sehen. Am Wochenende bin ich durch Zufall ueber das oben eingebettete Werbevideo von Audio gestolpert. Auf der Webseite von Audi TV gibt es noch einen interessanten Blick hinter die Kulissen des Videos.

Friday, February 6, 2009

DB2 jack-of-all-trades: Hybrid, native, bilingual, pureXML

Today I was pointed to the FAQ for XML:DB again and asked whether DB2 is a hybrid system or an XML-enabled database. In those FAQs they distinguish between a native XML database (DB), an XML-enabled DB, and a hybrid XML DB. So what is DB2?

In the FAQ they use "hybrid" in a different meaning than IBM is doing for DB2 as "hybrid database system". XML:DB is defining a hybrid XML database as one that can be both native and XML-enabled. DB2 is called a hybrid system because it is both a (native) relational database system as well as a (native) XML database system. What does native mean? It indicates that the data, either relational or XML, is processed and stored in its own data model, with its specific semantics. Relational data is stored in an optimized row format, relational operators work on the data, and the output are result sets. XML data is stored in its (native) hierarchical format, as optimized, easy-to-navigate trees on disk. The XQuery Data Model (XDM) is an inherent part of the storage structure, the processing of XML data - sequences of nodes and atomic values are core to the processing. Based on this infrastructure DB2 is a native XML database and a hybrid database (relational, XML) - not a hybrid XML database.

Furthermore, DB2 is also bilingual as it understands both SQL statements and XQuery statements. If you write a regular "SELECT ... FROM ... WHERE ..." you are by default in the relational world, using SQL. Thanks to part 14 of the SQL standard, XML is a "relational" data type and we have XML-specific functionality and defined semantics. Users can embed XQuery statements into SQL.
By using the keyword "xquery" in front of a query, users can switch to the XQuery mode and directly issue an XQuery statement. Something like "xquery for $i in .... where ... return ..." is understood by DB2, users who are coming from the XML and XQuery world do not need to learn SQL and can immediately start leveraging their experience.
BTW: Both SQL statements and XQuery statements end up in a single compiler and optimizer since everything is deeply integrated. It's similar to speaking two languages and having only one (!!!) brain.

As shown, DB2 is a bilingual, hybrid database. To top it off, you can download and use it for free as DB2 Express-C.

Thursday, February 5, 2009

Thermal Imaging - QA work for your house

This morning we were supposed to have a thermal imaging company take pictures of our house (Original source Passivhaus Institut, Germany – http://www.passiv.de, for more examples see here). The idea is to be sure the house was built according to specs and there are no thermal bridges. This is similar to following your quality plan in software development. A thermal bridge can be compared to a memory leak - there are small ones and big ones. In a passive house you usually can find big "memory leaks" yourself because the inside wall would be cold in the winter. Smaller leaks are hard to find and many of them can add up to some energy loss.

Anyway, the sun was already out and had warmed up two sides of the house, so the imaging didn't make sense. Now it's back to waiting for another appointment to come up. On the bright side the thermal imaging guy let me look at our house and I couldn't spot anything concerning. It's also interesting to see your house, mostly colored in blue (good sign).

BTW: It would be nice to have a thermal camera for software products. Take a snapshot and you have most (all?) leaks.

Tuesday, February 3, 2009

Would Harry Potter use pureXML?

Last night I woke up around 3am and had this question in mind, strange but true. It is an interesting question and it probably came up because end of last year I read all Harry books start to finish.

To some, pureXML certainly can do magic by storing information that was not in the original database schema (there is always something unforeseen in your exciting life).
For others pureXML allows to trace back "dark magic" to its originator because (XML-based) messages can be stored as is without breaking them up and loosing information (application logging, compliance). In addition, just indexing critical parts of the XML documents and nothing more gives the performance needed when every millisecond counts (did you ever duel with someone?), for inserts and for queries.
Storing data as XML can also allow for rapid application development and deployment, giving competitive advantages (dueling again, my friends).

Looking at some of our customers, it seems likely that at least the Ministry of Magic (after the reforms :) would have used pureXML. Can you guess which departments and why? I will leave that to you to comment on.

Sunday, February 1, 2009

An Even More Secret Advice About db2-fn:sqlquery() - Casting in XQuery

Fred Sobotka has a blog posting about how to make parameter passing from XQuery to SQL (and back) work. It details on the conditions to be met:

For DB2 9.5, Fix Pack 3 or newer is a must, due to a problem with SQL and XMLCAST. This means that folks running DB2 Express-C without the FTL (which is currently the only way to bring Express-C up to Fix Pack 3) will be thwarted by that APAR until IBM issues a (long overdue) refresh.
XML data being passed into a SQL UDF must be cast into the appropriate SQL data type for that UDF.
The final output of the SQL UDF being called within db2-fn:sqlquery must be cast back to XML, either by surrounding it with an XMLTEXT function, or by passing it into XMLCAST and specifying a type of XML.

Let me comment more on it and show you how you can work around prerequisite 1) a little bit. DB2 9.5 is still a must because then the parameter passing from XQuery to SQL was introduced. FP3 is NOT required in my experience as the query below shows. The trick is to additionally cast in XQuery (see the use of xs:string()).

xquery
<people>
{for $a in db2-fn:xmlcolumn('XMLTBL.XMLCOL')
return <person
name="{$a/name/text()}"
soundex="{db2-fn:sqlquery('SELECT XMLTEXT(SOUNDEX(CAST(parameter(1) AS CHAR(128))) ) FROM sysibm.sysdummy1', xs:string($a/name/text()))}" />
}
<people>

The other two requirements are due to the typing information that is (not) available. The CAST in SQL (bullet point 2) is necessary to tell SQL how to handle the value in the SQL world. The string could be a VARCHAR(500) or a CHAR(10). The latter would obviously fail given the values passed in. So you better tell SQL what values to expect. Making XML from the output (requirement #3) is also a must because db2-fn:sqlquery() by definition returns a sequence of XML values. I hope this gives some more insight and helps those on DB2 9.5 prior to FP3.

BTW: I tested this on my DB2 Express-C 9.5 (shows as "9.5.0" when connecting).

Pages