Data Henrik: August 2010

Tuesday, August 31, 2010

Food: Where business and life meet

Working at home has advantages. There is the "lunch is ready in 5 minutes" call to remind me to get ready to make the few meters from my desk to the table on time. Sometimes, those instant messaging windows pop up (US East Coast wakes up or urgent European business) and it becomes a struggle to be at the table when food is served. And I usually don't help to prepare lunch during the work week. But all that changed today.

I can point to the following video, actually a recent IBM commercial, and claim: "Darling, see how I helped with the food?".

Monday, August 30, 2010

Beer or wine, elements or attributes? Designing XML schemas

Image via Wikipedia

One of the questions asked repeatedly during DB2 pureXML workshops is about the use of elements and attributes when designing XML documents/schemas. What is better, using elements or attributes?

The answer given is that it depends on the data, the requirements, and other factors. Instead of going deeper into the reasoning for elements and attributes, I am referring to the summary at Cover Pages. Michael Kay has already in 2000 put the entire discussion into the following wise words found here:

Beginners always ask this question.
Those with a little experience express their opinions passionately.
Experts tell you there is no right answer.

Friday, August 27, 2010

Yes or no: What is better for your data?

In two posts in the past, I explained how to say no to invalid data and how to find out what data was wrong and causing errors. With the REJECT INVALID VALUES clause during CREATE INDEX it is possible to reject XML documents that have values in the specified XPath location not compliant with the specified data type. Now the question I got is: What is better, to reject invalid values or to ignore them?

Well, let me answer with THE answer: "It depends." I will try to be more specific in a minute. Using XML is often for flexibility reasons and so you may not know what the future will hold. However, in most cases the XML schemas, i.e., the structure of the documents and the data types to expect, are known. And then you want to be sure the data matches what you expect. Because the data at the path specified in the CREATE INDEX statement needs to be casted to the target data type anyway, regardless of whether the value ends up in the index (valid) or not (invalid), the performance is the same. Moreover, with the approach to reject you can be sure that IF the path is present, its value is in the index. That leaves room for optimizations at runtime.

So my recommendation is that if you don't need to be as flexible as possible, then use the REJECT INVALID VALUES clause. If you should receive an error because of a casting issue, see my post on how to find out why that I mentioned above.

Wednesday, August 25, 2010

To APPEND or not to APPEND: Insert and Import Performance

When you create a table, by default, DB2 is resource-conscious in terms of space. So-called FSCRs (free space control records) are used to keep track of free space in the pages. When you delete a row or a row has to move because of an update statement expanding the row size, the new gap in the data page is added as free space information to a FSCR. During insert or import operations the FSCRs are searched to fill up the pages.

The DB2 registry variable DB2MAXFSCRSEARCH controls how many of those space management records are searched when a row is added to a table. The default is 5 FSCRs, a value of "-1" means that in the worst case all records are searched for the best fit. However, searching for free space takes some time, even with efficient algorithms, and it is faster to not search for free space when adding data. This behavior can be turned on by executing ALTER TABLE with the APPEND ON clause. If large quantities of data are inserted (or imported), a performance gain can be realized by omitting the free space search and directly append the data to the table, i.e., to add it to the end of the table.

The drawback of the approach of using APPEND ON is that even when data gets deleted the free space is not reused, unless the table is REORGanized. A REORG is also required if the table is altered again to switch to APPEND OFF to update the free space information.

Of course, if bulk data insert needs to be performed, using LOAD is a good (or most often better) option. All the above applies to relational as well as to XML data.

Tuesday, August 24, 2010

"BY REF" in XML processing

Several XML-related functions that DB2 offers allow to specify the "BY REF" keyword. Is it worth using it? And what does it mean anyway? Let's take a look at it.

DB2 functions like XMLQUERY, XMLEXISTS, and XMLTABLE have a PASSING clause to provide input values, some of the functions also return either a sequence or column values. For all of those values it is possible to specify the option BY REF (see the linked documentation for the functions above). What the BY REF option tells DB2 is to not create copies of the input or output values, but to only move a reference to those values around. This reduces temporary data, uses less of the valuable memory on the database servers, and it is much faster.

SELECT doc FROM myDocs WHERE XMLEXISTS('$d/name' PASSING BY REF doc AS "d" BY REF)

But what happens when the option BY REF is not specified? DB2 takes the default which in all places is BY REF. So sit back, relax, and enjoy a resource-conscious, highly efficient DB2 - by design.

SELECT doc FROM myDocs WHERE XMLEXISTS('$d/name' PASSING doc AS "d")

SELECT doc FROM myDocs WHERE XMLEXISTS('$DOC/name')

BTW: The latter, omitting the PASSING CLAUSE, only works in DB2 for Linux, UNIX, and Windows.

Real life: May all your transactions commit!

Image via Wikipedia

About two weeks ago I had to buy train tickets for a trip to Paris. Because I had a special promotion code that could only be applied by a sales agent I went to the train station. There, a loooong journey into the inner workings of the sales and reservation system started. One hour later, I left with tickets, a long line of waiting customers, and a frustrated sales agent.

Recently I received a promotion letter (20 EUR off) by Deutsche Bahn, the state-owned German train operator. In the fineprint it stated that it could only applied when booking through a sales office (instead of online), was good till September, had minimum requirements on the ticket price and the type of train, and most importantly, once applied was non-refundable. Because I wouldn't use it for private travel, I thought about saving some bucks for IBM and planned to use it for a train-based business travel for which I needed to apply the IBM corporate client tariff. And I had so-called BahnCard, a frequent traveler card for 50% discount. Looking back, combining all this, it already called for disaster.

Once I was next in line, I mentioned the corporate client tariff, showed my identification for the corporate client tariff, the BahnCard, and the promotion letter. Few minutes later, I had my ticket, the promotion was applied and the credit card was swiped. Unfortunately, when I held the ticket in my hands, I couldn't see the corporate client tariff. No problem, the sales agent said, we'll look into it. After consulting with another sales agent, the "application support" was called in on how to apply the corporate client tariff. Next, the old ticket was canceled and a new one generated, all looking good. The idea was to apply the refund for the old to the new ticket, putting the difference back to the corporate credit card. However, because of the non-refundable promotion, the system went into a loop. It could not fully cancel the old ticket because parts of it were paid for by a non-refundable payment. The sales agent went into different menu options for payment, trying to charge a negative amount to the credit card, trying to refund the difference as cash, etc. All actions produced error messages because of incompatible offer conditions.

After several long minutes and an increasingly long line behind my (now sweaty back), the sales agent decided to call in help from the application support center again. The support engineer was able to remotely log into the transaction and see details, but was neither able to cancel the transaction nor to complete it. Together, they discussed options on how to convince the system to "get the job done". Finally, with my OK my credit card was charged again - paying the ticket a second time. Then a cash refund outside the system was initiated after performing and documenting a full cash inventory. Eventually, I left the sales office after about one hour, smiling to the "next in line" call.

When you design promotions, make sure the system is not only able to handle them, but also all the error cases. Those are best practices of engineering.

Wednesday, August 18, 2010

Smaller, but way faster and cheaper: IBM sets new TPC-C record

I am traveling right now, but wanted to point you to a new TPC-C benchmark result. IBM set a new record running DB2 9.7 on a POWER7-based system. The IBM system is more than 35% faster than the so-far top Oracle result, providing 41% better price/performance and 35% better energy efficiency per transaction. This should set an end to the misleading and silly Oracle advertisements (I hope, but am not sure).

Tuesday, August 10, 2010

pureXML Indexing - Behind the scenes

I was asked whether I could explain a little bit how the different XML indexes are used for performing an index scan. Well, this probably requires several blog entries, but today I at least try to cover the basics.

When you create a table with an XML column, DB2 automatically creates two different XML indexes. One is the XML Path index, the other the XML Region Index. The INDEXTYPE in SYSCAT.INDEXES lists them as XPTH and XRGN. If you create a table with more than one XML column, you will notice that there is a XPTH for each XML column, but only one XRGN for the entire table. The path index maps an encoded path (e.g., /root/department/employee) to a unique pathID, an integer value. The pathID is unique within the entire database. The region index is used to manage parts (regions) of an XML document, so that documents can be larger than a single data page.

Similar to indexes on non-XML columns, users can create indexes on XML columns utilizing some features of the CREATE INDEX statement. Thereafter, the system catalog might look like shown in this article in the Information Center. But how are those indexes used?

One of the XML-related operators is XISCAN. When you perform an index scan using a non-XML index, an IXSCAN operator shows up in the access plan. For XML data, because some more effort is needed, the operator is named XISCAN, XML Index Scan. Once the optimizer decides that an index scan should be used, some auxiliary information to set up the scan is needed, including of course the start or stop keys for the values in the predicate (similar to regular b-tree access). What is needed in addition is the pathID of the path to search for. In the XML world we are not comparing "id=4711", but the value at "/root/department/employee/id" needs to be 4711.

If the entire absolute path is known when the query is compiled, the path can be turned into a pathID. However, this is not the case when wildcards (e.g., "//employee/id") are used. That's when the XML path index (XPTH) comes in handy. Because it stores the encoded paths in reverse order (our example would be "id/employee/department/root" in clear text) it is possible to find all paths ending (now starting) with "id" or "employee/id". To retrieve the pathIDs, at runtime an index scan on the path index is performed and the range is X-locked. With the locking DB2 can prevent any new path to be created during the actual index scan for the query predicate.

Once the pathIDs are known, either from the compile phase or the index look-up, the actual scan on the XML index (XVIP type in the catalog) can be performed and docIDs, nodeIDs (to pinpoint the node having that desired value) and RIDs (record identifiers) are obtained from the b-tree. The docIDs identify the qualifying XML documents, the RIDs point to the records (rows) in the table which the XML documents are part of. Now, based on the docID and nodeID, the part of the document can be located using the XML region index. Then, regular XML processing kicks in, e.g., the XSCAN operator or others could be applied.

Note that the above description covers the standard case and that optimizations for some cases are possible. There is also the XANDOR operator which speeds up index evaluation when two or more indexes for the same column can be used. But that's another blog post. Stay tuned...

Friday, August 6, 2010

Performance scale-out and continous availability: DB2 pureScale now available on System x

One of the most heard questions after the DB2 pureScale announcement last year and the initial "Wow!!!" was "Will it be available on System x?". Now that question can be answered officially: Yes, it is. Yesterday, IBM announced DB2 pureScale on System x. Now you have a choice of building your system on either POWER hardware or on select IBM System x servers.

Now I hear you say "Wow!" and ask: "When will there be an Express-C edition?"....

Tuesday, August 3, 2010

Fat-free and now in a light edition: DB2 Express-C download image

I am one of the gazillion users of DB2 Express-C. And that's why I am really happy that (from now on I hope) there is a "light edition" of the DB2 Express-C download image. You can still download the full edition which features text search, the Control Center, languages other than English, and some other things I consider minor. For the most of us it means less to download with the next (or this) release, thus an even faster upgrade cycle.

And when your spouse asks what you are doing, remember to point out: It's the light edition now (and fat free).

I wish they had...

Image via Wikipedia

The past few days we drove from the South of Germany to Westphalia and back. In most cases this is more or less some relaxed driving on a long stretch of Autobahn (our navigation system: "follow the road for 480 km"). This time however, we had to deal with strong downpours and related accidents as well as many construction zones. That's when I had to think about DB2 - naturally....

Some minutes on the Autobahn, a heavy downpour from almost black clouds started. Still, some drivers continued without headlights and with Bleifuss. The result was a complete stop of traffic because of an accident. One driver in a sports car had lost control of his Ultimate Driving Machine and slammed it into the divider and, with some rotations, back onto the highway which then was blocked for traffic. After one lane was mostly cleared in heavy rain, we continued for not even 5 minutes until we reached another similar accident scene. Later, we were passed again by some cars without headlight and in (too) high speed.

How does that relate to DB2 other than that DB2 can be a sports car and fun to use? I was thinking about the automatic and autonomic features. They allow you to drive your (database) engine on autopilot. Sure, some cars already switch on headlights automatically, adjust the speed of wipers, and have enough electronic to control the brakes and keep the car stabilized. But they don't seem to have a health monitor or utility throttling. Simple rules to be implemented could be "if too much rain and no sight, slow down" or "lift right foot from accelerator when no sight". Even a display to show alerts about, e.g., no grip because of 10 cm of water on the road is missing. So my hope is that with more heavy rain, people will finally learn and adjust. Which brings me to learning optimizers...

BTW: Just based on the information from the navigation system and the map of the Autobahn system, which route did we take?