Data Henrik: March 2009

Tuesday, March 31, 2009

pureXML Devotee call today

Susan Malaika is hosting a "pureXML Devotee" call today. One of our pureXML performance experts, Susanne Englert, will walk through EXPLAINs with an emphasis on XML. See this wiki page for details.

If you are interested in forming a local pureXML interest group, please let Susan Malaika know (contact info on the linked page).

Thursday, March 26, 2009

One year ago: Moving in

One year ago on a cold Wednesday morning "our" container arrived and we moved in our new passive house. This was after a 7 week journey by truck and train from San Jose, CA, to Houston, TX, then on with the Sealand Motivator to Bremerhaven, Germany, by train to the Ulm (Germany) container terminal and on by truck to Friedrichshafen.

Later that morning during the unloading, a customs inspector arrived for a 30 minute/40 EUR visit to make sure we didn't bring large quantities of Weber grills. We only had US junk bonds... :)

Our kids enjoyed having their toys back and all of us, after 7 weeks in hotels and with relatives, to sleep in our own beds again.

Wednesday, March 25, 2009

Puzzled about the XML Regions Index?

In February I wrote about the XDA Object and that XML data is buffered, i.e., it is using the bufferpool for performance. Earlier I had given the puzzle below.

The triangle usually depicts an index (balanced tree). In the picture the index is over NA, EMEA, AP - three geographic regions. The solution to the puzzle is the so-called "XML Regions Index" in DB2. You may be puzzled and ask: What is the XML Regions Index?

The DB2 Information Center states the following:

The XML regions index captures how an XML document is divided up internally into regions, which are sets of nodes within a page. When an XML document is represented as nodes, each node is a record in a page. Since regions are sets of nodes within a page, the number of regions index entries can be reduced, and performance may be improved, if a larger page size that can store more nodes within a page is used.

DB2 formats XML documents into data pages which then fit into the bufferpool. In order to handle documents that are larger than a data page, documents are split up into so-called regions which again fit into a page. To be able to quickly find a document for a given document ID or the region for a given node inside such a document and to be able to recompose the XML document from its parts, the XML region index is used. The index holds an entry for every region. The larger the page size (DB2 offers 4kb, 8kb, 16kb, and 32kb), the fewer pages are needed to store a document. The fewer pages, the fewer regions and hence entries in the regions index. And the smaller an index, the faster the access.

That's why in the "15 best practices for pureXML performance in DB2 9" we recommend to use large page sizes where possible:
As a rule of thumb, choose a page size for XML data which is not smaller than two times your average expected document size, subject to the maximum of 32 KB. If you use a single page size for relational and XML data, or for data and indexes, a 32 KB page size may be beneficial for XML data but somewhat detrimental for relational data and index access. In such cases, 16 KB or 8 KB pages may be a better choice that works well for both.
Coming back to the initial puzzle, the rule of thumb is true regardless of the geographic region you are living in...

Monday, March 23, 2009

About Tecology, Lamp Mounts, and Empty Spots

Recently I wrote about Smart Meters and on Saturday I could see one in action - unfortunately not in my house. Over, at the other house, the owner had already detected one unneeded consumer he hadn't really thought of. The reaction to that finding was to plug it off.

The ON/OFF is a very natural way of thinking, especially for Computer Scientists. Many automated things around our homes and offices use that principle to control electrical consumers. Infrared sensors switch lights on and off, timers control when and for how long things are turned on, lawns are watered from x to y, and so on.

What is needed is a more adapted approach where we are moving away from the ON/OFF for many of the simple things. In many cases we already have technology for ecological use ("tecology"?), but don't apply it (for many reasons). To give an example: The CPU clock speed is reduced in our computers when we don't need that much CPU power to reduce energy consumption - you are probably using that feature on a daily basis. Similarly, we have sensors and controllers to adjust light output to automatically keep rooms at constant brightness with a mix of natural and artificial light. Does your home or your office have such a feature? Probably not. And you would not have it at home even if it would be affordable. The reason is that it is way more complex and time consuming to install a new ceiling lamp than to install or upgrade new hardware such as hard disks or CPU modules.

I am still waiting for an universal lamp mount where I can plug a lamp and thereafter the wires are connected and the lamp itself is safely mounted. That also explains why one year after moving into our new home many lamps are still missing...

Thursday, March 19, 2009

Everybody hates DB2, and pureXML is crap!

DB2 pureXML has only one XML type, can you believe it? There are no choices. To be more flexible I want one XML type for financial applications, one for report-generating applications, one for financial applications that are used for generating reports, and one for report-generating applications in a more financial context. Where are my choices in DB2? Only one XML data type, no options!
DB2 pureXML is following those quasi standards from W3C for updating XML data. It's not standard yet. Can't they do anything proprietary that will stay proprietary after the standardisation is finished? I want something different in my database system, not standards. It would be too simple to learn and too simple to port to or from other systems. I want to show that I have special skills that I can use only with one special system. I hate DB2, I hate standards.
DB2 has all this autonomic and automatic stuff which is too simple to use. Where are the knobs? I love configurations, optimizer hints and month-long fine tuning, why autonomics? I don't want a life. Autonomics does not work. Look at the DB2 pureXML benchmark results! They did only use autonomics during their 1 TB TPoX benchmark, no manual tuning. Look at the pureXML results, the numbers cannot be good.
Speaking of benchmark results, I hate them. Why can't the DB2 guys leave testing to enterprises alone? Why do they make benchmark results public? We all know that benchmarks are crap and that's why so far no really serious database vendor has published any benchmark results, ever! This is all so much crap!

==
Are you wondering whether I had too much coffee this morning or was trying substances used in professional "sports" or my wife made me sleep outside in the cold? Nice guess, but I am only testing something for the upcoming April Fools' Day. The above may have been a bit too obvious, may have been missing any "supportive" facts and links you may have asked for. I hope you enjoyed it nonetheless. If you are in for more subtleties you should try reading what Oracle lables "Oracle 11g XML DB vs. DB2 9.5 pureXML". It can be downloaded here. I enjoyed all the subtle humor and how they argue. Nice, a well done paper for April Fools' Day!

BTW: You can learn more about TPoX in one of my previous posts and of course at the official TPoX website. I also wrote about XML storage options before. If you have never touched DB2, I encourage you to try out DB2. Don't be fooled. DB2 Express-C can even be used in production systems free of charge and pureXML is included.

Tuesday, March 17, 2009

Spring has arrived (no Java inside!)

Spring is finally here! For those of us who are too deep in software I want to emphasize that it is NOT Java-related, it is the season and Wikipedia has this:

The phenological definition of spring relates to indicators, the blossoming of a range of plant species, and the activities of animals, or the special smell of soil that has reached the temperature for micro flora to flourish. The first swallow to arrive for the flowering of lilac may be the indicator of spring. It therefore varies according to the climate and according to the specific weather of a particular year.

What is not mentioned in the Wikipedia article is that it is a very nice part of a year when living in a passive house. The Winter has been cloudy and foggy here in the Lake of Constance region, but now the sun is shining and both warming our house and the water. The solar panel on our roof is getting enough hours in sun light to heat up the entire water tank, and also inside the house it is nice and warm without a heating system.

What else is missing in the Wikipedia article? Both our local Zeppelin here in Friedrichshafen as well as the one in the San Francisco area are starting passenger flights again. It's nice to see the Zeppelin overhead, tourist season is coming.

Finally, Spring is also linked to Spring cleaning. There seems to be a complete checklist of what to do during Spring cleaning. I just checked and I couldn't find upgrading to DB2 9.5 on it. One more important thing to consider for the coming days...

Friday, March 13, 2009

Smart Metering / DB2 Performance Expert

Since last Fall I am trying to become one of the beta testers for the Smart Metering offering of our local utility company. I was too late for the first batch of customers admitted to the program, then in January I learned that I am using the "wrong" Internet provider and to call back in March. Today I retried, but no luck and another month of waiting as they are still working on a solution.

Why do I want to have smart metering? There are two aspects of it. First of all, I am lazy and the idea of not having to run into the basement to read the meter is nice. Right now I am reading our two meters (one for our heat pump, solar pump, and related consumers, the other for the rest of our house including the computers) once a week to get an idea of how much energy we use. The second reason for getting a so-called e-meter is to have ongoing reporting, to get more data points. With a smart meter, the data about the current consumption is transferred to the reporting server every couple minutes. I would be able to see in graphs and tables how much energy is used and when it is used throughout the days and weeks. Moreover, comparative data is provided to see how we are faring compared to similar energy customers. The overall idea is to get insight into usage, then being able to detect unusual situations and to remove or reduce hidden consumers in our household, or to shift some consumption (washer, dryer, dishwasher, etc.) to the (possibly) cheaper evening hours.

So much for the theory, I have to wait another month to get an update on when I can start with Smart Metering.

BTW: Products like DB2 Performance Expert give you insight into your database server, your operating system and much more. They give you the frequent data points and historic and comparative data, so you can detect unusual situations before they are real problems. It's similar to smart meters: Less running, more insight, smoother system.

Have a nice weekend.

Tuesday, March 10, 2009

Hundreds of indexes on a single XML column - why not?

Last week I presented at the German Database Conference "BTW" (see here for a history) some work on how to address XML Index Challenges (paper written with two colleagues). More and more enterprises face the challenge of logging all their activities. Given that today SOA and Web Services become standard for new applications, there is a trend that the logs are XML-based as well.

The scenario in the paper is based on what was observed at several banking customers. The internet banking system is required to log every event in any of their internet banking applications - we are talking 10 to 20 million inserts a day. Events include clicks that take a user to a new web page or dialog, entry of user data, as well as every click that initiates a banking transaction. This “logging” happens across a set of diverse applications such as checking accounts, loans, investment management, and others. The log information can vary widely, thus very flexible storage is required. So far the data is stored in a relational table with few columns and the log record itself as VARCHAR.

The key problem with a relational-only approach is that the variable part of the log entries is hard to query with adequate performance. Current relational database technology does not allow easy indexing of individual pieces of strings in a VARCHAR column. The applications that read this data typically use SQL "LIKE" predicates on these VARCHAR values. This results in limited queryability, limited index usage, and sub-optimal performance. The solution is to use XML as log format and to make use of DB2 pureXML's hybrid capabilities, i.e., to store the logs in an XML column and some other information (like a timestamp) as relational fields.

What is the challenge?
The challenge is not the heavy insert workload as was demonstrated with the 1 Terabyte XML benchmark, it is the indexing. Given that there are different log record formats (different XML document types) for the different applications, and there is a heavy volume of data coming in, you want to have indexes on all important elements and attributes to find the needle in the haystack. If we think of only few dozen application types and typically 10 to 20 indexes per type, we are already in the hundreds of indexes.

Does it work?
The good news is that DB2 pureXML is able to handle it without problems. First of all, DB2 allows users to only index what is needed (see the Information Center and Tip 8 in the DB2 pureXML 15 best practices). Other systems require primary index, secondary index, side tables or whatever it is called and this would be overhead slowing down the system. DB2 directly extracts - without any prerequisite - the data to index.
Secondly, even if there are hundreds of indexes, only dozen or so match a given document. Still, taking care of the hundreds of non-qualifying indexes could be an issue. Fortunately not in DB2, how this is done was part of my presentation and is described in the conference paper. With the number of non-matching indexes increasing, there is only a small performance penalty and the DB2 server on a machine with 4 dual-core 1.9 GHz POWER5 processors was able to maintain close to 100,000 inserts a minute.

How many indexes are ok?
Earlier I had asked about the right number of indexes. The answer to the question can be (over-)simplified by saying that you should create as many as are needed to help query performance. As we have learned we don't need to worry too much about non-matching indexes. With that we are back to relational index maintenance characteristics most of us know already.

Friday, March 6, 2009

Does you car need different engines? (XML Storage)

Today I plan to talk about cars and XML storage, two seemingly different topics.

Cars have an engine, the engine is used to move the car with slow, middle, or high speeds, on flat roads, up and down hills or mountains, with light or heavy load. You don't need a different engine for different terrains. It's great because you can go where you need to go without too much pre-planning - flexibility you love. You don't need to select and put in a different engine for the next task ahead. There are more benefits because service technicians know how to deal with that single engine, car manufacturers can put their development budget into improving that single engine, and you know how the engine is behaving in different situations. It's not always optimal, but you are prepared to take the roads ahead because you don't need to change engines. You are prepared for unexpected detours ahead. Watch out if someone tries to sell you a car where you need to change engines!

BTW: With automatic transmissions you don't need to shift gears, with newer automatic transmissions there isn't even any gear shifting. Cars have evolved to turn on lights automatically, even to watch out for and assist in critical or complex situations (ESP, ABS, etc.). It takes stress from you and let's you reach your destinations safely.

Coming to the topic of XML storage, I want to emphasize that XML is about flexibility. DB2 pureXML only has a single XML type, a single XML storage engine. You don't need to choose a different XML storage option if your requirements change. You don't need to decide (and know) upfront which road to take, no issue dealing with an unexpected detour because your DB2 engine is suited to handle it. This is flexibility you love (or will learn to love)! Combine that with automatic storage, self-tuning memory management, and other autonomics and life is less stressful.

What a perspective for the weekend. And remember to watch out...

Tuesday, March 3, 2009

The DB2 Information Center has: Information!

I am always surprised how often I use the DB2 Information Center. I still remember my days as research assistant when I was the Informix DBA in our research group. We only had few printed manuals for each of the database products (Informix, DB2, ObjectStore, Illustra, Ingres, ...) and hence the electronic version of the manuals was the standard way of accessing or trying to access information. I was happy because the Informix manuals were pretty good. On the other hand I always found it hard to find the needed information quickly in the electronic DB2 manuals (version 5 I believe). When I remember correctly, then at that time it mainly consisted of some larger HTML documents, search was provided by a dedicated search server (explain that to the sysadmin).

Times have changed and as DB2 developer and DB2 user I found it interesting how - what is called today - the DB2 Information Center has evolved. Take a look at what I could find online:

Version 7: Not that much content yet, but the first version of how the Information Center looks today. Under "What's New" you can find the following section:
The Information Center, accessible by selecting Help --> Information Center from the Control Center, is now easier to use. In the Information Center you can now view information by subject or by category. The Information Center now contains a link to search the entire DB2 library.
You can also access help with the db2ic command on UNIX, or from the Start menu on Windows operating systems.
Version 8 added lots of content and it is organized by activities.
Version 9 is the first version with pureXML, the navigation bar on the left got extended and of course the documentation significantly expanded.
Version 9.5 changes the navigation bar and information is organized differently.

The above is from a quick glimpse on the front page, there are far more changes when you dig deeper.

One of the most frequently used features (at least for me) is the search box, but browsing can lead you to the needed information quickly as well. In the DB2 Version 9.5 Information Center click on the "Database fundamentals" topic. The last subsection should now be labelled "pureXML". Expand that section by clicking on the + sign and you should see something like shown here.

It provides hours of interesting reading material and can even, at least sometimes, beat a good novel. Follow the links to related topics and you are up to expanding knowledge and skills. Related material is not limited to the Information Center alone, our ID people include other articles, such those on developerWorks, as well. Thank you!

After all the words from my side, try it out yourself. I am back to reading, in the DB2 Information Center of course...

Zeppelin NT Review in "Road & Track" (no DB2)

I wrote in an earlier post that I follow the adventures of the Zeppelin NT airships. They are built where I live now and can be seen in the Lake Constance region, in the SF Bay Area, and in Tokyo.

The US magazine "Road & Track" has a nicely written Road Test of the Zeppelin NT in its newest edition (April) with pictures and a video. Enjoy!

Pages