IBM announced this week that a new supercomputer named "Watson" will compete on Jeopardy! The computer will have an advanced Question Answering (QA) system. One of the challenges is that the question is not really a question, but a description and contestants have to make up a question as their answer.
I many of my pureXML talks I mention something along the following lines as part of the introduction:
Goldfarb, Mosher, and Lorie created in the 1960s the predecessor of what is now used worldwide for information sharing and exchange and for greater flexibility in information management.
Now you press the buzzer and say: What is...?
Henrik's thoughts on life in IT, data and information management, cloud computing, cognitive computing, covering IBM Db2, IBM Cloud, Watson, Amazon Web Services, Microsoft Azure and more.
Wednesday, April 29, 2009
Tuesday, April 28, 2009
stringIDs in DB2 pureXML: What and Why
Earlier this month I had asked you about stringIDs in DB2 pureXML. Answer B) from the provided options is correct. DB2 replaces structural information such as element names, attribute names, namespace prefixes and URIs with stringIDs. But what are stringIDs and why are they used?
When we look at a simple XML document like
and an XPath expression like "/department/employee/lastname", then - first of all - we see a lot of strings. The strings are of different length and the tags (the markup) make up most of the document. And we haven't even introduced namespaces here.
How do you efficiently store such documents? XML can be very verbose compared to relational data. How do you quickly as possible navigate within such documents, i.e., compare the different steps of your XPath expression to the different levels of the XML document?
The key to compactness and speed is the use of stringIDs. For DB2 pureXML every element name, attribute name, namespace URI, and namespace prefix is substituted by a 32 bit integer value when an XML document gets parsed. Each string is mapped to a unique number, a so-called stringID.
In the above example, all "department" could be replaced by 1, all "employee" by 2, etc. When the DB2 engine compiles a query and generates an executable package it also uses the stringIDs. This way, when at runtime the XPath expression is evaluated on the data, only integer values need to be compared. First we need to match the root element, i.e., look for the element name with value 1 ("department"). If we found one, the child needs to be a 2 ("employee"). Comparing integer values is of course much faster than comparing strings of variable length.
How fast the XQuery execution is in DB2 pureXML can be seen when you look at the TPoX benchmark results or by reading some of the customer success stories collected at the pureXML wiki.
When we look at a simple XML document like
<department>
<employee>
<firstname>Henrik</firstname>
<lastname>Loeser</lastname>
</employee>
</department>
and an XPath expression like "/department/employee/lastname", then - first of all - we see a lot of strings. The strings are of different length and the tags (the markup) make up most of the document. And we haven't even introduced namespaces here.
How do you efficiently store such documents? XML can be very verbose compared to relational data. How do you quickly as possible navigate within such documents, i.e., compare the different steps of your XPath expression to the different levels of the XML document?
The key to compactness and speed is the use of stringIDs. For DB2 pureXML every element name, attribute name, namespace URI, and namespace prefix is substituted by a 32 bit integer value when an XML document gets parsed. Each string is mapped to a unique number, a so-called stringID.
In the above example, all "department" could be replaced by 1, all "employee" by 2, etc. When the DB2 engine compiles a query and generates an executable package it also uses the stringIDs. This way, when at runtime the XPath expression is evaluated on the data, only integer values need to be compared. First we need to match the root element, i.e., look for the element name with value 1 ("department"). If we found one, the child needs to be a 2 ("employee"). Comparing integer values is of course much faster than comparing strings of variable length.
How fast the XQuery execution is in DB2 pureXML can be seen when you look at the TPoX benchmark results or by reading some of the customer success stories collected at the pureXML wiki.
Monday, April 27, 2009
Passive House, Electric Cars, Noise, and DB2 diagnostics
It's a little bit over a year now that we live in our passive house. Thanks to 48cm thick walls and 4 pane windows it is mostly quiet inside the house, even with the flight path to/from FDH being very close.
What we are not missing from the US is the sound of Harleys, sometimes large groups of them. A lot of noise, you can hear and feel them approaching.
Putting that into perspective, I read about US lawmakers considering adding a requirement for non-visual alerts for motor vehicles (Pedestrian Safety Enhancement Act of 2009), i.e., to make so far quiet electric and hybrid cars louder. Worldwide everybody else seems to work towards making things quieter (reducing noise emissions). So this looks strange.
What I love about DB2 is that the many autonomic features help forgetting that a database is running. The diagnostic file has information about how DB2 was/is doing. Using diaglevel I can set the "noise level" I prefer, the default (3) is to log all errors and warnings. If you set it to the highest level (most noise), it gives you all kind of informational messages. Did you know that you can analyze the diagnostic messages using the tool db2diag? It was introduced in DB2 a couple years ago. If you haven't done so, try it out and look at the "noise"...
Friday, April 24, 2009
A rabbit in your life?
Today, I stumbled over Nabaztag, a so-called smart companion (no jokes here because my wife would deliberately misinterpret it). It's been on the market for a while now and is a WLAN-connected rabbit that can alert you of emails/news/events, read you the weather forecast, play music, and much more. The system can even be extended with RFID tags and it has its set of APIs to write custom applications.
On first sight, it looks more like a "nice" gadget and programming toy than really useful.
Is anybody (or your kids) already using this rabbit? I am asking because my oldest son is asking for a rabbit because other kids in the neighborhood recently got a pet...On first sight, it looks more like a "nice" gadget and programming toy than really useful.
XML arrives in the warehouse
More and more data is generated, sent, processed, (even!!!) stored as XML. Application developers, IT architects, and DBAs get used to XML as part of their life. Let's get ready for the next step, XML in the enterprise warehouse. Why? To bring the higher flexibility and faster time-to-availability (shorter development/deployment time) to the core information management systems, the base for critical business decisions, the enterprise warehouses. Greater flexibility and shorter reaction times are especially important during economic phases like the currently present one.
DB2 9.7 ("Cobra") adds support for XML in the warehouse. What is even better is that with XML compression and index compression cost savings can be realized. Right now not many of the analytic and business intelligence tools support XML data. Fortunately, both the SQL standard and DB2 feature a function named XMLTABLE that allows to map data from XML documents to a regular table format, thereby enabling BI tools to work with XML data.
My colleagues Cindy Saracco and Matthias Nicola have produced a nice summary and introduction of warehousing-related enhancements of pureXML in the upcoming DB2 9.7 release. The developerWorks article is titled "Enhance business insight and scalability of XML data with new DB2 V9.7 pureXML features". Among other things the paper explains how XML can now be used with range partitioning, hash (or database) partitioning, and multi-dimensional clustering. Enjoy!
DB2 9.7 ("Cobra") adds support for XML in the warehouse. What is even better is that with XML compression and index compression cost savings can be realized. Right now not many of the analytic and business intelligence tools support XML data. Fortunately, both the SQL standard and DB2 feature a function named XMLTABLE that allows to map data from XML documents to a regular table format, thereby enabling BI tools to work with XML data.
My colleagues Cindy Saracco and Matthias Nicola have produced a nice summary and introduction of warehousing-related enhancements of pureXML in the upcoming DB2 9.7 release. The developerWorks article is titled "Enhance business insight and scalability of XML data with new DB2 V9.7 pureXML features". Among other things the paper explains how XML can now be used with range partitioning, hash (or database) partitioning, and multi-dimensional clustering. Enjoy!
Wednesday, April 22, 2009
Q4U: R U Cmprsng UR DTA?
DAMHIKT, but it seems stuffing more in less space is popular (try going on vacation with a family). Fortunately, the next release of DB2 adds a lot of compression features. On the page for the Technology Sandbox (a.k.a. "beta") the following compression functionality is mentioned in addition to DB2's deep table compression:
Index compression saves you big because you usually have plenty of them and compressed indexes are still compressed when in the bufferpool. Compression of temporary tables is especially useful for analytics/warehousing. Sign up and test it yourself!
SIG2R.
CU, Henrik (DARFC)
BTW: I got some help coding parts of my blog using this abbreviation list.
- Multiple algorithms for automatic index compression.
- Automatic compression for temporary tables.
- Intelligent compression of large objects and XML.
Index compression saves you big because you usually have plenty of them and compressed indexes are still compressed when in the bufferpool. Compression of temporary tables is especially useful for analytics/warehousing. Sign up and test it yourself!
SIG2R.
CU, Henrik (DARFC)
BTW: I got some help coding parts of my blog using this abbreviation list.
Monday, April 20, 2009
Now it's final: Oracle buys SUN
In the case you haven't read it yet. Oracle continues to add more competitors by buying SUN, see here.
Home Automation, SOA, DB2, and an awning
Over the weekend I looked into a control for a patio awning. Right now we are planning a patio cover (like this one (English / Deutsch) with an automated awning. The idea is that the awning opens automatically when the sun gets too intense and it closes automagically when either the sun has retreated for a longer time, or the wind gets too strong, or it starts raining. You don't want the house to heat up too much when you are away, but most importantly you don't want the awning to be damaged because of the weather.
There are standalone versions of the control, some with wireless remotes. On the more interesting side for the technology-oriented person are controls which can be integrated or are part of a home automation systems (like the European Installation Bus (EIB/KNX) protocol). The idea is that you can control/automate your heating system, your shades, your lamps, your coffee maker, etc. The different devices offer data on the installation bus, and central processors can read and process the data, react and send out instructions. If it is getting dark outside, let down the shades. If you lock your front door, switch off all lamps. If it starts raining, send the kids outside for a bath...
How does that compare to SOA, Web Services, or DB2? All devices offer their data and services on the installation bus, requests are sent on that bus. It's simple to build new applications or to integrate new devices/services. The same idea is behind SOA and Web Services.
DB2 offers great automation features, autonomics. If the sortheap is too small, increase it by moving resources around - if more storage space is required for a tablespace, let the system add it. You don't want to deal with such issues while you are away (for the night or on the weekend). Let the software do it, so you can focus on other more important stuff.
With the same reasoning I would like some processor/software to take care of our patio shade, of the awning. I know it from my database system, I want it for my home.
There are standalone versions of the control, some with wireless remotes. On the more interesting side for the technology-oriented person are controls which can be integrated or are part of a home automation systems (like the European Installation Bus (EIB/KNX) protocol). The idea is that you can control/automate your heating system, your shades, your lamps, your coffee maker, etc. The different devices offer data on the installation bus, and central processors can read and process the data, react and send out instructions. If it is getting dark outside, let down the shades. If you lock your front door, switch off all lamps. If it starts raining, send the kids outside for a bath...
How does that compare to SOA, Web Services, or DB2? All devices offer their data and services on the installation bus, requests are sent on that bus. It's simple to build new applications or to integrate new devices/services. The same idea is behind SOA and Web Services.
DB2 offers great automation features, autonomics. If the sortheap is too small, increase it by moving resources around - if more storage space is required for a tablespace, let the system add it. You don't want to deal with such issues while you are away (for the night or on the weekend). Let the software do it, so you can focus on other more important stuff.
With the same reasoning I would like some processor/software to take care of our patio shade, of the awning. I know it from my database system, I want it for my home.
Thursday, April 16, 2009
Lessons from the field: Remember the trees and your relational data!
Some days before my vacation I was called to assist with a possible performance problem at a customer. To get a first impression I asked for the table definition, the indexes, and the query in question.
The table had several relational and one XML column. No problem, typical scenario. The indexes presented to me were a couple XML indexes. All looked OK to support the query in question which was similar to:
select col1, col2, ..., xmlCol
from tableInQuestion
where col1='someString'
and xmlexists('$XMLCOL/department/employee/name[first="Henrik"')
and xmlexists('$XMLCOL/department/employee/name[last="Loeser"')
When people with relational background learn to write queries with XML expressions, they repeatedly hear to watch out for XMLEXISTS and to write the comparison correctly (do not produce a boolean value!). This was done correctly in the query above. But I noticed a semantic issue. The problem is that you are not searching for XML documents that has one employee named "Henrik" and one that has a lastname of "Loeser", but for an employee named "Henrik Loeser". Because the predicates should be on the same employee, the query needs to be rewritten to something like the one below. Both comparisons are within a single XMLEXISTS, both on the same employee.
select col1, col2, ..., xmlCol
from tableInQuestion
where col1='someString'
and xmlexists('$XMLCOL/department/employee/name[first="Henrik" and last="Loeser"')
When writing queries against databases with XML data, keep in mind that the data is not normalized, i.e., you have an XML document with possibly multiple instances of a data object (such as employees).
The above did not cause the performance problem. Some questions later from my side we found that the team had focused too much on getting the XML part right and forgot to create indexes on the relational columns...
The table had several relational and one XML column. No problem, typical scenario. The indexes presented to me were a couple XML indexes. All looked OK to support the query in question which was similar to:
select col1, col2, ..., xmlCol
from tableInQuestion
where col1='someString'
and xmlexists('$XMLCOL/department/employee/name[first="Henrik"')
and xmlexists('$XMLCOL/department/employee/name[last="Loeser"')
When people with relational background learn to write queries with XML expressions, they repeatedly hear to watch out for XMLEXISTS and to write the comparison correctly (do not produce a boolean value!). This was done correctly in the query above. But I noticed a semantic issue. The problem is that you are not searching for XML documents that has one employee named "Henrik" and one that has a lastname of "Loeser", but for an employee named "Henrik Loeser". Because the predicates should be on the same employee, the query needs to be rewritten to something like the one below. Both comparisons are within a single XMLEXISTS, both on the same employee.
select col1, col2, ..., xmlCol
from tableInQuestion
where col1='someString'
and xmlexists('$XMLCOL/department/employee/name[first="Henrik" and last="Loeser"')
When writing queries against databases with XML data, keep in mind that the data is not normalized, i.e., you have an XML document with possibly multiple instances of a data object (such as employees).
The above did not cause the performance problem. Some questions later from my side we found that the team had focused too much on getting the XML part right and forgot to create indexes on the relational columns...
Wednesday, April 15, 2009
XML, WebSphere DataPower, and Data Studio
Michael Schenker, a fellow German and IBMer at the Silicon Valley Lab, has a nice blog entry about how IBM WebSphere DataPower and XML processing. The article gives an overview and has links to a tutorial and other developerWorks articles describing how to process XML with both DataPower and DB2 pureXML.
Monday, April 6, 2009
A day at AERO 2009: About aviation and databases
To my excuse let me start this post stating that to a database guy everything looks like a database.
On Saturday I took my oldest son to AERO 2009, the biggest European expo for General Aviation that included an air show in the afternoons. We went there half on foot and half by shuttle bus, but many visitors actually flew there with their own planes and Friedrichshafen's airport area and airspace was crowded.
Both I (and to some degree my son) can be labeled "experienced passenger", nothing more. In the database world this would compare to "having used an ATM" (this is not Air Traffic Management) or "received database-generated report in mail". At the expo were many commercial and private pilots, aircraft mechanics, aerobatic pilots, some flight attendants, air traffic controlers, some government agencies, and many more. In my (our?) world this would compare to DBAs and sysadmins, performance specialist, maybe application users, your management, auditors, etc. Of course there was finance and insurance companies present (and possibly lawyers...).
From strolling around I learned how much software is now used even by private pilots. Simulation, flight planning, navigation, in-flight monitoring and control, the electronic fligh bag (EFB), air traffic control/management (ATC/ATM), and many more require special software. The electronic flight bag can even be a collection of XML data. Statistical data such as that from ATADS is nothing more than your typical database application. A very nice air traffic visualization is using XML, web serivces, Google Maps, MashUps, and a database.
Thus, even if you were only an IT guy, you would have gotten your share of information. I won't write about the special deal I could have gotten for a private jet, how the goodies are different from an IT expo, and why I am lucky my wife is not a wing walking lady (or here)...
On Saturday I took my oldest son to AERO 2009, the biggest European expo for General Aviation that included an air show in the afternoons. We went there half on foot and half by shuttle bus, but many visitors actually flew there with their own planes and Friedrichshafen's airport area and airspace was crowded.
Both I (and to some degree my son) can be labeled "experienced passenger", nothing more. In the database world this would compare to "having used an ATM" (this is not Air Traffic Management) or "received database-generated report in mail". At the expo were many commercial and private pilots, aircraft mechanics, aerobatic pilots, some flight attendants, air traffic controlers, some government agencies, and many more. In my (our?) world this would compare to DBAs and sysadmins, performance specialist, maybe application users, your management, auditors, etc. Of course there was finance and insurance companies present (and possibly lawyers...).
From strolling around I learned how much software is now used even by private pilots. Simulation, flight planning, navigation, in-flight monitoring and control, the electronic fligh bag (EFB), air traffic control/management (ATC/ATM), and many more require special software. The electronic flight bag can even be a collection of XML data. Statistical data such as that from ATADS is nothing more than your typical database application. A very nice air traffic visualization is using XML, web serivces, Google Maps, MashUps, and a database.
Thus, even if you were only an IT guy, you would have gotten your share of information. I won't write about the special deal I could have gotten for a private jet, how the goodies are different from an IT expo, and why I am lucky my wife is not a wing walking lady (or here)...
Thursday, April 2, 2009
DB2 pureXML and stringIDs
You may have heard that DB2 pureXML uses so-called stringIDs instead of strings for internal processing. I plan to discuss them in an upcoming post. Until then, please guess which answer is correct:
A) DB2 replaces all strings with stringIDs.
B) DB2 replaces structural information such as element names, attribute names, namespace information, etc. with stringIDs.
C) DB2 replaces all attribute values with stringIDs for data compression (deep compression).
Which is correct (and why)? Feel free to comment.
A) DB2 replaces all strings with stringIDs.
B) DB2 replaces structural information such as element names, attribute names, namespace information, etc. with stringIDs.
C) DB2 replaces all attribute values with stringIDs for data compression (deep compression).
Which is correct (and why)? Feel free to comment.
Wednesday, April 1, 2009
Thermal Images have arrived
In a February post I wrote about Thermal Imaging and that it can be considered quality assurance work for your (new) house. Yesterday evening we finally received a binder with several infrared shots of our house (paper only).
The scanned image below shows the West side of our house with the forest in the backdrop. The picture was taken around 10pm when it was around 1 degree Celsius outside. Clearly visible is the roof, the windows, and on the left our porch. You might wonder about the one whitish vertical and the two horizontal lines. We have a prefabricated house and those lines indicate where the large panels have been joined. Energy-wise those joints are the weaker spots. However, in our case the entire house fares very well. QA passed!
The scanned image below shows the West side of our house with the forest in the backdrop. The picture was taken around 10pm when it was around 1 degree Celsius outside. Clearly visible is the roof, the windows, and on the left our porch. You might wonder about the one whitish vertical and the two horizontal lines. We have a prefabricated house and those lines indicate where the large panels have been joined. Energy-wise those joints are the weaker spots. However, in our case the entire house fares very well. QA passed!
Subscribe to:
Posts (Atom)