Some years ago, ThinkFridays were still popular at IBM. They could be used to work special projects and to extend skills. Here is something in that spirit:
Henrik's thoughts on life in IT, data and information management, cloud computing, cognitive computing, covering IBM Db2, IBM Cloud, Watson, Amazon Web Services, Microsoft Azure and more.
Friday, March 26, 2010
Accessing and processing embedded textual XML documents
Yesterday, I wrote about how and why embedded textual XML documents can create trouble. Today, I will look at different ways I found (so far) on how to deal with those documents in DB2 pureXML. The documents which are embedded as text are, as mentioned, a long text value (text node) and not element and attribute nodes. Because the transformation process of turning textual XML into an instance of the XQuery Data Model (XDM) and hence making the XML document queryable is not part of the XQuery language itself, we have to leave the XQuery environment to deal with those embedded documents. More on this later, let's take a look again at the sample document from yesterday:
Inside the "b" elements we have embedded documents in text form. To deal with those documents, we first need to decide how we want them. If we are only interested in their text, we could just navigate down to the "b"s and extract the text values (assuming we have a table CD with a DOC column of type XML where we inserted the above document):
select x.* from cd,xmltable('$DOC/a/b' columns edoc varchar(2000) path '.') as x
The query could directly be used in an insert statement where the embedded documents are stored back into an XML columns. If however all the XML fragments should be used to make up a bigger document, the following statement may work:
select ''||x.edoc||' ' from cd,xmltable('$DOC/a' columns edoc varchar(2000) path '{./b} ') as x
Within the XMLTABLE, we first navigate to all the "a"s. For the returned column we return all the concatenated text values of b wrapped into a "dummy" element. This is necessary because casting to a string is only possible for a single item, not for a sequence of items (which the different text values would be). Later we utilize string concatenation to attach an "outer" element to make the XML fragment single-rooted.
The same could have been done with a statement like the initial one:
With XMLTABLE however you have more options on what to do because usually more data needs to be extracted and post-processed.
Wrapping an XMLPARSE around the constructed textual XML data turns this back into instances of the XDM which can be queried using XQuery.
Enough of embedded documents for today. In my next post, I will show you an embedded video to demonstrate how to deal with a rainy Friday afternoon...
<a>
<b>
<![CDATA[<e1><e2>embedded data 1</e2></e1>]]>
</b>
<b>
<![CDATA[<e1><e2>embedded data 2</e2></e1>]]>
</b>
<c>oh, even a different element<c>
</a>
Inside the "b" elements we have embedded documents in text form. To deal with those documents, we first need to decide how we want them. If we are only interested in their text, we could just navigate down to the "b"s and extract the text values (assuming we have a table CD with a DOC column of type XML where we inserted the above document):
select xmlquery('$DOC/a/b/text()') from cd
The above query would return all embedded documents within the parent document as a single string. This is usually not what is needed. Using XMLTABLE, we can return each embedded document individually as string: select x.* from cd,xmltable('$DOC/a/b' columns edoc varchar(2000) path '.') as x
The query could directly be used in an insert statement where the embedded documents are stored back into an XML columns. If however all the XML fragments should be used to make up a bigger document, the following statement may work:
select '
Within the XMLTABLE, we first navigate to all the "a"s. For the returned column we return all the concatenated text values of b wrapped into a "dummy" element. This is necessary because casting to a string is only possible for a single item, not for a sequence of items (which the different text values would be). Later we utilize string concatenation to attach an "outer" element to make the XML fragment single-rooted.
The same could have been done with a statement like the initial one:
select ''||xmlcast(xmlquery('{$DOC/a/b/text()} ') as varchar(2000))||' ' from cd
With XMLTABLE however you have more options on what to do because usually more data needs to be extracted and post-processed.
Wrapping an XMLPARSE around the constructed textual XML data turns this back into instances of the XDM which can be queried using XQuery.
select xmlquery('$OUTPUT/outer/e1/e2') from (
select xmlparse(document ''||x.edoc||' ') as output from cd,xmltable('$DOC/a' columns edoc varchar(2000) path '{./b} ') as x)
select xmlparse(document '
Enough of embedded documents for today. In my next post, I will show you an embedded video to demonstrate how to deal with a rainy Friday afternoon...
Thursday, March 25, 2010
The trouble with processing embedded textual XML (CDATA)
You understand the basics of XML and started to master processing XML. You successfully started using XPath and XQuery in database systems like DB2. Now that, some other guys are introducing the next level of complexity: Embedding XML documents into other XML documents. This is really mean and I show you why.
XML documents like the following can be processed with DB2 by using its pureXML feature.
When a document is inserted into a column of type XML, it is parsed and transformed into the internal, "native" representation. An instance of the so-called XQuery Data Model (XDM) is created (see the Processing Model in the XQuery specification). This step or process is not part of the XQuery processing, XQuery processing assumes that you (or your system) managed to provide instances of the XDM it then can operate on. If you inserted the above document into DB2, you have an instance of the XDM and you can process the document using XPath or XQuery. That's fine and everybody is happy.
Now, humans have never rested and sought new challenges. When you learn a foreign language, you first learn how to speak and understand simple sentences. Eventually, you have to deal with more complex sentences like subordinate clauses, relative clauses, appositions, and whatever the stuff is named (think of subselects, common table expressions, case statements, etc.). Where was I? Ah, back to XML. Similar to sentences, people tend to make data more complicated and to seek new challenges. What they do is to embed XML data into other XML data.
In the above example the previously text values like "first value" are replaced with XML fragments on their own. The entire document and the embedded parts can still be easily processed because all "tags" are element nodes in the XDM instance. XPath and XQuery can directly answer queries on e1 and e2, e.g., all instances of "e2" can quickly be found by searching "//e2".
However, the above way of embedding XML is not the only one and some organizations, standards, and data providers embed entire XML documents as text. This can be done by escaping directly, i.e., to by replacing < and > by "<" and ">". A different, but equivalent way is to utilize CDATA sections (see the XML specification). Let's take a look at the following example:
The embedded document is only text data and XQuery or XPath cannot easily work on the data as e1 and e2 are not element nodes, but only characters in a longer string value. As mentioned earlier, turning XML in its textual representation into an instance of the XDM is not part of the XQuery language. And this is where the trouble begins.
I plan to look at options on how to process the data in a future post.
XML documents like the following can be processed with DB2 by using its pureXML feature.
<a>
<b>first value</b>
<b>second value</b>
<c>oh, even a different element<c>
</a>
When a document is inserted into a column of type XML, it is parsed and transformed into the internal, "native" representation. An instance of the so-called XQuery Data Model (XDM) is created (see the Processing Model in the XQuery specification). This step or process is not part of the XQuery processing, XQuery processing assumes that you (or your system) managed to provide instances of the XDM it then can operate on. If you inserted the above document into DB2, you have an instance of the XDM and you can process the document using XPath or XQuery. That's fine and everybody is happy.
Now, humans have never rested and sought new challenges. When you learn a foreign language, you first learn how to speak and understand simple sentences. Eventually, you have to deal with more complex sentences like subordinate clauses, relative clauses, appositions, and whatever the stuff is named (think of subselects, common table expressions, case statements, etc.). Where was I? Ah, back to XML. Similar to sentences, people tend to make data more complicated and to seek new challenges. What they do is to embed XML data into other XML data.
<a>
<b>
<e1><e2>embedded data 1</e2></e1>
</b>
<b>
<e1><e2>embedded data 2</e2></e1>
</b>
<c>oh, even a different element<c>
</a>
In the above example the previously text values like "first value" are replaced with XML fragments on their own. The entire document and the embedded parts can still be easily processed because all "tags" are element nodes in the XDM instance. XPath and XQuery can directly answer queries on e1 and e2, e.g., all instances of "e2" can quickly be found by searching "//e2".
However, the above way of embedding XML is not the only one and some organizations, standards, and data providers embed entire XML documents as text. This can be done by escaping directly, i.e., to by replacing < and > by "<" and ">". A different, but equivalent way is to utilize CDATA sections (see the XML specification). Let's take a look at the following example:
<a>
<b>
<![CDATA[<e1><e2>embedded data 1</e2></e1>]]>
</b>
<b>
<![CDATA[<e1><e2>embedded data 2</e2></e1>]]>
</b>
<c>oh, even a different element<c>
</a>
I plan to look at options on how to process the data in a future post.
Tuesday, March 23, 2010
My SYSDUMMY1 has turned DUAL
For ages I have used the SYSDUMMY1 view to evaluate simple expressions or retrieve registry variables (some of that could have been done using the VALUES expression). It seems that I will switch to a shorter version of it and in the future will most of the times use DUAL (SYSIBM.DUAL).
Once the DB2_COMPATIBILITY_VECTOR includes 0x02, you can even leave out the SYSIBM schema and just use DUAL:
Once the DB2_COMPATIBILITY_VECTOR includes 0x02, you can even leave out the SYSIBM schema and just use DUAL:
db2 => select current_date from dual
1
----------
03/23/2010
1 record(s) selected.
1
----------
03/23/2010
1 record(s) selected.
Wednesday, March 17, 2010
General Availability!!!
Today's topic is not about DB2 vs. Oracle, not about cost savings, TCOs, time to value, or truth in advertising. I won't talk about innovation that matters or smarter cities, not about how "advanced" or "deep" stuff like analytics, insight, or just intelligence is moving towards something eye opening, innovative, breathtaking, standards-based, health-conscious, patent pending, next generation, world dominating (now how to conclude this awful sentence?) and yet to come.
Today is about simplicity, about something beautiful, fundamental, long awaited, and still always surprising. It is about nature. Spring has finally arrived in Germany after a too long and too cold Winter. Blossoms are popping up, green is taking over the landscape. General Availability of Spring. What an important event, something to enjoy.
Today is about simplicity, about something beautiful, fundamental, long awaited, and still always surprising. It is about nature. Spring has finally arrived in Germany after a too long and too cold Winter. Blossoms are popping up, green is taking over the landscape. General Availability of Spring. What an important event, something to enjoy.
Friday, March 12, 2010
Deadlock and Hot Spot in Real Life (Police Report)
Being a database guy and lecturer, it is always nice to see how easy it is to explain database terms with real life analogies. Today's feature is "deadlock" and "hot spot". The situation is taken from this police report (look for "Unglaublich aber wahr" from March 9th, it's in German):
We see: Mutual dependencies create a deadlock in this situation. Depending on what resources are involved, hot spots can be created. And as usual: Don't blame the DBA.
In English: A 88 year old woman went outside to the trash bin forgetting the keys and the door fell shut. Eventually, the police was called and together they figured that the daughter had the only spare key. Police went to the daughter's house and while the police talked with the daughter outside the house the door felt shut. The only spare key was in her mother's house (the 88 year old woman). So they came up with the plan to order a locksmith to the mother's house and meet there. However, to make it more interesting, after some time the daughter remembered that the lunch was still cooking on the stove and, thus, the fire department was called in (fortunately nothing was damaged). Lunch was also on the old woman's stove (no damage here, again).Unglaublich aber wahr
(9. März 2010) Die Geschichte fing damit an, dass gestern Mittag eine ältere Dame im Buntentorsteinweg ihren Abfall aus dem Haus bringen wollte. Nach Erledigung musste sie aber feststellen, dass ihre Haustür zugefallen und sie keinen Haustürschlüssel mitgenommen hatte. Die Frau wandte sich daraufhin hilfesuchend an ihren Nachbarn, der seine Schutzmannskollegen informierte. Die sehr aufgeregte 88 Jahre alte Frau konnte den uniformierten Helfern lediglich mitteilen, dass ihre Tochter im Besitz eines Ersatzschlüssels sei. Deren Adresse und Telefonnummer fielen ihr in der Aufregung nicht mehr ein. Nachdem diese Lücke schnell durch die Polizeibeamten geschlossen werden konnte, wurde ein Einsatzfahrzeug zur Adresse der Tochter entsandt. Die 55-Jährige wurde auch angetroffen und um Hilfe gebeten. Nach einigen Minuten mussten die Beamten allerdings über Funk ihren Kollegen bei der Mutter mitteilen, dass es mit der Hilfe noch dauern wird, weil der Tochter bei dem Gespräch mit ihnen die Haustür zugefallen sei. Einen Ersatzschlüssel hätte nur die Mutter! Daraufhin orderten die Beamten einen Schlüsseldienst zum Buntentorsteinweg. Als die Tochter sich jetzt auf den Weg machen wollte, um ihren Ersatzschlüssel bei der Mutter abzuholen, fiel ihr siedendheiß ein, dass sie das Mittagessen auf dem Herd hatte. Logische Konsequenz - ihre Haustür wurde jetzt von der eilig informierten Feuerwehr geöffnet. Außer einem leichten Brandgeruch wurden keine weiteren Schäden festgestellt. Nachdem der Schlüsseldienst die Haustür der Mutter geöffnet hatte, wurde auch hier leichter Brandgeruch wahrgenommen. Auch die Mutter hatte ihr Essen auf dem Herd gehabt. Die Mittagessen bei Mutter und Tochter waren nach Angaben der Einsatzkräfte gut durchgekocht.
We see: Mutual dependencies create a deadlock in this situation. Depending on what resources are involved, hot spots can be created. And as usual: Don't blame the DBA.
Thursday, March 11, 2010
IBM XML Index Advisor for DB2
DB2 has an index advisor (part of the DB2 Design Advisor) which helps to identify missing indexes to speed up a given workload. Unfortunately, it lacks pureXML support. The good news is that there is a tool on IBM alphaWorks, called the "IBM XML Index Advisor for DB2 for Linux, UNIX, and Windows".
Give the tool a try and provide feedback to the team (I already did and, e.g., requested more examples).
Give the tool a try and provide feedback to the team (I already did and, e.g., requested more examples).
Tuesday, March 9, 2010
Is it me or the database topic? Simple reading guaranteed
Today I stumbled over an ad or a link for SEO (search engine optimization) and website grading. I put in the URL for this blog (http://blog.4loeser.net) and was surprised to see the following as part of the report:
Basically, what the tool found out is that what I have been writing about in my blog requires only Primary or Elementary School education to understand my blog. Could that mean I am an excellent writer and I am able to explain complex topics in simple terms? Or is DB2 easy to understand (and administrate)? Or is it the database topic in general? Maybe I should be more technical to raise the level and make reading this blog more demanding. On the other hand, what is the audience I am trying to cater to...?
Basically, what the tool found out is that what I have been writing about in my blog requires only Primary or Elementary School education to understand my blog. Could that mean I am an excellent writer and I am able to explain complex topics in simple terms? Or is DB2 easy to understand (and administrate)? Or is it the database topic in general? Maybe I should be more technical to raise the level and make reading this blog more demanding. On the other hand, what is the audience I am trying to cater to...?
Monday, March 8, 2010
DB2 Performance and cost savings tip: Don't maintain unused stuff (LASTUSED)
When you enjoying a regular life, over time you usually accumulate a lot of stuff. Some (most?) of that stuff is rarely used, but you have to dust off or clean it anyway, you somehow pay for keeping it through shelves or cabinets you need, through rent or paying off credits. That's one of the reasons why Spring cleaning has been invented (and my wife repeatedly asks me to clean up my home office). Anyway, the point I wanted to make is that it is a good idea to look for unused stuff to make housekeeping simpler and cheaper. The same applies to database systems.
It seems that the DB2 team dusted off an old feature requests and shipped it in DB2 9.7 Fixpack 1: a LASTUSED column in some catalog tables. With the help of these columns and the new infrastructure in the background it is possible to determine the date of when a database object was last used. If you don't use an index, why maintain it, slow down IUD operations and pay for storage? If an MQT (Materialized Query Table) is not used, it either means you can drop it or you have query execution plans to look at and to figure out why it's not used anymore.
The LASTUSED column of type DATE is available for tables (SYSCAT.TABLES), for indexes (SYSCAT.INDEXES), data partitions (SYSCAT.DATAPARTITIONS), and packages. One thing to note is that the information is updated asynchronously, i.e., if a table is used only once a week the LASTUSED value for it isn't updated the same second or minute the table has been accessed. That's done to avoid turning the catalog into a hotspot.
So much for today, I have to clean up my desk now and sort through old stuff...
It seems that the DB2 team dusted off an old feature requests and shipped it in DB2 9.7 Fixpack 1: a LASTUSED column in some catalog tables. With the help of these columns and the new infrastructure in the background it is possible to determine the date of when a database object was last used. If you don't use an index, why maintain it, slow down IUD operations and pay for storage? If an MQT (Materialized Query Table) is not used, it either means you can drop it or you have query execution plans to look at and to figure out why it's not used anymore.
The LASTUSED column of type DATE is available for tables (SYSCAT.TABLES), for indexes (SYSCAT.INDEXES), data partitions (SYSCAT.DATAPARTITIONS), and packages. One thing to note is that the information is updated asynchronously, i.e., if a table is used only once a week the LASTUSED value for it isn't updated the same second or minute the table has been accessed. That's done to avoid turning the catalog into a hotspot.
So much for today, I have to clean up my desk now and sort through old stuff...
Tuesday, March 2, 2010
Understanding Processor Value Units (PVUs)
When you are a software developer, you don't necessarily have insight into the pricing and sales process. From coding days I know how to hook up functionality to a license server and how to make sure features are only used when money was paid (that's the part regarding job security). So recently I tried to learn more about how DB2 software (distributed software in general) is priced or what forms the basis for pricing. The magic term in IBM lingo is PVUs or Processor Value Units.
PVUs are not based on how much you actually paid for the machine, but how valuable in processing power (from IBM's view) your machine is. A number is first determined for a processor core, then the numbers are added up to account for all cores in a machine. A table with up-to-date PVUs for the different processor types is maintained at http://www-01.ibm.com/software/lotus/passportadvantage/pvu_licensing_for_customers.html. The new POWER7 processors, e.g., have 120 or 100 PVUs per core, depending on the server model. Based on the table you can do the math and can come up with the total PVUs your machine has, e.g., 800 PVUs. A different (and I can't say simpler or more comfortable) way is to use the "processor value unit calculator" where you click through a decision tree.
An introduction to the DB2 pricing with some background on why PVUs were introduced is this article by Paul Z. and Deb Jenson.
If you have read all this, you can determine how valuable your machine is. To no surprise, there are other licensing option like per server or per socket which are only available for certain DB2 editions (usually only the "smaller" editions). If you don't (want to) care about PVUs, maybe why not just start with the free DB2 Express-C?
PVUs are not based on how much you actually paid for the machine, but how valuable in processing power (from IBM's view) your machine is. A number is first determined for a processor core, then the numbers are added up to account for all cores in a machine. A table with up-to-date PVUs for the different processor types is maintained at http://www-01.ibm.com/software/lotus/passportadvantage/pvu_licensing_for_customers.html. The new POWER7 processors, e.g., have 120 or 100 PVUs per core, depending on the server model. Based on the table you can do the math and can come up with the total PVUs your machine has, e.g., 800 PVUs. A different (and I can't say simpler or more comfortable) way is to use the "processor value unit calculator" where you click through a decision tree.
An introduction to the DB2 pricing with some background on why PVUs were introduced is this article by Paul Z. and Deb Jenson.
If you have read all this, you can determine how valuable your machine is. To no surprise, there are other licensing option like per server or per socket which are only available for certain DB2 editions (usually only the "smaller" editions). If you don't (want to) care about PVUs, maybe why not just start with the free DB2 Express-C?
Traveling too often? Dealing with not so easy customers? Try this...
If you are traveling too often or, of course because of the economy, have to cut down your budget for roses and chocolate, here is something to try. It doesn't say whether you can change the message using a USB plug or whether it is useful to deal with customers that need that little extra attention. Anyway, try it and let me know...
Subscribe to:
Posts (Atom)