Data Henrik: XML

Showing posts with label XML. Show all posts

Friday, August 17, 2018

Db2: Some Friday Fun with XML and SQL recursion

Recursion is fun!?

Right now, Michael Tiefenbacher and I have to prepare our joint talk "Some iterations over recursion" for the IDUG Db2 Tech Conference in Malta 2018. The title of our talk promises some fun, coding up some of the sample SQL already is. Yesterday and this morning I tested a little query I wrote and brought down my machine (not Db2). In April, I wrote about two different kind of recursive queries, standard SQL and Oracle syntax. Today, let me give you insights to a nasty recursive query that features SQL, SQL/XML and XQuery.

Bluemix: Where Python and Watson are in a Dialog

Right now I am working on a side project to hook up the Watson Dialog Service on Bluemix with dashDB and DB2. The idea is to dynamically feed data from DB2 into a conversation bot. To register and manage dialogs with the Watson Dialog Service, there is a web-based dialog tool available. But there is also a dialog API and a Python SDK for the Watson services available. So why not manage the dialogs from the command line...?

Converse with Watson from the Command Line

Here is a small overview of my Python client that helps to register, update, delete and list dialogs and that can even drive a dialog (converse with Watson) from the shell window on your machine. The code and for now some short documentation is available on GitHub as watson-dialog-client.

In order to use the tool, you need to have the Watson Dialog Service provisioned on IBM Bluemix. The service credentials need to be stored in a file config.json in the same directory as the tool "henriksDialog". The credentials look like shown here:

{
    "credentials": {
        "url": "https://gateway.watsonplatform.net/dialog/api",
        "password": "yourServicePassword",
        "username": "yourUserIDwhichIsALongString"
    }
}

The credentials are read by the tool to "chat" with the dialog service. The following commands are available:

register a dialog by providing a new dialog name and the XML definition file
"henriksDialog -r -dn dialogName -f definitionFile"
update a dialog by identifying it by its ID and providing a definition file
"henriksDialog -u -id dialogID -f definitionFile"
delete a dialog identified by its ID
"henriksDialog -d -id dialogID"
list all registered dialogs
"henriksDialog -l"
converse, i.e., test out a registered dialog which is identified by its ID
"henriksDialog -c -id dialogID"

Sample invocations and their output is available in the GitHub repository for this dialog tool. Let me know if something is missing or you had success chatting with Watson from the command line.

Friday, March 28, 2014

XML or JSON: When to choose what for your database

There is a new article on IBM developerWorks titled "XML or JSON: Guidelines for what to choose for DB2 for z/OS". It has been written by two very experienced technologists with DB2 background. Though the comparison and the many code examples for how to use either XML or JSON are for DB2 for z/OS, most of it also applies to DB2 LUW. So I recommend reading the article regardless of which "camp" you are in.

At the end of the XML vs. JSON comparison is a resource section with a list of good papers and documentation. In my blog's page on DB2 & pureXML Resources you will also find additional material on that topic.

With that to read enjoy the weekend!

Tuesday, May 14, 2013

Performance Tuning for XML-based OLTP Systems (Temenos T24)

From time to time my colleagues and I get contacted on how to set up and tune DB2 for use with Temenos T24. The latter is a core banking system and in use worldwide. And it uses XML as internal data storage format which makes it an ideal fit for DB2 pureXML (both on z/OS and on Linux, UNIX, and Windows). To make our answers easily findable, why not blog about it? Today, I will cover setting up tables for good performance.

A typical T24 table consists only of two columns, a RECID (as primary key) and an XMLRECORD. This is similar to other systems implemented based on XML data, both OLTP and more of OLAP/BI character.

Inlining of data

For these kind of tables it is a good idea to INLINE the XML column. Inlining is good for several reasons. Because XML data is regularly stored in a special XML Data Area (XDA), its access requires an indirection from the row data. The descriptor that is stored in the row is taken as input to look up the actual storage location via the XML Regions Index. Eliminating the descriptor means the XML data is directly fetched with the row as well as fewer entries in the XML Regions Index.

If there is test data, the function ADMIN_EST_INLINE_LENGTH can be used to find the typical inline length that should be set. The maximum inline length depends on the page size and hence the maximum row size. Both are documented with the just mentioned function. For T24 the recommendation is to use an inline length of 32000 bytes or even more in 32 kb pages.

Volatile Tables for Index Use

Another configuration option for tables is to declare them as VOLATILE. The background is explained on this page with "Relational index performance tips". Basically, the optimizer will prefer index-based access to the table even if the cardinality changes frequently.

Enable static compression

Many systems benefit from data compression. The same goes for T24 on DB2. However, the question is whether to use static (table) compression or even the newer adaptive compression (page-level compression on top). Based on different benchmarks and experience with production-oriented testing the recommendation is to use static compression. The additional CPU overhead for a higher degree of space and IO savings impacts system throughput. When in doubt test it on your own. Another effect: Indexes on compressed tables will also be compressed by default.

Consider APPEND ON or DB2MAXFSCRSEARCH=1

With APPEND ON as an option to CREATE/ALTER TABLE, DB2 will insert "at the end" of the table without searching for free space elsewhere. It results in slightly higher space consumption, but benefits insert performance. An alternative is to set DB2MAXFSCRSEARCH to a low value, e.g. to 1. It determines on a global level of how many free space control records (FSCR) to search for placing a new record. Thus, all tables would be impacted, not just the one where APPEND ON is specified.

Separate tablespaces for table, index, and long data

It is always a good idea to have separate tablespaces for the different page types. They can be specified during CREATE TABLE.

PCTFREE 99 for hotspot tables

Depending on the application and usage type of T24 there can be hotspot tables. For these it could make sense to specify a high value for PCTFREE during CREATE/ALTER TABLE. This determines how much space is left free (read: unused) in a page, i.e., how many or how few records are stored in a single page. The fewer records - the extreme would be a single record - in a page, the less likely that page becomes a hotspot.

Consider table/range partitioning for bigger tables

Last, but not least, it is a good idea to apply range partitioning on bigger tables. This benefits performance in several ways: Old data can be rolled out (DETACHed) quickly, queries perform faster due to range elimination and rebalancing of work, and maintenance can also be done in parallel on the ranges.
The partitioning can either be done on the RECID (the primary key) or a new hidden column could be introduced to evenly split the data into ranges, depending on needs.

Summary

Tuning basic properties for XML-based tables is not (that) different from other tables. The above gives an introduction of should be considered when setting up Temenos T24 or similar XML-based systems on DB2. I didn't provide examples with the full syntax. If you need them, leave a comment... ;-)

Special thanks to Juergen Goetz for sharing his experience for use with other customers and this blog entry.

BTW: There are also few links for Temenos T24 on DB2 on my resources page. General DB2 system setup for HA for Temenos is discussed in the Redpaper "High Availability and Disaster Recovery for Temenos T24 with DB2 and AIX"

Tuesday, July 10, 2012

Index something non-existing - functional indexes in DB2

After discussing hidden (non-visible) columns last week, indexing something non-existing fits into the same theme. Let's shed some light into this seemingly mysterious topic...

XML documents and XML-based data modeling have been around for quite some years now. DB2 started its XML support with the XML extender in the late 1990ies (version 7), DB2 pureXML shipped initially with version 9.1. Advantages of using XML include flexibility of what is included in a document (or set of documents) and how sparse data can be represented. In relational NULL values or other "empty" values need to be stored even if data for a property (column) does not exist, in XML documents that particular property could be just left off, i.e., nothing is stored. However, the "nothing is stored", the non-existing data introduced a problem for searching efficiently in some use cases. How do you find all customers that do not have a certain property, e.g., have not received the marketing letter yet or do not own a car?

To address this kind of problem, DB2 10.1 now allows certain functional indexes over XML data (when does it come for relational data - any guesses...?). One of the functions allowed in such an index definition is fn:upper-case(). That way an index supports case insensitive searches over strings:

CREATE INDEX customers_email_idx ON customers(cinfo) 
   GENERATE KEYS USING XMLPATTERN '/customer/contact/email/fn:upper-case(.)'
   AS SQL VARCHAR(80); 

SELECT * FROM customers WHERE 

XMLEXISTS('$CINFO/customer/contact/email[fn:upper-case(.)="HENRIK.LOESER AT GMAIL.COM"]');

The other supported function is fn:exists() which allows to index existence or non-existence of an element or attribute. Thus, an index can be utilized to search even for something that is not (directly) stored in the database, i.e., implied information.

CREATE INDEX customers_db2vers_idx ON customers(cinfo)

  GENERATE KEYS USING XMLPATTERN '/customer/sw/fn:exists(db2version)'

  SQL AS VARCHAR(1);

SELECT * FROM customers WHERE
XMLEXISTS('$CINFO/customers/sw[not(fn:exists(db2version))]');

The above query would return all customers who don't have any DB2 version of software installed. Maybe they are reading this article and then will install DB2 soon...?!

Tuesday, July 19, 2011

A small update on updating XML data in DB2

Once you get started with processing XML data within a database, such as DB2, the next question usually is: How can I update XML documents? Well, (relational) database systems usually have an UPDATE statement for modifying data. In DB2, the same UPDATE can be used to change XML documents. The way it is done is to provide the new XML document which can be based on the previous version and then modified by applying expressions based on the XQuery Update Facility.

DB2 uses the so-called "transform expression" of that standard to let you describe (both SQL and XQuery are declarative query languages) how the new XML document should look like. Instead of providing an example here, I will give you the link to "Transform expression and updating expressions" in the DB2 Information Center. There you find plenty of examples to get you started.

The interesting aspect of the transform expression is that because it is an expression like any other in XQuery, you can combine it with the rest of XQuery and modify XML documents on the fly (e.g., in XMLQUERY or XMLTABLE), use it in "if-then-else" expressions to update documents conditionally, or come up with new ideas of how to use it.

Tuesday, November 16, 2010

What "pureXMLness" do you have...?

The longer and the more intense the work with XML in a database system of your choice (this is DB2, right?), the more often comes the question: What pureXMLness do I have?

To what degree am I using pureXML? How much XML does my system have? Is our system mostly relational or XML (remember DB2 is hybrid)? How do we measure how XML is used? Give me some numbers for our CIO, it's Q4.

To give a scientifically-proven, valuable answer, something that stands the test of time and holds up to all questions from the business side, I came up with THE pureXMLness factor. How is it computed? The following is the simple query that needs to be run against the DB2 database in question. It's an XQuery of course.

xquery
let $i:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysindexxmlpatterns"),
$j:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.syscolumns where coltype='XML'"),
$k:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlstrings"),
$l:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlpaths"),
$m:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxsrobjects"),
$n:=db2-fn:sqlquery("select xmlcast(cast (avg(xda_hit_ratio_percent) as float) as xml) from sysibmadm.bp_hitratio")
return <pureXMLness>{avg((($i+$j+$m) div 20,($k+$l) div 300,$n))}</pureXMLness>;

Let me know what pureXMLness you have or if you think you have a better formula...

Monday, August 30, 2010

Beer or wine, elements or attributes? Designing XML schemas

Image via Wikipedia

One of the questions asked repeatedly during DB2 pureXML workshops is about the use of elements and attributes when designing XML documents/schemas. What is better, using elements or attributes?

The answer given is that it depends on the data, the requirements, and other factors. Instead of going deeper into the reasoning for elements and attributes, I am referring to the summary at Cover Pages. Michael Kay has already in 2000 put the entire discussion into the following wise words found here:

Beginners always ask this question.
Those with a little experience express their opinions passionately.
Experts tell you there is no right answer.

Thursday, March 25, 2010

The trouble with processing embedded textual XML (CDATA)

You understand the basics of XML and started to master processing XML. You successfully started using XPath and XQuery in database systems like DB2. Now that, some other guys are introducing the next level of complexity: Embedding XML documents into other XML documents. This is really mean and I show you why.

XML documents like the following can be processed with DB2 by using its pureXML feature.

<a>

  <b>first value</b>

  <b>second value</b>

  <c>oh, even a different element<c>

</a>

When a document is inserted into a column of type XML, it is parsed and transformed into the internal, "native" representation. An instance of the so-called XQuery Data Model (XDM) is created (see the Processing Model in the XQuery specification). This step or process is not part of the XQuery processing, XQuery processing assumes that you (or your system) managed to provide instances of the XDM it then can operate on. If you inserted the above document into DB2, you have an instance of the XDM and you can process the document using XPath or XQuery. That's fine and everybody is happy.

Now, humans have never rested and sought new challenges. When you learn a foreign language, you first learn how to speak and understand simple sentences. Eventually, you have to deal with more complex sentences like subordinate clauses, relative clauses, appositions, and whatever the stuff is named (think of subselects, common table expressions, case statements, etc.). Where was I? Ah, back to XML. Similar to sentences, people tend to make data more complicated and to seek new challenges. What they do is to embed XML data into other XML data.

<a>

<b>

    <e1><e2>embedded data 1</e2></e1>

  </b>

<b>

    <e1><e2>embedded data 2</e2></e1>

  </b>

  <c>oh, even a different  element<c>

</a>

In the above example the previously text values like "first value" are replaced with XML fragments on their own. The entire document and the embedded parts can still be easily processed because all "tags" are element nodes in the XDM instance. XPath and XQuery can directly answer queries on e1 and e2, e.g., all instances of "e2" can quickly be found by searching "//e2".

However, the above way of embedding XML is not the only one and some organizations, standards, and data providers embed entire XML documents as text. This can be done by escaping directly, i.e., to by replacing < and > by "<" and ">". A different, but equivalent way is to utilize CDATA sections (see the XML specification). Let's take a look at the following example:

<a>

<b>

     <![CDATA[<e1><e2>embedded data  1</e2></e1>]]>

   </b>

<b>

     <![CDATA[<e1><e2>embedded data  2</e2></e1>]]>

   </b>

   <c>oh, even a different  element<c>

</a>

The embedded document is only text data and XQuery or XPath cannot easily work on the data as e1 and e2 are not element nodes, but only characters in a longer string value. As mentioned earlier, turning XML in its textual representation into an instance of the XDM is not part of the XQuery language. And this is where the trouble begins.

I plan to look at options on how to process the data in a future post.

Friday, December 18, 2009

XML Processing on z/OS

A new Redbook has just been published that discusses XML Processing on z/OS. The book starts out with XML Essentials, so that even beginners to the XML world can make use of this book. After the introduction the different options of XML processing, ranging from COBOL, PL/I over CICS, IMS, and DB2 pureXML to Rational Developer and Java-based solutions are discussed in an overview section. The rest of the book then deals in detail with the different options and also includes a chapter about performance and cost considerations.

Wednesday, September 23, 2009

A look at free database systems (from the XML perspective)

On my laptop I have most (not all, because something is always missing) software I need for my day job and some add-on tasks, including a database system. Right now it is DB2 9.7 Express-C, a free database system where I very often use the included pureXML feature to test out XPath or XQuery expressions or quickly manipulate XML data. The latter can be done by importing the XML files, applying some XQuery magic, maybe even combine it with relational data and then being done. Other people use it to find the closest ATM - always good to have DB2 pureXML handy.

I also took a look at other free database systems. First, there was Oracle XE. While it seems to offer XML support, the software is based on an old release level (10g R2). Support is through a forum, but requires registration to even look inside to see how the support is. Nothing for me.

Next on my list was MySQL which I use in some LAMP environments and - being open source - has lots of community support. However, MySQL's XML support is limited in that sense that, e.g., XQuery expressions are not supported and that XPath evaluation is embedded into SQL in a non-standard way. So nothing for me again.

Last on my list during my evaluation was PostgreSQL. Here the picture is basically the same as for MySQL. PostgreSQL's XML support is limited again in terms of functionality and how, e.g., XPath expressions are embedded into SQL.

DB2 Express-C is free, is based on the latest DB2 release, has an open support forum (no tricks), and offers the broad XML support that the pureXML feature has. So it is DB2 Express-C on my laptop.

[Don't get me wrong when I talk about PostgreSQL and MySQL. I especially like that they added XML support over the years because it widens the common functionality available across products and leads to more XML-related skills among developers and DBAs.]

Thursday, August 27, 2009

XML and Big Data - And why DB2 is hybrid

I could have titled this post as "lesson from the field". In this blog post, "How XML Threatens Big Data" is a report about experience made with using XML for some data projects, like biotech, and then continues to give "Three Reasons Why XML Fails for Big Data" and then "Three Rules for XML Rebels".

I was pulled to some IT projects where the focus was on doing everything in XML. People got overboard in adopting new technology. However, DB2 didn't scrape its relational capabilities for a reason and supports both relational and XML data side-by-side. This is because there a projects where relational format (a.k.a. tables with columns and rows) is the right storage format and there are projects where it is XML. In many cases both are needed and it is good that DB2 is a hybrid system, supporting relational data and XML data very well.

Wednesday, June 3, 2009

What makes a great XML schema?

When you start using XML within your company there is a good chance that at one point in time you want to define your own XML schema. Then, the next question is "What makes a great XML schema?"

Well, there are many design philosophies, and best practices and recommendations for XML schema design exist (utilize your favorite search engine for that). Another way of approaching the design of new XML schemas is to look into how existing XML schemas have been designed (similar to learning from existing code libraries). To get you started, here are some resources:

Susan Malaika's "Get started with Industry Formats and Services with pureXML" has a nice overview of standards and provides links to them.
Wikipedia has a list of XML schemas.
If this is still not enough, Altova also has a Schema Library.
Finally, in Paul Kiel's article "Profiling XML Schema" is an overview of some industry schemas and what XML Schema features they use and which they avoid. This can give you an indication on what to watch for when you start with your own XML schemas.

BTW: You can see some of the industry formats in action.

Wednesday, April 29, 2009

Jeopardy, Watson, Goldfarb, Mosher, Lorie

IBM announced this week that a new supercomputer named "Watson" will compete on Jeopardy! The computer will have an advanced Question Answering (QA) system. One of the challenges is that the question is not really a question, but a description and contestants have to make up a question as their answer.

I many of my pureXML talks I mention something along the following lines as part of the introduction:
Goldfarb, Mosher, and Lorie created in the 1960s the predecessor of what is now used worldwide for information sharing and exchange and for greater flexibility in information management.

Now you press the buzzer and say: What is...?

Friday, April 24, 2009

XML arrives in the warehouse

More and more data is generated, sent, processed, (even!!!) stored as XML. Application developers, IT architects, and DBAs get used to XML as part of their life. Let's get ready for the next step, XML in the enterprise warehouse. Why? To bring the higher flexibility and faster time-to-availability (shorter development/deployment time) to the core information management systems, the base for critical business decisions, the enterprise warehouses. Greater flexibility and shorter reaction times are especially important during economic phases like the currently present one.

DB2 9.7 ("Cobra") adds support for XML in the warehouse. What is even better is that with XML compression and index compression cost savings can be realized. Right now not many of the analytic and business intelligence tools support XML data. Fortunately, both the SQL standard and DB2 feature a function named XMLTABLE that allows to map data from XML documents to a regular table format, thereby enabling BI tools to work with XML data.

My colleagues Cindy Saracco and Matthias Nicola have produced a nice summary and introduction of warehousing-related enhancements of pureXML in the upcoming DB2 9.7 release. The developerWorks article is titled "Enhance business insight and scalability of XML data with new DB2 V9.7 pureXML features". Among other things the paper explains how XML can now be used with range partitioning, hash (or database) partitioning, and multi-dimensional clustering. Enjoy!

Pages