Showing posts with label pureXML. Show all posts
Showing posts with label pureXML. Show all posts

Friday, July 1, 2016

Store and Query XML Data with dashDB on Bluemix

XML Column in dashDB
I recently got asked whether it is possible to process XML data with dashDB on IBM Bluemix. The answer to that is that it is possible. dashDB is based on DB2 with its industry-leading pureXML support which I wrote many blog entries about. In the following I give you a quick start into preparing dashDB to store XML data and how to query it.

If you are using the regular dashDB service plans which are tailored to analytics, then by default all tables use columnar storage. That format provides deep compression and high performance query processing capabilities for analytic environments, but it is not suited for natively storing XML data. That is the reason why tables need to be created by explicitly stating ORAGNIZE BY ROW in the "Run SQL" dialog (see screenshot above):

CREATE TABLE myTable(id INT, doc XML) ORGANIZE BY ROW

The above statement creates the table "myTable" with two columns, the second of type XML, and in the classic row-oriented table format.

SQL/XML Query with dashDB
Once the table is created, data can be inserted. This can be done by using INSERT statements in the "Run SQL" dialog or by connecting other tools to dashDB. The "Load Hub" is designed for analytic data sets and does not support XML-typed columns. An introduction to inserting XML data can be found in the pureXML tutorial in the DB2 documentation.
After the XML data is in, the "Run SQL" dialog can be used again to query the documents. Queries can be either in SQL (SQL/XML) or in XQuery, see the screenshots with examples.

I hope that gives you a rough idea how to utilize the pureXML feature in dashDB, even though its main focus is analytics.
XQuery with dashDB

Friday, March 28, 2014

XML or JSON: When to choose what for your database

There is a new article on IBM developerWorks titled "XML or JSON: Guidelines for what to choose for DB2 for z/OS". It has been written by two very experienced technologists with DB2 background. Though the comparison and the many code examples for how to use either XML or JSON are for DB2 for z/OS, most of it also applies to DB2 LUW. So I recommend reading the article regardless of which "camp" you are in.

At the end of the XML vs. JSON comparison is a resource section with a list of good papers and documentation. In my blog's page on DB2 & pureXML Resources you will also find additional material on that topic.

With that to read enjoy the weekend!

Tuesday, June 4, 2013

Some typical DB2 registry settings for Temenos T24 and other OLTP systems

In one of my previous articles I described what table properties could be used (CREATE or ALTER TABLE) for XML-based OLTP systems like Temenos T24. Today I will show and explain some DB2 registry settings that are typically in use for such (and other) systems.

A list with almost all (documented) registry and environment variables can be found in the DB2 Information Center. As you can see those variables are grouped into different categories, e.g., compiler, communication, performance, or general settings. As some settings require a restart of DB2, the configuration is typically applied once when setting up a system.

DB2_PARALLEL_IO=*
This tells DB2 on how to handle prefetch requests and what to assume about disks per container. Prefetching is used to bring in data "ahead before it is used", i.e., it reduces IO wait. The above setting is simple and a good comprise. All tablespaces are covered and it is assumed that all have the same layout of 6 disks per container and hence multiple smaller prefetch requests are issues in parallel. There are different design philosophies "out there" with sometimes contradicting recommendations. Also take a look at Configuring for good performance and Optimizing table space performance when data is on RAID devices.

DB2_USE_ALTERNATE_PAGE_CLEANING=ON
Page cleaning is the reverse of prefetching. It makes sure that the bufferpool (data cache) always has enough space available to bring in additional data pages from disk. This alternate page cleaning is more aggressive in writing out changed data, makes use of IO bandwith and helps to have slots available in the bufferpool for our OLTP system.

DB2_EVALUNCOMMITTED=ON
DB2_SKIPDELETED=ON
DB2_SKIPINSERTED=ON
All three registry settings are typically used together to improve concurrency for Cursor Stability and Read Stability isolation levels. When scanning data and evaluating predicates a slightly different, "less patient" approach is taken which helps throughput in OLTP systems.

DB2MAXFSCRSEARCH=1
When inserting new data DB2 has to search for a data page with free space. That space is managed with free space control records (FSCRs) that are located every 500 pages or so (see here for details on table organization and FSCRs). By default DB2 is searching up to 5 such records in order to find a page to squeeze the new record in. This helps to fill up pages and to efficiently reuse space left from delete or some update operations. However, it does not help insert performance which is critical in OLTP system. Hence a setting of 1 is a nod towards performance while still trying to reuse space. A more drastic setting would be APPEND ON for a table where new data is always inserted at the end.

DB2_FORCE_NLS_CACHE=TRUE
This setting only applies on AIX, not on Linux and is used to cache the codepage and territory information for a connection. As T24 and other OLTP systems do not change "national language support" (NLS, i.e., the codepage and territory information) this helps performance slightly and also avoids possible lock contention while accessing that information.

This already was my quick overview of few of the settings that should be considered for OLTP systems like Temenos T24 on DB2. What is left is to take a look at typical DB2 database manager and database configurations. But that is another article...

Tuesday, May 14, 2013

Performance Tuning for XML-based OLTP Systems (Temenos T24)

From time to time my colleagues and I get contacted on how to set up and tune DB2 for use with Temenos T24. The latter is a core banking system and in use worldwide. And it uses XML as internal data storage format which makes it an ideal fit for DB2 pureXML (both on z/OS and on Linux, UNIX, and Windows). To make our answers easily findable, why not blog about it? Today, I will cover setting up tables for good performance.

A typical T24 table consists only of two columns, a RECID (as primary key) and an XMLRECORD. This is similar to other systems implemented based on XML data, both OLTP and more of OLAP/BI character.

Inlining of data

For these kind of tables it is a good idea to INLINE the XML column. Inlining is good for several reasons. Because XML data is regularly stored in a special XML Data Area (XDA), its access requires an indirection from the row data. The descriptor that is stored in the row is taken as input to look up the actual storage location via the XML Regions Index. Eliminating the descriptor means the XML data is directly fetched with the row as well as fewer entries in the XML Regions Index.


If there is test data, the function ADMIN_EST_INLINE_LENGTH can be used to find the typical inline length that should be set. The maximum inline length depends on the page size and hence the maximum row size. Both are documented with the just mentioned function. For T24 the recommendation is to use an inline length of 32000 bytes or even more in 32 kb pages.

Volatile Tables for Index Use

Another configuration option for tables is to declare them as VOLATILE. The background is explained on this page with "Relational index performance tips". Basically, the optimizer will prefer index-based access to the table even if the cardinality changes frequently.

Enable static compression 

Many systems benefit from data compression. The same goes for T24 on DB2. However, the question is whether to use static (table) compression or even the newer adaptive compression (page-level compression on top). Based on different benchmarks and experience with production-oriented testing the recommendation is to use static compression. The additional CPU overhead for a higher degree of space and IO savings impacts system throughput. When in doubt test it on your own. Another effect: Indexes on compressed tables will also be compressed by default.

Consider APPEND ON or DB2MAXFSCRSEARCH=1

With APPEND ON as an option to CREATE/ALTER TABLE, DB2 will insert "at the end" of the table without searching for free space elsewhere. It results in slightly higher space consumption, but benefits insert performance. An alternative is to set DB2MAXFSCRSEARCH to a low value, e.g. to 1. It determines on a global level of how many free space control records (FSCR) to search for placing a new record. Thus, all tables would be impacted, not just the one where APPEND ON is specified.

Separate tablespaces for table, index, and long data

It is always a good idea to have separate tablespaces for the different page types. They can be specified during CREATE TABLE.

PCTFREE 99 for hotspot tables

Depending on the application and usage type of T24 there can be hotspot tables. For these it could make sense to specify a high value for PCTFREE during CREATE/ALTER TABLE. This determines how much space is left free (read: unused) in a page, i.e., how many or how few records are stored in a single page. The fewer records - the extreme would be a single record - in a page, the less likely that page becomes a hotspot.

Consider table/range partitioning for bigger tables

Last, but not least, it is a good idea to apply range partitioning on bigger tables. This benefits performance in several ways: Old data can be rolled out (DETACHed) quickly, queries perform faster due to range elimination and rebalancing of work, and maintenance can also be done in parallel on the ranges.
The partitioning can either be done on the RECID (the primary key) or a new hidden column could be introduced to evenly split the data into ranges, depending on needs.

Summary

Tuning basic properties for XML-based tables is not (that) different from other tables. The above gives an introduction of should be considered when setting up Temenos T24 or similar XML-based systems on DB2. I didn't provide examples with the full syntax. If you need them, leave a comment... ;-)

Special thanks to Juergen Goetz for sharing his experience for use with other customers and this blog entry.

BTW: There are also few links for Temenos T24 on DB2 on my resources page. General DB2 system setup for HA for Temenos is discussed in the Redpaper "High Availability and Disaster Recovery for Temenos T24 with DB2 and AIX"

Tuesday, March 12, 2013

SQL Quiz: How to get this XML output?

Here is a small SQL quiz for you. I start with the following statements against a DB2 database:

drop table contacts;

create table contacts(
   id int unique not null,
   fname varchar(40),
   lname varchar(40),
   cc char(2));


create index cc_idx on contacts(cc);

insert into contacts values(1,'Klaus','Störtebeker','DE'),(2,'Bugs','Bunny','US'),(3,'Robin','Hood','GB'),(4,'Mickey','Mouse','US');




Next, I execute a query and the output is as follows:
<data><contact><first>Klaus</first><last>Störtebeker</last><country>DE</country></contact><contact><first>Robin</first><last>Hood</last><country>GB</country></contact></data> 

And the question for you: What is an efficient way to produce the output above? Post your proposals as comments...

Tuesday, February 19, 2013

Some fun with DB2 Maintenance Policies and pureXML (Part 2)

In my post yesterday I started to look into retrieving and processing the automated maintenance policies using pureXML. The article ended with the conclusion to use the stored procedure AUTOMAINT_GET_POLICY to fetch the policy document. As stated, the SP returns the policy as BLOB in an output parameter. So, as first step, we need to convert the BLOB to XML and make it available for further processing. My choice is an SQL-based table function.

CREATE OR REPLACE FUNCTION MaintPolicy (poltype varchar(50))
    RETURNS TABLE(poltype varchar(50), poldoc xml)
    NO EXTERNAL ACTION
    MODIFIES SQL DATA
    BEGIN ATOMIC
    declare tpoldocblob blob(2m);
    call sysproc.automaint_get_policy(poltype,tpoldocblob);
    return values(poltype,xmlparse(document tpoldocblob));
END


To choose which kind of policy should be fetched and returned, the policy type is passed in. The prototype above doesn't check for valid parameters. It could be extended for error handling and returning all policy documents if null is passed in. The MODIFIES SQL DATA keyword is needed because the AUTOMAINT_GET_POLICY stored procedure seems to modify data and we ran into an error without it. Within the function we call the SP, fetch the policy as BLOB and then return a table row with the policy type and the policy document. The document is converted to the XML type using XMLPARSE.

By having a table function, we can now "simply" include the policy documents into regular SELECT statement.

select poltype, poldoc from table(MaintPolicy('MAINTENANCE_WINDOW'))

POLTYPE                          POLDOC

-------------------------------- -----------------------------------------------
MAINTENANCE_WINDOW
<DB2MaintenanceWindows xmlns="http://www.ibm.com/xmlns/prod/db2/autonomic/config">
<!-- Online Maintenance Window  -->
<OnlineWindow Occurrence="During" startTime="22:00:00" duration="07">
  <DaysOfWeek>Fri Sat</DaysOfWeek>
  <DaysOfMonth>All</DaysOfMonth>
  <MonthsOfYear>All</MonthsOfYear>
</OnlineWindow>
</DB2MaintenanceWindows>    

  1 record(s) selected.


The next step is to look into the XML document. This can be done using either XMLQUERY or XMLTABLE. We continue our quest with XMLTABLE, so that we can turn the individual pieces of information into relational columns. The reason is that most tools and administrators prefer relational data.

The obvious way for using XMLTABLE is the following, the result is not:

select t.* from table(MaintPolicy('MAINTENANCE_WINDOW')) as p, xmltable(XMLNAMESPACES( DEFAULT 'http://www.ibm.com/xmlns/prod/db2/autonomic/config'), '$POLDOC/DB2MaintenanceWindows' passing p.poldoc as "POLDOC" COLUMNS ONL_DAYS VARCHAR(30) PATH 'OnlineWindow/DaysOfWeek', ONL_DOM  VARCHAR(60) PATH 'OnlineWindow/DaysOfMonth', ONL_MOY  varchar(50) PATH 'OnlineWindow/MonthsOfYear' ) as t
 

SQL20267N  The function "HLOESER.MAINTPOLICY" (specific  "SQL130219093012500")
modifies SQL data and is invoked  in an illegal context. Reason code =  "1". 
SQLSTATE=429BL


The error SQL20267N indicates that our table function is not the last reference in the FROM clause which seems a limitation of calling functions that have MODIFIES SQL DATA. In the explanation and user response sections of the error message the advice is given to rewrite it using a common table expression. Let's try a CTE:

with p(polytpe,poldoc) as
     (select poltype, poldoc from table(MaintPolicy('MAINTENANCE_WINDOW'))) 
select t.* 
from p, xmltable(
   XMLNAMESPACES( DEFAULT 'http://www.ibm.com/xmlns/prod/db2/autonomic/config'),

   '$POLDOC/DB2MaintenanceWindows' passing p.poldoc as "POLDOC" 
   COLUMNS
     ONL_DAYS VARCHAR(30) PATH 'OnlineWindow/DaysOfWeek',
     ONL_DOM  VARCHAR(60) PATH 'OnlineWindow/DaysOfMonth',
     ONL_MOY  varchar(50) PATH 'OnlineWindow/MonthsOfYear' ) as t

ONL_DAYS             ONL_DOM                 ONL_MOY                           

-------------------- ----------------------- --------------------
Fri Sat              All                     All           

  1 record(s) selected.


Finally! The function and the query could now be extended and adapted. The WITH clause as a simple form of a common table expression retrieves the policy type and document using our own table function. In the main SELECT we first reference the data of the CTE, then call XMLTABLE for the XML processing. XMLNAMPESPACES is the first function parameter to set the default namespace, i.e., to declare the context for the XML processing. That way we can avoid further namespace declarations within the following clauses. Next, for all the column-related processing, we use "DB2MaintenanceWindows" as the starting point within the XML document. Three columns should be returned, online days, days of month and months of years. Because of the earlier namespace declaration all we need to do is go down to the respective elements.

What is left is to adapt our prototype to the individual requirements and extend it to other policy documents. Do you have questions or comments?

Monday, February 18, 2013

Some fun with DB2 Maintenance Policies and pureXML (Part 1)

To help reduce administrative costs, DB2 supports automatic collection of statistics, automatic database backups, automatic reorganization of tables and indexes, and some more. This so-called automatic maintenance is only an option, it is not necessary to use it with DB2. The specific autonomics can be configured on the database level. If automatic maintenance is switched on, it is a good idea to tell DB2 when it may perform those tasks. You don't want to take indexes, tables, or the entire database offline during peak hours or have I/O-intensive maintenance operations during the day. Your preferences are configured using automated maintenance policies as described in "Scheduled maintenance for high availability".

Policies are specified using XML documents (samples are provided as a starter), then registered with DB2 via either AUTOMAINT_SET_POLICY or AUTOMAINT_SET_POLICYFILE stored procedures. Now that the policies are set, how do you know which are implemented or active? Let's have some fun with DB2, policies, and XML functionality...

To retrieve the policy information, DB2 offers two stored procedures analogous to the SET procedures: AUTOMAINT_GET_POLICY and AUTOMAINT_GET_POLICYFILE. The former returns a BLOB which includes the XML document with the policy, the latter stores the policy document in the file system (only specify the filename and it is stored under ~/sqllib/tmp for most installations). The XML-based information somewhere in the file system is not what we wanted. Looking at AUTOMAINT_GET_POLICY doesn't make us happier either:

 db2 "call automaint_get_policy('MAINTENANCE_WINDOW',null)"

  Value of output parameters
  --------------------------
  Parameter Name  : POLICY
  Parameter Value : x'3C3F786D6C2076657273696F6E3D22312E302220656E636F64696E673D225554462D38223F3E200A3C4442324D61696E74656E616E636557696E646F7773200A786D6C6E733D22687474703A2F2F7777772E69626D2E636F6D2F786D6C6E732F70726F642F6462322F6175746F6E6F6D69632F636F6E66696722203E0A0A203C212D2D204F6E6C696E65204D61696E74656E616E63652057696E646F7720202D2D3E0A203C4F6E6C696E6557696E646F77204F6363757272656E63653D22447572696E672220737461727454696D653D2232323A30303A303022206475726174696F6E3D22303722203E0A20203C446179734F665765656B3E467269205361743C2F446179734F665765656B3E0A20203C446179734F664D6F6E74683E416C6C3C2F446179734F664D6F6E74683E0A20203C4D6F6E7468734F66596561723E416C6C3C2F4D6F6E7468734F66596561723E0A203C2F4F6E6C696E6557696E646F773E0A3C2F4442324D61696E74656E616E636557696E646F77733E200A'

  Return Status = 0


We cannot directly query the output parameter and it is a BLOB. How about the DB2 system catalog as source for lots of information? SYSTOOLS.POLICY looks promising and it shows up in the package cache with queries related to maintenance policies.

db2 describe table systools.policy

                      Data type                     Column
Column name           schema    Data type name      Length     Scale Nulls
--------------------- --------- ------------------- ---------- ----- ------
MED                   SYSIBM    VARCHAR                    128     0 No   
DECISION              SYSIBM    VARCHAR                    128     0 No   
NAME                  SYSIBM    VARCHAR                    128     0 No   
UPDATE_TIME           SYSIBM    TIMESTAMP                   10     6 No   
POLICY                SYSIBM    BLOB                   2097152     0 Yes  

  5 record(s) selected.


db2 "select med,decision,name from systools.policy"

MED                      DECISION                    NAME                     ------------------------ --------------------------- ---------------------------
DB2CommonMED             NOP                         CommonPolicy               DB2DatabaseRecoveryMED   DBBackupDecision            DBBackupPolicy

DB2TableMaintenanceMED   StatsProfileDecision        StatsProfilePolicy DB2TableMaintenanceMED   TableReorgDecision          TableReorgPolicy   DB2TableMaintenanceMED   TableRunstatsDecision       TableRunstatsPolicy 
 

  5 record(s) selected.

POLICY seems to hold the critical information, but it is a BLOB again. We can change it to XML by parsing the value:

db2 "select xmlparse(document policy) from systools.policy where name='CommonPolicy'"

1

---------------------------------------------------------------------------------
<PolicyDocument xmlns:db2="http://www.ibm.com/xmlns/prod/db2/autonomic/policy" xmlns="http://www.ibm.com/xmlns/prod/db2/autonomic/policylang" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/xmlns/prod/db2/autonomic/policy ../schema/DB2TableMaintenanceMED.xsd" medScope="DefaultMaintWindow"><PreconditionSection name="CommonMaintWindowPolicySection"><Precondition policyElementId="online_mw"><db2:MaintenanceWindowCondition timeZone="local" windowType="online"><DayOfWeekMask>0000011</DayOfWeekMask><MonthOfYearMask>111111111111</MonthOfYearMask><DayOfMonthMask>11111111111111111111111111111111111111111111111111111111111111</DayOfMonthMask><TimeOfDayMask>T220000/T050000</TimeOfDayMask></db2:MaintenanceWindowCondition></Precondition></PreconditionSection><PolicySection name="f"/></PolicyDocument>    


  1 record(s) selected.



Unfortunately, the output doesn't look like the official version of the policy document. We could extract information from it using XMLTABLE, but it would be based on an internal structure (which could change over time). Thus, we will bite the bullet and will work with AUTOMAINT_GET_POLICY. Read the second part here.

Thursday, January 31, 2013

My money, SEPA payments, DB2 z/OS, and pureXML

A new article about how DB2 pureXML is used to process SEPA payments has been published on IBM developerWorks. SEPA is the short for Single European Payment Area, an unification and simplification of cross-border payment processes within the European Union/Euro region and associated countries.

In the article Jane Man shows some tricks in handling documents, extracting information, and updating XML documents. Some of the problems they ran into are also discussed.

Thursday, September 20, 2012

DB2 10.1 - the first fixpack is out

FP1 for DB2 10.1 for Linux, UNIX, and Windows is now available. Here are the two important links:


Some of the new features or enhancements for DB2 10.1 were already included in DB2 9.7 FP6 and needed to be ported. An example of this is the support for XML type for global variables and in compiled SQL functions.

Monday, September 10, 2012

XML Devotee session with z/OS topics

A session of the so-called XML Devotee Community will be held this Wednesday, September 12th, with two topics: 1) Using COBOL with pureXML on DB2 z/OS and 2) XML processing on z/OS. Details of the session, including phone numbers, speaker info and more is available at the community website.

Tuesday, July 10, 2012

Index something non-existing - functional indexes in DB2

After discussing hidden (non-visible) columns last week, indexing something non-existing fits into the same theme. Let's shed some light into this seemingly mysterious topic...

XML documents and XML-based data modeling have been around for quite some years now. DB2 started its XML support with the XML extender in the late 1990ies (version 7), DB2 pureXML shipped initially with version 9.1. Advantages of using XML include flexibility of what is included in a document (or set of documents) and how sparse data can be represented. In relational NULL values or other "empty" values need to be stored even if data for a property (column) does not exist, in XML documents that particular property could be just left off, i.e., nothing is stored. However, the "nothing is stored", the non-existing data introduced a problem for searching efficiently in some use cases. How do you find all customers that do not have a certain property, e.g., have not received the marketing letter yet or do not own a car?

To address this kind of problem, DB2 10.1 now allows certain functional indexes over XML data (when does it come for relational data - any guesses...?). One of the functions allowed in such an index definition is fn:upper-case(). That way an index supports case insensitive searches over strings:

CREATE INDEX customers_email_idx ON customers(cinfo) GENERATE KEYS USING XMLPATTERN '/customer/contact/email/fn:upper-case(.)' AS SQL VARCHAR(80); 

SELECT * FROM customers WHERE 
XMLEXISTS('$CINFO/customer/contact/email[fn:upper-case(.)="HENRIK.LOESER AT GMAIL.COM"]');  

The other supported function is fn:exists() which allows to index existence or non-existence of an element or attribute. Thus, an index can be utilized to search even for something that is not (directly) stored in the database, i.e., implied information.

CREATE INDEX customers_db2vers_idx ON customers(cinfo)
  GENERATE KEYS USING XMLPATTERN '/customer/sw/fn:exists(db2version)'
  SQL AS VARCHAR(1);

SELECT * FROM customers WHERE
XMLEXISTS('$CINFO/customers/sw[not(fn:exists(db2version))]');

The above query would return all customers who don't have any DB2 version of software installed. Maybe they are reading this article and then will install DB2 soon...?!


Wednesday, June 20, 2012

NoSQL, SPARQL, RDF, Graph Store, and more buzzwords - all new in DB2 10

One of the smaller headlines during the DB2 10 announcement was the "NoSQL Graph Store". In the "What is new" section of the DB2 Information Center it is listed as "RDF application development support has been added". However, if you work in/with the Semantic Web, it is an important addition as it provides some robust and performant infrastructure. In the following, I will point you to some resources around SPARQL and RDF stores in DB2.


As the data can be represented in XML, a question about support for graph stores is actually a question that pops up from time to time in DB2 pureXML discussions. For those searching for a combination of the buzzwords and DB2, the search engines should now point you to the right pages... ;-)

Wednesday, June 13, 2012

DB2 pureXML and bufferpools - revisited

Some years ago I wrote that XML data is buffered for performance reason and looked at the XDA object. Now I found out that I didn't answer the "buffered or not" question entirely. The question which bufferpools are utilised depending on inlined or regular XML storage remains. So let me answer it.

CREATE TABLE myX (id INT, x1 XML INLINE LENGTH 500, x2 XML) IN myT1 LONG IN myT2;

Considering the above CREATE TABLE statement, data for table myX would be stored in two tablespaces, myT1 and myT2, and use the associated bufferpool(s). If the internal representation of the XML data for column x1 would be up to 500 bytes, the row data consisting of an integer value for id, the XML data for x1, and a descriptor pointing to the XML data of x2 would be stored in the DAT object and hence in tablespace myT1. When accessed, the data would go to the bufferpool associated with myT1. If the XML data for x1 would be larger than the 500 bytes, the row would hold an integer and two descriptors (one each for x1 and x2).
The descriptors would point to entries in data pages in the XDA object which is stored in tablespace myT2. All XML data for x2 and the data too long to be inlined for x1 would be stored in myT2. When accessed, the bufferpool associated with tablespace myT2 would hold the pages.

In short again: XML data is buffered and it goes through the bufferpool associated with the tablespace it is stored in. This is the same as with regular relational data or index pages.

Sunday, March 11, 2012

Turn XML data into columnar, relational format using SQL (lots of useful links inside)

The SQL standard offers many functions, some very useful are defined in part 14 of the SQL standard, "ISO/IEC 9075-14:2011 Part 14: XML-Related Specifications". That part 14 was published 2003, so it is rather old. However, one of the gems in SQL/XML (the name for that XML addition to SQL) and in DB2, the function XMLTABLE and its capabilities seem to be hidden as a recent email to me indicates. Or is it how Google, Bing, and other are used...? Anyway, in this article I will provide pointers to some useful resources, mostly older articles, and tag them with the right buzzwords. All in the hope that it is one email less in my inbox...

On IBM's developerWorks is a two-part article giving a detailed overview about XMLTABLE. In part 1 we learn about all the ways XML data can be turned into relational format, including best practices. The 2nd part deals with advanced processing, such as splitting XML documents into smaller pieces (including parameter passing), shredding XML documents during insert processing (turn XML into tables during insert), relational views over XML data, and creating field/value pairs out of XML data (is this already NoSQL...?). Parameter passing for XML processing is also a topic of one of my older posts. XMLTABLE can also be used to extract XML data embedded into another XML document via CDATA. And last, a useful webcast that discussed XMLTABLE in DB2 for both the mainframe and the distributed platform can be found at the pureXML devotees. That group, now driven by IBM customers using pureXML, has upcoming meetings/webcasts. Last but not least, Matthias Nicola has written many articles about XMLTABLE in his blog.

Let's see what the search engines do with this article. Have a nice week...

Friday, September 9, 2011

DB2 pureXML for the Protein Data Bank - Managing Atoms and Molecules with XML

Yesterday a new interesting article was published on developerWorks, "Managing the Protein Data Bank with DB2 pureXML". It describes how scientists with highly complex data (the Protein Data Bank), atoms and molecules that make up protein, can benefit from a highly sophisticated database management system, DB2.

At the core of the solution is pureXML to reduce the schema complexity (see the graphic for a relational design in the article). Compression is used to keep the database small and keep I/O to a minimum. Now add database partitioning across multiple nodes, range partitioning and multi-dimensional clustering to further help organize the data and to cut complexity and improve performance.

What a good example of combining many of DB2's features for the benefit of science.

BTW: This work was possible through the IBM Academic Initiative which as part of the benefits allows free use of DB2.

Friday, April 15, 2011

My resource page has been updated

I keep a DB2 and pureXML Resources page as part of this blog. I added some more links and will continue with this as I find time.

Tuesday, April 12, 2011

Pure and free: Test pureXML on pureScale

You probably already know that there is a feature for XML data processing in DB2 which is called pureXML. You also know there is a continuous availability and scale-out feature for DB2 called pureScale. And I won't talk about pureQuery today. But did you know that pureXML is supported on pureScale? Did you know that you can request access to the so-called pureScale Acceleration Kits and try it out yourself (for free)?

I recently was instructor for a pureScale Workshop and one of the attending companies reported that they built their own small pureScale system. Such a mini system - we often refer to them as nanoClusters (or here) - has all the features of a real cluster, but only costs few hundred dollars/Euros in hardware and you have your own pureScale system. If you want to "go high-end", you could request a proof of concept/proof of technology at one of the IBM locations. But in either case, test drive  pureXML on pureScale...

Tuesday, February 1, 2011

Redbook published: Extremely pureXML in DB2 10 for z/OS

Now the IBM Redbook "Extremely pureXML in DB2 10 for z/OS" has made its way to the public. On more than 300 pages it covers all important topics for XML data processing in DB2 on the mainframe.

Friday, January 28, 2011

World Economic Forum, Parking Problems, and DB2 pureXML

Official logo of the World Economic Forum.Image via Wikipedia
The World Economic Forum in Davos certainly is interesting, at least from a perspective of logistics and data. A lot of the big and famous try to be there. And who has thought that parking is a problem?

When you want to fly into Davos, you usually fly into Zurich airport (ZRH) or the airport in Friedrichshafen (FDH). The latter is close to where I live and this morning it was still serving as a parking lot for small and big private jets from around the world. Some of the VIPs have been flying directly into FDH and changed into helicopters or cars, but a good chunk opted for the "fly to Zurich, park your jet in Friedrichshafen" option. For the average guy it seem funny that even with some money behind you get into parking problems. Well, not the regular ones...

How do I get the drift to database systems? You cannot just fly to an airport and hope for the best. There is some trip planning and logistics involved, and a lot of communication. These days, parking your vehicle needs some extra communication. The more efficient the better.

Now to DB2 and pureXML: A lot of applications send XML to the database or retrieve it from the database. So far it meant that the sender had to produce text-based XML data from its internal structures, the receiver had to parse it and produce its internal structures. Both takes time and costs money (we are living in a fast world). Starting with DB2 10 for z/OS, both the DB2 client and server use binary XML, a special optimized format for encoding XML documents. This allows to produce the "on the wire" representation faster and is also simpler to process on the receiving side. Communication made faster and more efficient, a better chance to find a parking space for your vehicle of choice...?

Monday, January 24, 2011

Why XML columns have a length of zero

One interesting question when looking at the meta data of tables is why XML columns don't have a length.

db2 "create table thebigq(id int, name varchar(40), doc xml inline length 1000)"
DB20000I  The SQL command completed successfully.

hloeser@BR857D67:~/1Q11$ db2 describe table thebigq

                                Data type                     Column
Column name   schema    Data type name Length     Scale Nulls
------------- --------- -------------- ---------- ----- ------
ID            SYSIBM    INTEGER                 4     0 Yes  
NAME          SYSIBM    VARCHAR                40     0 Yes  
DOC           SYSIBM    XML                     0     0 Yes  

  3 record(s) selected.

Well, for an INTEGER type the size is well-known because of the bits allocated to store the value. For VARCHAR strings the specified length (in bytes) is what is shown. For XML columns, however, there is no maximum size that you can specify. This is in contrast to LOBs (BLOBs, CLOBs). There is also no maximum size of an XML document on disk, it is only determined by storage and possibly the document structure.

When pureXML was designed, there was a long discussion on what the length as shown above should be. Should we just show a really big number, like 2GB, or decide that it is -1 or -2?

Inserting a 2GB big XML document - this is the maximum that can be transferred into DB2 - can take up more storage space than 2 GB or less. Hence, it is not a good value. The options "-1" or other negative numbers usually have a special meaning and therefore were also not chosen. So it became zero (0) - XML values have the size they have. Something mystic...!?!