Friday, March 26, 2010

Accessing and processing embedded textual XML documents

Yesterday, I wrote about how and why embedded textual XML documents can create trouble. Today, I will look at different ways I found (so far) on how to deal with those documents in DB2 pureXML. The documents which are embedded as text are, as mentioned, a long text value (text node) and not element and attribute nodes. Because the transformation process of turning textual XML into an instance of the XQuery Data Model (XDM) and hence making the XML document queryable is not part of the XQuery language itself, we have to leave the XQuery environment to deal with those embedded documents. More on this later, let's take a look again at the sample document from yesterday:


<a>
  <b>
    <![CDATA[<e1><e2>embedded data 1</e2></e1>]]>
  </b>
  <b>
    <![CDATA[<e1><e2>embedded data 2</e2></e1>]]>
  </b>
  <c>oh, even a different element<c>
</a>

Inside the "b" elements we have embedded documents in text form. To deal with those documents, we first need to decide how we want them. If we are only interested in their text, we could just navigate down to the "b"s and extract the text values (assuming we have a table CD with a DOC column of type XML where we inserted the above document):

select xmlquery('$DOC/a/b/text()') from cd
The above query would return all embedded documents within the parent document as a single string. This is usually not what is needed. Using XMLTABLE, we can return each embedded document individually as string:

select x.* from cd,xmltable('$DOC/a/b' columns edoc varchar(2000) path '.') as x
The query could directly be used in an insert statement where the embedded documents are stored back into an XML columns. If however all the XML fragments should be used to make up a bigger document, the following statement may work:

select ''||x.edoc||'' from cd,xmltable('$DOC/a' columns edoc varchar(2000) path '{./b}') as x

Within the XMLTABLE, we first navigate to all the "a"s. For the returned column we return all the concatenated text values of b wrapped into a "dummy" element. This is necessary because casting to a string is only possible for a single item, not for a sequence of items (which the different text values would be). Later we utilize string concatenation to attach an "outer" element to make the XML fragment single-rooted.

The same could have been done with a statement like the initial one:
select ''||xmlcast(xmlquery('{$DOC/a/b/text()}') as varchar(2000))||'' from cd

With XMLTABLE however you have more options on what to do because usually more data needs to be extracted and post-processed.

Wrapping an XMLPARSE around the constructed textual XML data turns this back into instances of the XDM which can be queried using XQuery.

select xmlquery('$OUTPUT/outer/e1/e2') from (
select xmlparse(document ''||x.edoc||'') as output from cd,xmltable('$DOC/a' columns edoc varchar(2000) path '{./b}') as x)

Enough of embedded documents for today. In my next post, I will show you an embedded video to demonstrate how to deal with a rainy Friday afternoon...