Thursday, March 25, 2010

The trouble with processing embedded textual XML (CDATA)

You understand the basics of XML and started to master processing XML. You successfully started using XPath and XQuery in database systems like DB2. Now that, some other guys are introducing the next level of complexity: Embedding XML documents into other XML documents. This is really mean and I show you why.

XML documents like the following can be processed with DB2 by using its pureXML feature.
<a>
  <b>first value</b>
  <b>second value</b>
  <c>oh, even a different element<c>
</a>

When a document is inserted into a column of type XML, it is parsed and transformed into the internal, "native" representation. An instance of the so-called XQuery Data Model (XDM) is created (see the Processing Model in the XQuery specification).  This step or process is not part of the XQuery processing, XQuery processing assumes that you (or your system) managed to provide instances of the XDM it then can operate on. If you inserted the above document into DB2, you have an instance of the XDM and you can process the document using XPath or XQuery. That's fine and everybody is happy.

Now, humans have never rested and sought new challenges. When you learn a foreign language, you first learn how to speak and understand simple sentences. Eventually, you have to deal with more complex sentences like subordinate clauses, relative clauses, appositions, and whatever the stuff is named (think of subselects, common table expressions, case statements, etc.). Where was I? Ah, back to XML. Similar to sentences, people tend to make data more complicated and to seek new challenges. What they do is to embed XML data into other XML data.


<a>
  <b>
    <e1><e2>embedded data 1</e2></e1>
  </b>
  <b>
    <e1><e2>embedded data 2</e2></e1>
  </b>
  <c>oh, even a different element<c>
</a>

In the above example the previously text values like "first value" are replaced with XML fragments on their own. The entire document and the embedded parts can still be easily processed because all "tags" are element nodes in the XDM instance. XPath and XQuery can directly answer queries on e1 and e2, e.g., all instances of "e2" can quickly be found by searching "//e2".

However, the above way of embedding XML is not the only one and some organizations, standards, and data providers embed entire XML documents as text. This can be done by escaping directly, i.e., to by replacing < and > by "&lt;" and "&gt;". A different, but equivalent way is to utilize CDATA sections (see the XML specification). Let's take a look at the following example:

<a>
  <b>
    <![CDATA[<e1><e2>embedded data 1</e2></e1>]]>
  </b>
  <b>
    <![CDATA[<e1><e2>embedded data 2</e2></e1>]]>
  </b>
  <c>oh, even a different element<c>
</a>

The embedded document is only text data and XQuery or XPath cannot easily work on the data as e1 and e2 are not element nodes, but only characters in a longer string value. As mentioned earlier, turning XML in its textual representation into an instance of the XDM is not part of the XQuery language. And this is where the trouble begins.

I plan to look at options on how to process the data in a future post.