![]() |
Byte length of (Unicode) strings |
Showing posts with label XQuery. Show all posts
Showing posts with label XQuery. Show all posts
Wednesday, November 22, 2023
Unicode string length, code points, and Db2
Friday, August 17, 2018
Db2: Some Friday Fun with XML and SQL recursion
![]() |
Recursion is fun!? |
Monday, March 25, 2013
Passing string sequences to SQL/XML functions: Tokenize to the rescue
This morning I answered an interesting DB2 XQuery/SQL question about parameter passing. The core issue was related to the different data models and bridging them when data is passed from the SQL world to XQuery or back. After checking this blog's archive I found that I hadn't written about this common issue before. So let's start by showing you what worked and what not:
create table tok(id int, doc xml);
insert into tok values(1,'<a><b>IBM</b></a>');
insert into tok values(2,'<a><b>BLA</b></a>');
First I create table and insert two simple documents. Now we are ready to query the table:
select * from tok
where xmlexists('$DOC/a[b = ("IBM","whatever")]');
ID DOC ----------- -------------------------------
1 <a><b>IBM</b></a>
1 record(s) selected.
select * from tok where xmlexists('$DOC/a[b = $para]' passing cast('IBM,whatever' as varchar(200)) as "para");
ID DOC ----------- -------------------------------
0 record(s) selected.
In the first query above we directly evalute a comparison against a list of string values (IBM, BLA). It matches the first XML document. The second query passes the same comma-separated value in, but none of the documents is found. What happened?
In XQuery the comparisons above are called "general comparison" (there is also value comparison and node comparison). The general comparison returns true above if some value in the first sequence is equal to some value in the second sequence. The important piece of information is sequence. In the first query, the '("IBM", "BLA")' constructs a sequence with two string values as items . The "(" and ")" are used as sequence constructor. In the second query, a single string value is passed in. There are probably different ways of composing the string, but it remains a string.
What can be done? For the example above XQuery and XPath have the function fn:tokenize(). That function splits up a string into a sequence of substrings. Thus, we would have a sequence again that could be used in the comparison. Let's see whether it works:
select * from tok where xmlexists('$DOC/a[b = fn:tokenize($para,",")]' passing cast('IBM,whatever' as varchar(200)) as "para");
ID DOC ----------- -------------------------------
1 <a><b>IBM</b></a>
1 record(s) selected.
As it can be seen, the tokenize helps to split up the parameter into a sequence and the query result is as expected again.
create table tok(id int, doc xml);
insert into tok values(1,'<a><b>IBM</b></a>');
insert into tok values(2,'<a><b>BLA</b></a>');
First I create table and insert two simple documents. Now we are ready to query the table:
select * from tok
where xmlexists('$DOC/a[b = ("IBM","whatever")]');
ID DOC ----------- -------------------------------
1 <a><b>IBM</b></a>
1 record(s) selected.
select * from tok where xmlexists('$DOC/a[b = $para]' passing cast('IBM,whatever' as varchar(200)) as "para");
ID DOC ----------- -------------------------------
0 record(s) selected.
In the first query above we directly evalute a comparison against a list of string values (IBM, BLA). It matches the first XML document. The second query passes the same comma-separated value in, but none of the documents is found. What happened?
In XQuery the comparisons above are called "general comparison" (there is also value comparison and node comparison). The general comparison returns true above if some value in the first sequence is equal to some value in the second sequence. The important piece of information is sequence. In the first query, the '("IBM", "BLA")' constructs a sequence with two string values as items . The "(" and ")" are used as sequence constructor. In the second query, a single string value is passed in. There are probably different ways of composing the string, but it remains a string.
What can be done? For the example above XQuery and XPath have the function fn:tokenize(). That function splits up a string into a sequence of substrings. Thus, we would have a sequence again that could be used in the comparison. Let's see whether it works:
select * from tok where xmlexists('$DOC/a[b = fn:tokenize($para,",")]' passing cast('IBM,whatever' as varchar(200)) as "para");
ID DOC ----------- -------------------------------
1 <a><b>IBM</b></a>
1 record(s) selected.
As it can be seen, the tokenize helps to split up the parameter into a sequence and the query result is as expected again.
Tuesday, July 10, 2012
Index something non-existing - functional indexes in DB2
After discussing hidden (non-visible) columns last week, indexing something non-existing fits into the same theme. Let's shed some light into this seemingly mysterious topic...
XML documents and XML-based data modeling have been around for quite some years now. DB2 started its XML support with the XML extender in the late 1990ies (version 7), DB2 pureXML shipped initially with version 9.1. Advantages of using XML include flexibility of what is included in a document (or set of documents) and how sparse data can be represented. In relational NULL values or other "empty" values need to be stored even if data for a property (column) does not exist, in XML documents that particular property could be just left off, i.e., nothing is stored. However, the "nothing is stored", the non-existing data introduced a problem for searching efficiently in some use cases. How do you find all customers that do not have a certain property, e.g., have not received the marketing letter yet or do not own a car?
To address this kind of problem, DB2 10.1 now allows certain functional indexes over XML data (when does it come for relational data - any guesses...?). One of the functions allowed in such an index definition is fn:upper-case(). That way an index supports case insensitive searches over strings:
The other supported function is fn:exists() which allows to index existence or non-existence of an element or attribute. Thus, an index can be utilized to search even for something that is not (directly) stored in the database, i.e., implied information.
SELECT * FROM customers WHERE
XMLEXISTS('$CINFO/customers/sw[not(fn:exists(db2version))]');
The above query would return all customers who don't have any DB2 version of software installed. Maybe they are reading this article and then will install DB2 soon...?!
XML documents and XML-based data modeling have been around for quite some years now. DB2 started its XML support with the XML extender in the late 1990ies (version 7), DB2 pureXML shipped initially with version 9.1. Advantages of using XML include flexibility of what is included in a document (or set of documents) and how sparse data can be represented. In relational NULL values or other "empty" values need to be stored even if data for a property (column) does not exist, in XML documents that particular property could be just left off, i.e., nothing is stored. However, the "nothing is stored", the non-existing data introduced a problem for searching efficiently in some use cases. How do you find all customers that do not have a certain property, e.g., have not received the marketing letter yet or do not own a car?
To address this kind of problem, DB2 10.1 now allows certain functional indexes over XML data (when does it come for relational data - any guesses...?). One of the functions allowed in such an index definition is fn:upper-case(). That way an index supports case insensitive searches over strings:
CREATE INDEX customers_email_idx ON customers(cinfo)
GENERATE KEYS USING XMLPATTERN '/customer/contact/email/fn:upper-case(.)'
AS SQL VARCHAR(80);
SELECT * FROM customers WHERE
XMLEXISTS('$CINFO/customer/contact/email[fn:upper-case(.)="HENRIK.LOESER AT GMAIL.COM"]');
XMLEXISTS('$CINFO/customer/contact/email[fn:upper-case(.)="HENRIK.LOESER AT GMAIL.COM"]');
The other supported function is fn:exists() which allows to index existence or non-existence of an element or attribute. Thus, an index can be utilized to search even for something that is not (directly) stored in the database, i.e., implied information.
CREATE INDEX customers_db2vers_idx ON customers(cinfo)
GENERATE KEYS USING XMLPATTERN '/customer/sw/fn:exists(db2version)'
SQL AS VARCHAR(1);
XMLEXISTS('$CINFO/customers/sw[not(fn:exists(db2version))]');
The above query would return all customers who don't have any DB2 version of software installed. Maybe they are reading this article and then will install DB2 soon...?!
Tuesday, May 8, 2012
Business Travel and DB2 SQL Compatibilty
As part of my job I travel a lot within Europe. Most countries accept the Euro and participate in the Schengen Agreement (forming the Schengen Area) which means borderless travel and few, if any, controls (more that in a future blog post). What remains are the different languages and the different culture, including food. During one of the recent journeys I came up with a nice comparison of the DB2 SQL (Oracle) Compatibility and understanding the different European languages.
For the performance of PL/SQL and the Oracle SQL dialect in DB2 we often have to point out that statements are compiled to regular DB2 runtime code (so-called sections and packages). At runtime DB2 basically doesn't know whether DB2's SQL or the Oracle-like syntax was used. Both perform equally well as there is only one runtime infrastructure, no emulation. The same is true for SQL vs. XQuery, there is the same single runtime infrastructure.
In school I learned English, Latin, and French, later at university Spanish. I don't speak Latin, French, or Spanish fluently, but I understand enough to get around. The same for Dutch, Danish, etc. What that means is that I don't need a translator, but can directly understand local road signs, hotel employees, newspapers. This gives a "performance advantage", again, as I (my "runtime infrastructure") can process the languages directly, not all ("compatibility rate"), but enough for most common situations. Here DB2 and I differ as DB2 has a higher compatibility rate...
For the performance of PL/SQL and the Oracle SQL dialect in DB2 we often have to point out that statements are compiled to regular DB2 runtime code (so-called sections and packages). At runtime DB2 basically doesn't know whether DB2's SQL or the Oracle-like syntax was used. Both perform equally well as there is only one runtime infrastructure, no emulation. The same is true for SQL vs. XQuery, there is the same single runtime infrastructure.
In school I learned English, Latin, and French, later at university Spanish. I don't speak Latin, French, or Spanish fluently, but I understand enough to get around. The same for Dutch, Danish, etc. What that means is that I don't need a translator, but can directly understand local road signs, hotel employees, newspapers. This gives a "performance advantage", again, as I (my "runtime infrastructure") can process the languages directly, not all ("compatibility rate"), but enough for most common situations. Here DB2 and I differ as DB2 has a higher compatibility rate...
Tuesday, January 31, 2012
XMLQuery: The document that never was and the intermediate step error
Recently I got a help request by email stating that a snippet of SQL code didn't work. There wasn't much detail included about what went wrong, but to me it boiled down to a SELECT statement using a XMLQUERY function, all inside a SQL-based function. I had some minutes, so I gave it a try.
Many people start developing code with an easy skeleton, then extending it and adding the more complex stuff. Why not do the same for an XQuery?
The above the scaled down version of the easy skeleton. It seems to work well. Now let's extend it:
db2 => values xmlquery('$FOO/a/b' passing '<a><b>small test</b></a>' as "FOO")
SQL16011N The result of an intermediate step expression in an XQuery path expression contains an atomic value. Error QName=err:XPTY0019. SQLSTATE=10507
Hm, that doesn't look expected, or does it? Looking into the XQuery and DB2 error description doesn't help either if you are dealing with such problems. The only thing that really helps here is experience. And it says: Remember to process well-formed documents!
In the above statements we pass a string in for XQuery processing. The simple skeleton works because we just return the string. However, the XQuery processor refuses to apply step expressions ("a" and "b" in our case) to the value "$FOO" because it is NOT an XML document, but a string. Hence the error message with the hint "contains an atomic value" (the string). What can we do? We can pass in a valid XML document:
To produce an XML document out of a string we just use the XMLPARSE function as shown. And suddenly everything seems clear. Remember to process well-formed documents!
Many people start developing code with an easy skeleton, then extending it and adding the more complex stuff. Why not do the same for an XQuery?
db2 => values xmlquery('$FOO' passing '<a><b>small test</b></a>' as "FOO")
------
<a><b>small test</b></a>
The above the scaled down version of the easy skeleton. It seems to work well. Now let's extend it:
db2 => values xmlquery('$FOO/a/b' passing '<a><b>small test</b></a>' as "FOO")
SQL16011N The result of an intermediate step expression in an XQuery path expression contains an atomic value. Error QName=err:XPTY0019. SQLSTATE=10507
Hm, that doesn't look expected, or does it? Looking into the XQuery and DB2 error description doesn't help either if you are dealing with such problems. The only thing that really helps here is experience. And it says: Remember to process well-formed documents!
In the above statements we pass a string in for XQuery processing. The simple skeleton works because we just return the string. However, the XQuery processor refuses to apply step expressions ("a" and "b" in our case) to the value "$FOO" because it is NOT an XML document, but a string. Hence the error message with the hint "contains an atomic value" (the string). What can we do? We can pass in a valid XML document:
db2 => values xmlquery('$FOO/a/b' passing xmlparse(document '<a><b>small test</b></a>') as "FOO")
------
<b>small test</b>
To produce an XML document out of a string we just use the XMLPARSE function as shown. And suddenly everything seems clear. Remember to process well-formed documents!
Tuesday, November 22, 2011
All the news: DB2 Express-C with PL/SQL, new TPoX version, DS 3.1, and hello to all PMs
I am in between business trips and there is not much time for looking deeper into any specific problems. But I wanted to touch base on few things that are new:
- DB2 Express-C, the free to download, free to use edition of DB2 LUW now includes the Oracle compatibility. That is, you can develop and use PL/SQL packages with DB2, for free, even in production. Although it still says "9.7.4" at the DB2 Express-C download site, when you click through it then offers DB2 9.7.5 for download. I just tried it.
- A new version of TPoX, the open source XML database benchmark, has been released. Most changes are to the workload driver. I know that some of you use the workload driver not just for TPoX and DB2.
BTW: The DB2 Technology Explorer/Management Console includes a so-called Workload Multiuser Driver (WMD) that can be handy, too. - IBM Data Studio 3.1 is out since few weeks and Data Studio will replace the DB2 Control Center in the near future. There is a so-called Administration Client (which is small) and a Full Client. Both have a different download size and a different function set. An overview of what is included and which database servers in addition to DB2 are supported is listed at this Data Studio V3.1 features document.
Tuesday, July 19, 2011
A small update on updating XML data in DB2
Once you get started with processing XML data within a database, such as DB2, the next question usually is: How can I update XML documents? Well, (relational) database systems usually have an UPDATE statement for modifying data. In DB2, the same UPDATE can be used to change XML documents. The way it is done is to provide the new XML document which can be based on the previous version and then modified by applying expressions based on the XQuery Update Facility.
DB2 uses the so-called "transform expression" of that standard to let you describe (both SQL and XQuery are declarative query languages) how the new XML document should look like. Instead of providing an example here, I will give you the link to "Transform expression and updating expressions" in the DB2 Information Center. There you find plenty of examples to get you started.
The interesting aspect of the transform expression is that because it is an expression like any other in XQuery, you can combine it with the rest of XQuery and modify XML documents on the fly (e.g., in XMLQUERY or XMLTABLE), use it in "if-then-else" expressions to update documents conditionally, or come up with new ideas of how to use it.
DB2 uses the so-called "transform expression" of that standard to let you describe (both SQL and XQuery are declarative query languages) how the new XML document should look like. Instead of providing an example here, I will give you the link to "Transform expression and updating expressions" in the DB2 Information Center. There you find plenty of examples to get you started.
The interesting aspect of the transform expression is that because it is an expression like any other in XQuery, you can combine it with the rest of XQuery and modify XML documents on the fly (e.g., in XMLQUERY or XMLTABLE), use it in "if-then-else" expressions to update documents conditionally, or come up with new ideas of how to use it.
Monday, July 11, 2011
XQuery: Variable-based step in XPath expression
Recently I was asked how parts of an XPath expression can be passed into an XQuery in DB2. For performance, it is - of course - best to know all the steps when compiling a query. Let's look into the options by introducing an example:
Imagine a document where we have an element "greeting" inside either "b" or "c":
We want to search for either /a/b/greeting or /a/c/greeting and pass in the "b" or "c" as variable "qt" (query tag).
One option is to compose the entire statement inside the application by concatenating the query string, then executing it as dynamic SQL.
If the variable needs to be processed as part of a stored procedure, the query string could be composed inside the procedure, then prepared and executed using a cursor. The Information Center has a good example for such a procedure and XQuery for this second option.
A third option would be to process the variable as part of the XPath expression itself within XMLQUERY, XMLEXISTS, or XMLTABLE (XMLEXISTS shown):
We are using the XPath function fn:local-name() to access the element name and then comparing it against the passed in variable. Also note how we can use a parameter marker with our XQuery/XPath expression. The above works with namespaces, too. In that case use something like "/nsPrefix:*[local-name() eq $qt]/" in your XPath expression.
Imagine a document where we have an element "greeting" inside either "b" or "c":
<a>
<b><greeting>Hello</greeting></b>
<c><greeting>Moin</greeting></c>
</a>
We want to search for either /a/b/greeting or /a/c/greeting and pass in the "b" or "c" as variable "qt" (query tag).
One option is to compose the entire statement inside the application by concatenating the query string, then executing it as dynamic SQL.
qs="/a/"+qt+"/greeting";
If the variable needs to be processed as part of a stored procedure, the query string could be composed inside the procedure, then prepared and executed using a cursor. The Information Center has a good example for such a procedure and XQuery for this second option.
A third option would be to process the variable as part of the XPath expression itself within XMLQUERY, XMLEXISTS, or XMLTABLE (XMLEXISTS shown):
...
XMLEXISTS('$DOC/a/*[local-name() eq $qt]/greeting' PASSING (cast ? as varchar(60)) as "qt")
...
We are using the XPath function fn:local-name() to access the element name and then comparing it against the passed in variable. Also note how we can use a parameter marker with our XQuery/XPath expression. The above works with namespaces, too. In that case use something like "/nsPrefix:*[local-name() eq $qt]/" in your XPath expression.
Friday, April 8, 2011
Some more time, some XML functions
I had written about some details regarding current time and timezone for the regular DB2 and the Oracle mode. Now I had some time to play with built-in XQuery functions.
There are quite many that deal with time, date, timestamps, and timezones. For the XML and XQuery processing an implicit timezone of UTC (Coordinated Universal Time) is assumed. This is something to keep in mind when processing XML data - different semantics again (who said life is easy?). XQuery defines a good number of functions and operators on durations, dates, and times. DB2 supplements that with some more functions to make life simpler (not easy) in DB2. Most of them provide the local value, i.e., the one related to where your database server is located.
Note the "Z" behind the timestamp, indicating Zulu (UTC) time, not necessarily the one of your location. DB2 adds its own function to deliver that:
Now the "Z" is missing and we have a timestamp without a timezone. How about some experiments with timezones?
With the first call we can obtain the timezone at our place. In my example it is the Central European Time with Daylight Savings being active. Not surprisingly, subtracting the (global) current time from the current local time, we receive the same difference in hours.
XQuery has a function to adjust a timestamp to a specific timezone. In the example above, we use the already known functions as input and the result gives another timestamp. The interesting part about is that now a timezone indicator is returned, "+02:00".
My time is up for today. If you have time, try out the other time-related functions in XQuery...
There are quite many that deal with time, date, timestamps, and timezones. For the XML and XQuery processing an implicit timezone of UTC (Coordinated Universal Time) is assumed. This is something to keep in mind when processing XML data - different semantics again (who said life is easy?). XQuery defines a good number of functions and operators on durations, dates, and times. DB2 supplements that with some more functions to make life simpler (not easy) in DB2. Most of them provide the local value, i.e., the one related to where your database server is located.
xquery current-dateTime()
--------------------------------------
2011-04-08T10:12:48.70082Z
Note the "Z" behind the timestamp, indicating Zulu (UTC) time, not necessarily the one of your location. DB2 adds its own function to deliver that:
xquery db2-fn:current-local-dateTime()
--------------------------------------
2011-04-08T12:13:44.026531Now the "Z" is missing and we have a timestamp without a timezone. How about some experiments with timezones?
xquery db2-fn:local-timezone()
--------------------------------------
PT2H
xquery db2-fn:current-local-dateTime() - current-dateTime()
--------------------------------------
PT2H
With the first call we can obtain the timezone at our place. In my example it is the Central European Time with Daylight Savings being active. Not surprisingly, subtracting the (global) current time from the current local time, we receive the same difference in hours.
xquery adjust-dateTime-to-timezone(current-dateTime(), db2-fn:local-timezone())
--------------------------------------
2011-04-08T12:18:09.078883+02:00
XQuery has a function to adjust a timestamp to a specific timezone. In the example above, we use the already known functions as input and the result gives another timestamp. The interesting part about is that now a timezone indicator is returned, "+02:00".
My time is up for today. If you have time, try out the other time-related functions in XQuery...
Monday, March 14, 2011
Oracle compatibility in DB2: Some strings attached
One of the great features that came with DB2 9.7 is the so-called SQL compatibility. That is, DB2 supports a bigger part Oracle's SQL syntax and the related semantics. This is great when you want to "break free", i.e., move your applications from Oracle to DB2. Many DB2 developers have also looked into what of the compatibility features to use for their systems.
In this post, I want to point out the differences in string semantics depending on whether you are in Oracle or DB2 mode. Let's take a look at the regular first:
When you switch to the Oracle mode by setting the DB2_COMPATIBILITY_VECTOR, you implicitly also switch to VARCHAR2 semantics. By that different semantics are applied to strings, including removal of trailing blanks for comparisons and empty strings converted to NULLs. How does that look like for our example (after setting the vector, restarting DB2, creating a different database)?
db2 => create table foo(id varchar(20) unique not null, num int)
DB20000I The SQL command completed successfully.
db2 => insert into foo values('abc',1)
DB20000I The SQL command completed successfully.
db2 => insert into foo values('abc ',2)
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0803N One or more values in the INSERT statement, UPDATE statement, or
foreign key update caused by a DELETE statement are not valid because the
primary key, unique constraint or unique index identified by "1" constrains
table "HLOESER.FOO" from having duplicate values for the index key.
SQLSTATE=23505
Quite some different behavior than before and something to consider before enabling some of the compatibility features. For the XML crowd I want to point out, that XQuery string semantics are not impacted by the switch - they are different from both...
In this post, I want to point out the differences in string semantics depending on whether you are in Oracle or DB2 mode. Let's take a look at the regular first:
db2 => create table foo(id varchar(20) unique not null, num int)
DB20000I The SQL command completed successfully.
db2 => insert into foo values('abc',1)
DB20000I The SQL command completed successfully.
db2 => insert into foo values('abc ',2)
DB20000I The SQL command completed successfully.
DB20000I The SQL command completed successfully.
db2 => insert into foo values('abc',1)
DB20000I The SQL command completed successfully.
db2 => insert into foo values('abc ',2)
DB20000I The SQL command completed successfully.
When you switch to the Oracle mode by setting the DB2_COMPATIBILITY_VECTOR, you implicitly also switch to VARCHAR2 semantics. By that different semantics are applied to strings, including removal of trailing blanks for comparisons and empty strings converted to NULLs. How does that look like for our example (after setting the vector, restarting DB2, creating a different database)?
db2 => create table foo(id varchar(20) unique not null, num int)
DB20000I The SQL command completed successfully.
db2 => insert into foo values('abc',1)
DB20000I The SQL command completed successfully.
db2 => insert into foo values('abc ',2)
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0803N One or more values in the INSERT statement, UPDATE statement, or
foreign key update caused by a DELETE statement are not valid because the
primary key, unique constraint or unique index identified by "1" constrains
table "HLOESER.FOO" from having duplicate values for the index key.
SQLSTATE=23505
Monday, December 6, 2010
More on full text indexing: XML documents and XPath
Last week I explained how to build a full text index on PDF files stored in DB2. Today, I will look into how to search in XML documents using a full text index and DB2 Text Search. One of the benefits of XML documents is that their content is (semi-) structured. With full text search you cannot find a needle in a haystack, but you can find which of many haystacks a needle is in. With XML documents and information about their structure, search becomes more directed and results are more precise. Let's look into details.
First, we create a table name xfiles which holds XML documents. Then we insert three different rows with similar XML documents. Next in our journey to insight is creating a full text index on column doc and then to update it:
Note that when the column is of type XML, then the index is also of format XML. Now that we are done with the setup, we can actually perform the first search:
Well, we get what we asked for. All documents contain the word "Friedrichshafen" and we know as much as before. But then we remember that XML documents are (semi-) structured and that I mentioned XPath before. The good news is that we can use XPath expressions as part of full text search to cut down on the number of returned documents.
Note that in the above query we are using the Linux command shell and need some escaping for the quote signs. Without them the statement would look like the following:
select id from xfiles where contains(doc,'@xpath:''/r[a contains( "Friedrichshafen")]''')>0
As part of the parameters to the CONTAINS function, we specify that an XPath expression should be used and then the actual expression. In our example we only want documents where the word "Friedrichshafen" is contained in the "a" elements. Hence, only a single document is returned. You can combine predicates inside the CONTAINS using and/or and also make use of of some advanced expressions. For the full syntax see here.
or
As we have seen above, a full text index allows both content and structural search on XML documents. If you don't want to use SQL as primary query language but XQuery, then the DB2 built-in function for XQuery, xmlcolumn-contains, can be used.
And remember, all of this already works in the free DB2 Express-C.
db2 "create table xfiles(id int not null primary key, doc xml)"
db2 "insert into xfiles values (1,'<r><a>Friedrichshafen</a><b>Paris</b><c>San Jose</c></r>')"
db2 "insert into xfiles values (2,'<r><a>Paris</a><b>San Jose</b><c>Friedrichshafen</c></r>')"
db2 "insert into xfiles values (3,'<r><a>San Jose</a><b>Paris</b><c>Friedrichshafen</c></r>')"
First, we create a table name xfiles which holds XML documents. Then we insert three different rows with similar XML documents. Next in our journey to insight is creating a full text index on column doc and then to update it:
db2ts "CREATE INDEX idxDocs FOR TEXT ON xfiles(doc)"
db2ts "UPDATE INDEX idxDocs FOR TEXT"Note that when the column is of type XML, then the index is also of format XML. Now that we are done with the setup, we can actually perform the first search:
db2 "select id from xfiles where contains(doc,'Friedrichshafen')>0"
ID
-----------
1
2
3
3 record(s) selected.
ID
-----------
1
2
3
3 record(s) selected.
Well, we get what we asked for. All documents contain the word "Friedrichshafen" and we know as much as before. But then we remember that XML documents are (semi-) structured and that I mentioned XPath before. The good news is that we can use XPath expressions as part of full text search to cut down on the number of returned documents.
db2 "select id from xfiles where contains(doc,'@xpath:''/r[a contains(\"Friedrichshafen\")]''')>0"
ID
-----------
1
1 record(s) selected.
ID
-----------
1
1 record(s) selected.
Note that in the above query we are using the Linux command shell and need some escaping for the quote signs. Without them the statement would look like the following:
As part of the parameters to the CONTAINS function, we specify that an XPath expression should be used and then the actual expression. In our example we only want documents where the word "Friedrichshafen" is contained in the "a" elements. Hence, only a single document is returned. You can combine predicates inside the CONTAINS using and/or and also make use of of some advanced expressions. For the full syntax see here.
db2 "select id from xfiles where contains(doc,'@xpath:''/r[a contains(\"Friedrichshafen\") or b contains(\"Paris\")]''')>0"
ID
-----------
1
3
2 record(s) selected.
ID
-----------
1
3
2 record(s) selected.
or
select id from xfiles where contains(doc,'@xpath:''/r[a contains("Friedrichshafen") or b contains("Paris")]''')>0
As we have seen above, a full text index allows both content and structural search on XML documents. If you don't want to use SQL as primary query language but XQuery, then the DB2 built-in function for XQuery, xmlcolumn-contains, can be used.
And remember, all of this already works in the free DB2 Express-C.
Friday, November 19, 2010
pureXMLness - Explained!
Earlier this week I had asked "What pureXMLness do you have?". Let me explain the query that computes the pureXMLness today.
xquery
let $i:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysindexxmlpatterns"),
$j:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.syscolumns where coltype='XML'"),
$k:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlstrings"),
$l:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlpaths"),
$m:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxsrobjects"),
$n:=db2-fn:sqlquery("select xmlcast(cast (avg(xda_hit_ratio_percent) as float) as xml) from sysibmadm.bp_hitratio")
return <pureXMLness>{avg((($i+$j+$m) div 20,($k+$l) div 300,$n))}</pureXMLness>;
First, let me start with the first keyword, "xquery". DB2 is a hybrid database engine and it natively supports both SQL statements and XQuery statements. Prefixing the statement with "xquery" means that a native XQuery follows.
What then follows is a so-called let statement ("let $i [...]"). It is used to define the variables $i to $n. For each of the variables a value is assigned. All of them are computed using the sqlquery function which executes a SQL SELECT statement. One requirement is that the function sqlquery returns an XML sequence. The SQL statements that are executed are simple SELECT statements using the count() or avg() function. To convert the results of count() and avg() to an XML sequence, we use the XMLCAST function (xmlcast(count(*) as xml)). The SQL value is converted into an XML value (which is a XML sequence).
As input to the pureXMLness we take the number of entries in the indexxmlpatterns catalog table into account, i.e., the number of indexes over XML columns. Another variable is the number of XML-typed columns obtained from the columns catalog table. Three other variables are based on the number of stringIDs (sysxmlstrings), pathIDs (sysxmlpaths), and the registered XML schemas (xsrobjects). The last input variable is the average bufferpool hit ratio for XDA objects.
After all the variables have been computed, they can be used in the return clause to compose the pureXMLness which is packed into an XML element.
xquery
let $i:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysindexxmlpatterns"),
$j:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.syscolumns where coltype='XML'"),
$k:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlstrings"),
$l:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlpaths"),
$m:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxsrobjects"),
$n:=db2-fn:sqlquery("select xmlcast(cast (avg(xda_hit_ratio_percent) as float) as xml) from sysibmadm.bp_hitratio")
return <pureXMLness>{avg((($i+$j+$m) div 20,($k+$l) div 300,$n))}</pureXMLness>;
First, let me start with the first keyword, "xquery". DB2 is a hybrid database engine and it natively supports both SQL statements and XQuery statements. Prefixing the statement with "xquery" means that a native XQuery follows.
What then follows is a so-called let statement ("let $i [...]"). It is used to define the variables $i to $n. For each of the variables a value is assigned. All of them are computed using the sqlquery function which executes a SQL SELECT statement. One requirement is that the function sqlquery returns an XML sequence. The SQL statements that are executed are simple SELECT statements using the count() or avg() function. To convert the results of count() and avg() to an XML sequence, we use the XMLCAST function (xmlcast(count(*) as xml)). The SQL value is converted into an XML value (which is a XML sequence).
As input to the pureXMLness we take the number of entries in the indexxmlpatterns catalog table into account, i.e., the number of indexes over XML columns. Another variable is the number of XML-typed columns obtained from the columns catalog table. Three other variables are based on the number of stringIDs (sysxmlstrings), pathIDs (sysxmlpaths), and the registered XML schemas (xsrobjects). The last input variable is the average bufferpool hit ratio for XDA objects.
After all the variables have been computed, they can be used in the return clause to compose the pureXMLness which is packed into an XML element.
Wednesday, November 17, 2010
One editor, many tools: The SQL and XQuery editor has been enhanced
Earlier this month a new article "Creating scripts more efficiently in the SQL and XQuery editor" has been published on developerWorks. Some of us are switching from the deprecated DB2 Control Center to the Eclipse-based tooling. Good to know that the same SQL editor is part of Data Studio, InfoSphere Data Architect, and several Optim tools.
The articles describes how man new editor features can be used to efficiently develop SQL scripts or XQuery scripts. It is also a good introduction into how to use the statement editor. What is often done for editing XML-related queries, namely changing the statement terminator, is described as well.
The articles describes how man new editor features can be used to efficiently develop SQL scripts or XQuery scripts. It is also a good introduction into how to use the statement editor. What is often done for editing XML-related queries, namely changing the statement terminator, is described as well.
Tuesday, November 16, 2010
What "pureXMLness" do you have...?
The longer and the more intense the work with XML in a database system of your choice (this is DB2, right?), the more often comes the question: What pureXMLness do I have?
To what degree am I using pureXML? How much XML does my system have? Is our system mostly relational or XML (remember DB2 is hybrid)? How do we measure how XML is used? Give me some numbers for our CIO, it's Q4.
To give a scientifically-proven, valuable answer, something that stands the test of time and holds up to all questions from the business side, I came up with THE pureXMLness factor. How is it computed? The following is the simple query that needs to be run against the DB2 database in question. It's an XQuery of course.
xquery
let $i:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysindexxmlpatterns"),
$j:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.syscolumns where coltype='XML'"),
$k:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlstrings"),
$l:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlpaths"),
$m:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxsrobjects"),
$n:=db2-fn:sqlquery("select xmlcast(cast (avg(xda_hit_ratio_percent) as float) as xml) from sysibmadm.bp_hitratio")
return <pureXMLness>{avg((($i+$j+$m) div 20,($k+$l) div 300,$n))}</pureXMLness>;
Let me know what pureXMLness you have or if you think you have a better formula...
To what degree am I using pureXML? How much XML does my system have? Is our system mostly relational or XML (remember DB2 is hybrid)? How do we measure how XML is used? Give me some numbers for our CIO, it's Q4.
To give a scientifically-proven, valuable answer, something that stands the test of time and holds up to all questions from the business side, I came up with THE pureXMLness factor. How is it computed? The following is the simple query that needs to be run against the DB2 database in question. It's an XQuery of course.
xquery
let $i:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysindexxmlpatterns"),
$j:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.syscolumns where coltype='XML'"),
$k:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlstrings"),
$l:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlpaths"),
$m:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxsrobjects"),
$n:=db2-fn:sqlquery("select xmlcast(cast (avg(xda_hit_ratio_percent) as float) as xml) from sysibmadm.bp_hitratio")
return <pureXMLness>{avg((($i+$j+$m) div 20,($k+$l) div 300,$n))}</pureXMLness>;
Let me know what pureXMLness you have or if you think you have a better formula...
Wednesday, November 10, 2010
How long is long? Maximum size of an XML text node in DB2
I was recently asked about the maximum size of a single text node when stored in DB2 pureXML. Basically, when you create a table
and insert an XML document with several elements, how long can the text between the open and closing tag be, e.g., the value of myText?
The background to this question is that XML values are formatted to data pages. If the XML document is larger than the page size, the document is broken up into subtrees that fit into the page. However, a text node is already a single node. Can a text value be larger than the page size?
My answer: It can be really, really large. Try it out and let me know the biggest value that you could insert (or produce in an update statement).
CREATE TABLE test(id int, doc xml)
and insert an XML document with several elements, how long can the text between the open and closing tag be, e.g., the value of myText?
INSERT INTO test VALUES(1,'<root><myText>all the text that you want </myText></root>')
The background to this question is that XML values are formatted to data pages. If the XML document is larger than the page size, the document is broken up into subtrees that fit into the page. However, a text node is already a single node. Can a text value be larger than the page size?
My answer: It can be really, really large. Try it out and let me know the biggest value that you could insert (or produce in an update statement).
Monday, October 4, 2010
Using DB2 pureXML and ODF Spreadsheets
If you are using an office suite that supports the Open Document Format (ODF), such as OpenOffice.org or Lotus Symphony, and are a database/XML fan(atic) like I, then the artice "Using DB2 pureXML and ODF Spreadsheets" should be something for you. The most recent e-bulletin of the IDUG Solution Journal shows how you can process ODF-based documents by using XQuery inside DB2. Manipulate spreadsheets or text documents from the "tip of a query editor"...
Friday, August 27, 2010
Yes or no: What is better for your data?
In two posts in the past, I explained how to say no to invalid data and how to find out what data was wrong and causing errors. With the REJECT INVALID VALUES clause during CREATE INDEX it is possible to reject XML documents that have values in the specified XPath location not compliant with the specified data type. Now the question I got is: What is better, to reject invalid values or to ignore them?
Well, let me answer with THE answer: "It depends." I will try to be more specific in a minute. Using XML is often for flexibility reasons and so you may not know what the future will hold. However, in most cases the XML schemas, i.e., the structure of the documents and the data types to expect, are known. And then you want to be sure the data matches what you expect. Because the data at the path specified in the CREATE INDEX statement needs to be casted to the target data type anyway, regardless of whether the value ends up in the index (valid) or not (invalid), the performance is the same. Moreover, with the approach to reject you can be sure that IF the path is present, its value is in the index. That leaves room for optimizations at runtime.
So my recommendation is that if you don't need to be as flexible as possible, then use the REJECT INVALID VALUES clause. If you should receive an error because of a casting issue, see my post on how to find out why that I mentioned above.
Well, let me answer with THE answer: "It depends." I will try to be more specific in a minute. Using XML is often for flexibility reasons and so you may not know what the future will hold. However, in most cases the XML schemas, i.e., the structure of the documents and the data types to expect, are known. And then you want to be sure the data matches what you expect. Because the data at the path specified in the CREATE INDEX statement needs to be casted to the target data type anyway, regardless of whether the value ends up in the index (valid) or not (invalid), the performance is the same. Moreover, with the approach to reject you can be sure that IF the path is present, its value is in the index. That leaves room for optimizations at runtime.
So my recommendation is that if you don't need to be as flexible as possible, then use the REJECT INVALID VALUES clause. If you should receive an error because of a casting issue, see my post on how to find out why that I mentioned above.
Tuesday, August 24, 2010
"BY REF" in XML processing
Several XML-related functions that DB2 offers allow to specify the "BY REF" keyword. Is it worth using it? And what does it mean anyway? Let's take a look at it.
DB2 functions like XMLQUERY, XMLEXISTS, and XMLTABLE have a PASSING clause to provide input values, some of the functions also return either a sequence or column values. For all of those values it is possible to specify the option BY REF (see the linked documentation for the functions above). What the BY REF option tells DB2 is to not create copies of the input or output values, but to only move a reference to those values around. This reduces temporary data, uses less of the valuable memory on the database servers, and it is much faster.
SELECT doc FROM myDocs WHERE XMLEXISTS('$d/name' PASSING BY REF doc AS "d" BY REF)
But what happens when the option BY REF is not specified? DB2 takes the default which in all places is BY REF. So sit back, relax, and enjoy a resource-conscious, highly efficient DB2 - by design.
SELECT doc FROM myDocs WHERE XMLEXISTS('$d/name' PASSING doc AS "d")
SELECT doc FROM myDocs WHERE XMLEXISTS('$DOC/name')
BTW: The latter, omitting the PASSING CLAUSE, only works in DB2 for Linux, UNIX, and Windows.
DB2 functions like XMLQUERY, XMLEXISTS, and XMLTABLE have a PASSING clause to provide input values, some of the functions also return either a sequence or column values. For all of those values it is possible to specify the option BY REF (see the linked documentation for the functions above). What the BY REF option tells DB2 is to not create copies of the input or output values, but to only move a reference to those values around. This reduces temporary data, uses less of the valuable memory on the database servers, and it is much faster.
SELECT doc FROM myDocs WHERE XMLEXISTS('$d/name' PASSING BY REF doc AS "d" BY REF)
But what happens when the option BY REF is not specified? DB2 takes the default which in all places is BY REF. So sit back, relax, and enjoy a resource-conscious, highly efficient DB2 - by design.
SELECT doc FROM myDocs WHERE XMLEXISTS('$d/name' PASSING doc AS "d")
SELECT doc FROM myDocs WHERE XMLEXISTS('$DOC/name')
BTW: The latter, omitting the PASSING CLAUSE, only works in DB2 for Linux, UNIX, and Windows.
Friday, March 26, 2010
Accessing and processing embedded textual XML documents
Yesterday, I wrote about how and why embedded textual XML documents can create trouble. Today, I will look at different ways I found (so far) on how to deal with those documents in DB2 pureXML. The documents which are embedded as text are, as mentioned, a long text value (text node) and not element and attribute nodes. Because the transformation process of turning textual XML into an instance of the XQuery Data Model (XDM) and hence making the XML document queryable is not part of the XQuery language itself, we have to leave the XQuery environment to deal with those embedded documents. More on this later, let's take a look again at the sample document from yesterday:
Inside the "b" elements we have embedded documents in text form. To deal with those documents, we first need to decide how we want them. If we are only interested in their text, we could just navigate down to the "b"s and extract the text values (assuming we have a table CD with a DOC column of type XML where we inserted the above document):
select x.* from cd,xmltable('$DOC/a/b' columns edoc varchar(2000) path '.') as x
The query could directly be used in an insert statement where the embedded documents are stored back into an XML columns. If however all the XML fragments should be used to make up a bigger document, the following statement may work:
select ''||x.edoc||' ' from cd,xmltable('$DOC/a' columns edoc varchar(2000) path '{./b} ') as x
Within the XMLTABLE, we first navigate to all the "a"s. For the returned column we return all the concatenated text values of b wrapped into a "dummy" element. This is necessary because casting to a string is only possible for a single item, not for a sequence of items (which the different text values would be). Later we utilize string concatenation to attach an "outer" element to make the XML fragment single-rooted.
The same could have been done with a statement like the initial one:
With XMLTABLE however you have more options on what to do because usually more data needs to be extracted and post-processed.
Wrapping an XMLPARSE around the constructed textual XML data turns this back into instances of the XDM which can be queried using XQuery.
Enough of embedded documents for today. In my next post, I will show you an embedded video to demonstrate how to deal with a rainy Friday afternoon...
<a>
<b>
<![CDATA[<e1><e2>embedded data 1</e2></e1>]]>
</b>
<b>
<![CDATA[<e1><e2>embedded data 2</e2></e1>]]>
</b>
<c>oh, even a different element<c>
</a>
Inside the "b" elements we have embedded documents in text form. To deal with those documents, we first need to decide how we want them. If we are only interested in their text, we could just navigate down to the "b"s and extract the text values (assuming we have a table CD with a DOC column of type XML where we inserted the above document):
select xmlquery('$DOC/a/b/text()') from cd
The above query would return all embedded documents within the parent document as a single string. This is usually not what is needed. Using XMLTABLE, we can return each embedded document individually as string: select x.* from cd,xmltable('$DOC/a/b' columns edoc varchar(2000) path '.') as x
The query could directly be used in an insert statement where the embedded documents are stored back into an XML columns. If however all the XML fragments should be used to make up a bigger document, the following statement may work:
select '
Within the XMLTABLE, we first navigate to all the "a"s. For the returned column we return all the concatenated text values of b wrapped into a "dummy" element. This is necessary because casting to a string is only possible for a single item, not for a sequence of items (which the different text values would be). Later we utilize string concatenation to attach an "outer" element to make the XML fragment single-rooted.
The same could have been done with a statement like the initial one:
select ''||xmlcast(xmlquery('{$DOC/a/b/text()} ') as varchar(2000))||' ' from cd
With XMLTABLE however you have more options on what to do because usually more data needs to be extracted and post-processed.
Wrapping an XMLPARSE around the constructed textual XML data turns this back into instances of the XDM which can be queried using XQuery.
select xmlquery('$OUTPUT/outer/e1/e2') from (
select xmlparse(document ''||x.edoc||' ') as output from cd,xmltable('$DOC/a' columns edoc varchar(2000) path '{./b} ') as x)
select xmlparse(document '
Enough of embedded documents for today. In my next post, I will show you an embedded video to demonstrate how to deal with a rainy Friday afternoon...
Subscribe to:
Posts (Atom)