Wednesday, November 22, 2023

Unicode string length, code points, and Db2

Byte length of (Unicode) strings
After my recent blog post "🎃 Unicode characters and Db2 🕸️ 🏚️", I had some follow-up discussions. One was around how to determine the Unicode UTF-8 byte length of strings in a non-Unicode Db2 database. There were solutions proposed that required data export to analyze the data externally or to implement some functions or procedures. I insisted that there is an SQL-based solution. Here is my proposal.

My ISO 8859-x test database

In many countries, the local character encodings are based on ISO 8859. You can a create Db2 database and then specify "USING CODESET codeset TERRITORY territory". Before using the current default codeset of Unicode UTF-8, I would often have ISO8859-1 as codeset and territory DE or US. All those ISO 8859 codesets have in common that the first 128 characters (or code points) align with ASCII (and UTF-8).

To test my SQL-based proposal, I needed to create a non-Unicode database that supports characters with at least two bytes needed for the UTF-8 encoding. Oh, what fun. I settled on ISO8859-15 as the revision of ISO8815-1. It supports the Euro sign € which needs two bytes in UTF-8 (see my older blog post linked at the top).

create db uni15 using codeset iso8859-15 territory de

With that database in place and I could connect and create a simple table:

create table t (id int, s varchar(20))

Then, inserted some test data which includes the Euro sign.

insert into t values (1,'€');
insert into t values(2,'€ÄÜÖ12€€');
insert into t values(3,'a');
insert into t values(4,'ab');
insert into t values(5,'abĀ');
insert into t values(6,'abÄóôÊ');

XQuery to the rescue

First, I looked at the new UNICODE_STR function. But apparently it could not be used in my non-Unicode setup. Thus, I resorted to XQuery. Db2 supports the fn:string-to-codepoints function. It returns a sequence of Unicode code points that corresponds to a string value, exactly what I need. The rest is some advanced SQL and XQuery to sum up the bytes for a string's code points...

select id, s,
   xmlcast(
     xmlquery('let $cp := fn:string-to-codepoints($S)
               return fn:sum(
                 for $i in $cp
                 return 1+(if ($i > 255) then 1 else 0))'
        passing s as "S")
   as int) as byte_len
from t

The above query returns both columns of my test table plus some computed result as column BYTE_LEN. The result is based on XMLCASTing the output of XMLQUERY to an integer. The "magic" is the XQuery expression within. It retrieves the sequence of code points then uses fn:sum to keep tab on the number of bytes returned by the inner FOR (FLWOR) expression. It checks if the code point is greater than 255 (more than one byte) or not. Based on the check it is either 1 or 2 bytes per code point. Note that for ADVANCED analytics you need to add the checks for three or four bytes. I left that out for readability.

ID          S                    BYTE_LEN         
----------- -------------------- -----------
          1 €                              2
          2 €ÄÜÖ12€€                      11
          3 a                              1
          4 ab                             2
          5 abÄ€                           5
          6 abÄóôÊ                         6

  6 record(s) selected.

The query returns the above output for my sample data (the same as in the screenshot at the top). Does that look right...?

Conclusions

Dealing with Unicode conversions or migrations is always fun (or hard work). Glad to see, that some SQL (with XQuery support) can be used to determine string sizes in bytes - even in non-Unicode databases. BTW: I tested the solution with some other ISO 8859-x code sets and non-Latin characters, too.

If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik), Mastodon (@data_henrik@mastodon.social), or LinkedIn.