Data Henrik: indexing

Showing posts with label indexing. Show all posts

Friday, February 27, 2015

DB2 Battle: Optimization Profile vs. Statement Concentrator (Part 1)

DB2 explain output

Today I wanted to try using a DB2 Optimization Profile for a statement impacted by the DB2 Statement Concentrator. It turned out to be a longer exercise than thought, but also with more fun, covering a lot of the DB2 tuning and monitoring infrastructure. So set some minutes aside and follow my journey into the world of query optimization, workload monitoring, session actuals, and more. I split it into two parts because of length (Update on 02-Jul-2015: A third part is here).

The original question I tried to answer was related to REOPT and a query like this:
select id, s
from betw
where id between ? and ?

DB2 SQL Compatibility: CREATE INDEX ... EXCLUDE NULL KEYS

One of the often problems when migrating databases from Oracle to DB2 were with unique indexes. Oracle does not include and consider NULL keys when checking for uniqueness of values, DB2 does - or better: did. Starting with DB2 10.5 it is possible now to specify whether to INCLUDE NULL KEYS (the default) or to EXCLUDE NULL KEYS when CREATE INDEX is executed. This allows to have either the classic DB2 semantics or to mimic the behavior of other database systems. And thereby it reduces migration effort further.

In the following I tried the new feature by creating two tables, T1 and T2, with a unique index each. On T2 the index uses "exclude null keys". So let's see how they respond to my attempts to insert some values...:

create table t1(id int, s varchar(60))
DB20000I The SQL command completed successfully.
create table t2(id int, s varchar(60))
DB20000I The SQL command completed successfully.
create unique index t1_id_u_idx on t1(id)
DB20000I The SQL command completed successfully.
create unique index t2_id_u_idx on t2(id) exclude null keys
DB20000I The SQL command completed successfully.
insert into t1 values(1,'one')
DB20000I The SQL command completed successfully.
insert into t2 values(1,'one')
DB20000I The SQL command completed successfully.
insert into t1 values(2,'two')
DB20000I The SQL command completed successfully.
insert into t2 values(2,'two')
DB20000I The SQL command completed successfully.
insert into t1 values(2,'two again')
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0803N One or more values in the INSERT statement, UPDATE statement, or
foreign key update caused by a DELETE statement are not valid because the
primary key, unique constraint or unique index identified by "1" constrains
table "HLOESER.T1" from having duplicate values for the index key.
SQLSTATE=23505
insert into t2 values(2,'two again')
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0803N One or more values in the INSERT statement, UPDATE statement, or
foreign key update caused by a DELETE statement are not valid because the
primary key, unique constraint or unique index identified by "1" constrains
table "HLOESER.T2" from having duplicate values for the index key.
SQLSTATE=23505
insert into t1 values(null,'null')
DB20000I The SQL command completed successfully.
insert into t2 values(null,'null')
DB20000I The SQL command completed successfully.
insert into t1 values(null,'null again')
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0803N One or more values in the INSERT statement, UPDATE statement, or
foreign key update caused by a DELETE statement are not valid because the
primary key, unique constraint or unique index identified by "1" constrains
table "HLOESER.T1" from having duplicate values for the index key.
SQLSTATE=23505
insert into t2 values(null,'null again')
DB20000I The SQL command completed successfully.

As can be seen, both indexes check uniqueness of non-NULL values. When I try to insert the value "2" again, an error is returned. Both indexes accept a NULL value. However, on T1 only a single NULL value can be inserted and then a violation of the uniqueness criteria is reported. On the second table we are able to insert another NULL value without problems because NULL keys are not included in the index.

Try it yourself, best by migrating a database from Oracle to DB2... ;-)

Monday, April 15, 2013

(Updated) Special hybrid, object-capable, all purpose database

Recently I had to deal again with a special database. Let me share my observations.

This special database seems to have excellent compression capabilities as a ot of stuff can be cramped into it. From the outside the database looks small, but it seems to be able to store 10 times its advertised capacity.

This database can handle relational data, large objects, unstructured data, etc. and has several storage areas and storage types. For the experienced user data access is really fast and it seems index-only. For the infrequent user the only way of data access is through repeated table scans.

The database is equipped to deal with "situations" and act in "emergencies". Its capabilities include having candy up for distribution exactly when needed, a first aid kid with the right number and size of bandaids, chapstick, hand lotion, and some hand sanitizer.When fully loaded, the database can be used as a self-defense device, can be thrown as projectile even on longer distances.

I could go on with more features (you can in the comments), but do you know what database I described above...?

Update: My wife's handbag is one of the described special databases.

Monday, December 6, 2010

More on full text indexing: XML documents and XPath

Last week I explained how to build a full text index on PDF files stored in DB2. Today, I will look into how to search in XML documents using a full text index and DB2 Text Search. One of the benefits of XML documents is that their content is (semi-) structured. With full text search you cannot find a needle in a haystack, but you can find which of many haystacks a needle is in. With XML documents and information about their structure, search becomes more directed and results are more precise. Let's look into details.

db2 "create table xfiles(id int not null primary key, doc xml)"

db2 "insert into xfiles values (1,'<r><a>Friedrichshafen</a><b>Paris</b><c>San Jose</c></r>')"

 db2 "insert into xfiles values (2,'<r><a>Paris</a><b>San Jose</b><c>Friedrichshafen</c></r>')"

db2 "insert into xfiles values (3,'<r><a>San Jose</a><b>Paris</b><c>Friedrichshafen</c></r>')"

First, we create a table name xfiles which holds XML documents. Then we insert three different rows with similar XML documents. Next in our journey to insight is creating a full text index on column doc and then to update it:

db2ts "CREATE INDEX idxDocs FOR TEXT ON xfiles(doc)"

db2ts "UPDATE INDEX idxDocs FOR TEXT"

Note that when the column is of type XML, then the index is also of format XML. Now that we are done with the setup, we can actually perform the first search:

db2 "select id from xfiles where contains(doc,'Friedrichshafen')>0"

ID         

-----------

          1

          2

          3

  3 record(s) selected.

Well, we get what we asked for. All documents contain the word "Friedrichshafen" and we know as much as before. But then we remember that XML documents are (semi-) structured and that I mentioned XPath before. The good news is that we can use XPath expressions as part of full text search to cut down on the number of returned documents.

db2 "select id from xfiles where contains(doc,'@xpath:''/r[a contains(\"Friedrichshafen\")]''')>0"

ID         

-----------

          1

  1 record(s) selected.

Note that in the above query we are using the Linux command shell and need some escaping for the quote signs. Without them the statement would look like the following:

select id from xfiles where contains(doc,'@xpath:''/r[a contains( "Friedrichshafen")]''')>0

As part of the parameters to the CONTAINS function, we specify that an XPath expression should be used and then the actual expression. In our example we only want documents where the word "Friedrichshafen" is contained in the "a" elements. Hence, only a single document is returned. You can combine predicates inside the CONTAINS using and/or and also make use of of some advanced expressions. For the full syntax see here.

db2 "select id from xfiles where contains(doc,'@xpath:''/r[a contains(\"Friedrichshafen\") or b contains(\"Paris\")]''')>0"

ID         

-----------

          1

          3

  2 record(s) selected.

select id from xfiles where contains(doc,'@xpath:''/r[a contains("Friedrichshafen") or b contains("Paris")]''')>0

As we have seen above, a full text index allows both content and structural search on XML documents. If you don't want to use SQL as primary query language but XQuery, then the DB2 built-in function for XQuery, xmlcolumn-contains, can be used.

And remember, all of this already works in the free DB2 Express-C.

Friday, August 27, 2010

Yes or no: What is better for your data?

In two posts in the past, I explained how to say no to invalid data and how to find out what data was wrong and causing errors. With the REJECT INVALID VALUES clause during CREATE INDEX it is possible to reject XML documents that have values in the specified XPath location not compliant with the specified data type. Now the question I got is: What is better, to reject invalid values or to ignore them?

Well, let me answer with THE answer: "It depends." I will try to be more specific in a minute. Using XML is often for flexibility reasons and so you may not know what the future will hold. However, in most cases the XML schemas, i.e., the structure of the documents and the data types to expect, are known. And then you want to be sure the data matches what you expect. Because the data at the path specified in the CREATE INDEX statement needs to be casted to the target data type anyway, regardless of whether the value ends up in the index (valid) or not (invalid), the performance is the same. Moreover, with the approach to reject you can be sure that IF the path is present, its value is in the index. That leaves room for optimizations at runtime.

So my recommendation is that if you don't need to be as flexible as possible, then use the REJECT INVALID VALUES clause. If you should receive an error because of a casting issue, see my post on how to find out why that I mentioned above.

Tuesday, August 10, 2010

pureXML Indexing - Behind the scenes

I was asked whether I could explain a little bit how the different XML indexes are used for performing an index scan. Well, this probably requires several blog entries, but today I at least try to cover the basics.

When you create a table with an XML column, DB2 automatically creates two different XML indexes. One is the XML Path index, the other the XML Region Index. The INDEXTYPE in SYSCAT.INDEXES lists them as XPTH and XRGN. If you create a table with more than one XML column, you will notice that there is a XPTH for each XML column, but only one XRGN for the entire table. The path index maps an encoded path (e.g., /root/department/employee) to a unique pathID, an integer value. The pathID is unique within the entire database. The region index is used to manage parts (regions) of an XML document, so that documents can be larger than a single data page.

Similar to indexes on non-XML columns, users can create indexes on XML columns utilizing some features of the CREATE INDEX statement. Thereafter, the system catalog might look like shown in this article in the Information Center. But how are those indexes used?

One of the XML-related operators is XISCAN. When you perform an index scan using a non-XML index, an IXSCAN operator shows up in the access plan. For XML data, because some more effort is needed, the operator is named XISCAN, XML Index Scan. Once the optimizer decides that an index scan should be used, some auxiliary information to set up the scan is needed, including of course the start or stop keys for the values in the predicate (similar to regular b-tree access). What is needed in addition is the pathID of the path to search for. In the XML world we are not comparing "id=4711", but the value at "/root/department/employee/id" needs to be 4711.

If the entire absolute path is known when the query is compiled, the path can be turned into a pathID. However, this is not the case when wildcards (e.g., "//employee/id") are used. That's when the XML path index (XPTH) comes in handy. Because it stores the encoded paths in reverse order (our example would be "id/employee/department/root" in clear text) it is possible to find all paths ending (now starting) with "id" or "employee/id". To retrieve the pathIDs, at runtime an index scan on the path index is performed and the range is X-locked. With the locking DB2 can prevent any new path to be created during the actual index scan for the query predicate.

Once the pathIDs are known, either from the compile phase or the index look-up, the actual scan on the XML index (XVIP type in the catalog) can be performed and docIDs, nodeIDs (to pinpoint the node having that desired value) and RIDs (record identifiers) are obtained from the b-tree. The docIDs identify the qualifying XML documents, the RIDs point to the records (rows) in the table which the XML documents are part of. Now, based on the docID and nodeID, the part of the document can be located using the XML region index. Then, regular XML processing kicks in, e.g., the XSCAN operator or others could be applied.

Note that the above description covers the standard case and that optimizations for some cases are possible. There is also the XANDOR operator which speeds up index evaluation when two or more indexes for the same column can be used. But that's another blog post. Stay tuned...

Thursday, March 11, 2010

IBM XML Index Advisor for DB2

DB2 has an index advisor (part of the DB2 Design Advisor) which helps to identify missing indexes to speed up a given workload. Unfortunately, it lacks pureXML support. The good news is that there is a tool on IBM alphaWorks, called the "IBM XML Index Advisor for DB2 for Linux, UNIX, and Windows".

Give the tool a try and provide feedback to the team (I already did and, e.g., requested more examples).

Monday, July 20, 2009

Relational vs. pureXML: Some indexing differences

You probably know how relational indexing works. When you create an index on a column, then for every row in the table there is an index entry. There is a 1:1 relationship between the rows and the index entries.

In the XML world things can be quite different. Let's imagine that we have XML documents where as part of other personal information the names of our siblings are listed ("CREATE TABLE persons(id INT, doc XML)"). The XML fragment could like the following for a single sibling (2 kids for the parents):

<siblings>
  <name>Sarah</name>
</siblings>

If we would like to index the sibling names, we could do it in DB2 the following way:

CREATE INDEX sibNamesIdx ON persons(doc)
GENERATE KEY USING XMLPATTERN
'/siblings/name' as SQL VARCHAR(40)

Because we have one sibling in the above example, there would be a single entry in the index - the same as in relational. But what happens if you are an only child (<siblings/>)? Or what if you have multiple brothers and sisters?


<siblings>
  <name>Sam</name>
   <name>Sarah</name>
    <name>Sean</name>
  <name>Siegfried</name>
  <name>Simon</name>
  <name>Susie</name>
  <name>Sydney</name>
</siblings>

For an only child there wouldn't be a sibling name and hence no index entry. For multiple siblings there would be an index entry each, resulting in multiple index entries for a single row. This is quite different from the relational world.

What may look strange or at least unusual to you has some benefits to you. In the relational world you would need to model such data with joined-in side tables or with sparse columns. In the latter case you would manage NULL values in your index. In the XML world you only have data that is really present represented in your indexes, thus keeping them small.

Tuesday, June 16, 2009

XML: Learn how to say NO

When you have young kids you know that once they have learned the power of the word "NO!", your life gets "more interesting". Learning to say "no" in a more polite way and to use its power more wisely is something that is good for private and business life. Now you may ask "but what about database systems?" Ok, I got you...

When dealing with relational data, refusing "stuff" is more or less simple. You cannot store a string in an integer or date column unless you cast it and it has the right format. You can define unique indexes to make sure certain values are unique. You can define constraints or use triggers to make sure the data you store is in sync with your idea of an idyllic world. For XML data the picture is slightly different and some options are listed here.

A straight-forward way of verifying XML data is to validate it. You can do that in DB2 directly using the XMLVALIDATE() function or in special appliances like WebSphere DataPower. Using triggers or constraints you can make sure all data gets validated.

Validation might be more than you want and then making use of XML indexes is something to consider. When you store XML without validation which is typical for most customers, DB2 only checks that the document is well-formed. Creating an index and specifying the REJECT INVALID VALUES clause, DB2 checks that the value of an XML element or attribute conforms to the specified type.

<order><date>sometime</date></order>

CREATE INDEX myIX on myTable(doc) GENERATE KEY USING XMLPATTERN '/order/date' AS SQL DATE REJECT INVALID VALUES;

If you inserted a document like above, then during the index creation DB2 would choke and point out that the value "sometime" is not valid for a DATE type. A value like "09/29/2000" would pass. Once an index has been created, INSERTs/UPDATEs would pass/fail depending on which values are used for the "date" element.

What do you do when you have a larger table and CREATE INDEX fails? Look into your diagnostic log. DB2 generates a nice query statement for you to retrieve the troublemaker.

Now say YES to pureXML and try it out...

Friday, June 12, 2009

pureXML performance tip: A sequence of good indices...

When you have indexes you want your dbms of choice use them as efficiently as possible - even with XQuery involved. However, SQL and XQuery can be complicated and sometimes a helping hand is needed to aid the compiler/optimizer making the right decisions. Today I am going to show you how to make use of XQuery semantics to simplify a query and at the same time getting query plans with index ANDing to work. Special thanks to my colleague Matthias Nicola, author of the DB2 pureXML Cookbook, for the base idea.

For the query examples below I will use the DB2 sample database which includes XML data. The table CUSTOMER already has a couple XML indices defined, including one on the "Cid" attribute and the customer "name". You can look into the system catalog by trying something like:

select indname,pattern from syscat.indexxmlpatterns where indname in (select indname from syscat.indexes where tabname='CUSTOMER');

The following might not be an interesting query in itself, but it nicely demonstrates what I wanted to show you.

select * from customer
where
(xmlexists('declare default element namespace "http://posample.org";
$INFO/customerinfo[@Cid=1000]')
or
xmlexists('declare default element namespace "http://posample.org";
$INFO/customerinfo[@Cid=1002]'))
and
xmlexists('declare default element namespace "http://posample.org";
$INFO/customerinfo[name="Kathy Smith"]');

Basically, the query is written in the regular SQL style. We are ORing two predicates on the same path (or column), then ANDing a predicate on another path. Because we have two matching indexes, we would expect index ANDing in the plan. Using EXPLAIN we can see that this is not the case.

select * from customer
where xmlexists('declare default element namespace "http://posample.org"; $INFO/customerinfo[(@Cid=1000 or @Cid=1002) and name="Kathy Smith"]');

Even rewriting the query in a more elegant and shorter way as shown above doesn't result in index ANDing. What we can do now, is to make use of the fact that XQuery's General Comparisons utilize existential semantics, i.e., if one matching item is found, all is true. Instead of ORing comparisons on the "Cid" attribute, we compare that attribute against a constructed sequence of values. Semantically it is the same, but it is simpler for the optimizer.

select * from customer
where
xmlexists('declare default element namespace "http://posample.org";
$INFO/customerinfo[@Cid=(1000,1002)]')
and
xmlexists('declare default element namespace "http://posample.org";
$INFO/customerinfo[name="Kathy Smith"]');

select * from customer
where
xmlexists('declare default element namespace "http://posample.org";
$INFO/customerinfo[@Cid=(1000,1002) and name="Kathy Smith"]');

For each of two above queries indexes on "Cid" and "name" are picked and combined using index ANDing (IXAND operator). Simple idea, huge impact.

Tuesday, March 10, 2009

Hundreds of indexes on a single XML column - why not?

Last week I presented at the German Database Conference "BTW" (see here for a history) some work on how to address XML Index Challenges (paper written with two colleagues). More and more enterprises face the challenge of logging all their activities. Given that today SOA and Web Services become standard for new applications, there is a trend that the logs are XML-based as well.

The scenario in the paper is based on what was observed at several banking customers. The internet banking system is required to log every event in any of their internet banking applications - we are talking 10 to 20 million inserts a day. Events include clicks that take a user to a new web page or dialog, entry of user data, as well as every click that initiates a banking transaction. This “logging” happens across a set of diverse applications such as checking accounts, loans, investment management, and others. The log information can vary widely, thus very flexible storage is required. So far the data is stored in a relational table with few columns and the log record itself as VARCHAR.

The key problem with a relational-only approach is that the variable part of the log entries is hard to query with adequate performance. Current relational database technology does not allow easy indexing of individual pieces of strings in a VARCHAR column. The applications that read this data typically use SQL "LIKE" predicates on these VARCHAR values. This results in limited queryability, limited index usage, and sub-optimal performance. The solution is to use XML as log format and to make use of DB2 pureXML's hybrid capabilities, i.e., to store the logs in an XML column and some other information (like a timestamp) as relational fields.

What is the challenge?
The challenge is not the heavy insert workload as was demonstrated with the 1 Terabyte XML benchmark, it is the indexing. Given that there are different log record formats (different XML document types) for the different applications, and there is a heavy volume of data coming in, you want to have indexes on all important elements and attributes to find the needle in the haystack. If we think of only few dozen application types and typically 10 to 20 indexes per type, we are already in the hundreds of indexes.

Does it work?
The good news is that DB2 pureXML is able to handle it without problems. First of all, DB2 allows users to only index what is needed (see the Information Center and Tip 8 in the DB2 pureXML 15 best practices). Other systems require primary index, secondary index, side tables or whatever it is called and this would be overhead slowing down the system. DB2 directly extracts - without any prerequisite - the data to index.
Secondly, even if there are hundreds of indexes, only dozen or so match a given document. Still, taking care of the hundreds of non-qualifying indexes could be an issue. Fortunately not in DB2, how this is done was part of my presentation and is described in the conference paper. With the number of non-matching indexes increasing, there is only a small performance penalty and the DB2 server on a machine with 4 dual-core 1.9 GHz POWER5 processors was able to maintain close to 100,000 inserts a minute.

How many indexes are ok?
Earlier I had asked about the right number of indexes. The answer to the question can be (over-)simplified by saying that you should create as many as are needed to help query performance. As we have learned we don't need to worry too much about non-matching indexes. With that we are back to relational index maintenance characteristics most of us know already.

Thursday, February 19, 2009

How many trees does a forest need? (DB2 pureXML Indexing)

The above question was the key topic in a public debate here in town. The background is the small forest next to us. Right now there are too many trees per acre and they are competing for sun light and space. The proposed solution by the administration was to cut some trees, so that those trees which are important for the forest, for animals and insects can grow and prosper. The envisioned result is a healthy forest.

Do you see any parallels to our database world? One of the more frequently asked questions about DB2 pureXML is how many XML indexes do make sense for a table. In the relational world, the number of possible indexes is limited by the number of combinations and permutations of columns. For XML it would be possible to store an entire database inside a single document. But how do you index the data? How do you determine which indexes are needed? Too few indexes could mean that queries do not perform. Too many indexes could mean that a hefty penality is paid during IUD operations for index maintenance and the system health is impacted. The index maintenance costs are known for the relational world, but little is known for XML indexes.

In the coming days and weeks I plan to address XML indexing questions. In the meantime, please let me know if you have indexing topics that should be part of this series.

Data Henrik

Pages