Monday, December 6, 2010

More on full text indexing: XML documents and XPath

Last week I explained how to build a full text index on PDF files stored in DB2. Today, I will look into how to search in XML documents using a full text index and DB2 Text Search. One of the benefits of XML documents is that their content is (semi-) structured. With full text search you cannot find a needle in a haystack, but you can find which of many haystacks a needle is in. With XML documents and information about their structure, search becomes more directed and results are more precise. Let's look into details.

db2 "create table xfiles(id int not null primary key, doc xml)"
db2 "insert into xfiles values (1,'<r><a>Friedrichshafen</a><b>Paris</b><c>San Jose</c></r>')"
db2 "insert into xfiles values (2,'<r><a>Paris</a><b>San Jose</b><c>Friedrichshafen</c></r>')"
db2 "insert into xfiles values (3,'<r><a>San Jose</a><b>Paris</b><c>Friedrichshafen</c></r>')"

First, we create a table name xfiles which holds XML documents. Then we insert three different rows with similar XML documents. Next in our journey to insight is creating a full text index on column doc and then to update it:

db2ts "CREATE INDEX idxDocs FOR TEXT ON xfiles(doc)"
db2ts "UPDATE INDEX idxDocs FOR TEXT"

Note that when the column is of type XML, then the index is also of format XML. Now that we are done with the setup, we can actually perform the first search:

db2 "select id from xfiles where contains(doc,'Friedrichshafen')>0"

ID        
-----------
          1
          2
          3

  3 record(s) selected.

Well, we get what we asked for. All documents contain the word "Friedrichshafen" and we know as much as before. But then we remember that XML documents are (semi-) structured and that I mentioned XPath before. The good news is that we can use XPath expressions as part of full text search to cut down on the number of returned documents.

db2 "select id from xfiles where contains(doc,'@xpath:''/r[a contains(\"Friedrichshafen\")]''')>0"

ID        
-----------
          1

  1 record(s) selected.

Note that in the above query we are using the Linux command shell and need some escaping for the quote signs. Without them the statement would look like the following:

select id from xfiles where contains(doc,'@xpath:''/r[a contains( "Friedrichshafen")]''')>0

As part of the parameters to the CONTAINS function, we specify that an XPath expression should be used and then the actual expression. In our example we only want documents where the word "Friedrichshafen" is contained in the "a" elements. Hence, only a single document is returned. You can combine predicates inside the CONTAINS using and/or and also make use of of some advanced expressions. For the full syntax see here.

db2 "select id from xfiles where contains(doc,'@xpath:''/r[a contains(\"Friedrichshafen\") or b contains(\"Paris\")]''')>0"

ID        
-----------
          1
          3

  2 record(s) selected.

or

select id from xfiles where contains(doc,'@xpath:''/r[a contains("Friedrichshafen") or b contains("Paris")]''')>0

As we have seen above, a full text index allows both content and structural search on XML documents. If you don't want to use SQL as primary query language but XQuery, then the DB2 built-in function for XQuery, xmlcolumn-contains, can be used.

And remember, all of this already works in the free DB2 Express-C.

No comments:

LinkWithin

Related Posts with Thumbnails