Tuesday, November 30, 2010

Indexing PDF Files with DB2 Text Search

This week I was asked how to build a full text index on PDF documents in DB2. Because my hard drive is full of them, I created a new DB2 database and set up DB2 Text Search with so-called rich text document filters. After creating a table, loading data and indexing the documents, I eventually used the CONTAINS and SCORE functions to search in my PDF files. But let me show you the steps in some more details...

Because DB2 Text Search support can be installed as part of the DB2 installation, my system already had the base support set up.

Thus my endeavors started with downloading the DB2 Accessory Suite. It contains additional document filters that help to extract full text information out of different file types such as PDF, word processor, or spreadsheet documents. The DB2 Information Center has instructions for downloading and installing the accessory suite. I checked that all necessary files from the regular DB2 installation were already in place and then succeeded with setting up rich text support.

Actually, setup requires two steps. Install the software on the server, then setting it up for an instance. Some confusion, but I succeeded. Next was enabling rich text support, instructions are on Information Center again. Now on to creating the actual test database (on my Linux shell):


DB2 Text Search uses an environment variable DB2DBDFT. This should be set to the database you work with to simplify administration. Hence:

export DB2DBDFT=FT

Finally, we are able to set up our new database for full text search:


The db2ts command is the DB2 Text Search command processor. Now we can create a simple table in the database to hold the documents after connecting to the db:

db2 connect to ft
db2 "create table hlDocs(id int not null primary key, description varchar(200), doc BLOB(10M))"

The second statement above creates a table named "hlDocs" with three columns. The first column is named "id" and of type integer. Note that DB2 Text Search requires a primary key column in the indexed table. The 2nd column will hold a short string describing the document. The last column is a BLOB that we use to store the PDF files. To easily insert PDF documents into the table, I created a small delimited file for import (named "blob.del"):

1,'train schedule Paris',ParisSchedule.pdf
2,'Schedule recycling paper collection',Papier_Abfallplan2010.pdf
3,'New resume',Current_resume.pdf

The above data can now be imported with the PDF files assumed in the same as our current directory:
db2 import from blob.del of del lobs from . insert into hlDocs

It is important to check that all rows were inserted:
Number of rows read         = 3
Number of rows skipped      = 0
Number of rows inserted     = 3
Number of rows updated      = 0
Number of rows rejected     = 0
Number of rows committed    = 3

Now the test data is in, but we don't have an index yet. That can be done using the db2ts command again. We first create the index, then update it:

db2ts "CREATE INDEX idx1 FOR TEXT ON hldocs(doc) FORMAT inso"

CAUTION: Note that the option "FORMAT inso" is not clearly explained in the documentation (yet - hence this post). That option tells DB2 to use the additional document filters that we installed as part of the DB2 Accessory Suite.

Finally, we can test the index using the CONTAINS and the SCORE functions.
db2 "select id,description from hldocs where contains(doc,'Paris')=1"

ID          DESCRIPTION     --------------------------------------------------
          1 'train schedule Paris'

  1 record(s) selected.

db2 "select id,description,cast(score(doc,'Friedri*') as decimal(5,3))  as score from hldocs order by 3 desc"

ID DESCRIPTION                                   SCORE   
----------- ------------------------------------------
          2 'Schedule for paper collection'      0.046
          3 'Resume'                             0.022
          1 'Train schedule Paris'               0.000

  3 record(s) selected.

Most of the steps are dedicated to proper setup of DB2 Text Search with the additional filters for PDF documents. However, once it is set up, inserting data and searching the documents is simple.

Friday, November 26, 2010

Black Friday Special: DB2 for free (no MIR required)!!!

We have THE most important shopping day in the US today: Black Friday. Shops are trying to lure in shoppers with rebates and specials, some require complicated mail-in rebates (MIRs). Let me point you to a very valuable product and it is entirely free and doesn't require a mail-in rebate: DB2 Express-C.

The free product includes free autonomics and free pureXML and much more. Get DB2 Express-C today and if you want some great free books and tools, let me know.

Friday, November 19, 2010

pureXMLness - Explained!

Earlier this week I had asked "What pureXMLness do you have?". Let me explain the query that computes the pureXMLness today.

let $i:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysindexxmlpatterns"),
$j:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.syscolumns where coltype='XML'"),
$k:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlstrings"),
$l:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlpaths"),
$m:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxsrobjects"),
$n:=db2-fn:sqlquery("select xmlcast(cast (avg(xda_hit_ratio_percent) as float) as xml) from sysibmadm.bp_hitratio")
return <pureXMLness>{avg((($i+$j+$m) div 20,($k+$l) div 300,$n))}</pureXMLness>;

First, let me start with the first keyword, "xquery". DB2 is a hybrid database engine and it natively supports both SQL statements and XQuery statements. Prefixing the statement with "xquery" means that a native XQuery follows.

What then follows is a so-called let statement ("let $i [...]"). It is used to define the variables $i to $n. For each of the variables a value is assigned. All of them are computed using the sqlquery function which executes a SQL SELECT statement. One requirement is that the function sqlquery returns an XML sequence. The SQL statements that are executed are simple SELECT statements using the count() or avg() function. To convert the results of count() and avg() to an XML sequence, we use the XMLCAST function (xmlcast(count(*) as xml)). The SQL value is converted into an XML value (which is a XML sequence).

As input to the pureXMLness we take the number of entries in the indexxmlpatterns catalog table into account, i.e., the number of indexes over XML columns. Another variable is the number of XML-typed columns obtained from the columns catalog table. Three other variables are based on the number of stringIDs (sysxmlstrings), pathIDs (sysxmlpaths), and the registered XML schemas (xsrobjects). The last input variable is the average bufferpool hit ratio for XDA objects.

After all the variables have been computed, they can be used in the return clause to compose the pureXMLness which is packed into an XML element.

Wednesday, November 17, 2010

One editor, many tools: The SQL and XQuery editor has been enhanced

Earlier this month a new article "Creating scripts more efficiently in the SQL and XQuery editor" has been published on developerWorks. Some of us are switching from the deprecated DB2 Control Center to the Eclipse-based tooling. Good to know that the same SQL editor is part of Data Studio, InfoSphere Data Architect, and several Optim tools.
The articles describes how man new editor features can be used to efficiently develop SQL scripts or XQuery scripts. It is also a good introduction into how to use the statement editor. What is often done for editing XML-related queries, namely changing the statement terminator, is described as well.

Tuesday, November 16, 2010

What "pureXMLness" do you have...?

The longer and the more intense the work with XML in a database system of your choice (this is DB2, right?), the more often comes the question: What pureXMLness do I have?

To what degree am I using pureXML? How much XML does my system have? Is our system mostly relational or XML (remember DB2 is hybrid)? How do we measure how XML is used? Give me some numbers for our CIO, it's Q4.

To give a scientifically-proven, valuable answer, something that stands the test of time and holds up to all questions from the business side, I came up with THE pureXMLness factor. How is it computed? The following is the simple query that needs to be run against the DB2 database in question. It's an XQuery of course.

let $i:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysindexxmlpatterns"),
$j:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.syscolumns where coltype='XML'"),
$k:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlstrings"),
$l:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxmlpaths"),
$m:=db2-fn:sqlquery("select xmlcast(count(*) as xml) from sysibm.sysxsrobjects"),
$n:=db2-fn:sqlquery("select xmlcast(cast (avg(xda_hit_ratio_percent) as float) as xml) from sysibmadm.bp_hitratio")
return <pureXMLness>{avg((($i+$j+$m) div 20,($k+$l) div 300,$n))}</pureXMLness>;

Let me know what pureXMLness you have or if you think you have a better formula...

SQL and XML Limits in DB2

How many database partitions can a DPF system have (it is 999 partitions)? How many levels deep can an XML document stored in DB2 be (125 levels)? What is the maximum size of an SQL statement I can code (2097152 bytes)?

If you have questions like that around DB2, then the page SQL and XML limits in the DB2 Information Center is the right one for you. And if you have lots of users, rest assured, the maximum number of concurrent users for a system is 64000.

Monday, November 15, 2010

Oracle to DB2 terminology mapping

Sometimes, things you are searching for, are already there and wait to be found. Today, I have such an example. The DB2 Information Center features a page that lists both the Oracle concepts and DB2 concepts, the "DB2-Oracle terminology mapping".
If you need to explain the DB2 architecture to someone with Oracle background (or vice versa), this comes handy.

Wednesday, November 10, 2010

How long is long? Maximum size of an XML text node in DB2

I was recently asked about the maximum size of a single text node when stored in DB2 pureXML. Basically, when you create a table
CREATE TABLE test(id int, doc xml)

and insert an XML document with several elements, how long can the text between the open and closing tag be, e.g., the value of myText?
INSERT INTO test VALUES(1,'<root><myText>all the text that you want</myText></root>')

The background to this question is that XML values are formatted to data pages. If the XML document is larger than the page size, the document is broken up into subtrees that fit into the page. However, a text node is already a single node. Can a text value be larger than the page size?

My answer: It can be really, really large. Try it out and let me know the biggest value that you could insert (or produce in an update statement).

Tuesday, November 9, 2010

Some flights, appliances, and "one size fits all"

As I wrote before, I was traveling within Europe. Similar to the Skymall brochure in North America, some airplanes feature the Worldshop brochure. I used some spare time and read about the new G3Ferrari Pizzamaker. This brought up memories of the Popcorn Maker, the Ice Cream Maker, the Donut Maker, the bread machine, and all the other advertised appliances. How many of them do you own?

One reason I don't own (all of) them is the storage space required when they are not used. Another reason is that they usually can only do what they have been designed for (sometimes not even that!) and it often requires reading additional instructions, getting familiar with the do's and don'ts. Sometimes, some of those "specialty appliances" are something en vogue for few months, then disappear again. I trust my (hightech) oven and my induction stove on helping me out when I need a pizza or some popcorn and they work fine for base regular food. They are my "one size fits all" solution to my culinary requirements.

The same goes for my database system. I trust DB2 to handle all kinds of data: Strings, numbers, dates, timestamps, and - yes - XML documents. DB2 can manage my data for OLTP applications and in BI scenarios. Sometimes, the crust on the pizza is not as crispy as that from a true Pizzamaker, but I can prepare a pizza side by side with a nice plum tart in my convection oven. Hmm, a taste of Fall...

Thursday, November 4, 2010

DB2 9.7 and changes to DBADM authorities

I was traveling the last 2 weeks and met with business partners and customers. One of the topics that was discussed were the security-related changes that happened in DB2 9.7. Some new roles like DATAACCESS and ACCESSCTRL were introduced and authorities for, e.g., DBADM were reduced.

Everybody agreed that a separation of duties means better access control and higher security. However, less authorities for the DBADM can make the life harder or more complex when there are no changes to development and test processes.

What are the changes that you had to make? What are the best practices that were introduced in your company? And last: What don't you like about the security changes?