Tuesday, November 30, 2010

Indexing PDF Files with DB2 Text Search

This week I was asked how to build a full text index on PDF documents in DB2. Because my hard drive is full of them, I created a new DB2 database and set up DB2 Text Search with so-called rich text document filters. After creating a table, loading data and indexing the documents, I eventually used the CONTAINS and SCORE functions to search in my PDF files. But let me show you the steps in some more details...

Because DB2 Text Search support can be installed as part of the DB2 installation, my system already had the base support set up.

Thus my endeavors started with downloading the DB2 Accessory Suite. It contains additional document filters that help to extract full text information out of different file types such as PDF, word processor, or spreadsheet documents. The DB2 Information Center has instructions for downloading and installing the accessory suite. I checked that all necessary files from the regular DB2 installation were already in place and then succeeded with setting up rich text support.

Actually, setup requires two steps. Install the software on the server, then setting it up for an instance. Some confusion, but I succeeded. Next was enabling rich text support, instructions are on Information Center again. Now on to creating the actual test database (on my Linux shell):


DB2 Text Search uses an environment variable DB2DBDFT. This should be set to the database you work with to simplify administration. Hence:

export DB2DBDFT=FT

Finally, we are able to set up our new database for full text search:


The db2ts command is the DB2 Text Search command processor. Now we can create a simple table in the database to hold the documents after connecting to the db:

db2 connect to ft
db2 "create table hlDocs(id int not null primary key, description varchar(200), doc BLOB(10M))"

The second statement above creates a table named "hlDocs" with three columns. The first column is named "id" and of type integer. Note that DB2 Text Search requires a primary key column in the indexed table. The 2nd column will hold a short string describing the document. The last column is a BLOB that we use to store the PDF files. To easily insert PDF documents into the table, I created a small delimited file for import (named "blob.del"):

1,'train schedule Paris',ParisSchedule.pdf
2,'Schedule recycling paper collection',Papier_Abfallplan2010.pdf
3,'New resume',Current_resume.pdf

The above data can now be imported with the PDF files assumed in the same as our current directory:
db2 import from blob.del of del lobs from . insert into hlDocs

It is important to check that all rows were inserted:
Number of rows read         = 3
Number of rows skipped      = 0
Number of rows inserted     = 3
Number of rows updated      = 0
Number of rows rejected     = 0
Number of rows committed    = 3

Now the test data is in, but we don't have an index yet. That can be done using the db2ts command again. We first create the index, then update it:

db2ts "CREATE INDEX idx1 FOR TEXT ON hldocs(doc) FORMAT inso"

CAUTION: Note that the option "FORMAT inso" is not clearly explained in the documentation (yet - hence this post). That option tells DB2 to use the additional document filters that we installed as part of the DB2 Accessory Suite.

Finally, we can test the index using the CONTAINS and the SCORE functions.
db2 "select id,description from hldocs where contains(doc,'Paris')=1"

ID          DESCRIPTION     --------------------------------------------------
          1 'train schedule Paris'

  1 record(s) selected.

db2 "select id,description,cast(score(doc,'Friedri*') as decimal(5,3))  as score from hldocs order by 3 desc"

ID DESCRIPTION                                   SCORE   
----------- ------------------------------------------
          2 'Schedule for paper collection'      0.046
          3 'Resume'                             0.022
          1 'Train schedule Paris'               0.000

  3 record(s) selected.

Most of the steps are dedicated to proper setup of DB2 Text Search with the additional filters for PDF documents. However, once it is set up, inserting data and searching the documents is simple.