Friday, June 28, 2013

Video: say No! No! and No! (NoSQL, Big Data Parody)

It is Friday, the weekend is close by and this week I started playing with DB2's upcoming NoSQL for JSON support. So why not finishing off the hard work week with some insights into the history of NoSQL and also Big Data and a really deep technical look at the coolest features...


Thursday, June 27, 2013

(Updated) NoSQL and JSON with DB2

You may have heard that DB2 10.5, which recently became available, includes a technology preview "DB2 NoSQL JSON". JSON is the JavaScript Object Notation. The important stuff for JSON in DB2 can be found in the sqllib/json directory. Once I added the "lib" directory to the Java CLASSPATH, I could invoke a new command line processor "db2nosql" (found in "bin"):

hloeser@rotach:~/sqllib/json/bin$ sh db2nosql.sh
JSON Command Shell Setup and Launcher.
This batch script assumes your JRE is 1.5 and higher. 1.6 will mask your password.
Type db2nosql.sh -help to see options
Enter DB:jsontest


IBM DB2 NoSQL API 1.1.0.0 build 1.0.169
Licensed Materials - Property of IBM
(c) Copyright IBM Corp. 2013 All Rights Reserved.

Debug mode is off.
nosql>
nosql>Type your JSON query and end it with ;
nosql>Type help() or help for usage information
nosql>
nosql>Setup Tables and Functions seems to have not been created or have been created incorrectly. Please type enable(true) and enter to setup them up. You must have the correct admin privileges.
If you do not, type enable(false) and enter to see the SQL that will be used.


After entering the database name to use I got connected. Next was to enable the database for JSON processing as suggested above.

nosql>enable(true)
Executing SQL...
Database Artifacts created successfully


If you use "enable(false)" the tool will just print the DDL to execute manually. This is the creation of a system table SYSTOOLS.SYSJSON_INDEX and several functions. Next I set a JSON namespace. By default the user schema would be taken as namespace. Because I plan to look into storage details later on, I chose "TEST" (which will use the SQL schemaname "TEST" under covers).

 nosql>use test
Switched to schema TEST



Then I was good to go and insert two documents. Note that it is not necessary to create any table, collection, index or anything. Trying to find one of the documents I stored via predicate also worked.

nosql>db.obj.insert({name: "Henrik", country: "DE"})
OK
nosql>db.obj.insert({name: "Jim", country: "US"})
OK
nosql>db.obj.find({country: "US"})
nosql>Row 1:
nosql> {
nosql> "_id":{"$oid":"51c9acf301c690e828779af2"},
nosql> "name":"Jim",
nosql> "country":"US"
nosql> }
nosql>1 row returned in 134 milliseconds.


Some more commands are described in a recent developerWorks article on JSON in DB2. I also plan to write more about my adventures here in this blog.

Added 27.06.2013:
A couple more developerWorks articles are now available:
Finally, I have a video introducing the technology.


Wednesday, June 26, 2013

Combining data to find suspect in over 700 highway shootings

In Germany a 57-year-old trucker was arrested over the weekend. He is accused of shooting over 700 times at trucks (mostly car transporters) and cars over a span of several years. Only few people got injured as this were drive-by shootings on highways and he was aiming at the load, not the drivers. As the case couldn't be resolved for some years it eventually up with German federal police. They used new devices to automatically collect license plate information at central highway locations plus cell tower data from mobile phones to drill down on the suspect.

As soon as the arrest was reported, a discussion about data privacy started in Germany (again). Is it ok to store all the license plate information, even if it was deleted after 10 days? Does a case like this warrant to obtain cell tower information (who had their mobile phone turned on around possible shooting locations and traveling on the highway)? Was it ok to combine the data to find the suspect? There are pros and cons to it and I don't want to comment in either direction. When I teach data management at university I tell my students to be aware of the amount of data they produce, e.g., by having a mobile phone, by traveling, by using the Internet. Such data is stored for billing purposes and for legal reasons (laws that require companies to keep data for 30 or 90 days or even longer). And when there is data, there is always an opportunity to work with it, legally or illegally.

Are you in favor of using such data, maybe within certain limits? Are you always aware of what data you produce? Any opinions or thoughts?

Monday, June 24, 2013

DB2 SQL Compatibility: CREATE INDEX ... EXCLUDE NULL KEYS

One of the often problems when migrating databases from Oracle to DB2 were with unique indexes. Oracle does not include and consider NULL keys when checking for uniqueness of values, DB2 does - or better: did. Starting with DB2 10.5 it is possible now to specify whether to INCLUDE NULL KEYS (the default) or to EXCLUDE NULL KEYS when CREATE INDEX is executed. This allows to have either the classic DB2 semantics or to mimic the behavior of other database systems. And thereby it reduces migration effort further.

In the following I tried the new feature by creating two tables, T1 and T2, with a unique index each. On T2 the index uses "exclude null keys". So let's see how they respond to my attempts to insert some values...:


create table t1(id int, s varchar(60))
DB20000I  The SQL command completed successfully.
create table t2(id int, s varchar(60))

DB20000I  The SQL command completed successfully.
create unique index t1_id_u_idx on t1(id)

DB20000I  The SQL command completed successfully.
create unique index t2_id_u_idx on t2(id) exclude null keys
DB20000I  The SQL command completed successfully.
insert into t1 values(1,'one')

DB20000I  The SQL command completed successfully.
insert into t2 values(1,'one')

DB20000I  The SQL command completed successfully.
insert into t1 values(2,'two')

DB20000I  The SQL command completed successfully.
insert into t2 values(2,'two')

DB20000I  The SQL command completed successfully.
insert into t1 values(2,'two again')

DB21034E  The command was processed as an SQL statement because it was not a
valid Command Line Processor command.  During SQL processing it returned:
SQL0803N  One or more values in the INSERT statement, UPDATE statement, or
foreign key update caused by a DELETE statement are not valid because the
primary key, unique constraint or unique index identified by "1" constrains
table "HLOESER.T1" from having duplicate values for the index key. 
SQLSTATE=23505
insert into t2 values(2,'two again')

DB21034E  The command was processed as an SQL statement because it was not a
valid Command Line Processor command.  During SQL processing it returned:
SQL0803N  One or more values in the INSERT statement, UPDATE statement, or
foreign key update caused by a DELETE statement are not valid because the
primary key, unique constraint or unique index identified by "1" constrains
table "HLOESER.T2" from having duplicate values for the index key. 
SQLSTATE=23505
insert into t1 values(null,'null')
DB20000I  The SQL command completed successfully.
insert into t2 values(null,'null')
DB20000I  The SQL command completed successfully.
insert into t1 values(null,'null again')
DB21034E  The command was processed as an SQL statement because it was not a
valid Command Line Processor command.  During SQL processing it returned:
SQL0803N  One or more values in the INSERT statement, UPDATE statement, or
foreign key update caused by a DELETE statement are not valid because the
primary key, unique constraint or unique index identified by "1" constrains
table "HLOESER.T1" from having duplicate values for the index key. 
SQLSTATE=23505
insert into t2 values(null,'null again')

DB20000I  The SQL command completed successfully.

As can be seen, both indexes check uniqueness of non-NULL values. When I try to insert the value "2" again, an error is returned. Both indexes accept a NULL value. However, on T1 only a single NULL value can be inserted and then a violation of the uniqueness criteria is reported. On the second table we are able to insert another NULL value without problems because NULL keys are not included in the index.

Try it yourself, best by migrating a database from Oracle to DB2... ;-)

Thursday, June 20, 2013

WLB, ACR, and Client Affinity in a DB2 pureScale environment

End of April a longer article was published on IBM developerWorks describing Workload balancing, automatic client reroute, client affinities concepts and administration for the DB2 pureScale feature. The mentioned technologies are key to setting up a shared disk DB2 cluster for a 24x7 availability. So if you want to learn about server lists, member priorities, connection timeouts, and other related concept, this is a good place to start reading.

It is important to note that right now the article is based on DB2 10.1. The new release DB2 10.5 adds so-called member subsetting to the above concepts, so that an application can be configured to use a subgroup of available machines.

Tuesday, June 18, 2013

developerWorks article on setting up DB2 for encrypted communication with SSL

DB2 allows to encrypt data both in storage and in transit. Securing stored data makes sure that even if someone can access the disk, the actual data can not be read. Encrypting data in transit prevents sniffing, i.e., someone trying to listen to the communication channels and spy on the data (Footnote: this is not entirely true). There exist different options in DB2 for securing data on disk and during communication. SSL, the Secure Socket Layer, is a commonly used technology to encrypt communication, e.g., for Web access with the HTTPS protocol.

A new article has been published on developerWorks titled "Secure Sockets Layer (SSL) support in DB2 for Linux, UNIX, and Windows" that provides step by step instructions on how to setup SSL for use with DB2. If you haven't looked into that topic before, this is a good starter.

Monday, June 17, 2013

DB2 10.5 (Kepler) and Information Center available now

The latest release of DB2, version 10.5 which was code-named "Kepler", is available now. I recommend to visit the DB2 10.5 Information Center and then to read the "What's new" and "What's changed" sections. Significant changes are not only in terms of features and functionality, but also in the packaging.

The DB2 database product editions now include both "regular" and an advanced Enterprise Server and Workgroup Server editions (AESE, AWSE, ESE, WSE as acronyms to remember) plus the Express, Express-C, and Developer editions. As you can see from the features by DB2 edition overview, the database partitioning features ("shared nothing") is now included in both advanced editions. The Infosphere Warehouse edition is gone and it is named DB2 now - the name of its  foundation for all the past years.

Technical highlights of this new DB2 release include the BLU acceleration (column-organized tables) for building in-memory Data Marts, many features for DB2 pureScale (including rolling fixpack updates and online topology changes to the cluster).

Tuesday, June 4, 2013

Some typical DB2 registry settings for Temenos T24 and other OLTP systems

In one of my previous articles I described what table properties could be used (CREATE or ALTER TABLE) for XML-based OLTP systems like Temenos T24. Today I will show and explain some DB2 registry settings that are typically in use for such (and other) systems.

A list with almost all (documented) registry and environment variables can be found in the DB2 Information Center. As you can see those variables are grouped into different categories, e.g., compiler, communication, performance, or general settings. As some settings require a restart of DB2, the configuration is typically applied once when setting up a system.

DB2_PARALLEL_IO=*
This tells DB2 on how to handle prefetch requests and what to assume about disks per container. Prefetching is used to bring in data "ahead before it is used", i.e., it reduces IO wait. The above setting is simple and a good comprise. All tablespaces are covered and it is assumed that all have the same layout of 6 disks per container and hence multiple smaller prefetch requests are issues in parallel. There are different design philosophies "out there" with sometimes contradicting recommendations. Also take a look at Configuring for good performance and Optimizing table space performance when data is on RAID devices.

DB2_USE_ALTERNATE_PAGE_CLEANING=ON
Page cleaning is the reverse of prefetching. It makes sure that the bufferpool (data cache) always has enough space available to bring in additional data pages from disk. This alternate page cleaning is more aggressive in writing out changed data, makes use of IO bandwith and helps to have slots available in the bufferpool for our OLTP system.

DB2_EVALUNCOMMITTED=ON
DB2_SKIPDELETED=ON
DB2_SKIPINSERTED=ON
All three registry settings are typically used together to improve concurrency for Cursor Stability and Read Stability isolation levels. When scanning data and evaluating predicates a slightly different, "less patient" approach is taken which helps throughput in OLTP systems.

DB2MAXFSCRSEARCH=1
When inserting new data DB2 has to search for a data page with free space. That space is managed with free space control records (FSCRs) that are located every 500 pages or so (see here for details on table organization and FSCRs). By default DB2 is searching up to 5 such records in order to find a page to squeeze the new record in. This helps to fill up pages and to efficiently reuse space left from delete or some update operations. However, it does not help insert performance which is critical in OLTP system. Hence a setting of 1 is a nod towards performance while still trying to reuse space. A more drastic setting would be APPEND ON for a table where new data is always inserted at the end.

DB2_FORCE_NLS_CACHE=TRUE
This setting only applies on AIX, not on Linux and is used to cache the codepage and territory information for a connection. As T24 and other OLTP systems do not change "national language support" (NLS, i.e., the codepage and territory information) this helps performance slightly and also avoids possible lock contention while accessing that information.

This already was my quick overview of few of the settings that should be considered for OLTP systems like Temenos T24 on DB2. What is left is to take a look at typical DB2 database manager and database configurations. But that is another article...