Data Henrik: July 2013

Wednesday, July 31, 2013

Hot out of the lab: DB2 Best Practices: Tuning and monitoring database system performance (Completely Revised!)

Fresh and hot out of the DB2 labs: The existing DB2 Best Practices paper on database system tuning and monitoring has been completed revised and updated. It covers DB2 10.0, DB2 10.5, and PureData for Operational Analytics. On 80 pages it starts with hardware configuration for good performance (CPU, disk, memory, ...), configuration of the operating system and more, then delves into initial DB2 configuration for different environments. The next big section is on monitoring the system performance and which tools and APIs to use. Once you have identified possible problems, they should be addressed. The paper discusses how to deal with disk, CPU, memory and "lazy system" bottlenecks.

This is a very good update to the DB2 Best Practices series on IBM developerWorks. I covered some of the papers in past blog entries.

Friday, July 26, 2013

Time to restructure life: Cognos with DB2 BLU for extreme performance

It is Friday, time for a video. DB2 10.5 comes with BLU Acceleration which combines columnar storage, extreme compression, full parallelization and hardware exploitation as well as in-memory computing for faster analytics (almost all buzzwords mentioned ;-). In the video today you see a comparison of Cognos on top of DB2 10.1 and DB2 10.5, showing the performance differences.

Why did I say "time to restructure life"? Just recently I visited a customer where many users kick off queries after coming into the office, then get some coffee and thereafter pick up the analytic results. With DB2 10.5 there is the fear of not having time for the coffee between tasks. As both an IT AND life coach my advise is to restructure life: First kick off the queries, collect the results, then have a coffee break to discuss business strategies based on current data. How is that? Maybe the results are even a reason that someone pays your tab...?

Epilepsy and computers: Types of Memory

Last months my son and I spent a week in hospital again for a regular post-surgery check up. As you might know from past blog posts, my youngest son had a brain tumor which caused epilepsy. The tumor and some brain tissue including the hippocampus on his right side were removed two years ago. A hippocampus is like the central memory controller in a computer and fortunately we have two of them for redundancy and failover.

When you work with computers, especially database systems like DB2 or Oracle, all day, it is interesting to look deeper into how the human memory works. There is a sensory memory acting like an I/O buffer, the short-term memory (think CPU registers, caches, and main memory), and the long-term memory (disk, tape, etc.). What I learned at the recent hospital visit is about the different types of long-term memory. Typically it is classified as declarative memory and procedural memory. You can think of the declarative memory as your "storage of facts" or the data in a database system. The procedural memory is for processes that can be performed like walking, playing piano, riding a bike, etc. In DB2 you could compare it to the package cache where information about how to execute statements is stored.

Having different types of memory and different ways of how new information is acquired and managed leads to the surprising fact that even though someone might have trouble learning new facts (because a hippocampus is missing), that person could be still excellent in learning and reciting piano pieces.

As written earlier: How does someone cope with memory problems caused by epilepsy? Our son has occupational therapy to develop strategies for more efficient use of his working memory and also to train (build up) his capabilities. The Epilepsy Society has this summary on the memory issue and workarounds that can be used like sticky notes (external memory), task lists, etc.

Thursday, July 25, 2013

DB2 in virtualized environments

I had written about DB2 and virtualization in the past. You can use DB2 in many virtual environments and there are already kind of "dated" redbooks explaining details about DB2 with VMWare, PowerVM, and others. Today I want to point you to two interesting resources:

You can even virtualize a DB2 pureScale cluster on Linux: The "yes, it is supported" can be found in the Information Center when looking for KVM. All the technical details of setting such a cluster up are discussed in this developerWorks article: Virtualize the IBM DB2 pureScale Feature on Linux using Kernel-based Virtual Machine
An overview of all supported DB2 virtualization environments, e.g., which hypervisor, which guest operating systems, etc., is provided on this developerWorks page.

Friday, July 19, 2013

Redbook Solution Guide: Faster DB2 Performance with IBM FlashSystem

A so-called IBM Redbook Solution Guide about using flash memory for DB2 has been published. It gives an overview about how and where performance could be improved by using an IBM FlashSystem and provides some test results. The short article also has many links to background material. It is not a deep article on that topic, but a good introduction into possible performance gains and when to consider investing into flash memory instead of an entire new system.

Monday, July 15, 2013

Articles related to database security (relational, NoSQL, Big Data)

Over the past two weeks there have been some new articles related to database security been published on developerWorks, some have been updated. At IBM, the products in this area are all called InfoSphere Guardium, covering activity monitoring, auditing, encryption and more.

A series of articles describes the use of Guardium for a NoSQL database like MongoDB. Security in the context of "Big Data" is covered in a recently updated developerWorks article, dealing with the Hadoop framework.

Wednesday, July 10, 2013

Extended row size support in DB2: Fitting more data into a single row than the page size would typically allow (and sometimes it makes sense, but that is another story - stay tuned for user experience...)

One of the new features of the new DB2 10.5 release is the support for so-called extended row size. As you might have guessed from the article's title, you can squeeze a lot of data into a single row now. The feature can be used for compatibility reasons with other vendors, to consolidate a table with "side tables" into a single one, or to possibly improve performance by moving tables to smaller page sizes. Let's take a look into the details of "extended row size".

Many of the DB2 size-related limits are documented in the overview of SQL and XML limits, however all the information on column sizes, row sizes, and what can be fitted into data pages of different size is part of the CREATE TABLE documentation. There, way down, we find that for 4k pages the row size is limited to 4005 bytes, for 8k to 8101 bytes, for 16k to 16293 bytes, and for 32k to 32677 bytes. So, how can we insert more data into a row?

A new database configuration parameter, extendend_row_sz, has been added. For new databases the default is ENABLE, for migrated databases DISABLE (for backward compatibility). I created a new database and the extended row size support is enabled. Let's give it a try:

db2 => connect to hl

   Database Connection Information

Database server        = DB2/LINUX 10.5.0
SQL authorization ID   = HLOESER
Local database alias   = HL

db2 => create table ers(id int unique not null, firstname varchar(3000), lastname varchar(3000))
DB20000I The SQL command completed successfully.
db2 => update db cfg using extended_row_sz disable
DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully.
db2 => create table ers2(id int unique not null, firstname varchar(3000), lastname varchar(3000))
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0286N A table space could not be found with a page size of at least "8192"
that authorization ID "HLOESER" is authorized to use. SQLSTATE=42727
db2 => update db cfg using extended_row_sz enable
DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully.
db2 => create table ers2(id int unique not null, firstname varchar(3000), lastname varchar(3000))
DB20000I The SQL command completed successfully.
db2 => insert into ers values(1,'Henrik','Loeser')
DB20000I The SQL command completed successfully.

The database configuration parameter can be changed online as can be seen above. Creation of the table ERS2 is blocked with extended row sizes DISABLED, then it works after enabling it again. Inserting some simple test data into the table succeeds. To test this new feature further, I created a data file with 3 rows (ID is 2 to 4). The reason for this type of test is the size limitation of the DB2 CLP and trying to avoid a GUI. The first row has two long strings with about 3000 bytes each, the second row a short and a long string, the last row a long and a short string. Importing the data is no problem:

db2 => import from "data.txt" of del insert into ers
SQL3109N The utility is beginning to load data from file "data.txt".

SQL3110N The utility has completed processing. "3" rows were read from the
input file.

SQL3221W ...Begin COMMIT WORK. Input Record Count = "3".

SQL3222W ...COMMIT of any database changes was successful.

SQL3149N "3" rows were processed from the input file. "3" rows were
successfully inserted into the table. "0" rows were rejected.

Number of rows read         = 3
Number of rows skipped      = 0
Number of rows inserted     = 3
Number of rows updated      = 0
Number of rows rejected     = 0
Number of rows committed    = 3

So how can we find out more about how this feature works? The documentation says that some data might be stored outside the row as LOB if the data is too big for the row. I am going to use db2dart for that. First I obtain the TABLE ID and TABLESPACE ID of our test table ERS:

db2 => select tableid,tbspaceid from syscat.tables where TABNAME='ERS'

TABLEID TBSPACEID
------- ---------
      8         2

1 record(s) selected.

The next step is to invoke db2dart from the shell for the database "hl":
hloeser@rotach:~$ db2dart hl /DD

Please enter
Table ID or name, tablespace ID, first page, num of pages, and y/n for verbose:
(may suffix page number with 'p' for pool relative if working with a pool-relative tablespace)
8 2 0 80 y

         The requested DB2DART processing has completed successfully!
                    Complete DB2DART report found in: HL.RPT

The "/DD" option direct db2dart to dump data pages. It prompts us to enter the IDs for the table and tablespace (which we know), the first page to dump (zero), the number of pages to dump (80), and whether verbose mode is preferred (yes). Based on our input a report file "HL.RPT" is generated. After some general information, free space control records, and a table description record, the first actual data records can be seen:
            Slot 4:

               Offset Location = 2850 (xB22)
               Record Length = 34 (x22)

               Record Type = Table Data Record (FIXEDVAR)

               Record Flags = 0

               Fixed part length value = 14

                  Column 1:
            Fixed offset: 0
                  Type is Long Integer
                  Value = 1

                  Column 2:
            Fixed offset: 4
                  Type is Variable Length Character String
                  Length = 6 Offset = 14
                      48656E72 696B                          Henrik

                  Column 3:
            Fixed offset: 9
                  Type is Variable Length Character String
                  Length = 6 Offset = 20
                      4C6F6573 6572                          Loeser

         Slots Summary: Total=5, In-use=5, Deleted=0.
This is our first row (ID 1) that we inserted, everything is stored in a single record, as expected and as usual. Reading on in the file, the other rows appear. I won't print them here for size reasons, but here is how they are stored.

ID 2, two long strings: first string in the record, second string outside the record
ID 3, short string, long string: both strings within a data page, single record because they fit
ID 4, long string and short string: both strings within a data page, single record because they fit

The column value for row with ID2 is stored outside the record as LOB. The db2dart output looks like this:

                  Column 3:
            Fixed offset: 9
                  Type is Variable Length Character String
                  Length = 24 Offset = 3020

                  Var Descriptor Length = 24
                     lfd_check = 89
                     lfd_version = 0
                     lfd_life_lsn = 0000000000064942
                     lfd_size = 3024
                     lfd_first = 2
                     lfd_dir = 0

         Slots Summary: Total=1, In-use=1, Deleted=0.
What can also be seen is that only a single slot is in use on this data page because of the record size.

I hope you found these details interesting and it encourages you to use db2dart to look into storage structures. When I visit universities, I always tell students to dig deeper into the implementation of database systems...

Friday, July 5, 2013

Friday Film: IBM history in 30 minutes

It is Friday and if you can spare 30 minutes today or over the weekend, I recommend this 30 minute IBM Centennial Film. "They Were There - People who changed how the world works". Show this to your kids and they will finally understand that we were not born with technology many take for granted today, including booking a flight, storing data, or having the UPC/bar code on almost everything...

Wednesday, July 3, 2013

IBM and DB2 analytics accelerators, in-memory computing, Big Data, NoSQL, ...

In recent months I had to answer questions about what technologies and products IBM offers for acceleration of analytic queries, whether any products like DB2 (for z/OS and Linux, UNIX, and Windows) or Informix offer in-memory computing capabilities, why the products on the mainframe and Intel/POWER platforms differ, and much more. Often an additional request is to answer with as few as possible sentences. :)

Here is an attempt to give you some answers and background information from my point of view, not necessarily IBM's. The article is written on a single afternoon with the plan to point to it and say "RTFM" instead of repeating myself...

Let me start this endeavor how I would do it in front of students when I teach. So let's take a brief look at some relevant technologies and terms:

In-memory computing or in-memory processing: Instead of loading data from disk to memory on demand, all or most of the data to be processed is kept in memory for faster access. This could be traditional RAM or even flash memory to allow persistence. If RAM is used, data is initially loaded from disk.
Columnar storage or column-oriented databases: Traditional (relational) database systems use row-oriented storage and are optimized for row by row access. However, for analytic applications typically large amounts of data with similar values or properties is scanned and aggregated. With column-oriented storage of many but the same values, only a single entry would be stored pointing to all related "rows" (rowid - this is similar to RID-list compression for indexes). Columnar storage usually yields excellent compression ratios for analytic data sets.
Massively parallel processing (MPP) and Single Instruction, Multiple Data (SIMD): By processing several data sets or values in parallel, either by means of many CPUs or special CPU instructions, the entire data volume can be processed faster than in a regular, serial way (one CPU core processing the entire query).
Appliances, workload optimized appliance or expert-integrated systems: A combination of hardware and software that typically is "sealed off". Thus, it does not offer many knobs to turn which makes it simple to use. Analytic appliances and data warehouse appliances therefore have a focus on hardware and software for analytic query processing.
Big Data: This term is used to describe the large data sets that typically cannot be stored and processed with traditional database systems. MapReduce or "divide and conquer" algorithms are used for processing, similar to parallel processing as mentioned above. The boundaries between "traditional" database systems and Big Data systems are moving.
NoSQL: For handling big data sets other ways of instructing the database systems than SQL are used, hence "No SQL". However, sometimes "NoSQL" could also mean "not-only SQL", hinting that the boundaries are moving and hybrid systems are available.

Based on specific customer requirements, on what is available in terms of technology, what makes sense cost-wise, and what fits into the strategy, different IBM products are available which use one or more of the mentioned technologies or relate to the terms.

DB2 for Linux, UNIX, and Windows (LUW) has been using parallelism (MPP, see above) for many years for data warehouses. Based on the shared nothing principle for the data, it can parallelize query processing and aggregate data quickly. It has been around as DB2 Parallel Edition, DB2 Database Partitioning Feature, DB2 Balanced Configuration Unit (BCU), InfoSphere Warehouse, and other names.
DB2 LUW with BLU Acceleration, introduced in DB2 10.5, makes use of an appliance-like simplicity. By setting a single switch, data is stored column-oriented with the mentioned benefits. Based on improvements to the traditional bufferpool or data cache of row-oriented database technology, it can hold all or only parts of the data in memory. Thus, in-memory processing is possible. BLU Acceleration utilizes SIMD technology to parallelize and speed up work and data is layed out for the processor architecture. Individual tables can be either stored in a row-oriented or column-oriented way and both types of table can be accessed in a single (complex) query.
IBM Netezza is an analytic or data warehouse appliance. It makes use of specialized hardware and MPP (see above) to speed up query execution.
Informix has the Informix Warehouse Accelerator to serve complex analytic queries. It uses column-oriented storage and parallelism and data is held in memory.
The IBM DB2 Analytics Accelerator (IDAA) can be added to DB2 for z/OS. The main design goal was to keep the System z DB2 attributes like security and the approach to administration. IDAA is based on the Netezza technology (see above) and integrates the appliance into the DB2 for z/OS environment. All access and administration is done through DB2.
InfoSphere BigInsights is IBM's Hadoop-based offering for the Big Data market.

To simplify the purchase process, the administration, to reduce the so-called time to value and for many more marketing reasons ;-), IBM has introduced several appliances that make use of the above listed products. They are all named IBM PureData Systems now, but based on the specific name, serve different markets:

IBM PureData System for Hadoop: An appliance built with InfoSphere BigInsights for Big Data-related problems.
IBM PureData System for Analytics: Based on Netezza technology
IBM PureData System for Operational Analytics: The appliance is built using DB2 LUW with the Database Partitioning Feature
IBM PureData System for Transactions: This appliance is built with DB2 LUW. It does (right now) not serve the analytics or Big Data market, but is designed for OLTP workloads. It uses the DB2 pureScale technology for a shared disk cluster with the goal of continous availability and scale out performance.

With my spare time this afternoon coming to an end, I will end this introduction. Please leave comments for hints of where to add/remove/fix it. And as you could see, I was able to resist the urge to mention SAP HANA, Oracle Exadata, Sybase IQ, Teradata, Greenplum, Microsoft, etc. :)

Pages