Data Henrik: July 2012

Friday, July 27, 2012

A box of ice cream?

Some years back IBM was advertising a very special box. In the commercial they showed what it could be good for. One thing is definitely missing: ice cream. Today is a hot day here in the South of Germany and I am looking forward to some ice cream after work and then some fun at the Kulturufer ("shore of culture" - a well-known tent festival at the shores of Lake Constance in Friedrichshafen with music, theater, comedy, magic, and much more).

But now back to the magic box:

Tuesday, July 17, 2012

Another year passed by: Energy consumption in a passive house

In the first year of living in the new house I kind of read the energy meters more or less weekly, then it turned into bi-weekly to monthly activity. The past year I looked at the meters mostly during the Winter months and almost forgot to read them out last weekend when another year had passed. When you look at the energy consumption, there are surprising numbers:

	2012	2011	2010	2009
Household (kWh)	2597	2605	2473	2244
Heating/Ventilation/Water (kWh)	2713	2713	2858	2782

Even though the past year was a leap year (and had a leap second!!!) the consumption was the same. I had hoped to have a smaller number, but we had some very chilly weeks in February. Household consumption has also stayed constant with most consumers now in place. Most of it should come from cooking, laundry and my home office with computer, monitor, phones, etc.

Our house does not have a smart meter where you can track consumption online and real-time. However, I talked with some "owners" and the bottom line is that after the first few weeks they barely make use of it. It is similar to reading our old-fashioned meters more frequently in the beginning to learn and look for exceptions. Then leaning back once everything seems to run smoothly. Any similarities to a well-tuned database system and regular workloads...?

Friday, July 13, 2012

Big Data, Big Problems, Right Questions?

The term "Big Data" has been around for a while. Most of my time I have dealt with problems around traditional data (relational data), new data structures, non-traditional data, semi-structured data, in-memory data, on-disk data, hard-to-find data, lost data, object-oriented data, object-relational data, old data, new data, archive data, wrong data, corrupted data, THE right data, and all kinds of other data.

Almost daily I receive emails or letters where my name is misprinted, the address is incorrect, or where the email was not even intended for me (long story). Few days ago I listened to a longer discussion about problems with various kind of applications at several companies, all non-IBM-related. One key problem was that due to failing components in a too complex environment, data got outdated, was missing or is incorrect. The consequences impacted the "regular guys", ranging from having no or incorrect parts in an assembly line over not updated timesheets to not being able to "badge in". When I talk with my students at the Cooperative State University (who all have a job in the industry), many of them have seen or had to deal with incorrect or outdated data and "information" based on it.

The issues remind me of one of the first lessons in physics at school: "Garbage in, garbage out". For Big Data, the amount of data is in a different dimension, but the same principles as with the other data apply. It is important to know that there are or could be issues with the input data. How is that dealt with? Does it impact the algorithms and the result? Is the derived "information" really information? How is the data interpreted? Is it ok to base decisions on them or is it taken as just one of many indicators? Do those dealing with information found from Big Data know the source, is everything fully disclosed? There are many more questions, many of them non-technical. One important question is what questions to ask against the data. Is this to "proof" some speculation and guesswork or to really find out something new that may be validated by further work? And then we are back to a problem we still face since years with existing, "small" data.

Now to a small problem and statistics...:
Over the weekend we will have big fireworks locally here over the Lake Constance. My kids want to watch them for sure. I could point out that only 20% of the persons I considered for my statistical findings will attend. My kids could respond that 70% will attend. They chose people aged 8-80 years, I those aged 0-8 years...

The context counts and asking the right questions, regardless of small or big data.

Tuesday, July 10, 2012

Index something non-existing - functional indexes in DB2

After discussing hidden (non-visible) columns last week, indexing something non-existing fits into the same theme. Let's shed some light into this seemingly mysterious topic...

XML documents and XML-based data modeling have been around for quite some years now. DB2 started its XML support with the XML extender in the late 1990ies (version 7), DB2 pureXML shipped initially with version 9.1. Advantages of using XML include flexibility of what is included in a document (or set of documents) and how sparse data can be represented. In relational NULL values or other "empty" values need to be stored even if data for a property (column) does not exist, in XML documents that particular property could be just left off, i.e., nothing is stored. However, the "nothing is stored", the non-existing data introduced a problem for searching efficiently in some use cases. How do you find all customers that do not have a certain property, e.g., have not received the marketing letter yet or do not own a car?

To address this kind of problem, DB2 10.1 now allows certain functional indexes over XML data (when does it come for relational data - any guesses...?). One of the functions allowed in such an index definition is fn:upper-case(). That way an index supports case insensitive searches over strings:

CREATE INDEX customers_email_idx ON customers(cinfo) 
   GENERATE KEYS USING XMLPATTERN '/customer/contact/email/fn:upper-case(.)'
   AS SQL VARCHAR(80); 

SELECT * FROM customers WHERE 

XMLEXISTS('$CINFO/customer/contact/email[fn:upper-case(.)="HENRIK.LOESER AT GMAIL.COM"]');

The other supported function is fn:exists() which allows to index existence or non-existence of an element or attribute. Thus, an index can be utilized to search even for something that is not (directly) stored in the database, i.e., implied information.

CREATE INDEX customers_db2vers_idx ON customers(cinfo)

  GENERATE KEYS USING XMLPATTERN '/customer/sw/fn:exists(db2version)'

  SQL AS VARCHAR(1);

SELECT * FROM customers WHERE
XMLEXISTS('$CINFO/customers/sw[not(fn:exists(db2version))]');

The above query would return all customers who don't have any DB2 version of software installed. Maybe they are reading this article and then will install DB2 soon...?!

Monday, July 9, 2012

Another quiz: How to index something non-existing?

To start off this week with the right attitude, here is another quiz:

Imagine that I would like to store address data and information about my contacts ("friends"?). I want to keep track of when I met them. In my database I want to efficiently query whom of my contacts I met on a specific date. From database theory I know that indexing the data would help. But I would also like to search for persons I never met. How could I support that with an index in DB2? How could the data be stored?

SQL statements will be in the solution... [Update: The solution has been posted.]

Wednesday, July 4, 2012

Implicitly hidden columns in DB2

On Monday I posted a small quiz about an SQL error reported by DB2. In the comments were guesses about the solution, including generated columns, triggers, or the table actually being a view causing the error. However, the reason for the processing problem is a different one and it was also mentioned in the comments: The presence of an implicitly hidden column.

The concept of implicitly hidden columns was first introduced in DB2 9.5 for the purpose of easier and faster optimistic locking. A column defined as IMPLICITLY HIDDEN is not returned as part of a SELECT statement if not explicitly referenced. In DB2 9.5 and DB2 9.7 this feature was limited to a ROW CHANGE TIMESTAMP column type to implement the optimistic locking. With DB2 10.1 this has changed as you can see at the CREATE TABLE statement and a special page with an introduction to the concept of hidden columns. Now, you can hide all kinds of columns.

Here is how I defined the table that I used in the quiz:

db2 => create table ih(id int unique not null, s varchar(30), s2 varchar(30) not null implicitly hidden)
DB20000I  The SQL command completed successfully.

As you can see, the third column is hidden. Then I performed an insert which resulted in the reported error:

db2 => insert into ih values(1,'a')
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0407N Assignment of a NULL value to a NOT NULL column "TBSPACEID=2,
TABLEID=257, COLNO=2" is not allowed. SQLSTATE=23502

What is interesting to note is the following:

Best practice and strongly advised is to have a default value defined for hidden columns to not run into such errors.
In computer science many enumerations start with zero, not one. The column numbers, table identifiers, and tablespace identifiers in DB2 are no exception. Hence is the column with COLNO=2 the THIRD column which could mislead you. In the description for the error SQL0407N is actually a SQL query provided that helps resolve the mystery.
Similar issues are possible with so-called data movement utilities like IMPORT and LOAD as you might have or not have data for the hidden column.

In the quiz I showed the output of DESCRIBE with only two columns reported.
db2 => describe select * from ih

Column Information

Number of columns: 2

SQL type              Type length Column name                     Name length
-------------------- ----------- ------------------------------ -----------
496   INTEGER                   4 ID                                        2
449   VARCHAR                  30 S                                         1
Performing a DESCRIBE TABLE shows all three columns.

db2 => describe table ih

                                Data type                     Column
Column name                     schema    Data type name      Length     Scale Nulls
------------------------------- --------- ------------------- ---------- ----- ------
ID                              SYSIBM    INTEGER                      4     0 No    
S                               SYSIBM    VARCHAR                     30     0 Yes   
S2                              SYSIBM    VARCHAR                     30     0 No    

  3 record(s) selected.

Finally, if you are sick of all the hidden stuff, you can act as a magician and let the columns reappear again:

db2 => alter table ih alter s2 set not hidden
DB20000I The SQL command completed successfully.
db2 => describe select * from ih

Column Information

Number of columns: 3

SQL type              Type length Column name                     Name length
-------------------- ----------- ------------------------------ -----------
496   INTEGER                   4 ID                                        2
449   VARCHAR                  30 S                                         1
448   VARCHAR                  30 S2                                        2

Now I will disappear by altering my state, coffee is waiting...

Monday, July 2, 2012

A small DB2 SQL quiz - why the SQL error?

I have been playing with some new DB2 10.1 features and thought that a small SQL quiz at the beginning of the week is appropriate. Consider the following SQL statements and the related output:

db2 => insert into ih values(1,'a')
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0407N Assignment of a NULL value to a NOT NULL column "TBSPACEID=2,
TABLEID=257, COLNO=2" is not allowed. SQLSTATE=23502
db2 => select * from ih

ID          S
----------- ------------------------------

0 record(s) selected.

db2 => describe select * from ih

Column Information

Number of columns: 2

SQL type              Type length Column name                     Name length
-------------------- ----------- ------------------------------ -----------
496   INTEGER                   4 ID                                        2
449   VARCHAR                  30 S                                         1

How was the table defined or why is it giving an error? Any guesses? Post your answer and explanation as comment. I will post the solution in a couple days. [Edit: done and linked]

Pages