Data Henrik: compression

Showing posts with label compression. Show all posts

Saturday, February 8, 2014

Family life and DB2 BLU

Imagine that you had to search for a cooking pot within your house. Where would you start and search first? Most people would focus on the kitchen. Where would you look for some toothpaste? Most probably in the bathroom and maybe in the room where you just put the bag from your shopping trip.

Using the context information speeds up search, you are only considering some places and avoid searching the entire house. This is data skipping in normal life. DB2 with BLU Acceleration uses a synopsis table to provide the context information. By avoiding work less resources are needed, less data needs to be processed and you have the result much faster.

Now imagine that the cabinets are labeled, the kids would have cleaned up their room with clothes nicely folded and small junk sorted into plastic containers. In DB2 BLU this would be called "scan-friendly". Some people use "space bags", plastic wraps that can be vacuumed to reduce the storage size of clothes, pillows, etc. Because you can still see what is inside and handle it like everything else, it is "actionable compression" - same as in DB2 BLU which can operate on compressed data.

Now if I could create an analogy how DB2 BLU does the dishes - something I have to do now. Household chores. Enjoy the weekend!

Thursday, August 22, 2013

Bananas, pants, and DB2 with BLU Acceleration

Have you ever fed a banana to a young child? It would have been banana puree, i.e., mashed bananas (there are even instructions for that!). When you think about it, after feeding the banana puree you have a happy child - and a fork, a spoon, a plate and maybe more to wash. As a grownup or older child you would just peel and eat the banana, no dishes to do. Isn't that smart?

Source Wikipedia (Ed Yourdan / Themightyquill)

Now imagine that the bananas taste well and you eat a lot. Babies grow, you "grow". For the babies and children you have to buy new clothes because eventually, even with tricks like adjustable waist bands etc., the pants and shirts don't fit anymore. As a grownup you can react to "growing in the middle" and keep a healthy diet, do some workout, and keep your pants. No money to spend. Isn't that smart?

In database systems like DB2 avoiding to do the dishes would translate to avoid accessing data, to avoid I/O operations. DB2 with BLU Acceleration uses data skipping to only access those portions of the data where comparisons or further processing is necessary. No sweat, no dirty dishes. Even if the data needs to be processed, many operations can be done on the compressed, encoded version of it. Think of it as using a set of plates that remain in their original, shrink-wrapped packaging. That makes "clean-up" faster, a smart move. :)

What do you do when you have more data than it fits in memory? A baby would need larger clothes, some buy bigger machines. DB2 BLU uses decades old, but highly improved technologies like a bufferpool and algorithms to keep the most needed data in memory (or is it in-memory?) and to bring it in from disk with advanced prefetching before needed. That would be like telling your extra pounds to stay at home, so that your existing pants fit well. Maybe you could even wear a smaller size or keep your current pants longer. That would be smart and would save many bucks. DB2 BLU even goes further by tailor-made pants, i.e., laying out the data to make the best use of CPU registers and to improve performance even more. Tailor-made pants with expandable waist bands, "haute couture" for your data at no extra cost.

My older BLU-related posts can be found using the blu label.

Thursday, August 1, 2013

CTQ: Come on, Tune my Query (with DB2 BLU)

DB2 10.5 comes with BLU Acceleration. In a previous post I had explained what technologies are used for BLU and other analytic accelerators. For BLU it is a mix of columnar storage organization with extreme compression together with exploitation of parallel computing and new CPU instruction sets plus enhancements to bufferpool and I/O management for in-memory processing - quite a long list. But how can I find out whether BLU Acceleration is used?

The key to performance is looking for the new CQT operator in the query access plans (explain output). A usage example is shown here in the DB2 Information Center:

Access Plan:
-----------
 Total Cost:   570246
 Query Degree:  32

              Rows 
             RETURN
             (   1)
              Cost 
               I/O 
               |
               191 
             LTQ   
             (   2)
             570246 
             505856 
               |
               191 
             CTQ   
             (   3)
             570245 
             505856 
               |
               191 
             GRPBY 
             (   4)
             570245 
             505856 
               |
           2.87997e+08 
             ^HSJOIN
             (   5)
             377881 
             505856 
         /-----+------\
   2.87997e+08         2e+06 
     TBSCAN           TBSCAN
     (   6)           (   7)
     329484           4080.56 
     499008            6848 
       |                |
   2.87997e+08         2e+06 
 CO-TABLE: TPCDS  CO-TABLE: TPCDS   
   STORE_SALES       CUSTOMER
       Q1               Q2

But what is the operator CTQ and what does CTQ stand for? Is it "Critical To Quality", "Copy The Query", "Come on, Tune the Query"or "Critical To Quickness"? TQs are so-called Table Queue operators and are used to pass a special result set from A to B. They are often seen in partitioned databases (InfoSphere Warehouse) as either Directed Table Queues (DTQs) or Broadcast Table Queue (BTQs) to move data across nodes in the cluster. For BLU Acceleration a CTQ is the "columnar to row" gatekeeper and means that the world of BLU has been left and regular DB2 technology takes over.

Ideally, there is only a single CTQ in a query plan and high up because this would mean (almost) everything is processed with BLU Acceleration. If you only store some tables in columnar format, other tables in classic row organization (remember: DB2 with BLU Acceleration is hybrid) and see many CTQs in an access plan and only near the bottom of the plan, rethinking the strategy for the physical database design might be a good idea.

Come on, Tune the Queries - of course with DB2 and BLU Acceleration...

Monday, April 15, 2013

(Updated) Special hybrid, object-capable, all purpose database

Recently I had to deal again with a special database. Let me share my observations.

This special database seems to have excellent compression capabilities as a ot of stuff can be cramped into it. From the outside the database looks small, but it seems to be able to store 10 times its advertised capacity.

This database can handle relational data, large objects, unstructured data, etc. and has several storage areas and storage types. For the experienced user data access is really fast and it seems index-only. For the infrequent user the only way of data access is through repeated table scans.

The database is equipped to deal with "situations" and act in "emergencies". Its capabilities include having candy up for distribution exactly when needed, a first aid kid with the right number and size of bandaids, chapstick, hand lotion, and some hand sanitizer.When fully loaded, the database can be used as a self-defense device, can be thrown as projectile even on longer distances.

I could go on with more features (you can in the comments), but do you know what database I described above...?

Update: My wife's handbag is one of the described special databases.

Friday, September 9, 2011

DB2 pureXML for the Protein Data Bank - Managing Atoms and Molecules with XML

Yesterday a new interesting article was published on developerWorks, "Managing the Protein Data Bank with DB2 pureXML". It describes how scientists with highly complex data (the Protein Data Bank), atoms and molecules that make up protein, can benefit from a highly sophisticated database management system, DB2.

At the core of the solution is pureXML to reduce the schema complexity (see the graphic for a relational design in the article). Compression is used to keep the database small and keep I/O to a minimum. Now add database partitioning across multiple nodes, range partitioning and multi-dimensional clustering to further help organize the data and to cut complexity and improve performance.

What a good example of combining many of DB2's features for the benefit of science.

BTW: This work was possible through the IBM Academic Initiative which as part of the benefits allows free use of DB2.

Wednesday, September 29, 2010

Automatic compression in DB2? Good question and some answers

Image via Wikipedia

During the DB2 bootcamp I was teaching I was asked, why DB2 does not offer automatic compression. There are a lot of automatic and autonomous features inside DB2. Why could DB2 not automatically determine that compression makes sense and compress the data?

This is an excellent question. DB2 can indeed take control of several performance-related configuration settings once DB2 has been authorized (enabled) for that. Regarding compression it is interesting to see that DB2 compresses temporary tables automatically once a compression license has been applied and DB2 determines that it makes sense performance-wise. This same capability applied to regular data and indexes is what the question is about. Why not autonomous decisions for that?

The answer relates to several aspects: The scope of a decision and the business strategy. Let's start with the scope. For a temporary table DB2 has an idea of what to expect. It knows the query and has (rough) statistics about the data and expected intermediate result sets. Hence it estimate the impact, good or bad, more or less precisely and decide on whether it makes sense to apply compression. For regular data and indexes, the decision and the background knowledge are different. DB2 does not know which queries to expect, how the data might grow over time, what other concurrent activities might be planned, what resources would be available to compress and reorganize the data. The impact of the decision is by far of a much bigger scope and wide reaching.

The business side of the question whether to compress or not are also nothing DB2 knows about. Service level agreements, storage costs, costs for CPU cycles, response times, throughput goals, business priorities, and many more are something DB2 does not know about. Smart humans are needed to decide on what is best, not just for a single table or index, but for the overall design and how it fits into the business requirements and (cost) strategy.

And that is where skilled DBAs come into the picture. Excellent question, some answers, and overall good news for DBAs...

Wednesday, April 22, 2009

Q4U: R U Cmprsng UR DTA?

DAMHIKT, but it seems stuffing more in less space is popular (try going on vacation with a family). Fortunately, the next release of DB2 adds a lot of compression features. On the page for the Technology Sandbox (a.k.a. "beta") the following compression functionality is mentioned in addition to DB2's deep table compression:

Multiple algorithms for automatic index compression.
Automatic compression for temporary tables.
Intelligent compression of large objects and XML.

AAF I can tell U that this is cool stuff.

Index compression saves you big because you usually have plenty of them and compressed indexes are still compressed when in the bufferpool. Compression of temporary tables is especially useful for analytics/warehousing. Sign up and test it yourself!

SIG2R.

CU, Henrik (DARFC)

BTW: I got some help coding parts of my blog using this abbreviation list.

Pages