Wednesday, July 3, 2013

IBM and DB2 analytics accelerators, in-memory computing, Big Data, NoSQL, ...

In recent months I had to answer questions about what technologies and products IBM offers for acceleration of analytic queries, whether any products like DB2 (for z/OS and Linux, UNIX, and Windows) or Informix offer in-memory computing capabilities, why the products on the mainframe and Intel/POWER platforms differ, and much more. Often an additional request is to answer with as few as possible sentences. :)

Here is an attempt to give you some answers and background information from my point of view, not necessarily IBM's. The article is written on a single afternoon with the plan to point to it and say "RTFM" instead of repeating myself...

Let me start this endeavor how I would do it in front of students when I teach. So let's take a brief look at some relevant technologies and terms:
  • In-memory computing or in-memory processing: Instead of loading data from disk to memory on demand, all or most of the data to be processed is kept in memory for faster access. This could be traditional RAM or even flash memory to allow persistence. If RAM is used, data is initially loaded from disk.
  • Columnar storage or column-oriented databases: Traditional (relational) database systems use row-oriented storage and are optimized for row by row access. However, for analytic applications typically large amounts of data with similar values or properties is scanned and aggregated. With column-oriented storage of many but the same values, only a single entry would be stored pointing to all related "rows" (rowid - this is similar to RID-list compression for indexes). Columnar storage usually yields excellent compression ratios for analytic data sets.
  • Massively parallel processing (MPP) and Single Instruction, Multiple Data (SIMD): By processing several data sets or values in parallel, either by means of many CPUs or special CPU instructions, the entire data volume can be processed faster than in a regular, serial way (one CPU core processing the entire query).
  • Appliances, workload optimized appliance or expert-integrated systems: A combination of hardware and software that typically is "sealed off". Thus, it does not offer many knobs to turn which makes it simple to use. Analytic appliances and data warehouse appliances therefore have a focus on hardware and software for analytic query processing.
  • Big Data: This term is used to describe the large data sets that typically cannot be stored and processed with traditional database systems. MapReduce or "divide and conquer" algorithms are used for processing, similar to parallel processing as mentioned above. The boundaries between "traditional" database systems and Big Data systems are moving.
  • NoSQL: For handling big data sets other ways of instructing the database systems than SQL are used, hence "No SQL". However, sometimes "NoSQL" could also mean "not-only SQL", hinting that the boundaries are moving and hybrid systems are available.
Based on specific customer requirements, on what is available in terms of technology, what makes sense cost-wise, and what fits into the strategy, different IBM products are available which use one or more of the mentioned technologies or relate to the terms.
  • DB2 for Linux, UNIX, and Windows (LUW) has been using parallelism (MPP, see above) for many years for data warehouses. Based on the shared nothing principle for the data, it can parallelize query processing and aggregate data quickly. It has been around as DB2 Parallel Edition, DB2 Database Partitioning Feature, DB2 Balanced Configuration Unit (BCU), InfoSphere Warehouse, and other names.
  • DB2 LUW with BLU Acceleration, introduced in DB2 10.5, makes use of an appliance-like simplicity. By setting a single switch, data is stored column-oriented with the mentioned benefits. Based on improvements to the traditional bufferpool or data cache of row-oriented database technology, it can hold all or only parts of the data in memory. Thus, in-memory processing is possible. BLU Acceleration utilizes SIMD technology to parallelize and speed up work and data is layed out for the processor architecture. Individual tables can be either stored in a row-oriented or column-oriented way and both types of table can be accessed in a single (complex) query.
  • IBM Netezza is an analytic or data warehouse appliance. It makes use of specialized hardware and MPP (see above) to speed up query execution. 
  • Informix has the Informix Warehouse Accelerator to serve complex analytic queries. It uses column-oriented storage and parallelism and data is held in memory.
  • The IBM DB2 Analytics Accelerator (IDAA) can be added to DB2 for z/OS. The main design goal was to keep the System z DB2 attributes like security and the approach to administration. IDAA is based on the Netezza technology (see above) and integrates the appliance into the DB2 for z/OS environment. All access and administration is done through DB2.
  • InfoSphere BigInsights is IBM's Hadoop-based offering for the Big Data market.
To simplify the purchase process, the administration, to reduce the so-called time to value and for many more marketing reasons ;-), IBM has introduced several appliances that make use of the above listed products. They are all named IBM PureData Systems now, but based on the specific name, serve different markets:
With my spare time this afternoon coming to an end, I will end this introduction. Please leave comments for hints of where to add/remove/fix it. And as you could see, I was able to resist the urge to mention SAP HANA, Oracle Exadata, Sybase IQ, Teradata, Greenplum, Microsoft, etc. :)