Data Henrik: Db2 for your data lakehouse

Wednesday, March 12, 2025

Db2 for your data lakehouse

Db2 as engine and data source
for the data lakehouse

A while ago, I was working with IBM watsonx.data to prepare a presentation on data lakehouse solutions. When talking about the (query) engines for a data lakehouse, typically, it is mostly about Presto and Spark. Did you know that Db2 can be used both as data source AND as query engine in watsonx.data (see screenshot)? Let's take a look...

IBM watsonx.data

In my blog post "Lakehouse: Bronze, silver, and gold levels of data", I defined a data lakehouse as a data management solution that combines the concepts and benefits of a data warehouse and a data lake. IBM watsonx.data is such a combination that is capable of integrating various data sources and providing the right query engines to benefit from insights of those federated data sources. IBM describes watsonx.data as an open, hybrid, and governed fit-for-purpose data store optimized to scale all data, analytics and AI workloads.

watsonx.data is available as software for on-premises or cloud deployment and is offered as managed service (SaaS). There is also a containerized Developer Edition available. All version of watsonx.data share the same user interface (UI). Personally, I find the Infrastructure Manager "symbolic" for the lakehouse as it shows the (to me) available resources: Query engines, vector database services, metadata catalogs, database services, and storage buckets. The screenshot below shows the Infrastructure Manager with the UI switched to dark mode.

IBM watsonx.data Infrastructure Manager

Db2 Warehouse as engine for the data lakehouse

As shown in the top screenshot, any Db2 can be added as a data source and Db2 Warehouse can be used as engine. Supporting Db2 as a database (system) as data source is obvious. But why can Db2 Warehouse be used as engine for the lakehouse? To me, the following features come to mind:

Since long, Db2 can be used as coordinator for data federation and acts as federation server. You can register federated data sources using the CREATE SERVER statement, then make objects from that data source known via the CREATE NICKNAME statement. Thereafter, you can run (federated) SQL statements against the registered data sources and objects - a single SELECT possibly running on many data sources with Db2 coordinating the execution and combining the results into a single result set.
Db2 offers external tables as a way to integrate data files (CSV or some internal formts) stored in local and remote storage. See one of my recent blog posts on external tables for more details and links.
Additionally, Db2 Warehouse features so-called datalake tables. They are similar to external tables in that the data is stored outside the database itself. The difference is that more file formats (ORC, PARQUET, AVRO, and more) are supported and the data is managed by either a Hive or Iceberg layer on top. Db2 leverages its Db2 Big SQL architecture with related components to access and integrate the data for processing by Db2.

Coming back to the above screenshot of the lakehouse Infrastructure Manager with the storage and database layers, it should be obvious now that data from those sources can be integrated and processed by Db2 Warehouse. In that sense, Db2 Warehouse as (federated / lakehouse / query) engine is capable of working with data on gold, silver, and bronze levels.

Conclusions

When working with IBM watsonx.data, I first was surprised (a little bit at least :) to find Db2 Warehouse listed not just as data source, but as engine for the data lakehouse. But looking deeper into Db2's query capabilities, it's no surprise at all.

I stumbled over the following interactive demo of watsonx.data while researching links for this blog post. You are guided through many parts of the UI without signing up or installing anything. Even adding Db2 as a data source is shown (see screenshot below):

Configure Db2 as data source in a watsonx.data lakehouse

That's it for today. If you have feedback, suggestions, or questions about this post, please reach out to me on Mastodon (@data_henrik@mastodon.social) or LinkedIn.

Pages

Wednesday, March 12, 2025

Db2 for your data lakehouse

IBM watsonx.data

Db2 Warehouse as engine for the data lakehouse

Conclusions