Monday, July 1, 2024

Lakehouse: Bronze, silver, and gold levels of data

Is this a Data Lakehouse?
While working with customers and IBMers on data processing projects (to keep it as broad as possible), I often hear the talk about bronze, silver, and gold standards. These standards sometimes refer to the systems the data is stored in in terms of reliability, availability, performance, bandwith, and more. The IBM mainframe in a geographically dispersed parallel sysplex configuration may be considered such a gold standard. Lately, bronze / silver /gold standards are more frequently heard in the context of Data Lakehouse architectures and data sources or data zones. So, what is bronze, silver, and gold when discussing data and data lakehouse?

Data Lakehouse

A data lakehouse is a data management solution that combines the concepts and benefits of a data warehouse and a data lake. Data warehouses typically are high performant data stores with data brought in from various data sources. The data has been transformed "fit for purpose" to answer analytic / business intelligence queries. Data lakes are big data solutions that house all kinds of data, typically on "cost-optimized" (cheap) storage.

A data lakehouse combines data sources (federation) with data in different formats and stages of transformation, aiming to provide high performing query capabilities and optimizing data storage. In that sense, data lakehouses are very flexibile in what data is processed, where and how the data is stored, and how it needs to be accessed.

Bronze, silver, and gold data levels

If you have worked with data for a long time, you know that - somehow - all data can be analyzed, but sometimes (often?) takes some effort to obtain the desired insights. In the context of data lakehouse architectures, the following rough classification for data levels or data quality is used:

  • Bronze: The data is stored in its raw format or only slightly transformed into a queryable form close to its raw representation.
  • Silver: The data has already been transformed and cleansed and brought into a "better" format. It might have been combined with other data sources to filter it down to needed data sets, but still is stored in files.
  • Gold: The data has been transformed, refined, optimized, etc. and is stored in dedicated systems like optimized database systems or specialized files.

The screenshot below shows tiles in the infrastructure manager (view) of an IBM watsonx.data instance. I arranged them in such a way, so that four S3-compatible (cloud) object store services with the storage buckets are shown on the lowest layer (often bronze and silver level data). Also represented are an IBM Db2, an IBM Netezza, and a PostgreSQL database (silver and gold levels for my scenario).

IBM watsonx.data data organization

Conclusions

Data can be in many forms and formats and stored in many ways. To simplify discussions, data is categorized based on various properties and labels such as bronze, silver, and gold are attached. Keep in mind that "gold standard" might refer to highly available data, highly optimized data, or any other categorization.

That's it for today. If you have feedback, suggestions, or questions about this post, please reach out to me on Mastodon (@data_henrik@mastodon.social) or LinkedIn.