The Future of the Data Warehouse

We’ve had some lively discussion recently among Eckerson Group bloggers and our readers about the state of the data warehouse. Stephen Smith kicked it off with his observations about The Demise of the Data Warehouse. I felt compelled to offer counterpoint with my belief that The Data Warehouse is Still Alive. In short order we got perspectives from Ted Hills, David Loshin and Abie Reifer, and Dewayne Washington. All of these points-of-view together with the many comments from readers of Eckerson Group blogs indicate that most believe the data warehouse is alive, but perhaps not alive and well. Dewayne Washington suggests that it may be on life support.

Most organizations have a data warehouse and many have more than one. But the world has changed since the time those data warehouses were initially implemented. New data sources, new data types, new database technologies, new use cases, and new users of data combine to raise the question:

Where does the data warehouse fit in modern data management?

That’s really an architectural question that challenges the original positioning of the data warehouse. The data warehouse as single version of the truth no longer works. (If it ever did … I’ve always thought that SVOT is unrealistic.) The hub data warehouse as single point of integration for all BI applications doesn’t work. The BI and analytics world sees the data warehouse as just one of many available data resources. So the question of fit becomes more complex. We really need to answer three hard questions about data warehouse fitting into data management architecture:

What is the purpose of the data warehouse?

How is the data warehouse related to data lakes and other data stores?

What stages of data lifecycle does the data warehouse serve?

Clearly there is no one correct answer to these questions (no SVOT here either) but I’ll offer some thoughts to position the data warehouse as part of modern data management architecture.

Purpose

I believe that the purpose of a data warehouse in modern data management architecture is to provide a repository of enterprise history that is integrated, subject-oriented, non-volatile, and time-variant.

The warehouse characteristics that Inmon described so many years ago are still important. And historical data still matters. It doesn’t meet all analytic needs, and it doesn’t have the “shiny object” appeal of real-time and streaming data. But it is the essence of time-series analysis that is at the core of decision support and performance management.

Placement

I consider data warehouses to be components of an enterprise data hub. They exist together with MDM, ODS, and portions of the data lake as a collection of data that is curated, profiled, and trusted for enterprise reporting and analysis. (See Figure 1.)

Figure 1. Data Warehouses in Enterprise Data Management

This model shows data warehouses—one or more—as part of an enterprise data hub, where the hub is a subset of what I’ve called the data core. I made up this term because the core set of data created and used by BI and analytics includes more than the hub. Not all data is curated, profiled, and trusted. Analytic sandboxes and some parts of the data lake don’t meet those criteria.

Another aspect of placement considers the relationship of the data warehouse with the data lake. Some consider warehouse and lake to be two different things, while others view the warehouse as a subset of the data lake. Both view have merit, and I have developed architectural models from both perspectives—data warehouse in parallel with the data lake as shown in Figure 2, and data warehouse inside the data lake as shown in Figure 3.

Figure 2. Data Warehouse in Parallel with the Data Lake

Figure 3. Data Warehouse Inside the Data Lake

In both of these models the data warehouse retains the Inmon characteristics. Optionally, and depending on architectural intent, the warehouse may also include aggregation and dimensionality.

Positioning

I use a six-stage data lifecycle as a basic architectural principle:

  1. sourced-data 
  2. raw data 
  3. refined data
  4. trusted data
  5. prepared data 
  6. consumed data

Within this lifecycle I position the data warehouse to support two stages: refined data and trusted data. (See Figure 4.) This positioning works well with either of the placement models—in parallel with the data lake or inside the data lake.

Figure 4. The Data Lifecycle and the Data Warehouse

The data lake may serve as a landing and staging area for raw data and it may also support exploratory and experimental data stores such as sandboxes. Both staging and sandboxes, however, are beyond the scope of what the data warehouse does well.

The Future of Data Warehousing

Data warehouses are here for the long term. Much has been invested in building them and many people and business functions depend on them. But sustainability demands that we rethink the data warehouse. Data warehouse architecture can no longer stand alone. We must think purpose, placement, and positioning of the data warehouse in broader data management architecture.

Architecture, of course, is only the beginning. The data warehouse is alive but it faces many challenges.  It doesn’t scale well, it has performance bottlenecks, it can be difficult to change, and it doesn’t work well for big data. In a future of data warehouse modernization we’ll need to consider cloud data warehousing, data warehousing with Hadoop, data warehouse automation as well as architectural modernization.

Dave Wells

Dave Wells is an advisory consultant, educator, and industry analyst dedicated to building meaningful connections throughout the path from data to business value. He works at the intersection of information...

More About Dave Wells