The Demise of the Data Warehouse

Read Dave Wells counterpoint to this article. 

My job, as an industry analyst, is to research and write about what is going on in the industry. But my job is also to predict what is going to be going on in the industry. Talking about the future gives me some latitude to try out outrageous ideas. So in this post let me do that and also to try to follow the cardinal rule of all public speaking and writing: “Never be boring”.

So here we go: The data warehouse is dead.

Ok, so that is pretty blasphemous. Let me explain.

Let’s first think about the problems that the data warehouse was meant to solve. There are three big ones:

  1. Access - just being able to get at the data at all is a bit of a breakthrough (and still is for many organizations).
  2. Speed – Since the data is already there in the operational systems, the data warehouse is not about creating or even moving the data. The point of the data warehouse is to collect it in one place so that it is easy to answer questions quickly. The data warehouse is all about answering business questions at the speed of business.
  3. Truth - A single version of the truth is the great hope. With it, the definition of customer and last month’s revenue is the same everywhere in the enterprise.

So the breakthrough promise of the data warehouse is not really about data. The data was always there being collected in silos and stored away locally in apps. The breakthrough was:

  1. You could access it
  2. You could access it quickly
  3. It was correct and everyone agreed on it

The problems of the data warehouse were (dramatically overstating these for effect):

  1. No one could actually agree on the ‘truth’ or what a customer was (and if they did they changed their minds later).
  2. New data was always being discovered which then broke the data warehouse when it was included (so we sometimes pretended that it didn’t exist).
  3. No data warehouse project was ever finished (ever, anywhere on earth …)
  4. The data warehouse was fast for accessing stuff that was anticipated, but for other stuff, that was thought about later, it was slow or inaccessible.
  5. It took forever to incrementally change the data warehouse. (and it got worse over time)

So let me cut to the chase. I’ll propose that the combination of a data lake (DL) and great MDM (Master Data Management) is a nimble, flexible and superior substitute for the classical data warehouse.

“How could this be?” you ask. Well this DL+MDM combination provides the same benefits as the data warehouse while solving many of its problems. And it is made possible by the cloud.

Consider that the data lake is pulling in all the data from siloed enterprise apps and elsewhere just like the data warehouse but it is doing it in an “I don’t really care if this is organized let me just get the data into my system first” kind of way. Much more like ELT (Extract … Load … and … eventually … Transform) rather than the typical ETL (Extract Transform Load). The cumbersome nature of the transformation gummed up everything downstream and prevented the most basic promise of the data warehouse (e.g. access to the data).

I was having dinner last month with Deepa Sankar from Informatica and she was telling me about their AI based Data Catalog, EIC (Enterprise Information Catalog) product. It intelligently catalogs and classifies all the data sources . End users can then rapidly perform simple search queries in their own words to find data assets. (www.informatica.com/eic)

I asked her where it fit into the overall architecture: “Does it fit in between the data lake and the data warehouse or is it between the warehouse and the BI?” She looked perplexed: “No there isn’t any data warehouse… EIC just pulls in the data about data from anywhere – data sources, applications, BI – in cloud, on-premises, big data – …..”

Then it was my turn to look perplexed. No data warehouse? Was that even possible?

This was a wild idea: to just keep track of the data about the data (metadata) and only move the data when it was absolutely required. I began to think that maybe there was an analog somewhere else in the land of technology.

Perhaps this DL+MDM (data lake plus master data management) strategy is a bit like “lazy evaluation” in computer programming. With lazy evaluation (love that name) you only evaluate an expression (e.g. call a function) when you need the result but not necessarily at your first opportunity. By being lazy the computer program can become more efficient by avoiding calculations that will never be used.

The same could be true for data and the needs of the data warehouse. In DL+MDM you only move the data when you need to move the data. Otherwise you leave the data where it is. So maybe this strategy would be the ‘lazy data warehouse’ (I’m all for saving time and effort!).

The second important promise of the data warehouse is that, in order to be useful to the business, it must also provide the ‘truth’. This is where MDM comes in.

I’ve heard from leaders in the finance industry how they can now, with a strong MDM infrastructure, have regular meetings with key stakeholders and incrementally knock down the definitions of KPIs for their business one at a time. This process now means that every meeting provides real value and makes progress as opposed to the massive, high-pressure, heavily politicized, all-or-nothing design meetings that characterized the data warehouse.

An incremental approach to KPIs and other measures and dimensions is far more civilized and builds teamwork and consensus rather than frustrated compromise. It is made possible with a solid MDM infrastructure.

Thus the data lake provides access to the data and the MDM provides the truth. But the third promise of the data warehouse is speed.  In the past, the speed had to come from physically moving the data closer together (and is still a good idea if you can – aka ‘data gravity’). But today, with the elasticity of the cloud and MPP (Massively Parallel Processing) speed and size are much less of an issue. We are certainly in a business reality where spending money on compute resources is almost always a better investment than spending on staff months given the relative costs of human versus technological capital.

This DL+MDM configuration also solves some of the main problems with the data warehouse:

  1. Agreeing on the truth becomes a process where you can hunt down one field or variable at a time in a thoughtful and incremental way. You no longer need to solve all of your data dictionary problems at once.
  2. It accommodates new data because as soon as the new data is generated or collected it is immediately accessible because it doesn’t need to be transformed.
  3. There is no ‘finished’ to the data warehouse project. From the very start the data is accessible and then it is a smooth incremental journey to better access and better data quality as it moves along. This approach embraces rather than is embarrassed by the fact that the ‘data warehouse’ is never finished. The data warehouse is no longer static it becomes dynamic – just like business.
  4. The DL+MDM configuration is supremely flexible in that there is no required pre-compiling necessary - and computer resources can be provisioned as needed.
  5. DL+MDM is all about change. It is constantly changing and expects to have change. It embraces change.

Four Years From Now

One thing that drives the need for the ‘lazy data warehouse’ is the explosion in data. For example old data warehouses couldn’t accommodate text very well so new databases needed to be invented. This dynamism precluded the success of the old monolithic and static data warehouse.

Four years from now, the amount and types of new data sources will be accommodated by the advances in disk storage and computation. There may be another new source of data out there but consider that most of the challenge from ‘big data’ today really comes from unstructured data and unstructured really only comes from five sources: text, image, video, audio, and sensors/IoT. Four years from now we will have learned how to ‘structure’ this unstructured data but we will most assuredly also uncover a new source of valuable data that must be accommodated. Dynamism of the data warehouse will still be a requirement.

In four years the need for pre-planning and shuffling your data into a data warehouse will be eliminated.  The DL+MDM approach will accrue several advantages to its practitioners while fulfilling the goals of access, speed and truth:

  1. Those that embrace the DL+MDM approach will be more nimble than their competitors
  2. IT operations will be leaner and more efficient
  3. DL+MDM is a perfect match to the efficiencies of the cloud
  4. Ad hoc queries will be easily accommodated
  5. Quality definitions of the most important KPIs will be routinely and painlessly created

Ok was that controversial enough? Can the data warehouse really be obsoleted by a DL+MDM strategy or will it always be with us? Let me know what you think.

Expert Insights. Many thanks to Deepa Sankar who provided expert insights on this topic. Deepa is the Director of Product Marketing at Informatica. www.informatica.com

Stephen J. Smith

Stephen Smith is a well-respected expert in the fields of data science, predictive analytics and their application in the education, pharmaceutical, healthcare, telecom and finance...

More About Stephen J. Smith

Books by Our Experts