Data Lakes Are Cool, But You Still Need A Data Warehouse
In this podcast, Henry Eckerson interviews Dave Wells on the current health and future of the data warehouse. Wells acknowledges that data warehouses are struggling, but argues they are still necessary and cannot be replaced by data lakes. He then explains what the role of the modern data warehouse should be, practical steps forward for evolving the data warehouse, and much more.
More than 40 years of information systems experience combined with over 10 years of business management give Wells a unique perspective about the connections among business, information, data and technology. Knowledge sharing and skill building are Wells’ passions, carried out through consulting, speaking, teaching, and writing. Currently, Wells is the practice director of data management at Eckerson Group, cofounder and director of education at eLearningCurve, and a faculty member at The Data Warehousing Institute.
Key Findings
- The concept of data warehousing will live on, even if a physical data store called a data warehouse does not
- Data warehouses and data lakes are complimentary
- Every time people shortcut modeling, they end up shooting themselves in the foot, essentially
- Data warehouses are supposed to connect people with data, and careful data modeling will provide the best connections.
- Data lakes combined with meta data management cannot replace a data warehouse
- If you only have a data warehouse, you can't take advantage of unstructured, big data
The following is an edited transcript from the podcast between Henry H. Eckerson and Dave Wells
Henry: What is the status of the data warehouse? What is your thesis? Where does it lie on a scale from dying to thriving?
Dave: In most cases I’d put the data warehouse in the space of ‘struggling, but not dying’. I think the reality is most of today’s data warehouses would fit under the label of legacy data warehouse; they were good state-of-the-art 1990s stuff and we can’t get rid of them. People depend on them. So it’s really time to rethink them.
Henry: So why do so many thought leaders in this space declare that the data warehouse is dead? What are their primary reasons?
Dave: I think they’re looking toward data lake technology and the ability to use Hadoop and NoSQL technologies and those sorts of things as providing a replacement for the data warehouse, and I think that’s flawed thinking.
Henry: How so?
Dave: Data lake and data warehouse serve different user groups and different purposes. The data lake is very powerful for analytics, for people, data scientists, and others who want to work with raw data and do their own data preparation, do their own data blending and so on, and who want to do on-demand profiling of data and searching of data to discover insights.
The opposite side of the coin and the one that the data warehouse addresses is those people who have a need for regularly published information that is pre-aggregated, pre-integrated and ready for them to consume. The publish and subscribe model still works and people still want to have information published for them.
The other aspect of the data warehouse that’s important is that it collects a body of enterprise history that is time-variant using uniform time intervals that meets all of the criteria that are necessary to do real, meaningful time series analysis, and to examine and track trends in business metrics. Data lakes tend to be more erratic in time intervals and more erratic in its retention of history, and there are some disciplines that are built over the years around the time-variant nature of the data warehouse.
Time-variant is one of the defining characteristics and there are good reasons that we need time-variant data that is carefully managed if we’re going to use behavior over time as a model to understand business dynamics.
Henry: Some posit that the data lake is an alternative, but it seems like you’re saying they both serve a purpose. Is that what you’re saying and how could they work together?
Dave: They absolutely should work together. They’re complementary and architecturally we need to rethink maybe not the data warehouse, but the concept of data warehousing, keeping an inventory of data that is subject-oriented, time-variant, integrated, nonvolatile – the defining characteristics that Inmon set forth years ago for a data warehouse. We need data that has those characteristics.
Now, whether we store it in a separate data store, relationally-modeled, and call it the data warehouse or whether we treat that integrated, subject-oriented, time-variant, nonvolatile set of data as some collection, some subset of the data lake isn’t terribly relevant except from an architectural point of view. Architecturally do I want to place the data warehouse and the data lake side-by-side or do I want to place the data warehouse inside the data lake?
Henry: So what would be the results of going with one or the other? If a company just goes with the data lake or just the data warehouse, what would be the negative impacts of each?
Dave: If you just go with a data lake, you’d better be really good at data lake management, and you will need to think carefully about managing time-variant data within the data lake or you will lose all of your ability to track historical behaviors and trends and doing really meaningful time series analysis.
If you just go with a data warehouse and you follow the traditional data warehouse model, relationally-modeled structured data only, you’ll lose the opportunity to take advantage of the many opportunities that unstructured data and the big data world offer.
Henry: You mentioned earlier that a lot of people are using these legacy data warehouses. Obviously they need to evolve. How does one do that?
Dave: First, you need a plan and an architecture for how to evolve your data warehouse or, maybe more correctly, data warehouses – this is one of the challenges we have with data warehousing today. The original vision was a single data warehouse, that single version of the truth, and the reality is that most organizations today have multiple data warehouses.
I ran a live poll in a class that I taught recently where one of the poll questions was “how many data warehouses do you have in your organization?” Fewer than 10 percent of the people who responded had only one data warehouse. Only one respondent had no data warehouses and over 50 percent had four or more.
Henry: Wow.
Dave: So what’s happened through mergers and acquisitions and rogue data warehousing and whatever else may have driven it, these companies have ended up with multiple data warehouses. So we’re still not at that single version of the truth.
What companies need to do is step back and look at the purpose of the data warehouse – well, a single version of the truth is elusive and may always be a myth. The purpose of the data warehouse is primarily the integration and reconciliation of internal enterprise data and the collecting of that data as historical snapshots or time-variant with uniform time intervals, so that we can publish the kinds of information that people need.
Now, architecturally, that doesn’t mandate that it reside on premises in relational database management technology and be entirely SQL-dependent and so on. Those are artifacts of data warehousing of the ’90s.
So, we need to step back and ask, ‘For modern data warehousing, should I migrate to the cloud? Should I take advantage of Hadoop for fast data transformation engine, accelerate my ETL? For modern data warehousing, should I take advantage of Hadoop and NoSQL as ways to offload large amounts of infrequently accessed data so that I can get better performance from that data that is frequently accessed by reducing table sizes, database sizes, and so on?’
We need to ask all sorts of questions about how we position the data warehouse. We need to think about data virtualization – are there instances where the data warehouse is attempting to supply real-time or very low latency data, and without the demand for a body of history, and would that be better served with a data virtualization model? Or, perhaps, that would be better served by migrating some of the work that’s done by legacy data warehouses to the data lake and leaving the data warehouse doing only those things that a data warehouse does well that have to do with integrated, structured, relational data and time-variant history.
We need to ask about how we move data into the data warehouse. Most data warehouses today have a separate staging area. Does it make sense that the data lake becomes the landing zone for all incoming data and it also serves as the staging area for data that is further processed into the data warehouse?
We need to ask about our data transformation models – is ETL cast in stone? Should we sometimes be doing extract, load, transform? Should we sometimes be doing bulk loads? Are there occasions where we may want to process data streams into the warehouse? If we’re processing data streams, should they also land in the data lake first and then we pull those events from the stream that are of interest to warehousing?
So, there’s so many questions in terms of how to modernize the data warehouse that there’s not a quick leap to how do we get there. First, answer these questions and define the whole data management architecture, not just data warehousing architecture, but where does data warehouse fit together with data lake, with analytic sandboxes, with master data management? What is my entire data ecosystem comprised of in terms of data stores and then what are the constraints in terms of data flows? Can I move data from data lake to data warehouse, but not the reverse? Meet these architectural decisions. Then I can think about ‘How do I migrate to the new architecture incrementally?’ because big bang isn’t going to work.
Henry: I know some people have suggested a combination of metadata management and data lake could replace the data warehouse. I’m assuming that you don’t buy this.
Dave: I don’t buy that. Metadata management doesn’t give me the time-variant history that I need. We can’t run only on current data. We can’t run on historical data that is not carefully managed to ensure its veracity, to ensure uniformity of time intervals, and so on and so forth.
We need master data management for shared reference data. We need data warehousing in some form for that time-variant history and we need data lake for all the dynamics of the big data world and the serving of the analytics communities that didn’t really exist when data warehousing was initially conceived.
Henry: From a practical perspective, for those with speed issues and cost concerns, how do you speed deployments and reduce costs of the data warehouse?
Dave: One option certainly is migrating to the cloud, which reduces a substantial number of on-premises operating costs, data center costs, so on and so forth. One option is, as I’d mentioned earlier, when you rearchitect there is potential that you employ Hadoop as a fast and parallel processing engine for your ETL or ELT, so that you can improve performance and throughput. And Hadoop at least is rumored to reduce costs. You know, I’ve heard it said that Hadoop is free like a free puppy.
So cloud is a possibility. Data virtualization for data that doesn’t need to be physically stored is a possibility. Certainly, virtualization doesn’t work for all data. In terms of reducing deployment times, Agile helps in some respects. DevOps takes us even a step further and gets us to the point where the backlog of work and the task queue is being driven by operational needs, real business needs.
One of the struggles we have in data warehouse deployments is that a tremendous amount of time is spent gathering requirements when requirements are so elusive and difficult to express anyway. It seems like we’re employing ’90s thinking about how to run a project that doesn’t fit this world of fast data and fast business.
Henry: Are there any dangers of moving to the cloud that would be dangerous enough for some companies that they shouldn’t do it or challenging enough that they couldn’t?
Dave: I think the biggest danger is moving to the cloud without having thoroughly thought through the business and technical case for moving to the cloud. There certainly are companies or organizations that shouldn’t move to the cloud. If you do move to the cloud, you need to do it carefully and be cognizant of the fact that moving a data warehouse to the cloud is much more complex than simply moving the data.
Moving the data is one piece. Moving the processing is yet another piece. You can’t do either of those until you migrate the schema. When you migrate the schema, it’s likely that you’re going to want to do some redesign along the way because your 1990s data model needs some rethinking as you migrate it – maybe you’d want to make some changes to your processing.
The other reality is that any data warehouse of any substance is too large to migrate all at once, so you need to think about moving incrementally. As you make those moves incrementally, you need to put in place something that makes it entirely transparent to the users where the data resides. You should be able to migrate data and processes from on-premises to cloud without disrupting the business or without users having to rethink how to access the data – and that, to me, is a good application of data virtualization. It’s just to put a virtual access layer in there so that you can do your behind the scenes work with little or no impact whatsoever to business users.
Henry: Moving onto some modeling, has in-memory scale out technology changed how we need to model the data warehouse? Can we model less since machines are more powerful and will return fast queries running against ugly schema?
Dave: That’s a nice pipedream. The reality is every time we shortcut modeling, we end up shooting ourselves in the foot essentially. There’s no substitute for design before you build – that’s what modeling is. It’s a design process to determine what it is that you’re going to build.
Henry: Yeah, you can’t avoid the hard work.
Dave: No, you’re going to pay now or pay later.
Henry: Should we still model the enterprise or just model for individual use cases?
Dave: I think we can model for individual use cases and that can work fairly effectively as long as we have some sort of high-level enterprise architecture so that we know where each thing fits as we build it. It needs some framework so that as you build each model, it fits into an overall design. We can model the enterprise too, and we can get past the idea of ‘Enterprise modelling takes too long. We have to stop the world and model the enterprise.’
One of the things we’ve not done very well in data modeling is apply data model patterns. There are many of them out there. Len Silverston has three books out on universal data models. David Hay has a couple of books out about data model patterns and there are a handful of others.
There’s a consultant in Melbourne, Australia, John Giles, who has made a very successful practice out of enterprise models in three weeks. What John does is get in there and understand the business and the dynamics of the business. He then goes through a process of selecting which data model patterns best represent the business. Then he asks, ‘How do we take those patterns and mix and match them and build the bridges so that it’s a cohesive data model?’ and then, ‘What kinds of tailoring might we need to do for things that are unique to your particular business?’ and the tailoring tends to be relatively small because of the degree of adaptability in these data model patterns.
So, it is possible to do enterprise data models rapidly. It requires that you buy in to the concept of universal data models and data model patterns and understand them well and apply them well. Do we need to model the enterprise? Probably not, but there is some value in at least having a high-level enterprise model.
Henry: Do you prescribe some optimal data warehouse strategy or architecture, or is it different for each company?
Dave: I think it is different for each company, but not terribly different. Ultimately you end up once again with a small number of patterns that work particularly well. Back in the early days of data warehousing, the big debates were Inmon’s hub-and-spoke model or Ralph Kimball’s bus architecture. And all the while those debates were going on in the world of theory, the reality was everyone was implementing a hybrid of some sort.
If I have a green field for data warehouse design, I tend to start with the idea that I will have a hub data warehouse, but it will not be normalized as the typical Inmon model is. It will be denormalized so that it contains master dimension tables with every row and every column that might be needed by any data mart for any dimension. It can contain master measures tables so that when I have key measures and key metrics that they can be collected once and calculated once for the enterprise, so I get that level of consistency among what ultimately become the facts in my fact tables. And it can contain reference tables, code lookups and those sorts of things – and if I start with that model and then begin to understand the dynamics and the needs of the business, that model can sort of adapt and morph as needed.
For example, if I go to Micron, they have some very complex data that needs to be collected about the tests that they run on their chips and electronic components. Managing that data requires a level of hyper-normalization that isn’t possible in a typical E-R model. So for their test data, you begin with a bias toward a data vault model that supports that hyper-normalization. It doesn’t mean we need to put everything in a data vault.
So this is, I think, the process of designing a data warehouse. It’s thinking through the nature of the data and the nature of how the data is used, and building models to bridge between those two. The point is to connect people with data and careful data modeling is the way that you’re going to provide the best connections.
Henry: In five years, what do you think the health status of the data warehouse will be and what do you hope it will be?
Dave: In five years, I hope it will be data warehousing as a concept; integrated, subject-oriented, nonvolatile, time-variant, uniform time intervals, pre-aggregated and ready for use. I hope all of that will exist as part of a complete data management ecosystem. Whether we put the label of data warehouse on it or we treat it as a partition within the data lake is less important than the fact that we retain those concepts, so that we manage data in that way for those use cases where that’s the best way to manage it.
I would hope that data warehousing is alive and well as one practice in data management. I would hope that data lake architecture, which is still very much in its infancy, grows and adapts so that there’s a higher level of compatibility between data warehousing practices and data lake practices.
Henry: As the data lake matures, do you think people will continue to try to abandon the data warehouse?
Dave: I think they may abandon the data warehouse as a distinct physical data store, but they can’t abandon the collecting of enterprise history and the careful management of that history. As soon as you abandon that, you can’t do time series analysis; you can’t see business trends.
Not everybody wants to be a self-service data analyst. If you run a business, you don’t always want to have to use Tableau, Qlik, Looker, Domo, whatever, to create your own information. Sometimes you just want to have it published and pushed too. That’s more of a data warehousing use case than a data lake use case, at least, as the two exist today.
Henry: Alright, so it’s less important what you call it, but certain things are always going to need to happen regardless of where they happen.
Dave: Yeah, exactly.
Henry: So what should be the primary takeaway from this discussion? What should people know?
Dave: Rethink your data warehouse or data warehouses. Understand why the concepts of data warehousing are important and things that you want to retain regardless of whether you have a physical data store called a data warehouse. Maybe, most importantly, step back from data warehouse alone and think about your overall data management architecture – who are all of the people who consume data and what’s the variety of use cases? And understand that there is no one-size-fits-all data model that will serve all of those use cases.
You are going to have the same data stored in different places in different ways to serve different use cases. Understand that some of those users will want the data prepared and published to them. Others will want access to raw data to do their own data preparation. So start with your architecture on the use case and consumption side and understand the various ways that you need to configure data to serve those users. Then, go over and look at the data sources side and ask, ‘Now, how do we take these data sources and move the data into whatever data stores are appropriate to serve all of the use cases?’
So, I believe, the most important thing to take away is it’s time to rethink data management architecture. What we’ve been doing since the emergence of big data and data lakes and Hadoop and NoSQL is take our old data management architecture and patched things on to it. It looks a lot like duct tape architecture at this point, and it’s time to stop patching things on and step back and say, ‘Let’s define data management architecture from the ground up greenfield.’
Henry: Awesome. Do you have any suggested resources for people with more questions?
Dave: There are certainly some in the data management blogs at eckerson.com. There are several pieces out there, in fact, even an internal debate among some of the Eckerson consultants about ‘Are the data warehouses dead or are the data warehouses not?’
I happen to teach, right now, a class about modernizing data architecture that has been very well received everywhere I’ve taught it. I teach it at TDWI conferences. I’ve taught it at Enterprise Data World. I offer it onsite to companies. We could, if people wanted to bring this kind of education into their company, bring it in through Eckerson Group, Eckerson Education.
If you liked the podcast, please subscribe!