Data Warehouses: Dead or Alive? Maybe Just Reinvented
Steve Smith’s July post about the Demise of the Data Warehouse and Dave Wells’ response that the Data Warehouse is Still Alive has us thinking about the historical context for data warehousing provides some insight into its presumed mortality and what the future holds. Just as a reminder, Steve suggested that the problems that data warehouses were designed to address (access, speed, and truth) are not only unsolved by the data warehouse paradigm, but that they can be addressed using modern approaches to master data management and managing data in a data lake (DL + MDM). Conversely, Dave notes that data warehouses are still viable options for reporting and analytics, and that a modernized approach to data warehousing can augment Steve’s DL + MDM model to continue providing data consumers access to a familiar reporting and BI architecture.
Let’s think about Steve’s premise: the data warehouse architecture was intended to address some critical gaps in enabling what today would be called descriptive analytics. Transaction and operational systems were designed to accomplish business process goals, and consequently they were not built to allow for dimensional analysis and aggregation. You could search the database to find out if a transaction had taken place, but queries to perform summarization and reporting would have increased the computational demand, potentially dragging down transaction performance below agreed-to service levels.
The concept of the data warehouse addressed both of these challenges. Pulling the data into a separate environment and organizing the data in a way that facilitates fast summarization, aggregation, and other types of queries allowed the data analysts to do their thing without impacting on-line application performance. That being said, Steve had some good points in that instantiating a data warehouse created its own sets of problems having to do with data integration, synchronization, inconsistent semantics, and (perhaps as a victim of its own success) the creation of a workload bottleneck when the data warehouse team was pressed to help more and more users configure and generate reports and analyses.
Yet Dave is also right: those data warehouses that have successfully brought into production are indispensible, and continue to meet the needs of the data consumer community. And he is also right about the need for modernization, as users become more sophisticated, impatient, and aware of the value of transforming data into actionable knowledge.
So, where are we today? The lowered barriers of high-performance computing platforms and data storage that are massively scalable have led to different reporting and analytics innovations. Instead of relying on conventional SQL queries for reporting and analysis, analytics tools convert requests into MapReduce or Spark programs that (presumably) run as fast as a well-tuned RDBMS. Instead of trying to proactively organize data for rapid summarization and aggregation, users will dump all of their ingested data into a data lake. Platform hosting vendors are rapidly pumping out more and more services that are hypnotically drawing system managers’ attention to migrating their environments to the cloud.
But are MapReduce queries really faster than SQL? Maybe yes and maybe no, depending on the configuration of the data and the types of requests. And while there are some data scientists who claim that they can easily take advantage of that data lake, when more and more data assets are dumped into the lake, that analytical agility will diminish unless there are data catalogs that can capture the structural and semantic metadata associated with data lake asset content. And while transitioning from an on-premises environment to a cloud platform seems to make economic sense, not understanding the access and usage patterns of your data and a poor organization of your data objects can lead to some surprisingly large invoices at the end of the month.
Consider this example: An organization wants to migrate all of their entity data (customers, products, services, vendors) and their transaction data (sales) to a cloud-based environment on AWS (Amazon Web Services). They want to store their data on Amazon’s S3 storage platform. The question becomes: how do you want to configure that storage?
Well, one way is to just dump it in S3 in its original form. But if you do that, you might ignore the fact that there are duplicate entries in your entity data, which could cause confusion in your downstream analysis. So Steve is correct: master data management is key here, and you will need to perform identity resolution and realignment of your entity data before it gets instantiated in the object store. And when it comes to the transaction data? If you want to be able to perform any queries, it implies that a service application needs to be spun up to scan through all the transactions to find the ones you care about. Yes, the parallel execution models of MapReduce or Spark are good for this, but are they going to be faster than a query against an indexed set of tables?
Another way is to reconfigure each transaction at its own object (perhaps in XML or JSON format). Name those objects based on the ways that you might want to slice and dice them, and then requests for individual objects and even certain types of aggregations becomes trivial. However, Amazon advises against having names that are too similar, as that impacts execution performance! Not only that, when using cloud storage like S3 you are paying for storage and access by the object; transforming your tens (or hundreds) or millions of transactions into unique objects may result in a dramatic increase of your storage and access costs.
A third way is a hybrid: transform your transactions into objects and bundle them by some criteria. That reduces the storage and access cost burden, but increases the processing burden.
Hmm, what do we have here? A scenario in which we want to put transaction data into a separate environment and transform that data so that it is organized in a way to increase accessibility and speed performance. Isn’t that the same idea as the original approach to data warehousing? Yes, the performance is better, we can handle greater data volumes and a wider variety of sources. But we still have to think carefully about how we want to manage those data assets to support the downstream users’ needs.
That being said, there are some alternatives that have some appeal. For example (as both Steve and Dave suggest), the use of data preparation and data cataloging tools can enable greater degrees of self-service, which can reduce or eliminate the IT bottleneck in providing access. Separating storage from computation and enabling “serverless computing” further improves accessibility and performance while lowering overall costs. And hybrid transactional and analytical platforms (HTAP) that store data in memory that allow transaction processing and reporting and BI queries to be executed against the same data set (without needing to extract, transform, and load the data!!) may address all of the challenges that the data warehouse was meant to address (accessibility, performance, coherence and synchronization).
In our opinion, the data warehouse is not dead. But as newer technologies address the original motivating issues, newer versions of those issues are starting to appear. The data warehouse is not going to go away, but over the next 5-10 years it will increasingly look very different than what it used to be.