All Hail, the Data Lakehouse! (If Built on a Modern Data Warehouse)

The data lakehouse is the latest incantation from a handful of data lake providers to usurp the rapidly changing cloud data warehousing market. I doubt they’ll succeed, but I applaud their efforts.

Too many companies have invested in cloud data warehousing products—and with great success—to turn back the clock. Data lakes, in contrast, have largely been a dismal failure, at least until the advent of the cloud. Giving them a new moniker (i.e., data lakehouse) and adding ACID properties doesn’t really change the equation. Meanwhile, cloud data warehousing products have increasingly co-opted data lake functionality, providing customers with the best of both worlds.

Phraseology

The data lakehouse is a fusion of the terms “data warehouse” and “data lake”. Although my colleagues disparage the term, I kind of like it. It’s a straightforward attempt to merge two concepts that our clients are desperately trying to integrate in the real world. (Read the companion pieces to this article written by Dave Wells and Kevin Petrie or listen to the freewheeling debate on data lakehouses that we recently recorded.)

The term “data lakehouse” is a metaphor, just like “data warehouse” and “data lake”. These metaphors communicate conceptually what the technology tries to do physically. As such, a data lakehouse is a data structure that tries to combine the characteristics of a data warehouse and a data lake. In that sense, almost every modern data management offering can be considered a data lakehouse.

The Urge to Merge 

There is a race among data management vendors to combine the characteristics of a data lake and data warehouse into a single platform. Why? Because that is what customers want. Most seek a data lake to store all their data and a data warehouse to support fast business queries. But merging data lakes and data warehouses is challenging because their operating characteristics are completely different. (See figure 1.)

Figure 1. Operating Characteristics of Data Lakes and Data Warehouses

Data lakes and data warehouses have completely divergent operating characteristics.  

Creating a data lakehouse that combines these divergent characteristics into a single platform has become the holy grail for data management vendors. Each vendor takes a different approach, leveraging their technological heritage to meet the needs of their core constituents. Data lake vendors use open source software and file systems to appeal to data coders, data scientists, and big data advocates. Data warehousing vendors rely on SQL and relational databases to court their large base of business intelligence customers.

Who will dominate the emerging market for modern data platforms? I’m placing my wager on data warehousing vendors because they have the largest installed base and the most battle-tested technology.

Data Warehousing Vendors Adapt

Resurgence. Data warehousing products have made a remarkable resurgence. The failure of Hadoop as an enterprise platform, the advent of scalable cloud platforms, and industry acceptance of SQL as the lingua franca for data processing have made data warehouses relevant again. And the meteoric rise of Snowflake, a cloud data warehousing vendor, returns data warehousing to rock star status. With 4,000 customers and counting, Snowflake recently raised $479 million from investors, bringing its valuation to $12.4 billion, making it one of Silicon Valley’s most valuable startups.

Teradata. Even the stodgiest data warehousing vendor, namely Teradata, has undergone a complete makeover, wrapping its formidable SQL optimizer in modern cloud computing garb that includes pay-as-you-go pricing and decoupled storage and compute. Teradata has been innovating in the data processing space for 40 years. Its list of capabilities makes any data lake vendor drool: support for multiple languages, including SQL, Python, R, Java, and SAS; support for multiple data processing engines, such as SQL, machine learning, and graph; a mind-boggling litany of analytical and data functions; and a data virtualization engine that can optimize queries across relational and non-relational object and document stores (via Presto).

Data warehouses now act a lot like data lakes. Snowflake, for instance, uses cloud object stores to store data. Customers can either land data in the object store before loading into Snowflake or load it directly from source systems via an ETL/ELT tool. Once in Snowflake, users query data in the object store (that hasn’t been loaded into Snowflake) via external tables. Snowflake recommends the use of external tables for these types of queries, and when performance is critical, applying materialized views to those tables. Teradata goes a step further and allows users to join data across internal and external sources via its QueryGrid virtualization engine.

Many data warehouses can also ingest and query semi-structured data, once the sole domain of big data systems. Snowflake, for instance, supports a “variant” data type that stores JSON, Parquet, Avro, and CSV directly and which users can query without first parsing or flattening the data structures. To round out the three V’s—volume, velocity, and variety—many data warehousing vendors now support streaming data and petabyte data storage. And more is coming: serverless queries, visual design and development, data preparation, data cataloging, visualization—all things that once sat outside the realm of the traditional relational warehouse.

Branding Beyond the Data Warehouse

Given these new capabilities, data warehousing vendors are rebranding their offerings. For many years, Microsoft sold its cloud data lake and data warehouse services separately. But now it’s brought them together in a single package called Azure Synapse. The new cloud offering, which will be released in several phases over the next 12 months, integrates Azure Data Lake Store and Azure SQL Data Warehouse into a seamless environment. It also plugs multiple services into the Azure Synapse Studio, including Azure Data Factory, Power BI, and Azure Machine Learning. Azure Synapse is more than a cloud data lake or a cloud data warehouse: it’s a one-stop-shop and single pane of glass for all things data.

Likewise, Snowflake recently rebranded its offering as the Cloud Data Platform to emphasize that it does more than just data warehousing. Snowflake continues to beef up its capabilities to support non-data warehousing workloads, such as data engineering, data science, data lakes, data pipeline development, and data sharing. This last workload leverages Snowflakes global multi-tenant, multi-cloud, multi-region infrastructure to support seamless data exchange among Snowflake customers, a truly revolutionary feature that moves the industry well beyond the debate between data warehouses and data lakes into a new paradigm of data sharing and monetization.

Conclusion

The race is on: data warehouse vendors are working furiously to co-opt the workloads of data lake vendors, while data lake vendors are returning the favor. As vendors merge capabilities, it’s hard to resist using a catchy term like the “data lakehouse” to describe the unified data environments. But soon, the data lakehouse will become an amusing architectural anachronism as organizations tap into a global data fabric and enter the world of seamless data sharing.

Wayne Eckerson

Wayne Eckerson is an internationally recognized thought leader in the business intelligence and analytics field. He is a sought-after consultant and noted speaker who thinks critically, writes clearly and presents...

More About Wayne Eckerson