An Architect’s View of the Data Lakehouse: Perplexity and Perspective
Perplexity: The state of being perplexed; puzzled or confused.
I’m a data architect with lots of experience, and I must admit that perplexity is precisely the right word to describe me when I first encountered the data lakehouse concept. A lakehouse is described by its advocates as a combination of a data lake and a data warehouse that implements warehouse-like data structures and data management functions on low-cost storage that is typically used for data lakes.
The description left me puzzled and confused. Is it a new storage model for the data warehouse? Or a new architecture for data lakes? Or something entirely different? Is it a genuine effort to advance the state of data management? Or a marketer’s desperate search for a new buzzword? First came skepticism, then curiosity, and eventually serious consideration of the idea as real data architecture.
The first struggle with serious consideration was getting past the goofy name. I had visions of standing before a budget committee asking for funding to build a data lakehouse. The explanations needed after the CFO asks “To build a what?” would sound as convoluted as those of a politician putting a positive spin on a particularly foolish gaffe. The term “data lakehouse” is a barrier in itself—sort of silly, sort of frivolous, and not very enticing. To this architect (and I want to believe to many others) data architecture is no place for frivolity. Data is serious business with lots of value potential, but also lots of risk.
Setting aside the foolishness of the name—we can always give it a different name—I started to look for real architectural value. The challenge was in separating architecture from hype—finding the nuggets among the noise. As an example of hype, one recent article declares it to be Data LakeHouse – Paradigm of the Decade. This is year one of a decade with 9½ years yet to be experienced, and with data management technologies evolving rapidly.
The other aspect of sorting nuggets from noise is to distinguish between purposeful architectural constructs intended to ease the complexities of managing data, and marketecture concepts intended to subtly promote products and platforms. There is no shortage of the second with Databricks, Azure Synapse, Amazon Redshift, and others embracing the concept. Each of these is credible technology, and their marketers would be remiss if they failed to connect with emerging data management trends. Still, it is an architect’s responsibility to separate sound data management principles from product promotions and dependencies.
Moving on to the search for architectural value, the good news is that I found some real value. I also found some real drawbacks. Let’s begin with a look at how the data lakehouse concept compares with the state of data warehousing and data lakes in most organizations.
Many organizations today continue to operate the data warehouse independently of the data lake as shown in figure 1. In fact, most have multiple data warehouses. Data warehouses are the source of data for BI use cases, and the data lake serves data science use cases.
Figure 1. Separate Data Lake and Data Warehouse
Some have modernized their data architecture to combine data lake and data warehouse as a single data management platform that serves data for both BI and data science, as illustrated in figure 2. This aligns well with architectural constructs I first discussed in a 2017 blog: data warehouse in parallel with data lake, and data warehouse inside the data lake.
Figure 2. Combined Data Lake and Data Warehouse
The data lakehouse goes a step further. It advocates a single platform for data warehousing, data lake, business intelligence and data science as shown in figure 3.
Figure 3. Data Lakehouse – One Platform for Everything
One platform for everything … On the surface it seems like a good idea, and it does have some benefits. On the plus side of my analysis:
- A single platform may ease the administrative burden.
- Data governance may be simplified with a single control point.
- It may result in less data movement and less data redundancy (although the same may be true with the pattern of data warehouse inside the data lake).
- Keeping all data in data lake format is perhaps the biggest benefit of all.
- Schema management is simplified.
- Data adapts readily to all use cases.
- Transaction support with ACID compliance is practical.
My analysis also found some negatives, or at least uncertainties:
- Everyone already has data warehouses—usually more than one—that will need to migrate. Do the benefits justify the time and cost of migration?
- The tools to enable data lakehouse are in their infancy. At this early stage, they certainly don’t have all of the features and functions needed to live up to the promise or the hype. They may be ready for early adopters, but the more cautious architects will take a wait-and-see approach.
- The mistake of the monolith is what I see as a big disadvantage. Every architect is familiar with the myth of the monolith: If we build an all-in-one solution, then management is easy because we only have one thing to manage. The reality is that the monolith is inherently complex and fragile due to unnecessary dependencies. Modularity and decoupling are sound and proven architectural principles that don’t seem to be compatible with one platform for everything.
Early in my career I had the good fortune to know and learn from Gerald Weinberg (author of The Psychology of Computer Programming, Are Your Lights On?, Becoming a Technical Leader, and many other books). I recall a time in 1974 sitting with Jerry over a beer after a long day of conference activity. He said to me (referring to software developers): “We fool ourselves when we say we’re in the problem-solving business. We’re really in the problem-trading business. If we’re good we trade big ones for little ones. If we’re not so good … well, you know what happens.”
My conclusion: The data lakehouse is an interesting idea. It is thought-provoking. There are some ideas here that architects should pay attention to as the concepts evolve and mature. But it’s not ready for prime time, and it is not the “paradigm of the decade.” It is not a solution to data management problems. It will undoubtedly trade one set of problems for another. And we don’t yet know if we’ll be trading big ones for little ones or little ones for big ones.
To hear more of Eckerson Group perspectives on the data lakehouse be sure to check out the blogs from my colleagues, Wayne Eckerson and Kevin Petrie, and the recording of our recent Shop Talk discussion.