Data Lakehouses Hold Water (thanks to the Cloud Data Lake)

Like many marketing concepts, the Data Lakehouse evokes clean PowerPoint imagery, in this case a Venn Diagram of two converging ovals: the data warehouse and the data lake.

Real enterprise architectures, of course, look messy, because new platforms tend to add rather than replace. Technologies never fully merge, data integration is never complete, and migrations are not always worth it. These realities hold in the case of the Data Lakehouse, puncturing what my colleague Dave Wells aptly calls the “myth of the monolith.” Check out his blog here, as well as Wayne Eckerson's blog and our recent Shop Talk Webinar on this topic.

With that said, several industry forces now drive an indisputable sea change in data warehousing. Economic, elastic and open cloud-based data lake platforms can efficiently run automated SQL-driven workloads on multi-formatted datasets. These data lakes support business intelligence (BI) use cases in ways that look and smell a lot like the data warehouse. We should expect versions of the Data Lakehouse concept to crop up in many enterprise environments, complementing (but not fully replacing) data warehouses and data lakes.

To avoid comparison with marketing concepts, let’s call this emerging platform the New Data Lake. By examining each element in the following simplified reference architecture, we can understand why and how the New Cloud Data Lake will play a very real role in modern enterprise environments.

The New Cloud Data Lake: Reference Architecture

  • Cloud-driven infrastructure. Renting storage and compute resources gives you economic and technological flexibility that no inhouse data center can match. You can launch, scale and adapt IT and analytics initiatives faster and more easily by using cloud platforms such as object-storage data lakes. These projects often straddle hybrid or multi-cloud infrastructures as they layer onto existing environments.
  • The rise of object storage. Flat object stores such as Azure BLOB storage and Amazon S3 prove more cost-effective, scalable and available than the Hadoop File System or hierarchical structures of traditional data warehouses. A growing number of enterprises embrace cloud object storage to improve cost and agility as they consolidate structured, semi-structured and unstructured data.
  • High-performance processing. Data lakes now deliver sufficient performance to meet the SLAs of interactive OLAP. Most data lakes run Apache Spark in-memory processing rather than slow, legacy batch-oriented MapReduce software to support batch, micro-batch and streaming workloads. Some data teams speed things up further by using Apache Arrow to focus in-memory queries just on the data columns that matter for a given job. New commercial tools also reduce latency and improve throughput by applying distributed cache to parallel object storage.
  • Expanding workload types. A given dataset supports more analytics use cases today than it did just a few years ago. The same sales records, for example, might feed mainstay revenue dashboards, machine learning algorithms that predict customer behavior, and even automated real-time e-commerce offers. By moving some of their BI workloads to the data lake, enterprises can serve all constituents more efficiently.
  • Automation. Commercial automation tools help data engineers, data scientists and application developers reduce devilish scripting and development work at various layers of the stack. Graphical interfaces empower a rising population of data consumers to discover, probe, challenge and share insights without waiting on a data engineer.
  • Open architecture. Cloud data lake platforms generally integrate more easily with heterogeneous ecosystems than traditional data warehouses. Open file formats such as Parquet, ORC and JSON, and open APIs such as ODBC and JDBC, enable data teams to migrate data or plug in new tools. So enterprises stand ready for future change, which is one requirement they can count on. They should beware, however, the risk of “creeping lock-in.” CSP data formats and APIs might become less open over time, thereby raising the cost and complexity of migrating to other clouds.
  • Resource elasticity. Cloud-based platforms of many types have become more efficient by scaling separate storage and compute resources up or down, independently of one another. This improves utilization, efficiency and cost, particularly for variable or bursty workloads, compared with on-premises data lakes. A frequently-queried dataset will get the CPU cycles it needs, and an archived or “cold” dataset need not consume any.

In real life, a lake house does not replace one’s primary residence. It adds to your real estate responsibilities, but overall helps improve quality of life. So perhaps the “Data Lakehouse” term makes sense after all. It complicates your environment, but offers advantages that speak for themselves.

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie