Big Fish Swim in the Data Lake

Big Fish Swim in the Data Lake

I’ve been known as something of a data lake detractor, deeply suspicious of its early “definition” by James Dixon, CTO of Pentaho in a 2010 blog as a place where “the contents… stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” I have elaborated elsewhere on the many problems with this description and its direct descendants, but there is also an underlying truth in Dixon’s statement of the problems of traditional data warehousing and his vision that the Hadoop ecosystem has a significant role to play in their solution.

My recent softening toward the data lake (also known as data reservoir) is driven by a shift in the industry from an either-or position to a more realistic both-and stance. A more realistic and achievable outcome is that the data lake stands beside (or, in some marketing viewpoints, includes) the data warehouse and takes on some of the functions previously assigned to it. These are functions that have become increasingly difficult in the traditional data warehouse environment as business becomes ever faster and more agile, and technology drives ever more data into the business.

In my book “Business unIntelligence”, I suggested that the roles of both the enterprise data warehouse (EDW) and data marts would change to accommodate greater business agility and exploding data volumes. We would step back from the idea of pushing all data through the EDW for reconciliation and consistency, and limit the EDW’s scope to core, legally binding business information. Data marts become less tied to the EDW, but focus—like the EDW itself—on data from the transactional, operational systems of the business. These traditional components comprise the process-mediated data pillar, implemented largely in relational technology today. The architecture also identified two pillars of (mostly) externally sourced information/data—human-sourced and machine-generated—that correspond to a data lake, built on Hadoop and NoSQL technologies.

This more nuanced view of data architecture is now being adopted by both traditional data vendors and big data proponents alike. Teradata and IBM, for example, both place data lakes and warehouses side by side in their thinking. Early in February, Hortonworks announced an EDW Optimization Solution that positions the EDW and data lake together, defining their respective roles. This offering is worth a closer look, because it illustrates just how far Hadoop thinking has evolved.

As might be expected, and in line with the architectural discussion above, the EDW Optimization Solution recognizes the importance of Hadoop for the new, externally sourced data types. However, more interestingly, it recognizes how the Hadoop platform can play a supportive role to traditional EDW and data marts for process-mediated data.

A long-standing concern in traditional data warehouses is that a large percentage of data residing there—often called cold data—is rarely used and that storing it there for occasional use can be prohibitively expensive. Why not migrate such data to Hadoop (where storage costs are perhaps one tenth of those in traditional relational platforms) for occasional, but immediate, access as usage decreases? As a result, the ability to run SQL queries in the Hadoop environment becomes a central requirement, as does the ability to join (or virtualize) queries across multiple platforms.

Another part of the offering is to locate extract-transform-load (ETL) processing in the Hadoop environment. This is more of a technological / physical implementation issue than an architectural consideration. From a cost perspective, the attraction is obvious. Hadoop’s highly parallel, commodity hardware platform promises worthwhile operational cost savings over running such function in the EDW (known as extract-load-transform, ELT) or on some of the traditional ETL platforms. However, careful consideration is needed here. While this approach is likely to make sense for new, external data sources, re-engineering existing, working ETL or ELT solutions from older operational systems risks undermining these carefully crafted and complex systems. Caveat emptor applies!

Whether you choose to call these data lakes or reservoirs, or even modern data warehouse approaches, it is clear that data warehousing is in transition from a monotheistic, one-size-fits-all relational environment to a multi-platform approach, offering the opportunity to choose the most appropriate tool for the job in hand. The advantage is that we can optimize performance according to processing and data management needs. However, the complexity of the resulting platform should not be underestimated, nor should the fact that open source software brings both advantages in responsiveness and choice as well as challenges in management.

Data lakes have come of age; there are big fish to be found in their waters for those that venture out on the waters.

Barry Devlin

Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper on the topic in 1988....

More About Barry Devlin