Business Intelligence on the Cloud Data Lake, Part II: Improving the Productivity of Data Engineers

Read - Business Intelligence on the Cloud Data Lake, Part I: Why It Arose, and How to Architect For It

My first blog in this series charted the rise of business intelligence (BI) on the cloud data lake, explained its appeal from an architectural and performance perspective, and recommended ways to design effective data lakes for BI. Now we will explore use cases for BI on the data lake, focusing on the data lake query engine as the linchpin. We will explain how a data lake query engine minimizes ETL and streamlines data pipelines to make data engineers more productive—which benefits the whole enterprise.

Today’s data engineers need help serving a fast-growing population of data consumers. They work tirelessly to design, build and operate the data pipelines that ingest, transform and deliver data from operational source to analytics target. Data engineers apply various data pipeline scripts and tools to replicate, move, reformat, join and provision datasets, all in an effort to support fast-changing BI query, reporting and dashboard workloads. It often takes weeks or even months for them to create a new pipeline and provision data. They live the life of Sysiphus.

But a data lake query engine, which brings together a semantic layer with high-performance queries on cloud data lakes in the familiar SQL language, can break the cycle and improve results while reducing hours spent on data engineering, ETL and data pipelines. A data lake query engine uses columnar, in-memory processing, parallel data transfer and various cache optimizations to enhance performance. It also provides a central access point for self-sufficient data consumers, a central control point for increasingly efficient data engineers and an efficient semantic layer that eliminates the need for sprawling data warehouse structures like OLAP cubes or BI extracts.

A data lake query engine accelerates the “last mile” of the data pipeline that runs from the cloud data lake to the BI user. Imagine a world in which the following use cases are commonplace:

  • BI analysts and business managers execute fast ad-hoc, dashboard or reporting queries directly on the data lake. They leverage a unified data view to discover and enrich datasets in a self-service fashion, then share them with colleagues. These BI analysts and business managers serve themselves, without coding and without onerous requests of data engineers.
  • The data engineer quickly provisions, streamlines and governs data access through a rich and intuitive web interface. They manage a semantic layer that creates virtual datasets by integrating physical datasets across one or more sources in the data lake - without manual coding. They take their first long lunch break in years.
  • Analysts and data engineers collaborate to create and reuse structured, virtual datasets. Data engineers are coding less, so have more time to build and govern processes. They centralize KPIs and business logic for consistent reuse across enterprise teams.

But managing that last mile of the data pipeline in an efficient, flexible and governed way is obviously easier said than done. Let’s unpack just how data engineers can achieve this and improve productivity.

  • Standardize tools and formats. Carefully select a commercial tool that takes a systematic and automated approach to structuring and querying data in your data lake. By standardizing on commercially supported technology, you can minimize the need to develop, maintain and troubleshoot custom code. To flexibly address future needs, be sure this tool integrates comprehensively with your current and future potential platforms across hybrid and multi-cloud environments. Data teams also should standardize on open and flexible data formats such as the Apache Parquet and Apache ORC columnar file formats, both of which enable fast access, high scalability and parallel processing.
  • Eliminate steps in the data pipeline. As discussed earlier, the high performance of the data lake query engine eliminates the need to export data from the data lake into separate DW structures. It also replaces bulky ETL jobs with surgical, quickly-materialized views that can be adapted and re-used in memory. And by deploying a comprehensive and automated tool, you can eliminate the need for specialized developers to manage different semantic layers for different workloads. The data engineer’s to-do list shrinks further.
  • Shift data pipeline work to analysts. Data engineers typically struggle to transform and assemble datasets in accurate and usable formats, because they lack business domain knowledge. The data lake query engine reduces this burden by empowering domain-savvy analysts to generate the right query on their own. They need little or no help from data engineers because the ETL work has gone away.
  • Maintain control. Our earlier blog underscored the value of governance capabilities such as data quality checks, role-based access controls and data masking. From a larger architectural perspective, you can simplify governance by avoiding multiple data copies and extracts. Keep everything in the data lake, and query it as needed with in-memory performance optimizations rather than relying on a sprawling set of performance workarounds in the form of BI extracts, etc.

So what’s the best measure of success for data pipeline efficiency? Data engineers become more productive and enjoy more dinners with their families—a good step for both personal and enterprise health.

To dig more into this topic, watch our on-demand webinar with Dremio, "The Rise of the Cloud Data Lake Engine: Architecting for Real-Time Queries."

Kevin Petrie

Kevin is the VP of Research at Eckerson Group, where he manages the research agenda and writes about topics such as data integration, data observability, machine learning, and cloud data...

More About Kevin Petrie