Why Are Enterprise Analytics and AI So Painful? The Case for Data Pipeline Observability

Explosive data growth makes data pipelines complex on every dimension. This complexity makes it impossible for data teams to monitor and control thousands of data pipeline components and services.

Driven by business demand, a growing population of enterprise data consumers seeks to use new data from new sources to address new use cases. This prompts data teams to adopt new tools, run workloads on new platforms, and migrate to hybrid and multi-cloud infrastructures. As inter-dependent technologies accumulate, and data volumes rise, enterprises struggle to operate their data pipelines successfully with their analytics & AI projects. To regain control, data teams need data pipeline observability: the ability to monitor, automatically detect, predict, and resolve issues, from source to consumption, across the enterprise.

This blog, the first in a series, examines the evolution of enterprise data environments, rising pain of complexity, and resulting requirements for data pipeline observability. The next blog will define this new paradigm and its relationship to established disciplines such as DataOps, APM and ITOps. The final blog will explore best practices for data pipeline observability, based on enterprise successes and lessons learned.

Architectural Evolution

To understand the problem, let’s review how architectures have changed.

Until recently, enterprise data pipelines served a staid, predictable, on-premises world. A handful of ETL and change data capture (CDC) tools ingested structured data from databases and applications, then transformed and stored that data in monolithic data warehouses. Traditional business intelligence software created dashboards and reports based on batch analytics workloads in the data warehouse.

But architectures evolved. To manage increasing data and processing needs, enterprises had to rapidly adopt new technologies. Architects and data engineers now use ELT tools, CDC, APIs and event streaming systems such as Apache Kafka. These tools ingest structured, semi-structured and unstructured data from sources that include social media, IT logs and Internet of Things (IoT) sensors. That data is transformed and stored in data warehouses, data lakes, NoSQL, and stream data stores. Later, it is delivered for consumption in dashboards, BI tools, AI and advanced analytics.

A final layer of complexity: More and more of these pipelines rely on elastic cloud object stores and cloud compute nodes. This leads to hybrid and multi-cloud environments that still must integrate with legacy on-premises systems. All told, data pipelines become fragile webs of many interconnected elements.

Symptoms of Overload

This complexity and rising tide of data can overwhelm enterprise teams that manage the infrastructure, data pipeline and consumption layers.

Infrastructure layer. Platform engineers and site reliability engineers (SREs) struggle to support data pipelines at scale with distributed compute and storage resources. They use open source or commercial tools to monitor resource availability, utilization and performance in isolation. But they often cannot correlate those metrics across heterogeneous environments or gauge their impact on data pipeline flows. This lack of visibility leads to issues, outages and broken SLAs.

Data pipeline layer. Architects and data engineers struggle to diagnose and remediate bottlenecks. They monitor data processing flows and performance with Apache Spark, Apache Kafka and various commercial tools. But once again, those isolated views cannot explain how issues relate across heterogeneous components. They do not see the role of underlying resources – compute, storage, etc. – or the impact of pipeline latency and throughput on actual analytics consumption. Consequently, data timeliness and quality suffer.

Consumption layer. BI analysts, data scientists and business managers struggle to make decisions with less timely and reliable analytics output. They escalate issues to the VP of Analytics, Chief Data Officer or business executives – who turn to infrastructure and data teams that cannot provide conclusive answers or a path to improvement. Application performance management (APM) tools provide insufficient visibility into the root cause of issues.

These enterprise teams lack the time and skills to stitch together multiple tools or develop custom full-stack views themselves. Teams communicate with each other but lack a common language and platform to collaborate.

So far enterprises responded to these problems with a mix of incomplete responses. They rented elastic cloud compute resources to ease bottlenecks. They patched together monitoring views by customizing multiple tools. And they applied fast new engines like Apache Arrow in-memory, columnar processing to key parts of their pipelines. This is like putting a band-aid on a tumor. The problem may be hidden temporarily, but it’s only going to get worse. Ultimately, it can only be addressed at the root cause.

What’s Next?

It is time to take a comprehensive and holistic look at the issue.

Enterprises need data pipeline observability to achieve full-stack monitoring and control of all the elements that drive AI and analytics data workloads. They need to share common and intuitive views of data pipelines, and collaborate to anticipate, prevent and resolve issues. They need to observe data pipelines across the infrastructure, data and consumption layers, and across heterogeneous components.

This data pipeline observability can help platform engineers and site reliability engineers monitor and ensure infrastructure reliability, efficiency and capacity. It can help architects and data engineers improve data access, quality and lineage. It can help data consumers understand why issues arise – and bolster their confidence that such issues can be resolved.

That is the intended value of data pipeline observability. In the next blog, we’ll unpack what data pipeline observability means in practice, assess its feasibility, and compare it with current solutions for DataOps, APM and ITOps.

To learn more in the meantime, you can register for Acceldata’s webinar, “The Role of Observability for Analytics & AI,” on Tuesday, November 17, 2020, at 10 am PT / 1 pm ET.

Kevin Petrie

Kevin is the VP of Research at Eckerson Group, where he manages the research agenda and writes about topics such as data integration, data observability, machine learning, and cloud data...

More About Kevin Petrie