Register for "A Guide to Data Products: Everything You Need to Understand, Plan, and Implement" - Friday, May 31,1:00 p.m. ET

Making Data Observability Work for Data Lakes and the Hybrid Cloud

New users, applications, devices, platforms and clouds push data pipelines to the breaking point. Groaning under the weight of data volumes and variety, these pipelines often fail to support modern analytics and AI workloads.

Enter data observability. As defined in an earlier blog, this emerging paradigm monitors and correlates data events across application, data, and infrastructure layers to detect, predict, prevent, and resolve issues. Data observability tools help optimize performance, reduce costs, and increase data team productivity for large-scale enterprise data systems. This is most sorely needed for on-premises data lakes and hybrid/multi-cloud platforms. This blog explores requirements, use cases and guiding principles to apply data observability to these two environments.

First, some background. Enterprises built the first generation of data lakes on premises five to ten years ago, leveraging the Apache Hadoop ecosystem of open-source components. Many data teams still maintain workloads on Hadoop thanks to data gravity, despite the complexity of managing them. At the same time, they embrace cloud data platforms such as Azure Synapse, Databricks and Snowflake for new data science use cases. Enterprises often straddle multiple clouds as well as legacy on-premises systems.

On-Premises Data Lakes

Performance is a real challenge for the on-premises data lake. For example, while Spark accelerates batch processing compared with its predecessor MapReduce, it often needs oversight to support analytics and AI at scale. Data observability can significantly improve data lake speed and reliability. Here are three use cases and guiding principles to optimize data lake performance.

Data pipelines. Data architects and data engineers can use data observability tools to easily and efficiently collect thousands of pipeline events, correlate them, then identify and inspect anomalies or spikes. This helps predict and fix issues—for example, with Spark processors and the many components on which they depend.

Use performance findings to adjust people and processes rather than just technology. Train users on best practices for efficiency and devise processes to share resources more efficiently.

Infrastructure. DevOps, platform and site reliability engineers can use data observability applications to monitor memory, CPUs, storage, clusters, and nodes, as well as their interactions. This helps mitigate congestion, outages, and runaway workloads for Spark, YARN schedulers or the Hadoop File System (HDFS).

Identify opportunities to move to the cloud. For example, it’s now possible to support bursty workloads with elastic compute rather than dedicating CPUs on premises. The efficiency gains might justify the cost of migration.

Service Level Agreements (SLAs). Data consumers demand rigorous latency, throughput, and uptime commitments. Data architects and data engineers struggle to supply the right pipelines and infrastructure. Data observability tools can improve transparency to help both sides better understand what’s possible and agree on achievable SLAs.

Involve your Line of Business and IT Leaders in these discussions to help guide cost modeling, capacity planning and architectural design decisions.

Hybrid and Multi-Cloud Environments

Data teams need to optimize cloud performance, for example, to meet machine learning throughput and latency requirements. High compute prices force companies to measure costs, ration resources and streamline data pipelines. This can be tricky because enterprises often ingest data from on-premises sources and even share datasets and workloads across clouds. Data observability tools synthesize signals across layers and environments, and this intelligence can be used to improve costs, reliability, and SLO adherence.

Cost modeling and chargeback. Data observability applications can help Line of Business and IT leaders collaborate with BI analysts, data architects, and data engineers to accurately estimate individual team costs of compute, storage and memory consumption, as well as data transfers. This supports more efficient budgeting and chargeback processes.

Address cost modeling and chargeback questions before migrating or spinning up new workloads on the cloud to align on budget and accountability.

Capacity planning. DevOps, platform and site reliability engineers can use data observability to identify bottlenecks, redistribute workloads—for example, leveraging discounted spot instances—and avoid cost overruns. They can also improve planning for future needs based on available capacity and necessary buffer.

Plan for growth, and lots of it. Digital transformation and data democratization have created voracious demand for analytics and AI among business teams. Get ahead of this trend and anticipate future requests before they arrive.

Architectural design. Employing effective data observability tools gives data architects and data engineers the opportunity to step back from daily firefights to design better architectures that meet future business requirements. Data teams can now select, deploy, and configure new platforms that consider observed workload behavior, scenario modeling, and impact analysis.

Make this a cross-functional effort. Cast a wide net to fully incorporate future business needs. Well-architected cloud environments that satisfy just half the business won’t be sufficient.

Netting It Out

Planned and implemented well, data observability tools can increase efficiency, productivity and analytics value. But you need to think bigger than the technology. Employing best-in-class data observability applications gives data teams the ability to think strategically and holistically, driving longer-range decisions rather than just easing the pain of daily combat. Winning enterprises will adopt a data observability approach to transform data from a liability to a true business asset.

To learn more about data observability, join this Eckerson Group webinar, “The Rise of Data Observability: Improving the Scale, Performance, and Accuracy of Modern Data Pipelines.”

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie