The Rightful Role of Data Observability within DataOps
ABSTRACT: Enterprises should adopt data observability as the monitoring foundation for a smarter version of DataOps, rather than a standalone initiative.
Data observability creates justified enthusiasm about enabling the delivery of timely, accurate data.
But to get the right results, enterprises should integrate this emerging discipline into a full DataOps program. Data engineering teams that adopt data observability as the foundation for smarter DataOps—rather than a standalone initiative—can improve the odds they’ll meet stringent business demands. They can build, test, and update better data pipelines, and weave them together with more reliable workflows.
This blog defines the architecture that supports this unified approach.
Definitions
DataOps is an established discipline for building and managing data pipelines. It applies principles of DevOps, agile software development, and total quality management to data pipelines to help deliver timely, accurate data to the business. DataOps comprises testing, continuous integration and deployment (CI/CD), orchestration, and monitoring. My colleagues Wayne Eckerson, Joe Hilleary, and Dave Wells have written extensively about DataOps.
Data observability, meanwhile, is an emerging discipline for studying the health of enterprise data environments, using techniques adapted from governance tools and application performance management. It includes data quality observability, which monitors the quality and timeliness of data. It also includes data pipeline observability, which monitors the quality and performance of data pipelines, including the infrastructure that supports them. My earlier blog The Five Shades of Observability, described a taxonomy that encompasses these segments as well as business, operations, and ML model observability.
Architecture
Now let’s explore how the DataOps elements of testing, CI/CD, orchestration, and data observability (the new version of monitoring) come together.
Testing. Data engineers build tests into their pipelines to examine and validate data pipelines. They inspect code, execute functions, then assess behavior and results. They also perform A/B tests that compare pipelines so they can choose the best version. DataOps tools such as DataKitchen help test pipelines at each stage of development and production, then spot and fix errors before they affect consumers.
Continuous Integration / Continuous Deployment (CI/CD). Data engineers iterate both pipelines and datasets to maintain quality standards while relying on a single version of truth for production. They might branch a version of pipeline code into a development platform such as GitHub so they can fix errors or make enhancements, test it again, then merge the revised code back into production. In a similar fashion, they might branch versions of data using new platforms such as Project Nessie, then revise and merge as needed.
Orchestration. Data engineers automate data pipelines by grouping their tasks into workflows that transform and move data between various stores, algorithms, processors, applications, and micro services. Orchestration tools such as Apache Airflow help them build, schedule, track, and reuse workflows. By orchestrating these various workflows and elements, data engineers reduce the repetitive work of managing data pipelines.
Enter data observability
Data observability helps ensure the quality and performance of both data and pipelines.
Let’s start with the data. Data engineers profile datasets—at rest or in flight—to check that formats, data types, and value ranges fit expectations. They use machine learning to hunt for anomalies such as conflicting records, schema changes, or sudden value changes. When they spot issues, they check lineage to find and remediate the root cause. Data observability tools such as Metaplane, Bigeye, Monte Carlo, Anomalo, Acceldata, and Databand help perform this work.
Data observability also optimizes the performance of data pipelines. Data engineers monitor indicators such as the status of transformation jobs, utilization of Apache Spark clusters, or availability of Amazon EC2 compute instances. When something goes sideways, they analyze logs and traces to identify the root cause. They also track data usage to manage cost and improve compliance. Tools such as Unravel and Acceldata help perform this work.
Better together
By extracting the right signals from lots of noise, data observability makes testing, CI/CD, and orchestration more effective. Here’s how.
Testing becomes more efficient and effective because data engineers better understand the root cause that broke the last pipeline. They also understand the full sequence of pipeline tasks and their typical KPI ranges. Knowledge like this help data engineers test just the right weak spots so they can put better pipelines into production faster.
CI/CD becomes more efficient and effective because data engineers design pipelines and make enhancements that avoid known risks. They accelerate the branching and merging of pipeline code, with higher confidence as they go into production.
Orchestration becomes more efficient and effective because data engineers have more reliable code to bake into reusable workflows. They reduce the risk that they’ll automate and replicate error-prone pipelines.
Like most hyped-up tech concepts, data observability can lead to disappointment… unless you combine it with older and more proven practices. Data engineering teams that combine data observability with proven DataOps practices might just meet their expectations and deliver new value to business teams.