The Blending Disciplines of Data Observability, DataOps, and FinOps
ABSTRACT: Data observability provides intelligence about data quality and data pipeline performance, contributing to the disciplines of DataOps and FinOps.
“Data observability” sounds like a passive, somewhat academic pursuit, the purview of analysts and bean-counters. In fact, data observability drives mission-critical action on several fronts, enabling enterprises to understand and capitalize on fast-moving business opportunities. And it contributes to strategic initiatives such as DataOps and FinOps.
This blog profiles four vendors, Data Kitchen, DataOps.live, Informatica, and Unravel, each of which offer data observability as a foundation for bigger and bolder initiatives. The profiles derive from our upcoming digital market landscape report on data observability, and the CDO TechVent we hosted in August. Check out the TechVent and report to learn more and to read additional profiles about Anomalo, Acceldata, Bigeye, FirstEigen, Databand (an IBM company), Monte Carlo, and Soda.
Data observability is an emerging discipline that studies the health of enterprise data environments. It uses techniques adapted from governance tools and application performance management to address modern use cases and hybrid or cloud-only environments. Data observability tools apply machine learning to monitor the accuracy and timeliness of data delivery, with a particular focus on cloud environments. Data observability helps optimize data delivery across distributed architectures for both analytics and operations.
Data observability includes:
Data quality observability, which studies the quality and timeliness of data. It observes data in Draft flight or at rest, for example by validating sample values and checking metadata such as value distributions and data volumes, schema, and lineage.
- Data pipeline observability, which studies the quality and performance of data pipelines, including the infrastructure that supports them. It observes pipeline elements such as data sources, compute clusters, landing zones, targets, and applications by studying their logs, traces, and metrics.
Data observability also serves as the monitoring foundation for DataOps, which is an established discipline for building and managing data pipelines. DataOps applies principles of DevOps, agile software development, and total quality management to data pipelines to help deliver timely, accurate data to the business. DataOps comprises testing, continuous integration and deployment (CI/CD), orchestration. Data observability delivers intelligence that makes each of those elements more effective.
Data observability contributes to FinOps as well. This emerging discipline helps IT and data engineers, finance managers, data consumers, and business owners collaborate to reduce cost and increase the value of cloud-related projects. FinOps instills best practices, automates processes, and makes stakeholders accountable for the cost of their actions. Data teams use FinOps to make cloud-analytics projects more profitable—and they need the intelligence of data observability to achieve that. When you hear cloud costs, think FinOps.
With these definitions in mind, let’s explore the stories of Data Kitchen, DataOps.live, Informatica, and Unravel. Each of these vendors has a distinct approach to tackling these disciplines as they integrate with the ecosystem of data management tools.
Chris Bergh, Gil Benghiat, and Eric Estabrooks lived the pain of poor data delivery as they struggled to purge data errors from their prior company’s analytics services. In 2013 Bergh and team founded DataKitchen, a bootstrapped venture, to help companies fix such issues with a DataOps platform based on total quality management and statistical process control. DataKitchen pioneered many aspects of the discipline of DataOps, which seeks to make data pipelines more efficient and effective with continuous integration and delivery (CI/CD); pipeline orchestration; pipeline testing; and data observability. It now espouses an “observability first” approach to DataOps, helping enterprises understand the health and status of existing data flows before building new pipelines.
Data Kitchen offers both data pipeline observability and data quality observability across what it calls the “data journey,” from inception to transformation, delivery, and consumption. Enterprises use Data Kitchen to reduce the risk of their hybrid data environments by finding and fixing issues before they reach the business owners and applications that consume data. Data Kitchen monitors data pipelines in development and production, and enforces user-configured rules to take action when pipeline jobs or data quality fall short of expectations.
Data Kitchen also monitors the outputs of operations observability tools such as DataDog to help data teams understand the performance, utilization, and cost of compute clusters and other architectural elements that support pipelines. Data Kitchen customers include AstraZeneca, Bristol Meyers Squibb, and Catholic Relief Services.
CEO Justin Mullen and CTO Guy Adams co-founded London-based DataOps.live in 2020 as a software spinoff from their data consulting business. Their stated mission: “to make data products and teams as governed and agile as their software counterparts,” with an initial focus on the Snowflake Data Cloud. They also helped shape the philosophy of the TrueDataOps community. DataOps.live differentiates itself by providing a control layer that orchestrates, monitors, and controls various third party tools and custom pipeline elements. It best serves mid-sized and large enterprises that want to improve productivity with a unified developer and operator experience on Snowlake. Other cloud data platform vendors are on their near-term roadmap.
Modern data environments rely on a broad ecosystem of tools. To help manage data flows into Snowflake, DataOps.live integrates with and orchestrates ingestion tools such as Fivetran and Stitch, transformation tools such as Matillion and dbt, and data quality observability tools such as Monte Carlo and Soda. To help manage data within and across data pipelines, it integrates with catalogs such as Collibra and data.world, notebooks such as Jupyter, AI/ML platforms such as Dataiku, and data access management tools such as Okera and Immuta. The DataOps.live platform enables data engineers and data product managers to configure, build, test, and release data pipelines and data products that comprise these various elements.
DataOps.live offers multiple levels of observability for data pipelines and data products—including pipeline performance, data quality, data usage, and infrastructure cost—by studying trends and anomalies in metadata. To assess the value and viability of a data product, users can refer to a knowledge graph that illustrates the interrelationships of metadata such as source file names, quality scores, and query performance. DataOps.live’s new release, now in private preview, supports “Data Products” as complex objects, traces their lineage, and shares their metadata with tools such as Monte Carlo and Collibra.
The largest and most mature vendor profiled here, Informatica offers data observability tools as part of a comprehensive portfolio for data management and governance. The Redwood City, Calif.-based company started as an ETL provider in 1993. It grew organically and via acquisition to address critical segments such as cataloging, DataOps, data privacy, and master data management, all integrated as modules on an AI-driven platform. Informatica has a long history in data quality observability, focused on structured data at rest and data in motion. It recently extended its offerings to address data pipeline observability, in particular to help enterprises operationalize and scale AI/ML, while controlling the cost of data delivery as they move to the cloud. Various Informatica products already gather relevant metadata, making data observability a natural add-on. All these capabilities are offered as cloud-native services of Informatica’s Intelligent Data Management Cloud.
Many enterprise stakeholders, including data engineers, DataOps engineers, and business owners, use Informatica’s graphical interface and AI-guided prompts for data observability. They study three layers: infrastructure, data pipelines, and business consumption. At the infrastructure layer, Informatica monitors elements such as compute clusters, spots performance issues, and suggests how to fix them. It also helps measure, predict, and control the compute cost of individual jobs that pipelines perform, contributing to FinOps practices. At the pipeline layer, Informatica profiles data, detects anomalies, and helps remediate issues. At the business layer, Informatica tracks consumption by user, dataset, and consumption to assist compliance with internal policies and external regulations.
Enterprises such as Amgen and Discount Tire rely on Informatica for data observability. It makes about $1.5 billion in annual revenue, has more than 5,000 customers, and trades on the New York Stock Exchange.
CEO Kunal Agarwal and CTO Shivnath Babu founded Unravel in 2013 to automate performance troubleshooting for Big Data pipelines that rely on data lake elements such as Apache Spark. Since then the Unravel team has raised more than $107 million and grown to more than 100 employees. They also broadened their portfolio to address much more than data pipeline performance. Unravel offers “DataOps observability” that assists each stage of the iterative data lifecycle. This includes planning, deployment, operation, monitoring, orchestration, troubleshooting, and optimization. Their solution best serves enterprises that need to improve cross-functional collaboration between data-related roles such as data engineers, data analysts, and data scientists; and business roles such as product and platform owners, operations managers, and chief data or analytics officers.
As part of this strategy, Unravel addresses multiple categories of observability beyond just data pipeline observability. This includes operations observability, which spots and fixes performance issues with data-driven applications or infrastructure elements such as the processing engines and compute clusters that support them. It includes FinOps observability, which offers budget planning dashboards, chargeback reports, alerts, and AI-enabled optimization recommendations to control cloud costs by user, application, or infrastructure element. Unravel also now addresses data quality observability through external integration, starting with the open-source platform Great Expectations. Marquee enterprise customers such as Intel, HSBC, and Kimberly-Clark use Unravel to help data teams optimize data workloads across their stack.
Observing the Horizon
These vendors offer just a taste of the industry convergence that is underway. In the coming year, we should expect data observability to continue to drive bigger, bolder initiatives, from DataOps to FinOps to cataloging, data governance, and beyond.