Three Data Quality Automation Tools You Should Consider
ABSTRACT: Traditional techniques for managing data quality break at scale. This article profiles three tools and approaches that use ML to automate data quality.
Traditional techniques for managing data quality break at scale. Thankfully, machine learning (ML) offers a way forward. By using ML algorithms to automate data quality workloads, organizations can reassert control over their data pipelines and deliver reliable data to business decision makers.
Traditionally, data engineers write data quality rules using SQL. This manual method works well when there are dozens, or even hundreds of tables, but not when there are ten thousand or more. In a modern, data-driven organization, data engineers can never keep up with demand for data quality scripts.
New data quality automation (DQA) tools replace manual methods with ML models. You can view this product segment as a subset of data observability, which addresses both data quality and data pipeline performance. There are three different approaches to DQA: automated checks, automated rules, and automated monitoring. Each approach has its pros and cons, but collectively they represent the future of data quality management.
The following three vendors embody these approaches, respectively.
Ataccama employs the automated checks approach, which uses ML to classify incoming data at the row and column level and automatically apply data quality rules written by data engineers.
First Eigen uses the automated rules approach, which uses ML to generate data quality rules, which consist of standard quality checks (nulls, duplicates, etc.) and complex correlations between data columns and values.
BigEye uses the automated monitoring approach, which uses ML to detect anomalies in the data rather than apply rules. The tool monitors changes to tables at fixed intervals, triggering alerts if it detects an uncharacteristic shift in the profile of the data.
Consider the above three vendors when evaluating data quality platforms. Below is a summary profile of each. For an in-depth discussion on data quality automation and these three vendors, see Joe Hilleary’s report, “Deep Dive on Data Quality Automation: Three Tools to Consider” (Eckerson Group, May 2022.)
Founded in 2007, Ataccama builds on more than a decade and a half of experience in data quality, employing artificial intelligence to automate key functions in the data quality workflow. From data profiling to data matching and rule mapping, Ataccama provides data quality engineers with tools that supplement, rather than disrupt the traditional approach to data quality. Data quality automation is only one part of the larger Ataccama ONE platform that also has modules for data governance and master data management. The platform is available on premises, in the cloud, or both, as a hybrid deployment. As a technology company that spun out of a consulting company, Ataccama places a strong emphasis on services in addition to its software, providing close guidance on adapting its vertical agnostic solution to specific industries.
Ataccama is ideal for customers that:
> Don’t want to throw out their traditional approach to data quality but want to automate select aspects of the data quality workflow.
> Need a solution that supports hybrid deployments and legacy systems.
> Would like to approach data quality automation within the context of a larger platform, rather than through standalone tools.
> Value a strong service partnership with their technology vendors.
Founded in 2015, FirstEigen’s DataBuck platform automates the writing of data quality rules. Using machine learning, the platform identifies the characteristics of individual tables and generates complex rule sets covering the nine most significant classes of data errors. FirstEigen focuses on providing data quality checks for large volumes of operational and transactional data. Its solution deploys both on-premises and in the cloud, and customers most often use it with cloud data lakes. DataBuck rechecks the data at every stage of refinement, performing in situ analysis, rather than moving data out of the client environment. FirstEigen provides a robust set of APIs that allow engineers to integrate DataBuck with other tools and build it directly into data pipelines. The tool also provides a no code interface that allows users to monitor aggregated data quality indices for their tables, provide feedback to improve the rule-writing algorithm, and view root-cause analyses for flagged errors.
DataBuck is ideal for customers that:
> Want to automate the discovery and writing of most of their data quality rules.
> Need a data quality solution for a cloud data lake environment.
> Would like to apply data quality checks at multiple stages of their data pipelines.
> Require a solution that performs data quality analysis where the data resides.
Bigeye eschews the rules-based approach to data quality. Instead of pass/fail tests, it uses anomaly detection models to flag unusual behavior. The solution relies on machine learning to identify the metrics most worth monitoring and then provides automated alerts when they exceed dynamically determined thresholds. It minimizes the need for human work by providing a deep library of built-in metrics, but also provides capabilities to customize monitoring for specific business needs. The platform puts the usability of data front and center, allowing stakeholders to agree on enforceable standards that data teams must meet for the metrics that matter most to consumers. The tool, which can deploy on-premises or in the cloud, offers both an intuitive, low-code user interface and a robust set of APIs, so users can interact with the software either manually or programmatically.
Bigeye is ideal for customers that:
> Want to take an SLA-based approach to data quality, rather than a rules-based one.
> Use a modern data stack, especially one centered on Snowflake.
> Need a fully automated tool to scale data quality along with their data.
> Need an API to their data quality tool.
Ensuring data quality in a world of big data requires automation. Manual approaches simply can’t keep up with the ever-growing torrents of data coursing through modern enterprises and data-first start-ups. That said, no technology is a silver bullet. Machine learning can’t eliminate the need for human oversight and intervention in data quality. Instead, ML acts as a force multiplier for data engineering teams. Depending on how it’s utilized, it can aid in applying custom rules to new datasets, creating rules from scratch, or even help in transitioning away from a rule-based data quality paradigm altogether.
Questions to ask DQA vendors:
> How does the vendor define data quality? Data quality can be a squishy term. Make sure that you and the target vendor think about data quality the same way. What do they consider the key aspects of high-quality data?
> What’s being automated? Lots of vendors talk about “automation,” but each automates different parts of the data quality workflow. Determine where in the workflow your engineering team feels most pinched, and look for a tool that supports that step.
> Where is the solution deployed?
You should answer this question on two dimensions.
— Physical. Is the solution cloud-based or on-premises? Will it require your data to flow to another server? These are critical questions from a risk and compliance standpoint depending on the kind of data you work with.
— Within the data stack. Data quality tools embed at different points in the flow of data. Many sit on top of data warehouses or other repositories, while others tap into every pipeline. The location of the data quality tool impacts the kinds of trace back and root cause analysis functions it can provide. It also has ramifications for the complexity of deployment.
> What other tools does it integrate with? This should be obvious, but the last thing you want to do is buy an expensive new tool only to discover that it can’t connect to the rest of your data infrastructure. The degree of integration also matters. A tier one integration is very different than a generic connection template that’s been adapted.
> What affects pricing? Pricing models vary dramatically. Figure out what levers a vendor uses to set its prices, and try to determine what the actual cost will be both in the present and as time goes on. Number of assets, overall volume of data, number of sources, number of users, and computational demands are all common elements used in data quality pricing schemes.