Developing a Robust Data Quality Strategy for Your Data Pipeline Workflows
ABSTRACT: A robust data workflow testing strategy helps ensure the accuracy and reliability of data processed within a data pipeline.
Developing a robust data workflow testing strategy is crucial to ensuring the accuracy and reliability of data processed within a data pipeline.
The strategy should be a comprehensive plan to test all aspects of data quality, from individual code units to workflows to full data pipelines. In this blog, we define the data quality (DQ) testing functions that support data extractions, transformations, processing, and loading. You can view these functions as the backbone of data quality observability, which is part of overall data observability as defined by the Eckerson Group. The blog “The Rightful Role of Data Observability within DataOps'' describes how data observability assists DataOps processes such as continuous integration and deployment (CI/CD) of pipeline code. While data quality is hardly a new problem, startup vendors such as Monte Carlo, Acceldata, and Unravel offer new machine learning techniques to optimize data quality for hybrid, cloud, and multi-cloud environments.
Let’s start with some definitions. A data pipeline comprises "workflows" that executes in a defined order to process and move data from sources to destinations. Typically these are collections of small self-contained tasks. Data flow functions, meanwhile, are specific tasks or sets of jobs that workflows perform as part of a larger data pipeline. For example, a data workflow could include functions for extracting data from a database and transforming the data into a usable format. To ensure data quality across pipelines, workflows, and functions, we must test the dimensions of accuracy, completeness, conformity, consistency, integrity, precision, timeliness, and uniqueness (see Table 1 below).
Data errors often arise from schema changes, null values, duplication, or database loading issues (Figure 1). These errors can happen within one or multiple components of the data pipeline, including sources, targets, processing (e.g., transformations, aggregations), and loads to storage destinations. See Data Quality Issues that Kill Your Machine Learning Models for an elaboration.
Today, most data pipelines perform "tests" within each workflow to ensure data quality. Many tests today are designed as integral to the production system functionality. Data stewards or data engineers design them to ensure that erroneous data is identified, blocked, scrubbed, fixed, or just logged as the pipeline is run. Such tests are necessary, partly because data that flows into pipelines may come from untrusted systems.
Data pipeline tests are closely linked to the quality of stored data. To quantitatively assess the dimensions of quality, data teams should define both the measures and the measurement methods used.
The ETL process also can ensure the quality of data that arrives and resides in the data warehouse. A malfunctioning ETL process may result in data not being transferred to the DW or not being transferred promptly. With that in mind, it is essential to assess how accurate and complete the data is. This includes assessing whether the pipeline extracts all the required data from its sources, stores it in the staging area, transforms it, and subsequently loads it into the DW–then comparing that target data with the source.
Figure 1 Modern Data Pipeline Components
Source: The Complexities of Modern Data Pipelines
How to leverage data quality dimensions to enhance test coverage of data pipeline workflows
Dimensions such as accuracy, completeness, and consistency help identify potential issues with the data and develop specific test cases to verify that the data meets the required quality standards. These data quality dimensions can contribute to test planning and provide a framework to identify and evaluate data characteristics. In addition, using data quality dimensions early in the test planning process can help identify and address data issues before they become major problems during later project phases. It also can help evaluate the suitability and quality of various datasets for testing and identify any potential issues or areas of concern. Data stewards or data engineers can then take the necessary data preparation or cleaning steps before they begin testing.
Many researchers have contributed to the understanding of data quality problems. They have collectively identified general causes at the following stages of data integration planning and execution:
Schema design and modeling
Data source profiling
Data staging and ETLs
Data transformations, cleansing, and enrichment
Data reporting
Table 1 displays the most common dimensions of data quality, as well as tests that help identify data issues at each stage listed above. Data engineers, developers and other implementers of data integration solutions can review and analyze these potential issues before releasing their projects. From the sample tests shown in Table 1, developers and data testing specialists can find ideas for creating test conditions.
Because of the complexity and processing logic needed to manage and control these data quality dimensions, many organizations rely on data discovery and quality management tools. Each supports a certain level of automation, and some can perform multiple types of tests. Popular testing tools include Acceldata, Appian, Katalon, Kobiton, Kualitee, Monte Carlo, Micro Focus, Qaprosoft, Sauce Labs, Selenium, Telerik Test Studio, TestArchitect, and Unravel.
Table 1: Data Quality Dimensions - Testing Checklist
Data Quality Dimension | Description | Examples of Data Quality Workflow Tests |
---|---|---|
Accuracy | The extent to which data is correct, reliable, and certified | Data element precision checks |
Data values defined and populated per requirements | ||
Data elements correspond to correct data types | ||
Null or blank values are as expected | ||
Truncated data is as expected | ||
Data transformations are correct | ||
Data contained in the source precisely as in the target | ||
All transformations are correct | ||
Completeness | The extent to which elements associated with entities (ex., table) comprise all expected values for all expected attributes | Record counts for sources & targets match |
Data integrity checks for all keys | ||
Out-of-boundary checks for data | ||
SCD and CDC checks | ||
Record counts as expected | ||
Null value checks for all columns | ||
Conformity | Data conforms to business rules and specified formats to meet users' expectations | Correct implementation of business rules |
Date/time formats correct | ||
Data types represent requirements | ||
Data values per column as per specifications | ||
Report formats comply with requirements | ||
Data elements conform to data models | ||
Metadata conforms to requirements across environments | ||
Consistency | The extent to which data attributes values are free from contradiction and coherent with other data in a specific context of the use | Source to target field mapping checks |
Contains the values of all measures in the data set | ||
One-to-one, many-to-one value, many-to-many data value checks | ||
Slowly changing dimensions (SCD) consistently applied | ||
Data values conform to data types and data lengths | ||
Data equivalence maintained during data movements | ||
Integrity |
| Data aggregation rules applied correctly |
Correctness of referential integrity (ex., primary and foreign keys) | ||
Data element values are within specified ranges | ||
Data values conform to business rules | ||
Correctness of concatenated data from multiple sources/fields | ||
Check for referential integrity across related tables or entities | ||
Ensure that the data is not corrupted and can be accessed without error | ||
Data meets specified business or regulatory requirements | ||
Precision | The measurement or classification detail used in to specify an attribute's values | Numeric field precisions calculated and presented correctly |
Numeric data precisions per requirements | ||
Data trimming is correct | ||
Data values not truncated | ||
Timeliness | Data reflects times and dates correctly. Data is available within the required time frame | Dates and time values loaded as defined |
Dates, times values within the boundaries of requirements | ||
Time/dates formats defined as required | ||
Date/time fields not null or blank | ||
Most recent dates and times are represented | ||
Uniqueness |
| Duplicate values checked for all fields and records |
Duplicate records/fields removed where specified | ||
Duplicate primary, foreign, and surrogate keys do not exist | ||
Surrogate keys are assigned where needed | ||
Using a data quality dimensions checklist when planning a testing strategy can help ensure that your data is accurate, complete, and reliable – i.e., that it meets the requirements of each data quality dimension. This checklist can serve as a guide to help you assess the quality dimensions of your data pipelines. It can help you identify any issues or discrepancies in the data, allowing you to take action to correct them before the data is used for analysis or decision-making. Using the checklist and guidelines in this blog, you can ensure that the data meets your organization’s specific requirements and standards.
Guest blogger bio:
Wayne Yaddow is a freelance writer focusing on data quality and testing issues. Much of his work is available at DZone, Dataversity, TDWI, Tricentis, and Eckerson Group