Developing a Robust Data Quality Strategy for Your Data Pipeline Workflows
ABSTRACT: A robust data workflow testing strategy helps ensure the accuracy and reliability of data processed within a data pipeline.
Developing a robust data workflow testing strategy is crucial to ensuring the accuracy and reliability of data processed within a data pipeline.
The strategy should be a comprehensive plan to test all aspects of data quality, from individual code units to workflows to full data pipelines. In this blog, we define the data quality (DQ) testing functions that support data extractions, transformations, processing, and loading. You can view these functions as the backbone of data quality observability, which is part of overall data observability as defined by the Eckerson Group. The blog “The Rightful Role of Data Observability within DataOps'' describes how data observability assists DataOps processes such as continuous integration and deployment (CI/CD) of pipeline code. While data quality is hardly a new problem, startup vendors such as Monte Carlo, Acceldata, and Unravel offer new machine learning techniques to optimize data quality for hybrid, cloud, and multi-cloud environments.
Let’s start with some definitions. A data pipeline comprises "workflows" that executes in a defined order to process and move data from sources to destinations. Typically these are collections of small self-contained tasks. Data flow functions, meanwhile, are specific tasks or sets of jobs that workflows perform as part of a larger data pipeline. For example, a data workflow could include functions for extracting data from a database and transforming the data into a usable format. To ensure data quality across pipelines, workflows, and functions, we must test the dimensions of accuracy, completeness, conformity, consistency, integrity, precision, timeliness, and uniqueness (see Table 1 below).
Data errors often arise from schema changes, null values, duplication, or database loading issues (Figure 1). These errors can happen within one or multiple components of the data pipeline, including sources, targets, processing (e.g., transformations, aggregations), and loads to storage destinations. See Data Quality Issues that Kill Your Machine Learning Models for an elaboration.
Today, most data pipelines perform "tests" within each workflow to ensure data quality. Many tests today are designed as integral to the production system functionality. Data stewards or data engineers design them to ensure that erroneous data is identified, blocked, scrubbed, fixed, or just logged as the pipeline is run. Such tests are necessary, partly because data that flows into pipelines may come from untrusted systems.
Data pipeline tests are closely linked to the quality of stored data. To quantitatively assess the dimensions of quality, data teams should define both the measures and the measurement methods used.
The ETL process also can ensure the quality of data that arrives and resides in the data warehouse. A malfunctioning ETL process may result in data not being transferred to the DW or not being transferred promptly. With that in mind, it is essential to assess how accurate and complete the data is. This includes assessing whether the pipeline extracts all the required data from its sources, stores it in the staging area, transforms it, and subsequently loads it into the DW–then comparing that target data with the source.
Figure 1 Modern Data Pipeline Components
How to leverage data quality dimensions to enhance test coverage of data pipeline workflows
Dimensions such as accuracy, completeness, and consistency help identify potential issues with the data and develop specific test cases to verify that the data meets the required quality standards. These data quality dimensions can contribute to test planning and provide a framework to identify and evaluate data characteristics. In addition, using data quality dimensions early in the test planning process can help identify and address data issues before they become major problems during later project phases. It also can help evaluate the suitability and quality of various datasets for testing and identify any potential issues or areas of concern. Data stewards or data engineers can then take the necessary data preparation or cleaning steps before they begin testing.
Many researchers have contributed to the understanding of data quality problems. They have collectively identified general causes at the following stages of data integration planning and execution:
Schema design and modeling
Data source profiling
Data staging and ETLs
Data transformations, cleansing, and enrichment
Table 1 displays the most common dimensions of data quality, as well as tests that help identify data issues at each stage listed above. Data engineers, developers and other implementers of data integration solutions can review and analyze these potential issues before releasing their projects. From the sample tests shown in Table 1, developers and data testing specialists can find ideas for creating test conditions.
Because of the complexity and processing logic needed to manage and control these data quality dimensions, many organizations rely on data discovery and quality management tools. Each supports a certain level of automation, and some can perform multiple types of tests. Popular testing tools include Acceldata, Appian, Katalon, Kobiton, Kualitee, Monte Carlo, Micro Focus, Qaprosoft, Sauce Labs, Selenium, Telerik Test Studio, TestArchitect, and Unravel.
Table 1: Data Quality Dimensions - Testing Checklist
Data Quality Dimension
Examples of Data Quality Workflow Tests
The extent to which data is correct, reliable, and certified
Data element precision checks
Data values defined and populated per requirements
Data elements correspond to correct data types
Null or blank values are as expected
Truncated data is as expected
Data transformations are correct
Data contained in the source precisely as in the target
All transformations are correct
The extent to which elements associated with entities (ex., table) comprise all expected values for all expected attributes
Record counts for sources & targets match
Data integrity checks for all keys
Out-of-boundary checks for data
SCD and CDC checks
Record counts as expected
Null value checks for all columns
Data conforms to business rules and specified formats to meet users' expectations
Correct implementation of business rules
Date/time formats correct
Data types represent requirements
Data values per column as per specifications
Report formats comply with requirements
Data elements conform to data models
Metadata conforms to requirements across environments
The extent to which data attributes values are free from contradiction and coherent with other data in a specific context of the use
Source to target field mapping checks
Contains the values of all measures in the data set
One-to-one, many-to-one value, many-to-many data value checks
Slowly changing dimensions (SCD) consistently applied
Data values conform to data types and data lengths
Data equivalence maintained during data movements
Data aggregation rules applied correctly
Correctness of referential integrity (ex., primary and foreign keys)
Data element values are within specified ranges
Data values conform to business rules
Correctness of concatenated data from multiple sources/fields
Check for referential integrity across related tables or entities
Ensure that the data is not corrupted and can be accessed without error
Data meets specified business or regulatory requirements
The measurement or classification detail used in to specify an attribute's values
Numeric field precisions calculated and presented correctly
Numeric data precisions per requirements
Data trimming is correct
Data values not truncated
Data reflects times and dates correctly. Data is available within the required time frame
Dates and time values loaded as defined
Dates, times values within the boundaries of requirements
Time/dates formats defined as required
Date/time fields not null or blank
Most recent dates and times are represented
Duplicate values checked for all fields and records
Duplicate records/fields removed where specified
Duplicate primary, foreign, and surrogate keys do not exist
Surrogate keys are assigned where needed
Using a data quality dimensions checklist when planning a testing strategy can help ensure that your data is accurate, complete, and reliable – i.e., that it meets the requirements of each data quality dimension. This checklist can serve as a guide to help you assess the quality dimensions of your data pipelines. It can help you identify any issues or discrepancies in the data, allowing you to take action to correct them before the data is used for analysis or decision-making. Using the checklist and guidelines in this blog, you can ensure that the data meets your organization’s specific requirements and standards.
Guest blogger bio: