Developing a Robust Data Quality Strategy for Your Data Pipeline Workflows

ABSTRACT: A robust data workflow testing strategy helps ensure the accuracy and reliability of data processed within a data pipeline.

Developing a robust data workflow testing strategy is crucial to ensuring the accuracy and reliability of data processed within a data pipeline. 

The strategy should be a comprehensive plan to test all aspects of data quality, from individual code units to workflows to full data pipelines. In this blog, we define the data quality (DQ) testing functions that support data extractions, transformations, processing, and loading. You can view these functions as the backbone of data quality observability, which is part of overall data observability as defined by the Eckerson Group. The blog “The Rightful Role of Data Observability within DataOps'' describes how data observability assists DataOps processes such as continuous integration and deployment (CI/CD) of pipeline code. While data quality is hardly a new problem, startup vendors such as Monte CarloAcceldata, and Unravel offer new machine learning techniques to optimize data quality for hybrid, cloud, and multi-cloud environments.

Let’s start with some definitions. A data pipeline comprises "workflows" that executes in a defined order to process and move data from sources to destinations. Typically these are collections of small self-contained tasks. Data flow functions, meanwhile, are specific tasks or sets of jobs that workflows perform as part of a larger data pipeline. For example, a data workflow could include functions for extracting data from a database and transforming the data into a usable format. To ensure data quality across pipelines, workflows, and functions, we must test the dimensions of accuracy, completeness, conformity, consistency, integrity, precision, timeliness, and uniqueness (see Table 1 below).

Data errors often arise from schema changes, null values, duplication, or database loading issues (Figure 1). These errors can happen within one or multiple components of the data pipeline, including sources, targets, processing (e.g., transformations, aggregations), and loads to storage destinations. See Data Quality Issues that Kill Your Machine Learning Models for an elaboration.

Today, most data pipelines perform "tests" within each workflow to ensure data quality. Many tests today are designed as integral to the production system functionality. Data stewards or data engineers design them to ensure that erroneous data is identified, blocked, scrubbed, fixed, or just logged as the pipeline is run. Such tests are necessary, partly because data that flows into pipelines may come from untrusted systems.

Data pipeline tests are closely linked to the quality of stored data. To quantitatively assess the dimensions of quality, data teams should define both the measures and the measurement methods used.

The ETL process also can ensure the quality of data that arrives and resides in the data warehouse. A malfunctioning ETL process may result in data not being transferred to the DW or not being transferred promptly. With that in mind, it is essential to assess how accurate and complete the data is. This includes assessing whether the pipeline extracts all the required data from its sources, stores it in the staging area, transforms it, and subsequently loads it into the DW–then comparing that target data with the source. 

Figure 1 Modern Data Pipeline Components

Source: The Complexities of Modern Data Pipelines

How to leverage data quality dimensions to enhance test coverage of data pipeline workflows

Dimensions such as accuracy, completeness, and consistency help identify potential issues with the data and develop specific test cases to verify that the data meets the required quality standards. These data quality dimensions can contribute to test planning and provide a framework to identify and evaluate data characteristics. In addition, using data quality dimensions early in the test planning process can help identify and address data issues before they become major problems during later project phases. It also can help evaluate the suitability and quality of various datasets for testing and identify any potential issues or areas of concern. Data stewards or data engineers can then take the necessary data preparation or cleaning steps before they begin testing.

Many researchers have contributed to the understanding of data quality problems. They have collectively identified general causes at the following stages of data integration planning and execution: 

  1. Schema design and modeling

  2. Data source profiling

  3. Data staging and ETLs

  4. Data transformations, cleansing, and enrichment

  5. Data reporting

Table 1 displays the most common dimensions of data quality, as well as tests that help identify data issues at each stage listed above. Data engineers, developers and other implementers of data integration solutions can review and analyze these potential issues before releasing their projects. From the sample tests shown in Table 1, developers and data testing specialists can find ideas for creating test conditions.

Because of the complexity and processing logic needed to manage and control these data quality dimensions, many organizations rely on data discovery and quality management tools. Each supports a certain level of automation, and some can perform multiple types of tests. Popular testing tools include Acceldata, Appian, Katalon, Kobiton, Kualitee, Monte Carlo, Micro Focus, Qaprosoft, Sauce Labs, Selenium, Telerik Test Studio, TestArchitect, and Unravel.

Table 1: Data Quality Dimensions - Testing Checklist

Data Quality Dimension

Description 

Examples of Data Quality Workflow Tests

Accuracy

The extent to which data is correct, reliable, and certified

Data element precision checks

Data values defined and populated per requirements 

Data elements correspond to correct data types

Null or blank values are as expected

Truncated data is as expected

Data transformations are correct

Data contained in the source precisely as in the target

All transformations are correct

Completeness

The extent to which elements associated with entities (ex., table) comprise all expected values for all expected attributes

Record counts for sources & targets match

Data integrity checks for all keys

Out-of-boundary checks for data

SCD and CDC checks

Record counts as expected

Null value checks for all columns

Conformity

Data conforms to business rules and specified formats to meet users' expectations

Correct implementation of business rules

Date/time formats correct

Data types represent requirements

Data values per column as per specifications

Report formats comply with requirements

Data elements conform to data models

Metadata conforms to requirements across environments

Consistency

The extent to which data attributes values are free from contradiction and coherent with other data in a specific context of the use

Source to target field mapping checks

Contains the values of all measures in the data set 

One-to-one, many-to-one value, many-to-many data value checks

Slowly changing dimensions (SCD) consistently applied

Data values conform to data types and data lengths

Data equivalence maintained during data movements

Integrity

  • The exactness with which two or more data objects data are combined to create new data

  • The degree to which data is missing significant relationship linkages

  • The maintenance and assurance of the accuracy and consistency of data over its entire life-cycle during data integration/movement

Data aggregation rules applied correctly

Correctness of referential integrity (ex., primary and foreign keys)

Data element values are within specified ranges

Data values conform to business rules

Correctness of concatenated data from multiple sources/fields

Check for referential integrity across related tables or entities

Ensure that the data is not corrupted and can be accessed without error

Data meets specified business or regulatory requirements

Precision

The measurement or classification detail used in to specify an attribute's values

Numeric field precisions calculated and presented correctly

Numeric data precisions  per requirements

Data trimming is correct

Data values not truncated

Timeliness

Data reflects times and dates correctly. Data is available within the required time frame

Dates and time values loaded as defined

Dates, times values within the boundaries of requirements

Time/dates formats defined as required 

Date/time fields not null or blank

Most recent dates and times are represented

Uniqueness

  • Data constraints are applied as defined (ex., uniqueness, PK, FK, and Indexes).

  • Unique data is defined so that there are no duplicates

Duplicate values checked for all fields and records

Duplicate records/fields removed where specified

Duplicate primary, foreign, and surrogate keys do not exist

Surrogate keys are assigned where needed


Using a data quality dimensions checklist when planning a testing strategy can help ensure that your data is accurate, complete, and reliable – i.e., that it meets the requirements of each data quality dimension. This checklist can serve as a guide to help you assess the quality dimensions of your data pipelines. It can help you identify any issues or discrepancies in the data, allowing you to take action to correct them before the data is used for analysis or decision-making. Using the checklist and guidelines in this blog, you can ensure that the data meets your organization’s specific requirements and standards.

Guest blogger bio:

Wayne Yaddow is a freelance writer focusing on data quality and testing issues. Much of his work is available at DZoneDataversityTDWITricentis, and Eckerson Group