Apr 17, 2023 / Wayne Yaddow/ in Guest Blogs

Developing a Robust Data Quality Strategy for Your Data Pipeline Workflows

Secrets of Data & Analytics Leaders · Developing a Robust Data Quality Strategy for Your Data Pipeline Workflows - Audio Blog

ABSTRACT: A robust data workflow testing strategy helps ensure the accuracy and reliability of data processed within a data pipeline.

Developing a robust data workflow testing strategy is crucial to ensuring the accuracy and reliability of data processed within a data pipeline.

The strategy should be a comprehensive plan to test all aspects of data quality, from individual code units to workflows to full data pipelines. In this blog, we define the data quality (DQ) testing functions that support data extractions, transformations, processing, and loading. You can view these functions as the backbone of data quality observability, which is part of overall data observability as defined by the Eckerson Group. The blog “The Rightful Role of Data Observability within DataOps'' describes how data observability assists DataOps processes such as continuous integration and deployment (CI/CD) of pipeline code. While data quality is hardly a new problem, startup vendors such as Monte Carlo, Acceldata, and Unravel offer new machine learning techniques to optimize data quality for hybrid, cloud, and multi-cloud environments.

Let’s start with some definitions. A data pipeline comprises "workflows" that executes in a defined order to process and move data from sources to destinations. Typically these are collections of small self-contained tasks. Data flow functions, meanwhile, are specific tasks or sets of jobs that workflows perform as part of a larger data pipeline. For example, a data workflow could include functions for extracting data from a database and transforming the data into a usable format. To ensure data quality across pipelines, workflows, and functions, we must test the dimensions of accuracy, completeness, conformity, consistency, integrity, precision, timeliness, and uniqueness (see Table 1 below).

Data errors often arise from schema changes, null values, duplication, or database loading issues (Figure 1). These errors can happen within one or multiple components of the data pipeline, including sources, targets, processing (e.g., transformations, aggregations), and loads to storage destinations. See Data Quality Issues that Kill Your Machine Learning Models for an elaboration.

Today, most data pipelines perform "tests" within each workflow to ensure data quality. Many tests today are designed as integral to the production system functionality. Data stewards or data engineers design them to ensure that erroneous data is identified, blocked, scrubbed, fixed, or just logged as the pipeline is run. Such tests are necessary, partly because data that flows into pipelines may come from untrusted systems.

Data pipeline tests are closely linked to the quality of stored data. To quantitatively assess the dimensions of quality, data teams should define both the measures and the measurement methods used.

The ETL process also can ensure the quality of data that arrives and resides in the data warehouse. A malfunctioning ETL process may result in data not being transferred to the DW or not being transferred promptly. With that in mind, it is essential to assess how accurate and complete the data is. This includes assessing whether the pipeline extracts all the required data from its sources, stores it in the staging area, transforms it, and subsequently loads it into the DW–then comparing that target data with the source.

Figure 1 Modern Data Pipeline Components

Source: The Complexities of Modern Data Pipelines

How to leverage data quality dimensions to enhance test coverage of data pipeline workflows

Dimensions such as accuracy, completeness, and consistency help identify potential issues with the data and develop specific test cases to verify that the data meets the required quality standards. These data quality dimensions can contribute to test planning and provide a framework to identify and evaluate data characteristics. In addition, using data quality dimensions early in the test planning process can help identify and address data issues before they become major problems during later project phases. It also can help evaluate the suitability and quality of various datasets for testing and identify any potential issues or areas of concern. Data stewards or data engineers can then take the necessary data preparation or cleaning steps before they begin testing.

Many researchers have contributed to the understanding of data quality problems. They have collectively identified general causes at the following stages of data integration planning and execution:

Schema design and modeling
Data source profiling
Data staging and ETLs
Data transformations, cleansing, and enrichment
Data reporting

Table 1 displays the most common dimensions of data quality, as well as tests that help identify data issues at each stage listed above. Data engineers, developers and other implementers of data integration solutions can review and analyze these potential issues before releasing their projects. From the sample tests shown in Table 1, developers and data testing specialists can find ideas for creating test conditions.

Because of the complexity and processing logic needed to manage and control these data quality dimensions, many organizations rely on data discovery and quality management tools. Each supports a certain level of automation, and some can perform multiple types of tests. Popular testing tools include Acceldata, Appian, Katalon, Kobiton, Kualitee, Monte Carlo, Micro Focus, Qaprosoft, Sauce Labs, Selenium, Telerik Test Studio, TestArchitect, and Unravel.

Table 1: Data Quality Dimensions - Testing Checklist

Data Quality Dimension	Description	Examples of Data Quality Workflow Tests
Accuracy	The extent to which data is correct, reliable, and certified	Data element precision checks
		Data values defined and populated per requirements
		Data elements correspond to correct data types
		Null or blank values are as expected
		Truncated data is as expected
		Data transformations are correct
		Data contained in the source precisely as in the target
		All transformations are correct
Completeness	The extent to which elements associated with entities (ex., table) comprise all expected values for all expected attributes	Record counts for sources & targets match
		Data integrity checks for all keys
		Out-of-boundary checks for data
		SCD and CDC checks
		Record counts as expected
		Null value checks for all columns
Conformity	Data conforms to business rules and specified formats to meet users' expectations	Correct implementation of business rules
		Date/time formats correct
		Data types represent requirements
		Data values per column as per specifications
		Report formats comply with requirements
		Data elements conform to data models
		Metadata conforms to requirements across environments
Consistency	The extent to which data attributes values are free from contradiction and coherent with other data in a specific context of the use	Source to target field mapping checks
		Contains the values of all measures in the data set
		One-to-one, many-to-one value, many-to-many data value checks
		Slowly changing dimensions (SCD) consistently applied
		Data values conform to data types and data lengths
		Data equivalence maintained during data movements
Integrity	The exactness with which two or more data objects data are combined to create new data The degree to which data is missing significant relationship linkages The maintenance and assurance of the accuracy and consistency of data over its entire life-cycle during data integration/movement	Data aggregation rules applied correctly
		Correctness of referential integrity (ex., primary and foreign keys)
		Data element values are within specified ranges
		Data values conform to business rules
		Correctness of concatenated data from multiple sources/fields
		Check for referential integrity across related tables or entities
		Ensure that the data is not corrupted and can be accessed without error
		Data meets specified business or regulatory requirements
Precision	The measurement or classification detail used in to specify an attribute's values	Numeric field precisions calculated and presented correctly
		Numeric data precisions per requirements
		Data trimming is correct
		Data values not truncated
Timeliness	Data reflects times and dates correctly. Data is available within the required time frame	Dates and time values loaded as defined
		Dates, times values within the boundaries of requirements
		Time/dates formats defined as required
		Date/time fields not null or blank
		Most recent dates and times are represented
Uniqueness	Data constraints are applied as defined (ex., uniqueness, PK, FK, and Indexes). Unique data is defined so that there are no duplicates	Duplicate values checked for all fields and records
		Duplicate records/fields removed where specified
		Duplicate primary, foreign, and surrogate keys do not exist
		Surrogate keys are assigned where needed

Using a data quality dimensions checklist when planning a testing strategy can help ensure that your data is accurate, complete, and reliable – i.e., that it meets the requirements of each data quality dimension. This checklist can serve as a guide to help you assess the quality dimensions of your data pipelines. It can help you identify any issues or discrepancies in the data, allowing you to take action to correct them before the data is used for analysis or decision-making. Using the checklist and guidelines in this blog, you can ensure that the data meets your organization’s specific requirements and standards.

Guest blogger bio:

Wayne Yaddow is a freelance writer focusing on data quality and testing issues. Much of his work is available at DZone, Dataversity, TDWI, Tricentis, and Eckerson Group

Apr 17, 2023 / Wayne Yaddow/ in Guest Blogs

Developing a Robust Data Quality Strategy for Your Data Pipeline Workflows

How to leverage data quality dimensions to enhance test coverage of data pipeline workflows

Guest blogger bio:

You Might Also Like