May 10, 2022 / by Guest Author Wayne Yaddow/ in Guest Blogs

Testing Capabilities and Tools for Data Engineers – Part 1

ABSTRACT: This article describes how a data quality management framework helps data engineers create reliable data pipelines, data stores, and data lakes.

Introduction

A well-planned "data quality framework" for data pipeline engineers can help end production surprises. When data engineers invest in a data quality management framework that includes strong data testing skills to help the team solve issues, the result is reliable data.

The worst indicator of data quality problems is a delayed indicator - users tell you that data reporting is incorrect and data integrity is at risk. Data engineering requires building trust in data pipelines.

Figure 1 depicts the widespread process components of a "data stack" (see The Modern Data Stack) with a sampling of technologies utilized to ingest, store, transform, test, publish and provide data for consumer use.

Figure 1: An example of the many components and technologies that often comprise a pipeline's data stack.

Source: "DataOps for the New Data Stack", Shivnath Babu: Sample Data Stack for DataOps

On Top of Everything Else, Data Engineers Need to be Effective Testers

Data engineers are responsible for carrying out many data quality checks on all data assets. This includes ETL scripts, jobs, and data pipeline workflows. They work with stakeholders to review requirements and prepare test plans that effectively assess complex data systems. One of the critical goals of this testing responsibility is to ensure that high-quality data is provided to internal stakeholders and clients.

A data pipeline testing approach for ETL projects should prioritize test automation for source and target datasets and ensure that they are up-to-date and accurate. It is essential to assess your data sources and targets (see Figure 2). By doing so, data pipeline teams can adequately detect errors before they affect production applications. And they have more time to resolve issues before deploying them to production.

Figure 2: An example of data source transformations; their range, targets, and associated platforms.

Data pipeline engineers' quality assurance and testing skills/responsibilities are often summarized on job boards and within data engineering certifications (Google Data Engineer Certification).

Essential Testing Skills for Data Engineers

Apply operational and functional knowledge, including testing standards, guidelines, and testing methodology, to meet stakeholders' expected data quality standards.
Work with project stakeholders to understand the requirements to design and execute test cases
Create unit and integration test strategies to ensure that data loads are correct
Create and maintain an automated test environment and perform tests
Develop and procure testing tools and processes to achieve desired data quality
Validate batch and real-time data ETL and pipeline data loads
Develop test cases to analyze/compare data sources and targets in data pipelines
Test data pipeline workflows end to end: data loads, data structures, data transformations, and overall data quality
Analyze test results and support developers and data analysts in bug fixing activities
Ensure that data is delivered with the expected format, structure, and data transformations
Perform data integration testing ensuring data ingestion and integration between source and destination are accurate and efficient
Plan, execute, and stress test data recoveries (i.e., fault tolerance, rerunning failed jobs, performing retrospective re-analysis)
Measure and improve query performance
Maintain a defect tracking process
Generate test reports for management
Develop production monitors for workflows, data storage, and data processing
Implement logging used by cloud services and DB monitors
Configure data pipeline monitoring services
Measure performance of data movements
Monitor data pipeline performance

Data Pipeline Monitoring on Production Systems

Data observability with production monitoring: Automated testing frameworks are usually limited to unit tests for data-intensive applications. In the data world, pipeline tests and ongoing monitoring of metrics solve the need for alerting when things are amiss.

Data observability is an organization's ability to comprehend the health of data in its systems. Data observability tools use automated monitoring, alerting, and triage to identify data quality and discoverability issues, which leads to healthier pipelines, more productive teams, and happier customers.

For data to meet user requirements, these actions should be considered:

Monitoring — dashboards that provide an operational view of the pipeline
Alerting — for expected and unexpected events
Tracking — setting and tracking specific events
Comparisons — monitoring over time, with alerts for anomalies
Analysis — automated issue detection that adapts to your pipeline and data health
Logging — a record of an event in a standardized format for faster resolution
SLA Tracking — the ability to measure data quality and pipeline metadata against pre-defined standards

Source: Databand.com, https://databand.ai/data-observability/

Data engineers often develop and implement production data-related monitors and incident management that identify and test:

Anomalies in column-level statistics like nulls and distributions
Irregular data volumes and sizes
Missing data deliveries
Pipeline failures, inefficiencies, and errors
Service level agreement misses
Unexpected schema changes

Data engineers can implement harmonized incident management and surveillance systems that create a more powerful synergy than these two individual processes. Developing an incident monitoring and management system complements one another to enable proactive, real-time responses that enhance data quality outcomes.

In Part 2 of this article, we show why and how data engineers conduct unit, integration, and end-to-end tests, followed by several open source and commercial testing and production monitoring tools available today.

Wayne Yaddow is a freelance writer focusing on data quality and testing issues. Much of his work is available at DZone, Dataversity, TDWI, and Tricentis.