Testing Capabilities and Tools for Data Engineers – Part 1

ABSTRACT: This article describes how a data quality management framework helps data engineers create reliable data pipelines, data stores, and data lakes.

Introduction

A well-planned "data quality framework" for data pipeline engineers can help end production surprises. When data engineers invest in a data quality management framework that includes strong data testing skills to help the team solve issues, the result is reliable data. 

The worst indicator of data quality problems is a delayed indicator - users tell you that data reporting is incorrect and data integrity is at risk. Data engineering requires building trust in data pipelines.

Figure 1 depicts the widespread process components of a "data stack" (see The Modern Data Stack) with a sampling of technologies utilized to ingest, store, transform, test, publish and provide data for consumer use.

Figure 1: An example of the many components and technologies that often comprise a pipeline's data stack.

Source: "DataOps for the New Data Stack", Shivnath Babu: Sample Data Stack for DataOps

On Top of Everything Else, Data Engineers Need to be Effective Testers

Data engineers are responsible for carrying out many data quality checks on all data assets. This includes ETL scripts, jobs, and data pipeline workflows. They work with stakeholders to review requirements and prepare test plans that effectively assess complex data systems. One of the critical goals of this testing responsibility is to ensure that high-quality data is provided to internal stakeholders and clients.

A data pipeline testing approach for ETL projects should prioritize test automation for source and target datasets and ensure that they are up-to-date and accurate. It is essential to assess your data sources and targets (see Figure 2). By doing so, data pipeline teams can adequately detect errors before they affect production applications. And they have more time to resolve issues before deploying them to production. 

Figure 2: An example of data source transformations; their range, targets, and associated platforms.

Data pipeline engineers' quality assurance and testing skills/responsibilities are often summarized on job boards and within data engineering certifications (Google Data Engineer Certification).

Essential Testing Skills for Data Engineers

  • Apply operational and functional knowledge, including testing standards, guidelines, and testing methodology, to meet stakeholders' expected data quality standards.

  • Work with project stakeholders to understand the requirements to design and execute test cases

  • Create unit and integration test strategies to ensure that data loads are correct 

  • Create and maintain an automated test environment and perform tests

  • Develop and procure testing tools and processes to achieve desired data quality

  • Validate batch and real-time data ETL and pipeline data loads

  • Develop test cases to analyze/compare data sources and targets in data pipelines

  • Test data pipeline workflows end to end: data loads, data structures, data transformations, and overall data quality

  • Analyze test results and support developers and data analysts in bug fixing activities

  • Ensure that data is delivered with the expected format, structure, and data transformations

  • Perform data integration testing ensuring data ingestion and integration between source and destination are accurate and efficient

  • Plan, execute, and stress test data recoveries (i.e., fault tolerance, rerunning failed jobs, performing retrospective re-analysis)

  • Measure and improve query performance                      

  • Maintain a defect tracking process

  • Generate test reports for management

  • Develop production monitors for workflows, data storage, and data processing 

  • Implement logging used by cloud services and DB monitors 

  • Configure data pipeline monitoring services 

  • Measure performance of data movements

  • Monitor data pipeline performance 

Data Pipeline Monitoring on Production Systems

Data observability with production monitoring: Automated testing frameworks are usually limited to unit tests for data-intensive applications. In the data world, pipeline tests and ongoing monitoring of metrics solve the need for alerting when things are amiss. 

Data observability is an organization's ability to comprehend the health of data in its systems. Data observability tools use automated monitoring, alerting, and triage to identify data quality and discoverability issues, which leads to healthier pipelines, more productive teams, and happier customers.

For data to meet user requirements, these actions should be considered:

  • Monitoring   dashboards that provide an operational view of the pipeline

  • Alerting — for expected and unexpected events

  • Tracking — setting and tracking specific events

  • Comparisons monitoring over time, with alerts for anomalies 

  • Analysis — automated issue detection that adapts to your pipeline and data health

  • Logging — a record of an event in a standardized format for faster resolution

  • SLA Tracking — the ability to measure data quality and pipeline metadata against pre-defined standards

Source:  Databand.com, https://databand.ai/data-observability/

Data engineers often develop and implement production data-related monitors and incident management that identify and test:

  • Anomalies in column-level statistics like nulls and distributions

  • Irregular data volumes and sizes

  • Missing data deliveries

  • Pipeline failures, inefficiencies, and errors

  • Service level agreement misses

  • Unexpected schema changes

Data engineers can implement harmonized incident management and surveillance systems that create a more powerful synergy than these two individual processes. Developing an incident monitoring and management system complements one another to enable proactive, real-time responses that enhance data quality outcomes.

In Part 2 of this article, we show why and how data engineers conduct unit, integration, and end-to-end tests, followed by several open source and commercial testing and production monitoring tools available today.

Wayne Yaddow is a freelance writer focusing on data quality and testing issues. Much of his work is available at DZone, Dataversity, TDWI, and Tricentis.

Guest Author

Guest blogger at Eckerson Group

More About Guest Author