Overcoming The Challenge of Low Data Quality

ABSTRACT: This article by Piotr Czarnas, founder of DQOps, outlines a proven, team-based approach to tackling persistent issues like invalid data, delayed reporting, and inconsistent formats.
Read time: 5 mins.
Low data quality is a significant obstacle to the success of data projects and limits organizations' ability to harness the power of Artificial Intelligence (AI). When data is unreliable, decisions based on it can be flawed, leading to incorrect conclusions and ineffective strategies. Several common data quality issues contribute to this problem. Timeliness issues arise when data is not delivered promptly. For instance, if daily reports are delayed, their value for decision-making diminishes. Validity issues occur when data contains inaccuracies or inconsistencies. This could include dates that are obviously incorrect, such as dates in the future, or data in a format that does not adhere to the expected standard. Such issues erode trust in data and hinder its use for analysis and AI model training.
The Role of Data Owners and Data Stewards
One approach to tackling data quality is to designate data owners who take on the responsibility of improving it. Data owners are responsible for defining data quality standards and addressing data quality issues. Ideally, a data owner possesses a deep understanding of how the data is used within the business and how data quality issues can ultimately impact business processes. It's crucial that data owners who focus on the business value of the data are involved.
Technical data stewards play a crucial role in helping data owners detect and resolve data quality issues. Technical data stewards are experts in the use of data quality tools. Their responsibilities include registering data sources for data quality monitoring, configuring thresholds for data quality checks, and re-executing data quality checks after fixes have been implemented. The role of technical data stewards is comparable to that of Software Quality Assurance (SQA) teams in software engineering. Just as software testers verify that software functions according to requirements, technical data stewards test whether data meets data quality requirements and is fit for its intended use.
The Role of Data Engineers in the Fixing Process
Data engineering teams are critical in fixing data quality issues. These teams are responsible for the collection, transformation, and loading of data from various sources. When a data quality issue is reported, data engineers can address it if the problem originated during the data loading process. They can also implement additional data enrichment steps within data pipelines to correct invalid records during loading. For instance, they might join incoming records with a reliable dataset to perform lookups and populate missing values.
Effective collaboration between data engineers, data owners, and data stewards is crucial. Data engineers must collaborate closely with data owners and stewards to assess data quality. After implementing changes, they need to retest the data to ensure issues are resolved. This process requires clear communication to ensure that data engineers understand the nature of the data quality issue and how the corrected data should appear or be structured. Establishing a feedback loop between data engineers, data owners, and data stewards enables data engineers to confirm when a fix has been applied.
Data Observability and the Role of Data Operations Teams
The communication loop described so far, involving data engineers and data owners, mirrors a typical communication loop in software engineering projects. Business users perform functional tests, technical testers conduct system tests, and engineers implement fixes. However, data projects have a key difference: the majority of data quality issues are not caused by bugs introduced by data engineers. Data engineering teams typically do not own the data itself. They often ingest data from sources such as OLTP databases or external providers. These data providers may alter the data schema without notifying downstream data engineering teams or data consumers.
To effectively manage data quality issues, data teams must implement data observability practices to monitor data and continuously reassess data quality issues after the data platform has been deployed. The introduction of data monitoring brings another team into the process: data operations teams. These teams are dedicated to monitoring data and reviewing incoming issues.
Teamwork and Communication
All of these teams—data owners, data stewards, data engineers, and data operations teams—must operate as a cohesive unit. The success of the data platform depends on timely and effective communication. Before data can be effectively monitored using data observability tools, data stewards need to test the data and establish appropriate data quality thresholds. This process necessitates a communication loop involving data owners, data stewards, and data engineers.
Large-scale environments with numerous tables in frequent use can present significant challenges. The process of reviewing data quality and establishing a data quality baseline in such environments can be time-consuming. It must be organized in an agile manner, with tables prioritized according to their criticality, ensuring that the most important tables are reviewed first.
A Proven Method for Managing Data Quality Communication
There is a proven method to effectively manage the communication and collaboration between these various teams. Piotr Czarnas, the founder of DQOps, has authored an eBook that serves as a step-by-step guide to implementing a robust data quality process. The process detailed in this eBook was utilized internally by his team during a large-scale data quality project spanning five years and involving up to 50 participants, including data owners, data stewards, and data engineers. The goal of the project was to assess and improve the quality of data assets.
The process was applied to each data domain, and it consisted of the following key steps:
Collecting data quality requirements from data owners.
Profiling the data assets.
Integrating data quality checks into data pipelines.
Prioritizing critical data assets.
Implementing a recurring feedback loop to review and address data quality for a selected group of datasets within each iteration.
This structured process enabled the team to achieve improvements within each data domain within a matter of weeks and to resolve all critical issues within a six-month timeframe. The process also emphasized providing feedback to business sponsors through the measurement of data quality using data quality KPIs and the creation of data domain-specific dashboards that illustrated the progress of data quality improvement across each data quality dimension.
If you're looking for a proven method to enhance data quality, Piotr Czarnas, the founder of DQOps, has outlined a step-by-step approach in his eBook. Visit DQOps to download it for free and start improving your data today.