A Data Quality Framework for Big Data
The data quality practices and techniques that we’ve traditionally used when working with structured enterprise data don’t work well for big data. Many of the data quality rules used with structured data—referential integrity rules, for example—don’t apply when the data is not organized and managed as relational tables.
Data quality practices from BI and data warehousing are geared toward cleansing data to improve content correctness and structural integrity in data that is used by query and reporting processes. In the big data world, quality is more elusive. Correctness is difficult to determine when using data from external sources, and structural integrity can be difficult to test with unstructured and differently structured (non-relational) data.
Data Quality for Analytics. We’ve traditionally treated data quality as an absolute—either the data is right or it isn’t. That worked okay when the primary uses of data were for query and reporting. But the game changes when using data for analytics. (See figure 1.)
Figure 1 – Different Approaches for Data Quality
Where enterprise data quality management applies concepts of data quality measurement and data quality scorecards, big data shifts from measurement to judgment. Lacking crisp data quality rules for which pass/fail counts can be obtained, big data quality judgment is more qualitative and subjective than quantitative and objective.
Data profiling is a good first step in judging data quality. But it is different for big data than for structured data. Structured methods of column, table, and cross-table profiling can’t easily be applied to big data. Data virtualization tools can create row/column views for some types of big data, where the views can then be profiled using relational techniques. This approach provides useful data content statistics but fails to give a full picture of the shape of the data. Visual profiling shows patterns, exceptions, and anomalies that are helpful in judging big data quality.
Most “unstructured” data does have structure, but it is different from relational structure. Visual profiling will help to show the structure of document stores and graph databases, for example. Data samples can then be checked against the inferred structure to find exceptions—perhaps iteratively refining understanding of the underlying structure.
Data quality judgment and structural findings should be recorded in a data catalog allowing data consumers to evaluate the usability of the data. With big data, quality must be evaluated as fit for purpose. With analytics, the need for data quality can vary widely by use case. The quality of data used for revenue forecasting, for example, may demand a higher level of accuracy than data used for market segmentation. The definition of quality depends on the use case, and each use case has unique needs for data accuracy, precision, timeliness, and completeness.
It is important to recognize that some kinds of analytics rely on outliers and anomalies to find interesting things in the data. Predictive analytics, for example, has a greater interest in the long tails of a value frequency distribution curve than in the center of the curve. Traditional data cleansing techniques are likely to regard the outliers as quality deficient data and attempt to repair them through cleansing. We must be careful to not cleanse away analytic value and opportunities
Judging Data Quality
Data quality judgment is necessary for big data, but it isn’t easy and it can’t be entirely subjective. We need a framework that provides criteria and guidance to judge data quality. I recommend a framework based on three dimensions:
Data Characteristics encompassing the data itself, the available metadata, and the source of the data.
Data Usefulness including interpretability, relevance, and accuracy.
Data Processing comprising ingestion, refinement, and consumption.
The three dimensions, each with three subtopics, yield nine areas for data evaluation. Intersecting each dimension with the other results in 27 questions that can help bring objectivity to the work of judging data quality.
Quality as a Function of Data Characteristics and Data Usefulness
Figure 2 illustrates the intersection of data characteristics with data usefulness and nine questions to apply when judging data quality.
Figure 2. Data Characteristics and Data Usefulness
Quality as a Function of Data Processing and Data Usefulness
Figure 3 illustrates the intersection of data processing and data usefulness.
Figure 3. Data Processing and Data Usefulness
Quality as a Function of Data Characteristics and Data Processing
Figure 4 illustrates the intersection of data characteristics with data usefulness and nine questions to apply when judging data quality.
Figure 4. Data Processing and Data Usefulness
Putting the Framework to Work
Note that I have not wrapped these concepts with a lot of definitional language. Ideally, you’ll decide for yourself and your organization what is the definition of interpretability, relevance, accuracy, consumption, refinement, etc. The goal of this framework is to provide some guidance, not to be prescriptive. By asking the 27 questions shown here and having a discussion about them, you are sure to arrive at a better judgment of big data quality than if judging on a purely subjective basis using individual and random criteria.