Operationalizing Data Governance: Actualizing Policies for Data Validation
Business are finally recognizing the value of data governance as an enterprise program, with the emergence of the Chief Data Officer (CDO) as a C-level role charged with ensuring that corporate data sets meet defined business data quality expectations. There are two aspects to this governance: defining policies, and operationalizing their compliance. One straightforward approach to data validation is to make it part and parcel of the application architecture: adjust the development methodologies to direct software designers to embed data validation as an integral component of their applications.
Institutionalizing data validation within the organization’s application environment is predicated on standardizing an approach for defining data quality. Yet as is becoming more apparent, the definition of “data quality” is non-monolithic. Rather, as each data consumer has a particular set of expectations, it becomes clear that there are definitions of data quality that are relevant for each data usage scenario, and this has some key implications when it comes to data validation:
- There is no single specification of validity. The business process and data consumption context frame what “valid” means, and a data instance that is valid for one consumer may in completely unacceptable to another.
- Everyone is entitled to valid data: At the same time, each data consumer is entitled to use the data in a way that conforms to the validity rules associated with their own business processes.
- Data issue severity is also context-dependent: In some cases the determination that a data instance does not meet the validity specification does not prevent that data from being used; the invalid data is logged but still loaded into the target data warehouse. There are more severe scenarios in which the invalid data prevents a process from properly completing. In that case the invalidity must be remediated before attempting to restart. Note that there is a lot of space between those extremes.
- Data issues should be tracked and managed: No matter what a data issue’s level of severity is, there is a need to specify the types of invalidities, their severity, who should be notified, and what needs to happen when that issue is identified.
- Interoperability of validation, monitoring, and alerts: All aspects of the integrated data validation have to be simultaneously operational. It might be obvious to embed validation rules and report back on their measures, but when you fold in the fact that there are going to be many different specifications of validity in flight at the same time, it suggests a greater level of complexity in effective monitoring and altering; different consumers have different expectations, and you can't focus on one community while ignoring the others, nor can you kill all active processes if there are different tolerances to the level of severity of invalid data. It must be a business directive that guarantees that all data validation rules are at least monitored, if not assured, at the same time.
- Remediate while maintaining consistency: There must be ways to address the identification of invalid data that does not materially impact how other data consumers are using the data. This might not make sense at first, as why wouldn’t you always want to have the most correct data? Yet if you are willing to allow different user communities to specify their own data validity expectations, it is likely that at some point that one set of expectations will clash with some other users’ set of expectations. Don't introduce inconsistency when attempting to ensure enterprise data usability.
In retrospect, these implications suggest that embedding data controls and automatically generating alerts of invalid data within your code is necessary, but not sufficient to meet the union of all enterprise expectations for data quality. Each set of data consumers defines their own set of data validity policies (along with corresponding data quality rules). In turn the operational data governance practitioners (including data analysts and data stewards) must manage the policies, review how the data quality rules in each policy coincide (or potentially conflict) with the rules from other consumers’ data policies, and observe the degree to which validation tasks (both automatic and manual) are integrated within the end-to-end data flows. This allows for the right types of alerts and notifications to be sent out without disrupting the entire set of information flows and maximize the impact of integrated data validity inspection.
More to the point: operational data governance is more than just having a data steward alerted every time a data value goes bad. It combines aspects of data policy management with data policy operationalization in a way that goes beyond having a collection of individuals running down lists of data issues in their monthly data governance council meeting. Embedded validation represents a deeper engagement to engage the different data consumer communities so that the right types of data validation procedures are engineered directly into the application landscape.