Data Governance Part I: Data's Evil Twins: Definitions and Duplicates

First in a multi-part series on data governance

Every one of my clients suffers terribly at the hands of data’s evil twins: definitions and duplicates. Once the evil twins latch on to an organization, it’s hard to shake them. Like an anaconda snake, the twins encircle an organization and squeeze slowly and relentlessly until the lights go out.

Most of my clients recognize the value of data—or rather, the pain of inconsistent and poor quality data. Most executives learn the hard way that without clean, consistent data they are running the business blindly: they can’t gauge customer interest in a new product fast enough to modify designs or marketing campaigns; they can’t pinpoint manufacturing or supply chain problems quickly enough to avoid cost overruns and delays; they can’t accurately forecast sales or profits to optimize inventory and distribution; they can’t cost effectively schedule shipments, deliveries, or repairs based on current conditions and historical patterns.

Most executives also discover that without good data, they are flat-footed. They can’t respond nimbly to changing conditions because they can’t see conditions changing. The market continually catches them by surprise and their carefully planned strategies evaporate overnight. They are perpetually reacting to events, instead of directing them. They feel helpless while fleet-footed competitors sprint past, luring away hard-won customers. They are at the total mercy of the market.

The problem is that these executives haven’t figured out how to corral the twin sisters. They’ve allowed them to wreak havoc on their organization’s data consistency and data quality. Most feel the pain but don’t know how to ease it. Often, they blame the information technology (IT) department, erroneously thinking data is a technical problem. (It isn’t!) Then, they call in me—or another consultant—to exorcise the two sisters so they can see again and run their companies with 20/20 vision.

 Definitions and Dictionaries

The first twin is pervasive because she spawns inconsistent data everywhere. In every organization, I hear the same refrain, “We get different answers to the same questions because people define terms differently.” Two analysts running the same query get different answers because they select data from different fields or at different grains or timeframes. Then, the analysts spend 30 minutes in an executive meeting arguing about whose data is correct.

The most dangerous question. Ironically, the most commonly used terms are most poorly defined.  I once heard a colleague say, “The most dangerous question you can ask a client is, ‘What’s your definition of customer?’” Executives often become quite vexatious debating the issue. To marketing, a customer is someone who responds to a campaign; to sales, it’s someone who has signed a purchase order; in finance, it’s someone who has paid a bill. Each is right, but collectively they are wrong.

The solution to data inconsistency is easy: it’s a data dictionary that spells out in plain English the definitions to commonly used terms and metrics. But creating a data dictionary is hard. Gaining consensus on commonly used terms is fraught with politics. People fight tooth and nail to ensure their definitions prevail in the corporate catechism.

Standardizing data definitions. To overcome the politics, the CEO needs to appoint a cross-functional committee of subject matter experts to prioritize terms and propose definitions for each. The executive team then needs to review and refine the committee’s definitions and establish corporate standards for each. Often, this requires a lot of discussions and arm wrestling until executives reach a consensus—or more likely, a truce.

Typically, executives agree to disagree: that is, they create a corporate definition for each term, and then re-label departmental definitions so they are distinct. They then aggregate departmental data so it conforms with the corporate definitions while maintaining local definitions. This way, the organization gets a singular definition, and each department preserves its view of the world. This works as long as everyone uses and adheres to the corporate data dictionary.

 Defects and Duplicates

The second twin is more pernicious than the first. She welcomes consistent definitions but works surreptitiously behind the scenes to undermine the data values that comprise those definitions. The twin liberally sprinkles defects into database tables and fields using a variety of means: data entry and programming errors, systems migration mishaps, legacy system rewrites, and just plain data obsolescence.  

For example, in a customer database, five percent of the records each month deteriorate in quality due to death, divorce, marriage, and change of address. And worse, an even higher percentage of customer records spawn duplicates, largely because most organizations house customer data in multiple databases running different applications that capture different attributes of a customer at different times for different reasons. Then, organizations face the perplexing problem of trying to figure out whether “Joe Daley, 51, of 1 Prescott Lane is the same as “J. Daley, 53, of 10 Presque Lane” who is the same as “Joseph Dailey, 49, of 1 Prescot Ln”.

Dimensional Data. Customer data is an example of what the industry calls master data—or dimensional data. Customer data describes “who” has conducted a transaction—bought something, sold something, or done something. Dimensional data describes transaction data, which are the numbers companies aggregate to determine sales, profits, inventory, and shipments, etc. There are many types of master or dimensional data: product, parts, employee, supplier, partner, offices, geographies, accounts, contracts, warranties, agents, and so on.

Keeping dimensional data harmonized across applications and systems is difficult. Ideally, companies store master data in one application and system only; that way, it never gets replicated or out of synch with itself. But most companies spawn systems and applications like wildflowers after a spring rain. The only way to harmonize dimensional data in a heterogeneous application and systems environment is to apply master data management (MDM).

MDM hubs. MDM uses a data hub to store “golden records” for each type of dimensional data, such as customers or products. The hub collects new records from each application and runs them through a matching algorithm to determine whether the record already exists in the hub. If the record is new, the hub turns it into a golden record; if it’s not, the hub identifies which information is new and updates the golden record accordingly. The hub then makes the changes available to all subscribing applications in batch mode or real-time.

 Thus, an MDM system is designed to prevent data duplication, which other than data entry mistakes, is the prime cause of poor quality data. Without MDM, a company might think it has twice as many customers or employees or parts than it actually has. This creates all kinds of confusion, process inefficiencies, and costly workarounds that frustrate employees and annoy customers and suppliers.

Caging data’s evil twins is not easy. But they must be confronted, apprehended, and cuffed if organizations want to lay a solid foundation for reporting and analytics. Without clean, consistent, and harmonized data, organizations cannot compete effectively in today’s economy.

Next article in the series: "Data Governance Part II: How to Create a Common Data Vocabulary."

Wayne Eckerson

Wayne Eckerson is an internationally recognized thought leader in the business intelligence and analytics field. He is a sought-after consultant and noted speaker who thinks critically, writes clearly and presents...

More About Wayne Eckerson