Domesticating Data: How to Realize Value from External Data

Today, dogs are a man’s best friend. Before they became the lovable pups we welcome into our homes, however, dogs were wild beasts. The process of turning wolves into our furry pals took thousands of years of taming and breeding. Creating value from external data requires a similar, though much faster, domestication process.

Thanks to the wave of digital transformations in the last decade, organizations generate more data than ever before. At the same time, new technologies such as data exchanges have allowed companies to monetize and distribute this data with unprecedented ease. Wild data is everywhere. Like our canine companions, this data has the potential to help us with a multitude of endeavors, but first we must tame and train it.

Domesticating external data consists of two steps:

  1. Selection

  2. Taming

Data Selection

To have a chance of success, first you need to pick data with potential. The first dog was probably a cub that was unusually friendly to start with. In a data sense, this means identifying a source that already has a high level of data quality and also provides the kind of information you need. The hunt for wild data typically begins in one of three places—a data vendor, a partner organization, or a data marketplace.

Data vendors. Data vendors are large scale data providers that sell data as a primary line of business. These organizations, such as Bloomberg, S&P, Experian, Nielsen, and Acxiom, have numerous data products and their own means of delivery. Often businesses subscribe directly to the vendor who provides new data on a regular basis. For many years, these organizations dominated the market for external data. Now, that’s changing.

Because providing data as a service is a core aspect of their business, these organizations have an incentive to provide high quality data. Their brand reputations depend on delivering accurate, complete data sets, but because they’ve been in business for so long, they can be more rigid in the kinds and formats of data they provide.

Partner Organizations. If you have close partnerships with organizations that might have the data you need, you can always ask directly. These groups don’t have to be strictly external either. Within a large enterprise, different departments can have separate data environments, so the data you need might even be with the company. This is the least formal way of acquiring data. It can work for small amounts of data, but can create a headache from a governance perspective, especially if it involves emailing around flat files.

This approach requires a high amount of trust in the partner to provide good data. They often have little incentive to spend time crafting a clean, usable data product. It also scales poorly as there’s no infrastructure to support an on-going exchange in most cases.

Data marketplace. Essentially, a data marketplace is a hosted platform where users can browse and acquire data from a wide range of providers. Some are large public marketplaces that carry data of all sorts; AWS and Snowflake both operate exchanges of this kind. Others specialize in data relevant to particular sectors, or charge fees for access. These are often run by non-tech companies or industry consortiums using a data exchange platform like those provided by Dawex, Harbr, or Narrative. By separating the exchange platform from the data product, data marketplaces have democratized third party data. Smaller data vendors become discoverable, and smaller data consumers can buy just the data they need. The significant financial and technological resources of large enterprises are no longer a prerequisite to acquiring external data.

Most marketplaces require data providers to display a range of information about their data products, including a description, basic metadata, and often even a sample. The exact requirements vary from platform to platform, but they can help would-be data consumers vet the product before purchasing it. 

Data Taming

Once you’ve selected and acquired the data you want, it’s time to “tame” it. Just like training  a new puppy, shaping wild data into a usable asset takes some effort. Even the highest quality external data typically needs to be remodeled and validated before it can be joined with internal sources. 

Data Quality. The first checks for external data should measure basic metrics of data quality. Tests should check attributes including null values, duplicates, missing rows, and obvious errors. It should also identify data types and common patterns. The allowable threshold for different attributes and the specific metrics you need to check will depend on your intended use case. You’ll need to rely on domain experts to understand what would render data unusable. Often this process requires manually writing rigorous data quality tests in SQL to check the incoming data. Tools such as FirstEigenBigEye, and Ataccama, however, have started to automate pieces of this workflow, making it much more scalable. 

Remodeling. Once the data has passed the requisite quality checks, an engineer typically needs to model it to fit with internal data. Most companies have their own business logic and definitions that won’t match those used by an external data provider. To join it with company data to create a product for internal consumption or to power an application, you’ll need to modify the structure of external data. Only when the external data can be used with comparable friction to internal sources can it be considered domesticated. 

The Value of House Trained Data

The process may seem daunting, but domesticating third party data is a worthy endeavor. One organization, no matter how large, simply cannot generate the data required to answer every question it might need to ask. External data can fill in the gaps and power new insights, but only when thoroughly tamed. Using the data domestication process described here will help you to extract the potential of third-party data, without falling prey to its pitfalls.

Figure 1. The Data Domestication Process

Joe Hilleary

Joe Hilleary is a writer, researcher, and data enthusiast. He believes that we are living through a pivotal moment in the evolution of data technology and is dedicated to...

More About Joe Hilleary