Get Data Preparation Right or Prepare to Fail

Get Data Preparation Right or Prepare to Fail

Git along, little gigabtyes, terabytes, and petabytes, git along. Head ‘em up, move ‘em out, as we head on down the trail...

Everyone wants analytic insights to make smarter, faster decisions, but (to continue the earthy/rustic metaphors) before we reap the benefits of analytics, we must first sow good data—and separating the wheat from the chaff requires effective data preparation, i.e., ETL and data wrangling.

Data wrangling conjures visions of cowpunchers wrangling cattle out on the prairie, but given the amount of data companies are facing today, the rate at which they must assimilate it needs to look a lot more like a fleet of speeding JEEP Wranglers. I have identified 41 specific data sources enterprises need to access, and another 14 sources if the enterprise happens to be a communications service provider (CSP). The IoT is already posing a major challenge by unleashing massive new volumes of data on the organization. True, it holds the allure of huge benefits--if existing data management systems don't choke on all that data first. Now the next generation of data sources, from geospatial to drones to blockchain to bitcoin and other cryptocurrencies, is poised to push systems (and organizations) to the breaking point.

Yes, we list blockchain separately: it’s not just about crypto; some companies are already putting blockchain to work in areas like adtech and retail/e-commerce.

Here are some of the things data science and IT executives tell me are keeping them up at night, the big questions they wrestle with:


Are we providing seamless access to all data: on-premise, cloud, internal/external sources? Are we helping our organizations make decisions based on the best data available? Is the data we’re using the most relevant, most current data? These are killers because we can work all the magic we want on data, streaming and analyzing it for real-time insights…but if we are working off incomplete datasets, our organization is making decisions based on only part of the truth; in effect, with blinders on.


Strategically, are we ensuring data quality at the source, in the data pipeline, and in the data sets that are being created? Day-to-day, can I as a business user trust the outcome of this prep job I’m working on right now? IT has tribal knowledge about the data that business users probably need in order to get the best results. Without it, I suffer from blind spots that will skew my results, and thus my recommendations.

Scrutiny into data integrity is only really starting to get appropriate attention. There are great reporting and visualization tools and services that make data look exciting and polished. But where is the proof that the data and the numbers are actually correct?


Who is using my data? What are they learning? We have access to a vast quantity of data; how can we help more people across our organization gain value from it? Day-to-day, how do I as a user get answers to my business hypotheses in more timely and cost-effective ways?

Business users will use tools and resources that are accessible to them, are easy to learn and use, and are flexible to meet their changing requirements. If data prep tools cater only to the technically oriented data experts and analysts, there will forever be bottlenecks, and business users will look for alternatives to get their work done.


Where does all of my data reside? With GDPR and other regulatory requirements, a CDO needs to know where all manifestations of a particular entity reside (such as a person, or transaction). When you have dozens or maybe hundreds of systems and limited ability to perform data matching across them, I’ll tell you right now: you are exposed. You may start with a clean, privacy-respecting dataset and best practices—but transformations that occur during data preparation can expose sensitive personal data and create privacy violations anyway!

These challenges have given rise to a new market for fragmented data preparation tools for data scientists that can quickly manipulate raw data into prepared data assets. The main risk of fragmented tools is the lack of a secure and governed system for data management. This takes the old issue of “data silos” and now extends it into data PREP silos! Data scientists never have complete access to all available data, and their work is never fully operationalized. It makes it difficult or impossible to govern sensitive data assets across the enterprise.

As a result of the challenges they face—and how they are trying to deal with those challenges—companies cannot (first) discover and (later) find the data they need. Even if they CAN find and discover the data they need, they are relying on slow and manual processes to get the job done. We assert that data science and IT are spending up to 80% of their time cleansing and normalizing data in order to analyze it and obtain the analytic insights their organizations crave.

So, what constitutes a good data preparation strategy? The first step is to identify the data sets that will be used to support decisions; how the data will be manipulated; and the analytical process that will define insight generation. Next, select the data sources to support the desired decisions; this will not only define the types of data available but will largely define the kinds of data cleansing that will need to be done. Next, assess and ingest additional data sets. As new data is discovered or becomes available, data preparation must occur to ensure that new data sources can inform ongoing decision making. The final pieces of the puzzle are a data preparation platform to enable access to curated and trusted data; and analytic tools to produce the desired insights.

All of this points toward an enterprise data management platform with data preparation on board—not requiring separate data calls to an external system—to handle the added volume and new types of new data emanating from all sources; to empower users to quickly cleanse and prepare that data for use; and to provide advanced analytics that help users act decisively on it all. The result? Yee ha! Data wrangled. Analytic insights lassoed and put to work helping your organization survive and prosper out there on the competitive trail.

Contact me to learn more about all of this, and for suggestions on which sorts of enterprise data management platforms may be up to the job.

Jeff Cotrupe


| Have launched and sold 20 products, driven hundreds of millions of dollars in revenue and funding, and worked for or...

More About Jeff Cotrupe