Automating Data Preparation in a Digitalized World, Part 4
In today’s increasingly digitalized world, social media and the Internet of Things (IoT) generate an enormous and ever growing volume of data / information. It is claimed that 90% or more of all existing data has been created in the prior two years, and that, according to IBM, this percentage has remained true over the past thirty years. Put another way, in every two year period, we generate 10 times as much data as in the previous two years. Generated does not, however, equate to original data. We copy and replicate data and information continuously. The first IDC Digital Universe study in 2007 said that as much as three-quarters of the data they tracked was copied. Intuitively, that feels like an underestimate, given the amount of data we copy and cleanse, combine and compress in BI and analytics alone.
Data preparation is, of course, at the heart of this great industry of data (re-)generation. From its original role in populating data warehouses and marts, data preparation is now spawning tools to fill data lakes and distribute their data goodness ever further. However, the volumes of data involved today raise serious questions about the validity of an architecture based firmly on creating ever more copies of data, especially of social media and IoT data. This simple observation was one driver of the pillared data architecture I proposed in Business unIntelligence, where these voluminous data types are stored only once—if at all—within the enterprise. In addition, for reasons of timeliness and agility, the same architecture posits the need to reduce the number of copies of traditional business data as well.
Data preparation technology has, as a result, expanded to include virtualization (aka federation) techniques, where data is accessed in situ when required and cleansed en route to the requesting application or user. This is, of course, in addition to data wrangling, data warehouse automation, and ETL, as well as supporting functionality ranging from metadata to data quality tools. With such a variety of tools and techniques included, data preparation becomes too broad to be seen as a single market. Rather we should evaluate it as a toolkit of diverse functions that can be applied to data between its first creation and its final use by business in order to improve the usability of that data / information.
The full list of evaluation criteria is long, but five key aspects stand out:
- Focus: from general purpose copy and cleanse to building specific-purpose data stores
- Usage: from pure business users to IT data experts
- Starting point: from existing physical data sources to desired (data) model of the business
- Timeliness: from point-in-time batch transfers to real-time data access
- Agility: from preparation sandboxes to guaranteed SLA (service level agreement) production
In general, most products (except the most basic) will span a range of points on each of these spectra of criteria. SnapLogic, to take one example, would be characterized as focused on general purpose copy and cleanse, usable by business users and less skilled IT, starting from existing sources, covering the full range of timeliness, and agility ranging from sandbox to production. Data warehouse automation product, TimeXtender, in contrast, focuses on building a specific purpose data store, bridging from data-savvy business users to IT, starting from existing physical sources, for point-in-time loads with easily maintained, production-level data preparation.
With a burgeoning array of tools and products in the broad area of data preparation, BI and data lake implementers are spoiled for choice. There is undoubtedly a product (or two) that most closely meets your needs and suits your pocket. Nonetheless, whichever approach you take and product you purchase, there is one important aspect where you must place special attention: ensuring the governance and quality of the data delivered. And that demands an organization and process focus that will likely add considerably to the effort or cost directly associated with your chosen solution.
But, that’s a topic for another series and another day…