Automating Data Preparation in a Digitalized World, Part 3

Automating Data Preparation in a Digitalized World, Part 3

Extract, transform and load (ETL) is a category of software that is unfortunately named. As pointed out in part 1 of this series, the name is misleading in a variety of ways. It implies that the three functions are equally important, cover the full set of function needed, and are executed in a particular order. None of these assumptions is true. Indeed, some have never been true. So, how do we characterize products in this market?

One approach is simply to “follow the crowd” on which products are ETL tools. For example, consulting and analyst company Passionned Group in the Netherlands provides a list of almost two dozen ETL tools evaluated against over ninety criteria. (The list is out-of-date for some products and misses others entirely, but provides a useful starting point for tool selection.) An examination of the products listed here and elsewhere reveals a range of functionality from basic source/target transformation mapping to complete data and quality management suites. Some products are closely integrated or aligned with databases or business intelligence tools, while others stand independently. Some operate solely in batch mode, while others support real-time operations, and even data virtualization.

With such a range of function and approach, the market defies easy categorization, either within or against other categories, such as data warehouse automation, with which the functionality overlaps extensively. However, considering the vendor’s breadth of vision about the broad topic of data and information is instructive.

At one end of this spectrum are the niche vendors, whose products address a well-bounded and well-understood task: how to get data from one or a number of structures to a different one in the simplest and most efficient manner. In short, these tools support a project oriented focus on a well-defined task carried out by a largely IT-centric team. In a small organization with relatively simple data preparation needs, this approach works well. It reduces the learning curve and skills required for IT professionals tasked with getting data cleanly and reliably from A and B to C. It reduces ongoing maintenance effort and cost as the business’ data needs evolve.

In fact, most—if not all—ETL products on the market today started here. However, in an organization of any size or in an IT environment of any complexity, the limitations of this simple, project-oriented approach rapidly become apparent. Multiple sources contain the same or related data. Worse, multiple sources that should contain the same or related data actually contain conflicting data. A single target data warehouse has long been a rarity; and today’s data lakes only add to the complexity of the information/data environment. The larger vendors, with their focus on larger enterprises, have therefore broadened the scope of ETL to encompass the entire data management spectrum.

This direction is clearly seen, for example, in the direction taken by Informatica in recent years. At their recent customer conference, CEO Anil Chakravarthy declared “a clear ambition—to be the number one data management company in the world”. The venerable PowerCenter, a graphical ETL tool (in the narrow sense of ETL) dating back to 1993, has been expanded and supplemented by a full range of function including real-time integration, a publish/subscribe hub, as well as the Intelligent Data Platform that includes data integration, master data management, and data quality, based on a common engine and metadata (business and technical) repository.

Not being a database or store vendor, Informatica emphasizes their data management vision in the interstices between the various data stores. Database vendors, such as IBM and Oracle, also offer comprehensive sets of ETL/data management function comparable to that listed above, based on ETL tools they have acquired and built out over the past decade and more. Despite differing starting points, all these major ETL vendors emphasize the overarching necessity of a data quality and management focus in the digitalized world.

That databases are central to comprehensive data and quality management seems undeniable to me. However, this foundation has been destabilized by the (re-)emergence of and emphasis on primitive, exposed, file-based data stores in the world of Hadoop and big data. I use the word primitive both in the sense of their history as the first approaches to storing data (often on magnetic tape) prior to the 1970s and to their lack of metadata and structure needed to capture and store business meaning. By exposed I acknowledge that databases are, of course, built on files, but that the internal structure of these files is irrelevant from a data management viewpoint.

With exponentially increasing volumes of data being stored in these primitive file stores, the central role of ETL as a data and quality management environment becomes critical. In this sense, ETL is a category of software that will be of growing importance to IT organizations who wish to participate fully in the digitalized world.

In parallel, however, these same volumes of data militate against old habits of creating copy after copy of the same or related data with every new business need that arises. The implications of this trend for data preparation tools—ETL, DWA, data wrangling, and more—are the subject of the next part of this series.

Barry Devlin

Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper on the topic in 1988....

More About Barry Devlin