Automating Data Preparation in a Digitalized World, Part 1

Since time immemorial—or, to be more precise, the early 1990s—data warehousing experts and consultants like myself have been extolling the virtues of automating the creation and management of data coming into and within the data warehouse (DW) environment for informational use. One key area we focused on was that of populating the DW environment. Extract, transform and load (ETL) tools emerged in that timeframe. Informatica, for example, now among the last standing of the original independent ETL vendors, was founded in 1993.

Simply speaking, ETL aimed to displace hand-written code for populating data warehouses (and marts) with automated procedures both in the initial build and, more importantly, in the ongoing changes needed as business needs evolved. Procedure design and editing is via graphical drag-and-drop. Metadata describing the steps and actions is stored and reused. This metadata is used either to drive a data processing engine or is transformed into code prior to execution.

Extract, transform and load was a rather unfortunate choice of name for such tooling, in my opinion. It emphasizes procedures rather than actual functional value and objective. A preferred order of execution is embedded in the name, which may not be the real order needed, leading to the newer acronyms such as ELT and ELTL with the same steps but in different orders. The complexity and multiplicity of function offered is easily overlooked in the word transform: selection, translation, encoding, cleansing, validation, integration, joining, sorting, reconciliation, aggregation, and derivation are but a sample of the actions covered. To add to the complexity, the targets of the load step may be data warehouses or marts. Add the fact that this function, which originally operated strictly in batch mode, now includes mini-batches, trickle-feeds, real-time or near real-time replication. More recently, the targets have been expanded to include data lakes and more.

In my 2013 book, Business unIntelligence, I took the view that we needed a more inclusive term—instantiation—to cover the increasing breadth of the field and, in particular, the fact that in many big data use cases the data may be accessed in situ, rather than extracted and loaded. I also wished avoid phrases that had already been used or even misappropriated by specific vendor marketing efforts. Sadly, the word hasn’t caught the imagination of the BI or big data worlds.

In reality, however, there is more to automation than ETL or instantiation. ETL traditionally focuses explicitly on the more IT-oriented aspects of getting data from A and B to C. Many traditional developers find it relatively easy to write scripts or programs to do the initial loads, integrate data sources and more. However, what they often overlook is the ongoing burden of maintenance and update that will follow. This particular oversight is regularly cited by ETL vendors when they state that their main competitors are the internal script writers and coders within enterprise IT.

Furthermore, to truly automate data warehousing (or a big data environment) one has to look at the entire scope of the full “data preparation” process. This process starts with understanding specific and enterprise business needs, moves through finding and understanding data sources, deciding what the data must look like in the target to meet those needs, and finally, knowing how to keep all this going in rapid iteration as the business needs and technologies change rapidly and often unexpectedly. These thoughts are encapsulated in the term Data Warehouse Automation (DWA) that has been around for some ten years now, and that is finally beginning to gain more traction in the market. Unfortunately, like ETL, it is a somewhat unfortunate name, particularly because its application is and, indeed, must be far wider than the data warehouse and its sources far more varied and poorly defined.

The scope of this full process, combined with the increasing range of sources and targets, further implies that vendors offering such functionality usually focus on a particular subset of the scope. In his November 2015 report, "Which Data Warehouse Automation Tool is Right for You?", Wayne Eckerson defines two broad approaches to DWA: model-driven, starting from logical model of the business needs, and data-driven, which emphasizes the importance of first understanding the data available in the enterprise. At the heart of both approaches is the idea that business users must be involved from the beginning of the process, right through to the end. Of course, ETL tool vendors are also expanding the scope of their offerings to more deeply involve business users and to address more of the process of building and maintaining data warehouses, as well as in the new world of data lakes.

But, there’s more. As I pointed out in my series of articles last year on data wrangling (Part 1, Part 2Part 3), users of big data solutions are increasing concerned about the amount of work involved in preparing incoming data for analysis. If they had talked to anybody in BI, they would have known what to expect! However, they didn’t and vendors are now developing a slew of data wrangling tools to ease their pain. These tools amount to self-service data preparation, as befits an environment where data scientists play and experiment with data of largely unknown characteristics. Of course, we can imagine that the next requirement to be rediscovered will be to automate the process for regular use, followed closely by the need for easier maintenance and change processes.

The data preparation landscape as we enter an increasingly digitalized world is thus characterized by three converging and complementary approaches. ETL brings a depth of knowledge of the complexity of processing needed to integrate data from very disparate sources. Data wrangling is developing insights into how self-service can be applied by non-IT people in data preparation. DWA offers to provide a soup to nuts process. In Part 2 of this series, I’ll take a deeper look at DWA functionality and how these tools are developing to play a central role in preparing data for informational use.


Barry Devlin

Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper on the topic in 1988....

More About Barry Devlin