The Continuing Evolution of Data Management

Three Generations of Data Management

I first encountered data warehousing in the early 1990s. It impressed me as the first true architectural approach to managing data after having worked for more than 20 years in a world of disparate databases and point-to-point application interfaces. Looking back across nearly three decades since that time, it is clear that data warehousing was only the beginning of data management and not the end state that we naively believed at the time. Today we are in the early stages of what I see as the 3rd generation of data management. Let’s look at each stage of data management evolution to understand where we’ve come from, where we are today, and what is likely in the near future.

1st Generation Data Management — Data Warehousing

In 1988 Barry Devlin and Paul Murphy published an IBM Systems Journal article describing an architecture for data management in which the core component was something that they called the Business Data Warehouse. In 1991 IBM announced the Information Warehouse as an offering. Bill Inmon’s book Developing the Data Warehouse popularized data warehousing starting in 1992. Through the 1990’s and the early part of this century data warehousing was widely adopted by companies throughout the world. Data warehousing was viewed as the means to tame data chaos through integration and subject orientation. Much of the hype around data warehousing focused on creating a single version of the truth.

Figure 1 illustrates a typical data warehousing architecture. The data sources are almost exclusively enterprise data that is generated internally—data from operational systems such as ERP, SaaS applications, and legacy systems. The source data consisted entirely of structured data, most commonly in relational databases but sometimes in file systems, hierarchical databases, and network databases. ETL processes moved data from sources to the data warehouse with transformations to integrate, cleanse, and aggregate the data. The warehouse itself was deployed with RDBMS technology. Use cases were largely limited to BI and reporting.

Figure 1: Data Warehousing – 1st Generation Data Management

Data warehousing made an important first step in formalizing data architecture and data management practices. Resolving inconsistencies among redundant data sources, providing an understandable source of data for business user access, reducing the complexity of tangled and fragile point-to-point application interfaces—these are all valuable achievements that illustrate the success of data warehousing.

Figure 2: Proliferation of Multiple Data Warehouses

Data warehousing, however, was not entirely successful. Very few data warehouses achieved the single-version-of-truth (SVOT) goal. In fact, in all of my research, consulting, and teaching I have never encountered a data warehousing organization that claims to have fully achieved SVOT. Data warehouses themselves became pockets of data disparity as mergers, acquisitions, and competing internal projects resulted in multiple data warehouses. Recent polls that I have conducted show fewer than 10% of companies with only one data warehouse. (See Figure 2.)

Furthermore, while data warehousing was solving data management challenges of the 1990s, big changes in data management brought new challenges for 21st century data management.

2nd Generation Data Management - Data Lakes

Shortly after the turn of the century we experienced big shifts in data management—something that I often refer to as The Data Quake. Just as an earthquake shakes the foundations of things built on the earth’s surface, the data quake has shaken the foundations of data management practices built on 1990s architectures. The once stable world of data management rapidly became dynamic and volatile. The new century arrived with many advances in data management and demands for a new approach: big data technologies, cloud computing, data lakes, and optimizing for analytics and self-service became data management priorities.

Figure 3: Big Data and Data Lakes – 2nd Generation Data Management

Figure 2 illustrates a typical reference architecture for 2nd generation data management. The data lake is a core architectural component that removes the limitations of RDBMS as the dominant database technology. NoSQL databases provide capabilities to store and manage unstructured and differently structured data. The requirement to impose schema before storing data in a database disappears. Integrating all data into a single cohesive schema is no longer required, and in fact no longer desirable. Multiple levels of data refinement, ranging from raw data for data scientists to integrated and aggregated data for basic reporting, has become the new best practice. Cloud deployment provides scalability and elasticity that are needed when working with massive amounts of data. Hadoop and similar technologies provide the means to process big data. New data management roles emerged—data scientist and data engineer—along with new use cases in the areas of advanced analytics and data science. More data, more kinds of data, more use cases, and more users … these are the driving forces for a next generation of data management architecture.

But the data warehouse (or data warehouses given that most companies have multiple) did not disappear. Despite declarations by many that the data warehouse is dead, data warehouses persist. The data warehouse is not dead. Big data can extend and enrich a data warehouse, but cannot replace it. The data warehouse integrates critical and valuable enterprise data—data that is not found in big data sources and that continues to be the primary data resource for descriptive, prescriptive, and decision analytics. It serves as corporate memory, collecting the body of history that makes time-series and trend analysis possible. Equally important, the data warehouse organizes and structures data to make it understandable and useful for consumption by many different business stakeholders. In many implementations data warehouses become additional sources of data ingested into the data lake, though today many data architects struggle to determine how best to position data lakes and data warehouses as complementary.

Are data lakes a success? That’s a difficult question to answer. There are many success stories, but also many reports of costly failures. I think the reality is that the data lake concept is successful because it responds to the challenges of big data, unstructured data, scalability, and elasticity. The failures are not failures of concept. They are failures of execution—failure to plan, organize, and govern—that turn data lakes into data swamps.

Data lakes make positive contributions to the business of managing data. In a world of digital economy and digital transformation of business we can’t ignore data simply because it doesn’t fit into first generation architecture. So much of the data that is the source of meaningful insights comes from big data sources—web activity, mobile activity, social media, and machine and sensor data that depends on stream processing. Data lakes provide the architecture and principles to manage that data and make it available for analysis.

 This 2nd generation of data management is not trouble free. Volatile and rapidly changing open source technologies challenge even the best of change managers. Cloud technologies raise concerns about lock-in. Processing of data has become much more complex with a shift from batch ETL to multiple, multi-directional, and often real-time data pipelines. Data lineage can be elusive to trace and data governance is difficult. Data access is frequently painful because finding data in the lake is not easy, especially with deficiency of metadata. Overlaps and inconsistencies between data lakes and data warehouses suggest that the data lake is yet another data silo.

Data architecture for big data and data lakes represents the current state for most organizations today—what they are implementing or what they aspire to build. As we build toward this next level of data management maturity, it is important to look to the future. Innovative technology vendors are already moving toward the next generation of data management tools to address the challenges of 2nd generation data management.

3rd Generation Data Management — Catalogs, Hubs, and Fabrics

The next evolution of data management tackles the difficulties of lake-centric architectures. Data catalogs address the difficulties of finding and understanding data and make big steps forward for metadata management. Data Catalogs are the most mature of the generation 3 technologies, but widespread adoption is just beginning. Data hubs mitigate the difficulties of data lakes and data warehouses as multiple and disconnected silos. Data fabric adds smart tools for complex data management and satisfies many of the requirements to be successful with DataOps

Figure 4 illustrates how data catalogs, data hubs, and data fabric form the core of 3rd generation data management.

Figure 4: Catalogs, Hubs, and Fabric – 3rd Generation Data Management

Data catalogs were initially driven by self-service data analysis tools, which gave non-technical data and business analysts friendly code-free tools to report, analyze and visualize data. Self-service data preparation tools provided capabilities to improve, enrich, format, and blend data. Yet most data analysts are working blind, without visibility into the datasets that exist, the contents of those datasets, and the quality and usefulness of each. Data catalogs remove the blinders, making datasets easily searchable and well described so analysts and quickly and easily find the data that they need and evaluate its usefulness. Catalogs have become the go-to technology for data curation, data management collaboration, and metadata management.

A data hub is more than just another variation on data consolidation. A robust data hub includes features for data storage, harmonization, indexing, processing, governance, metadata, search, and exploration. Data from many sources—both operational and analytic—is acquired through replication and/or publish-and-subscribe interfaces and is stored in the hub. Note that the data lake and data warehouses have not disappeared. They become sources of data for ingestion into the hub. Batch processing, stream processing, and AI/ML processing are central to the hub concept. AI/ML is a distinguishing feature that makes it practical to move analytic processing to the data location instead of moving massive data volumes across a network for processing. Features such as indexing, search, data exploration, data protection, harmonization, and metadata management round out the essential capabilities of a robust data hub.

Data fabric is a combination of architecture and technology that is designed to ease the complexities of managing many different kinds of data, using multiple database management systems, and deployed across a variety of platforms. A typical data management organization today has data deployed in on-premises data centers and multiple cloud environments. They have data in flat files, tagged files, relational databases, document stores, graph databases, and more. Processing spans technologies from batch ETL to changed data capture, stream processing, and complex event processing. The variety of tools, technologies, platforms, and data types make it difficult to manage processing, access, security, and integration across multiple platforms. Data fabric provides a consolidated data management platform. It is a single platform to manage disparate data and divergent technologies deployed across multiple data centers, both cloud and on-premises. Data orchestration is the primary job of a data fabric—a job that requires interoperability with data storage, ingestion, transport, preparation, and pipelines. Features for security, metadata management, governance, and protection are essential. Data access is provided through support for queries, APIs, and data services.

The 3rd generation of data management brings new capabilities and puts the power of AI and machine learning to work for data managers. It resolves many of the challenges of managing in a data lake world. Is it perfect? I truly doubt that it is, but we’ll see advances as catalogs, hubs, and fabrics continue to mature. Is it the end state? I think that’s unlikely too. There is more to come in the evolution of data management.

Beyond 3rd Generation — What’s Next

Figure 5 illustrates the 3 generations of data management architectures and practices described above. It is clearly an evolutionary process where each advance in capabilities brings new challenges and the need for more innovation. I think of this as past, present, and near-future; 1st generation is past, 2nd generation is present, and 3rd generation is the near future. What do you think we’ll see beyond the near future?

Figure 5: The Evolution of Data Management

I expect that the 4th generation will deliver self-driving data capabilities as a logical extension of data fabric. Self-driving cars are a near certainty that will reshape the world of personal transportation. An autonomous vehicle knows its location and destination, can navigate, avoid collisions, and send alerts when maintenance or other help is needed. The self-driving car provides a great metaphor for reshaping the world of data transport and data management. Similar to a self-driving car, self-driving data knows its location and destination, can navigate data pipelines and can deliver data wherever it’s needed. At the point of ingestion, data moves autonomously through the ecosystem from landing zone through data lakes, data warehouses, analytics applications, dashboards, scorecards, reports, and wherever it may be needed. Throughout the route, the right integration, transformation, and tagging is automatically applied.

Dave Wells

Dave Wells is an advisory consultant, educator, and industry analyst dedicated to building meaningful connections throughout the path from data to business value. He works at the intersection of information...

More About Dave Wells