Data Hubs – What’s Next in Data Architecture?

Throughout the 1990s and the first decade of this century much of data management effort was directed at eliminating data silos with architectures and technologies for data warehousing and master data management. In the past several years we’ve created new data silos with the adoption of data lakes, NoSQL, and big data technologies.

It seems that the problem of data silos keeps coming back. And this time around we need to approach the problem differently. When data warehousing was at the core of data architecture we tackled integration by building a single data model in which to fit all data intended for sharing across the enterprise. Then we moved data from sources to that data model through ETL processing. The resulting data warehouses were characterized by high-latency data with limited adaptability and agility. As the data world continued to evolve with trends such as big data, NoSQL, self-service analytics and data science the limits of legacy warehouses were amplified.

Data lake architecture emerged in response to the limits of data warehouse architecture, with new capabilities for handling of unstructured and differently structured data and substantial gains in scalability and elasticity. The capabilities to process and store exceptionally large volumes of data and to ingest data at high speed proved valuable for advanced analytics and data science efforts. But with new capabilities, data lakes also brought new challenges including governance, security, and managing the risks of becoming a data swamp. In many instances, the data lake became yet another data silo, existing side-by-side but disconnected from data warehouses, operational data stores, and master data repositories.

Tackling the problem once again, we need to do it differently. MarkLogic, Cloudera, SAP, Informatica, PureStorage and other vendors are showing their early versions of data hub technology with variations on the solution. Every data hub solution, I believe, regardless of technology must step up to data management agility and real-time data. They must support data of all types—relational, NoSQL, geospatial, etc. They must focus on data harmonization without proliferating unnecessary and redundant copies of data. They must (along with data catalogs) resolve the challenges of finding the right data quickly. And they must support a broad span of use cases ranging from simple reporting to data science, artificial intelligence, and machine learning.

A data hub is more than just another variation on data consolidation. A robust data hub (See Figure 1) includes features for data storage, harmonization, indexing, processing, governance, metadata, search, and exploration. Data from many sources—both operational and analytic—is acquired through replication and/or publish-and-subscribe interfaces. Replication uses changed data capture (CDC) to continuously populate the hub at or near real time as changes to data sources occur. Publish-and-subscribe allows the hub to subscribe to messages that are published by data sources as data changes occur.

Figure 1 – Data Hub Reference Architecture

 Data hubs are emerging as the next generation of data architecture – a 3rd generation that evolved naturally from the data warehouse and data lake predecessors. To find their place in modern data management architecture, data hubs must distinguish themselves from data warehousing, data virtualization, and data lakes with the goal to complement and enrich those technologies—not to replace them. The table below shows my effort to identify the distinguishing characteristics.


Data Warehouse

Virtualized

Data Lake

Data Hub

Move and copy data

yes

no

yes

limited

Harmonize data

yes

yes

limited

limited

Index data

yes

limited

no

yes

Isolate source systems from queries

yes

no

yes

yes

Collect time-variant history

yes

no

limited

yes

Minimize data latency

limited

yes

yes

yes

Work with all types of data and database systems

no

limited

yes

yes

Optimized for BI and reporting

yes

yes

yes

yes

Optimized for analytics

no

limited

yes

yes

Optimized for AI and ML

no

no

no

yes

Bring applications to the data

no

no

no

yes


Collectively, the leading vendors in the data hub evolution describe several purposes for a data hub—one-stop shopping for data, data-centric storage architecture, powering of data science and AI, and ability to execute applications where the data resides among the most common. Not every vendor focuses on all of these purposes but I think ultimately every successful data hub technology will support all of them.   

Figure 2 – Modern Data Management Technologies Working Together

  

Data hubs are needed to continue the evolution of modern data management architecture. Hubs don’t fix everything but they are a necessary response to the challenge of new data silos and the increasing demands of data science use cases. Hubs have a role together with other recently emerged technologies in shaping the future of data architecture. Expect the next generation of data management architecture to combine data fabric, data hub, and data services technologies to step up to today’s data management challenges. (See Figure 2.)

Dave Wells

Dave Wells is an advisory consultant, educator, and industry analyst dedicated to building meaningful connections throughout the path from data to business value. He works at the intersection of information...

More About Dave Wells