Feb 07, 2019 / by Dave Wells Data Management

Data Hubs – What’s Next in Data Architecture?

Throughout the 1990s and the first decade of this century much of data management effort was directed at eliminating data silos with architectures and technologies for data warehousing and master data management. In the past several years we’ve created new data silos with the adoption of data lakes, NoSQL, and big data technologies.

It seems that the problem of data silos keeps coming back. And this time around we need to approach the problem differently. When data warehousing was at the core of data architecture we tackled integration by building a single data model in which to fit all data intended for sharing across the enterprise. Then we moved data from sources to that data model through ETL processing. The resulting data warehouses were characterized by high-latency data with limited adaptability and agility. As the data world continued to evolve with trends such as big data, NoSQL, self-service analytics and data science the limits of legacy warehouses were amplified.

Data lake architecture emerged in response to the limits of data warehouse architecture, with new capabilities for handling of unstructured and differently structured data and substantial gains in scalability and elasticity. The capabilities to process and store exceptionally large volumes of data and to ingest data at high speed proved valuable for advanced analytics and data science efforts. But with new capabilities, data lakes also brought new challenges including governance, security, and managing the risks of becoming a data swamp. In many instances, the data lake became yet another data silo, existing side-by-side but disconnected from data warehouses, operational data stores, and master data repositories.

Tackling the problem once again, we need to do it differently. MarkLogic, Cloudera, SAP, Informatica, PureStorage and other vendors are showing their early versions of data hub technology with variations on the solution. Every data hub solution, I believe, regardless of technology must step up to data management agility and real-time data. They must support data of all types—relational, NoSQL, geospatial, etc. They must focus on data harmonization without proliferating unnecessary and redundant copies of data. They must (along with data catalogs) resolve the challenges of finding the right data quickly. And they must support a broad span of use cases ranging from simple reporting to data science, artificial intelligence, and machine learning.

A data hub is more than just another variation on data consolidation. A robust data hub (See Figure 1) includes features for data storage, harmonization, indexing, processing, governance, metadata, search, and exploration. Data from many sources—both operational and analytic—is acquired through replication and/or publish-and-subscribe interfaces. Replication uses changed data capture (CDC) to continuously populate the hub at or near real time as changes to data sources occur. Publish-and-subscribe allows the hub to subscribe to messages that are published by data sources as data changes occur.

Figure 1 – Data Hub Reference Architecture

Data hubs are emerging as the next generation of data architecture – a 3^rd generation that evolved naturally from the data warehouse and data lake predecessors. To find their place in modern data management architecture, data hubs must distinguish themselves from data warehousing, data virtualization, and data lakes with the goal to complement and enrich those technologies—not to replace them. The table below shows my effort to identify the distinguishing characteristics.

	Data Warehouse	Virtualized	Data Lake	Data Hub
Move and copy data	yes	no	yes	limited
Harmonize data	yes	yes	limited	limited
Index data	yes	limited	no	yes
Isolate source systems from queries	yes	no	yes	yes
Collect time-variant history	yes	no	limited	yes
Minimize data latency	limited	yes	yes	yes
Work with all types of data and database systems	no	limited	yes	yes
Optimized for BI and reporting	yes	yes	yes	yes
Optimized for analytics	no	limited	yes	yes
Optimized for AI and ML	no	no	no	yes
Bring applications to the data	no	no	no	yes

Collectively, the leading vendors in the data hub evolution describe several purposes for a data hub—one-stop shopping for data, data-centric storage architecture, powering of data science and AI, and ability to execute applications where the data resides among the most common. Not every vendor focuses on all of these purposes but I think ultimately every successful data hub technology will support all of them.

Figure 2 – Modern Data Management Technologies Working Together

Data hubs are needed to continue the evolution of modern data management architecture. Hubs don’t fix everything but they are a necessary response to the challenge of new data silos and the increasing demands of data science use cases. Hubs have a role together with other recently emerged technologies in shaping the future of data architecture. Expect the next generation of data management architecture to combine data fabric, data hub, and data services technologies to step up to today’s data management challenges. (See Figure 2.)

Previous post by expert Next post by expert

Dave Wells

Dave Wells is an advisory consultant, educator, and industry analyst dedicated to building meaningful connections throughout the path from data to business value. He works at the intersection of information...

More About Dave Wells

Feb 07, 2019 / by Dave Wells Data Management

Data Hubs – What’s Next in Data Architecture?

You Might Also Like

Dave Wells