Jun 10, 2021 / by Dave Wells Data Management

Data Architecture: Complex vs. Complicated

Secrets of Data & Analytics Leaders · Data Architecture: Complex Vs. Complicated - Audio Blog

The need for adaptable data management architecture has never been more pressing. Yet getting there seems to be more confusing than ever. The field is rampant with buzzwords – data lake, data lakehouse, data fabric, data mesh, data hub, data as a network. Software vendors quite naturally adopt the buzzwords as part of their marketing approach, sort of making a shift from architecture to marketecture. Unfortunately, each vendor may define the buzzwords differently which adds to the confusion. Figure 1 shows a few quotes that were pulled quickly from web searches. The quotes illustrate the variety of buzzwords and the different perspectives by which the terms are viewed. Interestingly, collecting the quotes shows that buzzword-based marketing works. It took less than 30 minutes for me to pull these quotes together. Within 30 minutes after I finished, I received 3 marketing emails offering me opportunities to learn more about various data fabric technologies.

Figure 1. A Sampling of Data Architecture Quotes


	“Data and analytics leaders must upgrade to a data fabric design that enables dynamic and augmented data integration in support of their data management strategy.” (source: k2view.com)			“Because it is based on a graph data model, the data fabric is able to absorb, integrate, and maintain the freshness of vast quantities of data in any number of formats.” (source info.cambridgesemantics.com)
	“In simplest terms, a data fabric is a single environment consisting of a unified architecture, and services or technologies running on that architecture.” (source: talend.com)			“The data hub, provides the answer that architects are looking for.” (source: marklogic.com)
	“A data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.” (source: databricks.com)			“The data mesh is a new approach to designing and developing data architectures. Unlike a centralized and monolithic architecture based on a data warehouse or data lake, a data mesh is a highly decentralized data architecture.” (source: dlt.com)
	“Integration is obsolete. By managing data as a network, Cinchy replaces the insecure and inefficient 40-year-old approach of integrating applications using ETLs and APIs.” (source: ai4.io)			“The idea behind a data lake is to have all your organizational data in one place while also enabling ubiquitous access to the data with a wide variety of tools, and by a broad range of services.” (source: upsolver.com)

Each of these terms—data fabric, data mesh, data lake, data lakehouse, data hub, data as a network—brings important concepts to the field of data architecture. But collectively they don’t create a clear path to adaptable data architecture. In many ways it becomes less clear. Do I need a data lake or a data lakehouse? Data mesh or data fabric? An integrated data hub or data as a network? What will work for today’s needs? What will we need in the future? What will we do when the next buzzword comes along?

Complex vs. Complicated

Complexity is an unavoidable reality of data management. It is complex by nature, but it doesn’t need to be complicated. Let’s be clear about the distinction. Complexity is an inherent characteristic of something. Complication is a result of some actions taken or not taken. Remember that complicate is a verb. Things are complicated because we complicate them. We can design architecture to handle data management complexities without making architecture complicated.

This approach to architectural design, less complicated than what we typically do, takes a different view of fabric, mesh, hub, etc. in three ways. First, they are concepts, not things. Data hub as an architectural concept is different from data hub as a database. Second, they are components, not alternatives. It is practical for architecture to include both data fabric and data mesh. They are not mutually exclusive. Finally, they are architectural frameworks, not architectures. You don’t have architecture until the frameworks are adapted and customized to your needs, your data, your processes, and your terminology.

My goal with this article is to illustrate the first two points—concepts and components—and how they help to manage complexity without adding complications. First, I’ll introduce a few architectural frameworks, then look at one example of how they might work together as a custom architecture.

Data Lake / Data Hub Framework

Data lakes and data hubs are data management approaches used to collect, refine, store, and share data. The strongest characteristic that distinguishes a lake from a hub is the degree of governance and control that is imposed upon the data. Data hubs are rigorously controlled to assure quality and trustworthiness of data. Data lakes are typically governed to manage security and compliance but are less concerned about quality control of data.

Figure 2. Data Lake / Data Hub Architectural Framework

Data hubs and data lakes use similar architectural patterns of collecting data from sources, refining the data, profiling and collecting metadata, and sharing data to be prepared and consumed for reporting and analysis.

Data Fabric Framework

Data fabric is a combination of architecture, technology, and services that is designed to ease the complexities of managing many different kinds of data, using multiple database management systems, and deployed across a variety of platforms. It provides a single, unified platform for data management across multiple technologies and deployment platforms. One key point in this definition needs to be highlighted: Data fabric is a combination of architecture, technology, and services. Architecture alone does not make a data fabric, but architecture is an essential part of the fabric.

Figure 3. Data Fabric Architectural Framework

The architectural framework for data fabric begins with data sources, supporting all types of data and all data sources both internal and external. Data acquisition supports batch processing, data streaming, and change data capture (CDC). Data storage and refinement supports multiple shared data repositories including data warehouses, data lakes, operational data stores, and master/reference data management. One category of services in the data fabric supports data access, with query, virtualization, abstraction and semantic layer, and SOA-based services interoperating with data cataloging, data governance, and data security. Access services make data available for the variety of data consumers. A second category of services supports cross-platform orchestration of processing and data pipeline execution.

Data Mesh Framework

Data mesh is an interesting contrast with data fabric. Where the fabric attempts to centralize and coordinate data management, data mesh leans the other direction with emphasis on decentralization and data domain autonomy. The big shift with data mesh is in managing data as a set of products, not as a collection of processes and pipelines.

Figure 4. Data Mesh Architectural Framework

With this shift, each data domain operates autonomously and manages its data independently of the practices of other domains. Data domain refers to a specific area of data ownership and management—perhaps a business functional area such as marketing or finance, maybe a regional designation such as the Asia Pacific, a subject area such as customer data, or any combination of these. Any domain that needs to collect and manage data is free to do so. But every domain is responsible to operate within data governance constraints, to create and share data products for use by other domains, and to adhere to interoperability standards for those products. All domains share a common data infrastructure for storage, cataloging, access control, and other infrastructure components.

Architecturally data mesh is a shift from enterprise data management to domain data management with enterprise collaboration. This means that enterprise data architecture gives way to multiple domain data architectures, and that enterprise data integration is replaced by enterprise data interoperability. To learn more about data mesh take a look at this article by Kevin Petrie, this blog by James Serra, and this article by Zhamak Dehghani.

Designing an Adaptable Data Management Architecture

Each of the three frameworks has some desirable features. Data lake/data hub architecture is familiar, relatively simple, and well matched to the skills of a typical data engineer. Data fabric is inviting because it eases the pain of cross-platform data management and uses AI/ML to automate many data management and data operations tasks. Data mesh is compelling because it allows us to stop struggling with integration.

Each framework also has a downside. Data lake/data hub is built on the concept of storing data multiple times in multiple formats—copies of data, and then copies of copies, and then… It should be clear to everyone that this approach doesn’t scale and isn’t sustainable as data volumes grow. Data fabric looks great on the surface, but it is very technology-dependent and we can’t help but think about vendor lock-in. Most data fabric tools do a good job of cross-platform interoperability, but they don’t integrate well with other tools. What are the consequences when your fabric metadata is out of sync with your catalog metadata? And the data mesh? It is an intriguing concept, but a big paradigm shift.

It’s also important to realize that none of us gets to have a fresh start. We all have data warehouses, data lakes, data pipelines, and analytic applications to be managed. Declaring them to be legacy doesn’t make them more manageable. Where and how will they fit into the new fabric, mesh, or other modern architecture?

So, upside … downside … legacy data management. Which approach to modern data architecture makes sense for you? It’s a difficult decision, but fortunately you don’t have to choose. Remember that these are frameworks. They are concepts and components, not architectures. Treat them as the right stuff to help you define a hybrid data architecture that is a mix of hub, fabric, mesh, and other concepts that you may want to introduce. Figure 5 illustrates one example of this approach.

Figure 5. Hybrid Data Architecture Example

This example is a blending of data lake, data fabric, data mesh, and other architecture concepts that I’ve not yet discussed. At the top, you see the data lake architecture much as is shown in the data lake/data hub framework. The significant difference is the addition of a data services component. I chose to add a services layer to decouple data consumption from data storage, and to tightly couple security and governance with data access. Moving down the diagram, on the left side you see the data mesh with independent data domains producing data products. We no longer need to work with data integration as a one-size-fits-all solution to data sharing, and we can position legacy data warehouses as independent data domains. At the bottom are the multiple data platforms. The cross-platform automation and orchestration capabilities of data fabric are the right solution for multi-platform data sprawl.

Finally, let’s look at adaptability. New concepts and components will continue to change the data management landscape for the foreseeable future. It isn’t practical to continuously redefine architecture from scratch, but neither is quick fix bolt-on a sustainable approach. We need to be able to fit new concepts and components neatly into existing architecture—to adapt to changes without disruption. The examples here include data network/data links that accommodate an emerging and highly compelling approach known as zero-copy integration. Also, note the addition of knowledge graph alongside data catalog as part of metadata management. This adds the concept of graph-extended data fabric advocated by Cambridge Semantics.

Data Architecture that Really Works

What I’ve shown in this example is not a perfect architecture. It is an example of how to design data architecture that really works—an architecture that meets today’s data management needs, accommodates legacy data systems not up-to-date with modern architectural thinking, and readily adapts to change as new thinking emerges. Data architecture is difficult but it can also be rewarding. My advice to those doing this challenging work:

Don’t let complexity lead you to complications.
Don’t be anchored by legacy data systems, but don’t leave them behind.
Don’t let tunnel-vision for today’s needs create technical debt for the future.
Do enjoy the work and be proud of what you do because it really matters!

Previous post by expert Next post by expert

Dave Wells

Dave Wells is an advisory consultant, educator, and industry analyst dedicated to building meaningful connections throughout the path from data to business value. He works at the intersection of information...

More About Dave Wells