The Data Mesh: Re-Thinking Data Integration

Man pushing a boulder over the mountain

Enterprises work tirelessly to centralize diverse, ever-multiplying datasets. Their data engineers struggle to transform mountains of data they don’t understand, into information that analysts do understand. They have the job of Sisyphus.

The emerging concept of the data mesh proposes a refreshing alternative: decentralize.

What does this mean? Put the smartest business domain experts – perhaps the operational owners of a certain region or business unit  – in charge of their data throughout the lifecycle. Have those domain experts transform and deliver their data as a discoverable, consumable product to the rest of the business. Give them the people and training needed to handle data engineering. Create a federated governance team to devise and implement policies. And relegate the rest to a standard, enterprise-wide resource pool, such as a cloud Infrastructure as a Service (IaaS) platform.

This paradigm is no cure-all. It can address certain organizations and certain use cases. But I recently met with a Fortune 100 client that is considering the data mesh as a way to apply discipline to a sprawling set of regional fiefdoms that manage sales and marketing data. Their case study sheds light on the potential efficiency benefits of such an approach.

The data mesh is similar to the concept of self-service analytics, in which business domain experts analyze data and build reports themselves, with minimal IT support. But unlike self-service analytics, in a data mesh, these domain experts also create and publish their data as a product for the rest of the business to consume as well. Such an approach raises the tension between central governance – still of paramount importance – and business unit autonomy. Consider the data mesh a federation, with independent business owners still agreeing on a common language and common units of exchange.

Ensnared in Problems

First, let’s unpack the concept itself. The data mesh, championed by Zhamak Dehghani of the consulting firm ThoughtWorks, defines three enterprise problems to fix. Here is my take on the problems and solutions Zhamak proposes. The problems are familiar, but the solution breaks new ground.

Monolithic platforms cannot keep up. Data warehouses and data lakes often fail to give myriad data sources and data consumers the domain-specific structures they need to create value. Business domain knowledge matters, and it gets lost in central platforms. You can land supply-chain management (SCM) records and delivery-truck sensor feeds easily enough in a data lake. But then data engineers need to reconcile different formats in order to make meaningful correlations about the health of their supply chain – while still accurately representing operational reality.

Data pipelines create bottlenecks because they isolate data ingestion, transformation, and delivery from each other. One team throws data over the wall to the next guy. To build on our prior example, the data engineers that ingest the SCM and delivery truck data into the data lake might be too busy to properly collaborate with the engineers, developers or data scientists charged with integrating and transforming it. They trample on each other’s deadlines and requirements, undermining their ability to identify and fix supply chain bottlenecks.

Hyper-specialized data source owners, data engineers and data consumers work at cross-purposes because they speak different languages. Without domain knowledge of how the supply chain works, and how the fleet operates, data engineers and data consumers focus on their narrow field: ETL scripting efficiency, visualization methods, etc. But they lack the know-how to map analytics back to business fundamentals – or even ask data source owners the right questions.

Enter the Mesh

To address these problems, the data mesh paradigm reverses conventional wisdom and embraces distributed data architectures. The four threads of the data mesh – distributed, domain-driven architectures, data as a product, self-service, and governance – rewrite the rules by putting business experts in charge.

Distributed, domain-driven architectures. Operational data owners design the data mesh. They define the components, processes, and formats to ensure their data always represents business reality. They host, transform and serve their data to consumers, using dedicated data engineers within their teams. In the case of our Fortune 100 client, region-specific data teams own the end to end process of serving that data in a consumable format to any consumer globally. The China team ensures that Corporate Finance analysts understand the nuances of their local pricing strategies.

Data as a Product. Operational data owners seek to delight their internal consumers by making data discoverable, self-describing, and inter-operable. The China team with our client catalogs its data, runs integrity checks, and delivers the final product to the rest of the business. When Corporate Finance plugs that data product into its global dashboard, they understand local reality. They easily compare China KPIs with the rest of the world.

Standardized infrastructure. Just as importantly, operational data owners outsource domain-agnostic functions such as storage and computing. Ideally, they simplify deployment and economics by using cloud-based Infrastructure as a Service (IaaS). In many cases, the mesh subsumes rather than replaces existing components, meaning the data warehouse (EDW) and/or data lake become nodes rather than centerpieces. The EDW or data lake also could complement the mesh and manage data alongside it. To standardize its data mesh, the regional data teams with our client start to combine their individual Amazon S3 contracts in order to pool and streamline their infrastructure.

Ecosystem Governance. To ensure business owners can trust and share their data products, an enterprise data governance team must implement access controls, cataloging and compliance policies across the distributed data mesh. This team examines each point in the creation of the data product: is the data trustworthy, and have operational data owners applied the right constraints on its usage? They also need a common glossary to minimize the ever-present risk of language barriers between business units.

Like any modernization approach, the data mesh is one arrow in the quiver, and should only be pointed at certain targets. To decide whether you need a data mesh, I propose asking the following questions:

  • Do your operational data owners, data engineers and data consumers struggle to collaborate effectively?
  • Do they struggle to understand one another?
  • Is a lack of business domain knowledge a primary productivity barrier for your data engineers?
  • Is it a primary productivity barrier for your data consumers?
  • Do you have unavoidable, business domain-specific variations in data across regions, BUs, etc.?

If you answered yes to all five questions, in particular the final one, the data mesh is an option to consider. Then it is time to gain executive sponsorship and budget. Start by creating one business domain-based team, staffed with data engineers, that carefully scope and execute the necessary transformations to deliver their data as a product to decision makers of all types. Cross-train your staff to instill domain expertise everywhere. Automate data engineering tasks wherever possible, and embrace IaaS to simplify the tactical stuff.

Based on the outcome of your pilot, rinse and repeat for the next function.

Kevin Petrie

Kevin is the VP of Research at Eckerson Group, where he manages the research agenda and writes about topics such as data integration, data observability, machine learning, and cloud data...

More About Kevin Petrie