Data Mesh Translated: Software Engineers Try to Reform Data
ABSTRACT: The data mesh is an attempt by software engineers to remake the data industry in their image. There is a lot of goodness in the data mesh but will it work?
Every decade or so, a segment of the information technology industry “discovers data” and makes big investments that don’t pan out as expected and sometimes fail spectacularly.
App Vendors. In the mid 2000s, enterprise application providers, including SAP, IBM, and Oracle, who paid billions to buy into the maturing business intelligence market through acquisitions of Business Objects, Cognos, and Hyperion. Those products were soon eclipsed by new ad hoc visualization tools (e.g., Qlik, Tableau, PowerBI) and more recently cloud native BI tools. It’s unlikely the Big Three ever realized a smart return on their money.
Open Source. In the mid 2010s, the open source community coined the term “Big Data,” which was all the rage until its flagship codebase – Hadoop – collapsed under the weight of its complexity, unreliability, and, yes, cost. Spark, then cloud data platforms, such as Snowflake and Databricks, buried Hadoop and its dreams of making data warehouses obsolete.
Software Engineering. Today, software engineers have discovered data and don’t like what they see. The data mesh is a software engineer’s model for how to reinvent data so it functions more like modern object-oriented code and microservices. Among other things, it specifies a peer-to-peer architecture that eliminates the need for data warehouses and data lakes. Zhamak Dheghani, author of “Data Mesh: Delivering Data-Driven Value at Scale”, writes:
“The principles of data mesh are a generalization and adaptation of practices that have evolved over the last two decades and proved to solve our last complexity challenge: scale of software complexity led by the mass digitization of organizations…. They are an adaptation of what formulated the previous paradigm shift in software: the microservices and APIs revolution, platform-based Team Topologies, computational governance models such as Zero Trust Architecture, and operating distributed solutions securely and across multiple clouds and hosting environments.”
The Reckoning
I give Dheghani credit for accurately cataloging the ailments of most data implementations today: the endless data bottlenecks, rigid data models, poor data quality, disenfranchised data users, and lack of return on data investments. All true—guilty as charged.
I give her more credit for coining a buzzword that represents the holistic changes that organizations need to make to optimize their use of data. Yes, organizations need data governance; yes, they need a self-service platform; yes, they need a distributed architecture; yes, they need to make it easy for people and teams to find and use data; yes, they need to build solutions closer to the business; yes, people and teams need to create their own data-driven solutions and share them with others. Data consultants and experts have counseled organizations for years about doing these things.
But like the big data movement, Dheghani is eager to ditch the old in favor of the new. The data mesh has no place for a data warehouse, a data lake, or the data pipelines that feed them and the enterprise data engineers who create them. There is no place for an enterprise data team except to build and maintain the data sharing platform and (perhaps) facilitate the governance work that keeps the data domains from spinning into irreconcilable data silos. In other words, the data mesh is all about decentralizing data. That sounds good in theory, but will it work?
In a true data mesh, all data is created, managed, and shared by domain teams in the business. This is a true peer-to-peer data sharing environment in which domain teams create and share atomic data products through standard interfaces. Welcome to a microservices architecture for data! The question is: Will the lessons of software engineers who’ve mastered how to code complex solutions in large geographically dispersed organizations apply to data?
Welcome to a microservices architecture for data!
In my last article on the data mesh, I wrote about some of its obvious problems. Most business teams don’t have the time, staff, skills, budget, or interest to hire their own data experts to build and manage data products for themselves, let alone others in the company. A self-service data platform will make it easier for them to do this—and certain groups—typically finance and marketing that have already invested in their own data resources—will benefit immensely. But most other groups will have to rely on enterprise data teams to do this work.
In addition, a big challenge with the mesh is keeping data domains from spinning off into irreconcilable data silos. A shared data platform is important, but a robust data governance program is critical. Unfortunately, data governance is notoriously hard to do right, even with a strong enterprise data team facilitating the work. Enforcing consistency on dimensions, metrics, master and reference data, and security is key to ensuring that a data mesh doesn’t descend into data chaos.
Stars Are Aligned
Nonetheless, the data mesh is gaining momentum. Five new technologies are transforming the way organizations use data, and propelling adoption of the data mesh. It provides a suitable, if not coincidental, methodology to tie these technologies together. Many companies already use them to support modern data architectures, whether or not they are implementing a “data mesh”.
Cloud data platforms. These global, multi-tenant databases give each data domain its own space on a shared platform that makes it easy to share data across domains and access a common repository of enterprise data.
DataOps. Another carryover from software engineering, DataOps makes it easy for domain teams to quickly build reliable data pipelines on a cloud data platform, especially ones that support zero-copy cloning and time travel.
Data Catalog. Finally, we have tools that index and tag all data in an enterprise. Catalogs make it easy for business users to find that data and understand its meaning and context.
Data Marketplace. An extension to a data catalog, the data marketplace makes it easy for individuals and teams to publish and consume data with each other, including external data.
Data Virtualization. These modern SQL engines make it easy to find and query distributed data at scale. They federate queries across multiple systems behind a common business model.
Roche, a global biotech company, began implementing a data mesh a year-and-a-half ago to replace an on-premises data warehousing architecture afflicted with delays and bottlenecks. The migration has been a success. To date, Roche onboarded 40 data product teams which have created 50 data products. The data team has increased its delivery cadence from one release every three months to an astounding 120 releases in one month. (See “What Can Data Mesh and DataOps Do For You? Ask Roche”):
To make data mesh work, Roche has relied heavily on Snowflake to provide a common data repository and sharing platform for its global domain teams. It also relies on DataOps.live, a DataOps tool that orchestrates development and execution across geographically distributed development teams and a dozen or more data management tools.
These tools have been “a complete game changer,” says Paul Rankin, head of data management and architecture. “We’re talking about ROI in terms of saving thousands of hours and dollars in processing and developer time.”
Although Roche has embraced the data mesh methodology and uses its terminology, it still has many vestiges of its old data environment, including a data warehouse, data lake, ETL tools and so on. It can be debated whether Roche has benefited more from the tools or the methodology.
Summary
Despite some drawbacks to the data mesh, it has a lot of goodness. I hope the current buzz about the data mesh along with new data technologies will break through the miasma that blankets many data organizations and help them make the necessary organizational and process changes to effectively harness data.