Zero-copy Approaches to Data Sharing

Driving through the southeastern United States you can’t miss the kudzu. It grows in long tangled vines up trees and telephone poles, swallowing forests along the highway. Originally planted to control soil erosion, kudzu now smothers its environment. One plant would be just a pretty vine, but its rapid multiplication has made it an uncontrollable menace to the ecosystem. If we aren’t careful, copies of data sets can choke our data ecosystems in the same way.

Today, the standard for sharing or moving data requires copying it from one place to another via integration. Whether internally or externally, when we want someone to have access to a certain set of data, 80% of the time we make a copy of it for them. Not only does this duplication cost money and introduce errors, but it also creates a nightmare for data governance. With dozens or more copies of the same data floating around, companies lose control. They can no longer easily determine who has access to data or enforce policies. This is not a new problem, but it’s getting worse.


With dozens or more copies of the same data floating around, companies lose control.


In the not-too-distant future we will see the rise of data webs. Just as the world wide web links millions of documents, or websites, together, soon these data webs will link data within and between organizations. Why will this happen? Because the more data organizations have, the better they can contextualize the data they generate. To compete in the modern economy, we need access to more than just our own data. The rise of data exchanges, documented in my colleague Wayne Eckerson’s report of the same name, is demonstrative of this trend. These platforms make it easy for suppliers and consumers of data to interact and share data. Although there are huge potential upsides to data sharing, there are also huge risks, at least for now. In this article, I will explore a few of the new approaches to governed data sharing that seek to facilitate the creation of secure data webs by eliminating the need to copy data.

To begin, let’s visualize the problem. Data exists in application and organizational silos. As companies share data, between departments and with customers, partners, and applications, they create data flows that replicate and transform data from one target to another. Together they form a kind of “network.” Not in the IT sense, but the classical conception of a group of nodes connected by edges.

Figure 1. Connected Silos

In this “network,” the nodes represent external data repositories, cloud applications, customers, operational systems, and business environments. Each edge represents a data sharing relationship. Already, we can imagine most companies have a version of this configuration based on their integration patterns. Operational data is extracted to a data warehouse for business intelligence (BI) analysts. At the same time, that data also flows into a data lake where data scientists pool it with information from other sources such as internet of things (IoT) data to create complex models. Finally, some portion of the data might feed into consumer facing applications which also collect and return data from user interactions. These relationships are complex enough on their own, and that’s before companies begin to share their data externally, essentially connecting the nodes in one organization to those in another.

Currently, copy-based integration makes up the vast majority of the sharing relationships in these “networks.” To move data from one node to another, data pipelines extract (through replication) data from one place and load it somewhere else. (See figure 2.)

Figure 2. An Integration-based Approach

As prefaced, this approach results in numerous copies of the data floating around. The same data resides in multiple places, potentially even in places where the organization no longer has oversight of it. With the rising importance of data governance, both for regulatory compliance and operational excellence, this is no longer acceptable. Instead, we need an approach that preserves the original data as a unique resource.

The idea of data ownership is deeply entrenched in new approaches to governed data sharing within a data web. As an owner, you don’t want to give up copies of your data, you want to “rent” access to it by granting temporary access. Whether or not you actually monetize your data, by maintaining a zero-copy approach, you ensure complete and continuous control over your data even as others benefit from it. In this paradigm, an organization stores the data itself in a single location, such as a centralized network, and permits internal and external project stakeholders or applications to request access to view and use it where it resides. This is different from a data lake because the data is never copied. Instead of exporting replicates of data to applications, the repositories in a data web allow applications to work directly off of them.

Functionally, this approach allows the data owner to operate like an electrical utility. Any system or organization that wants access to the data can plug in, but the data owner can regulate their access or turn off the tap at any time. This opposes the copy-based model where the data owner essentially gives everyone their own battery. Although the zero-copy method has clear advantages, it is technologically non-trivial.

If we zoom in on a given node in our data web, we see that the key to the whole network is the system that manages access to the data. Without a way to seamlessly open and close the access points while protecting the data itself, the data web won’t function as intended. (See figure 3.)

Figure 3. A Node in a Zero-copy Data Web

Multi-tenant sharing within a single platform is one solution. Snowflake takes this approach as it builds out the data web it calls the “Data Cloud.” The company now visualizes itself as the infrastructure supporting a vast network of interconnected data and providing services that run on top. One of these services is the ability to share data through its data exchange. Because of Snowflake’s cloud-based, multi-tenant, high-concurrency architecture, multiple applications or organizations can use the same data at the same time. Instead of relying on copies, Snowflake allows them to view the original data simultaneously. This only works, however, because all the data in question is stored in Snowflake. It’s also oriented toward analytic rather than operational use cases, because non-owners cannot edit the data. The approach works best for sharing data sets that are already cleaned and organized to meet standard consumption patterns. In order to build data webs without getting locked into a particular vendor and to facilitate true data collaboration, organizations need a different approach.

New international standards that dictate how to create nodes and manage access may be the alternative. Currently, there are two main initiatives attempting to address this problem. The first is the Solid protocol. This set of specifications defines how to create a “data pod.” Each pod functions as a kind of data wallet. It securely stores data of all types and allows the owner to selectively share that data with other organizations or applications via defined authentication and authorization systems. Unlike a platform-specific solution, the Solid protocol is intended to facilitate interoperability and depends on W3C open standards. Tim Berners-Lee, inventor of the World Wide Web, is one of the key minds behind this project, which is currently in development at MIT. 

The other initiative is the standard for Zero-Copy Integration driven by the Data Collaboration Alliance and the CIO Strategy Council of Canada. Although presently a national project, the Data Collaboration Alliance sees their work in Canada as a jumping off point for a global standard. The goal is to create an agreed upon framework for the development of new applications that allows data owners to define access controls. Like Solid, the Zero-Copy Integration standard would create a platform-agnostic solution that allows for the creation of secure nodes that connect to a wide array of applications. Although this enables operational sharing, in which multiple applications can access and edit a single, governed set of data in its native format, the number of companies that actually adopt the protocol will dictate its utility. 

Conclusion

The volume of data in the world increases every day. Our ability to benefit from it depends on our ability to access it. But although data sharing is increasingly vital to the operation of our organizations, it continues to be at odds with the need for increased data governance. How we decide to meet this challenge will have ramifications for years to come. Data webs will continue to grow in size and prevalence, so we must lay a strong foundation now to ensure they meet our privacy and security demands into the future.

Joe Hilleary

Joe Hilleary is a writer, researcher, and data enthusiast. He believes that we are living through a pivotal moment in the evolution of data technology and is dedicated to...

More About Joe Hilleary