Let’s Not Get Physical: Get Logical
Despite over 50 years of database system development experience and continued advances in data management, we still have a fundamental problem that is dragging us down.
There is an obsession with physical data organization and movement, and an inability to deal with data primarily at the logical level.
The result is data management that costs too much, takes too long, and yields too little business value.
Let me explain.
A modern data architecture for a multinational firm uses a significantly mixed set of technologies. It is not at all surprising to find SQL database management systems, from multiple vendors, for master data management, transactional data capture, and data warehouses; NoSQL databases such as MarkLogic and Neo4j for unstructured data; and databases in the cloud such as HBase and Cassandra for big data analytics.
The resulting landscape is a tower of Babel.
Data must be moved from database to database in order to be used, and that data movement is achieved through a variety of methods, including bulk data unload/load operations, extract-transform-load (ETL) jobs, and ad-hoc custom code. We often play the telephone game with the data as we move it, altering it slightly each time to suit the needs of its next recipient; eventually, just as with a message whispered from ear to ear, the final data bears little resemblance to the original.
Naturally, with an enterprise landscape—and an industry landscape—in this condition, the focus of data architecture remains on physical issues: how do we get data from point A so we can use it at point B; how do we physically reorganize the data to meet the needs of a particular data consumer; how do we deal with data format incompatibilities as data moves from vendor system to vendor system; etc.
Think about this question for a moment: What is the business value of spending expensive staff time setting up jobs to move data from place to place, or to transform data from one form to another, just so you can use it? It’s very easy to answer this question: there is no business value. All of that programming is just overhead—cost that sucks profit out of the bottom line. And the resultant data architecture complexity reduces business agility and creates risk.
As mentioned above, it’s taken us 50 years to get to today’s state of data management, and there have been real advances in speed and scale. What will it take to get to a data architecture where the focus is on what the data means, rather than on its form or location? When will our data analysts stop spending 70% of their time just acquiring, cleaning, and transforming data before they can start analyzing it?
In order to answer this question, let’s cast our minds forward to a more ideal future state.
In the ideal future, there would be no programmers responsible for data movement. Instead, the data infrastructure would provide the illusion that all data is almost instantly available at the physical point of its need. Data consumers, including data analysts, would log on to a data catalog, shop for, and request the data they needed. That data would be described at a high level, with its business meaning clearly spelled out for both the human and the machine. (We call that computable meaning.) When a user requested data to be delivered to a certain point (perhaps a virtual point in the cloud), the data infrastructure would start copying the data from its origin, using replication techniques—meaning no potentially deforming transformations would be built into the data movement. The user would have to wait for the data to arrive, but the delay would be primarily a function of the amount of data requested. Most data would be available in seconds to minutes. Large data sets could be delivered for stream processing, where data is processed as it arrives, thereby minimizing time to results.
Today, most database management systems are built around a single physical data organization, and their vendors are busy explaining why that’s the best form for all your data needs. In the future, data consumers will consider every data organization to be valuable in its own right for specific purposes, and will demand that any one database management system be able to organize data in many ways: graph, table, document, columnar, or whatever. Physical data organization will become a detail rather than the focal point of data architecture decisions.
If this is so wonderful, why don’t we have it already? Because enterprises and vendors are focused on the physical data organization and movement problems that they face today. Enterprises are looking for solutions to their physical data problems, and each vendor is touting the benefits of one physical solution over all the others.
If we could shift the focus away from the physical and onto the logical, and demand that vendors support that focus, we could dramatically reduce the cost and increase the value of data management.
The data movement solution is already at our fingertips. Modern NoSQL database systems replicate data across hundreds of nodes, giving the illusion that data in its original form is ubiquitous. If such systems could be tweaked so that any one node can replicate specific data on demand, the illusion of local data access can be delivered to end consumers.
For the rest of the ideal future state, there are several barriers that must be overcome.
First, we need database system vendors that deliver hybrid or multi-model database systems, so that we can physically reorganize our data for various purposes without needing to switch vendors. Those systems need to have highly scalable on-demand replication out of the box.
Second, we need standard, non-vendor-specific data definition and query languages that can work across hybrid databases from multiple vendors, so that we’re not forced down into the physical layer once again. SQL, the most widely used language, is too low-level, and the many SQL derivatives created to deal with NoSQL databases share this problem.
Third, we need a better data language for exchanging data between systems. XML is too inefficient for this purpose, and JSON lacks data types which are critical to preserving data quality.
Fourth, we need a better way to express computable meaning, and data needs to travel with its meaning. We have the Web Ontology Language (OWL), but it’s only efficient for predicates with two variables (triples), and it’s not human-friendly; further, we don’t have the means to package it with data efficiently.
Hybrid model, replication-based databases are already emerging from some vendors, but the number of models they support needs to increase. The data language problems remain a significant gap. They won’t be addressed before the industry recognizes that the barriers to the next data management advancements are logical, not physical. As data product consumers, we can tell our vendors that we want a focus on replication-based, multi-model, hybrid data management solutions, and we can ask them to support advances in data languages that are data-model independent.
So, let’s get logical!