The Future of Data Cataloging
The data catalog market is in the early stages of a significant shift. We will see standalone data catalogs that are biased toward a single primary use case give way to data cataloging integrated as a set of functions embedded into broader tool suites. Early signs of the shift are evident by watching trends of software vendors who offer data catalogs. Some of the early data catalog providers are being acquired by vendors with more comprehensive product offerings. Waterline has been acquired by Hitachi Vantara, and Dell Boomi has announced its intent to acquire the Unifi. At the same time, we see vendors with other origins adding data cataloging to their toolsets—Informatica, Tableau, IBM Watson, Alteryx, and many more. Some of the more recent entrants into the data management market built in data cataloging right from the start. Ramesh Menon, Product Management VP at Infoworks, says that you simply can’t do DataOps without a data catalog.
The heart of a data catalog is metadata, and metadata management suffers some loss of value when biased toward a single primary purpose. What I mean by that is that Alation is a data catalog built originally to support data analysts and the trend toward self-service. Collibra, by contrast, is focused first and foremost on data governance. Waterline’s initial focus was simply organizing the content of a data lake. The Unifi data catalog is biased toward supporting data integration and preparation. The reality is that none of these things — analyst needs, governance, data lake management, data preparation, and data integration — are islands that exist independently of the others. Aaron Kalb, CDO at Alation, acknowledges the trend and tells me that “Alation is evolving into a platform that supports all of those use cases.”
So, with those thoughts about the data catalog market, what about Hitachi Vantara and Waterline? I think it remains to be seen to what extent this is good for both companies. For Waterline it’s probably a good thing because they really need to be part of a bigger product suite to maintain and begin to grow their somewhat limited market share. For Hitachi, it depends almost entirely on how well they can weave data catalog functionality into their DataOps strategy, make Waterline fit neatly with Lumada, achieve either integration or seamless interoperability of Pentaho and Waterline, and make Waterline part of the metadata foundation for future product evolution. There is certainly potential here, but there is always a lot of work to be done to turn potential into reality.
The Boomi acquisition of Unifi is in many ways similar to Hitachi and Waterline. Integration will be the key to success. If the Unifi catalog becomes an integral part of the Boomi iPaaS platform, it will make a difference for integrators working with Boomi. As an add-on without full integration, it may struggle to gain acceptance and adoption. Unifi, however, is more than a data catalog. It is data preparation technology that is well-suited for self-service integrators and analysts — something that fits well with Boomi’s iPaaS foundation. Again, there is lots of potential here, and this is definitely one to watch.
Ultimately, the future will bring the demise of the single-use case-oriented standalone data catalog. The “bolt-on” data catalog is isolated from data operations that are performed by other products as we integrate, cleanse, blend, and prepare data for analytics. A smart standalone catalog can draw inferences about the processing of data, but those inferences are limited, lacking in depth, and unable to provide an unbroken chain of data provenance and data lineage. Data cataloging embedded in the tools that handle data from original sources to analysis and publishing of insights is a big part of the data cataloging future. What this means, however, is that when you use many tools for data integration, data preparation, and data analysis you will have multiple data catalogs. But we don’t want to repeat the metadata muddle of the BI era with many disconnected and disparate pockets of proprietary metadata. Together with tool-embedded data catalogs, you’ll need an enterprise data catalog (a natural repositioning for Alation) that knows about all of your data from original sources to analytics-ready data, is capable of supporting all catalog use cases, and is able to interoperate and exchange catalog metadata with other catalogs.