The Battle for the Big Data Catalog

The Battle for the Big Data Catalog

The battle for the big data catalog has begun.

This week at its annual users conference, Informatica fired an initial salvo when it unveiled alpha technology called Live Data Map that will eventually identify and relate every data asset in an organization.

Informatica joins data prep vendors, Paxata and Trifacta, and data discovery vendors, Tamr and Waterline, which have already shipped products that either directly or indirectly help business analysts discover, profile, relate, rate, and collaborate on data throughout an enterprise digital ecosystem.

Why a Data Catalog?

In the age of big data, the vendor that manages metadata wins. Its product becomes a veritable data geopositioning system—a data catalog or map if you will—that enables business analysts and application developers to quickly find and evaluate data resources to drive business insights and create innovative data-driven applications.

Speed time to insight. A big data catalog is primarily geared to business analysts and data scientists who need to locate, profile and evaluate a company’s data assets prior to scrubbing, integrating, and analyzing that data. Since analysts currently spend more time exploring data than analyzing it, a big data catalog would increase their productivity and accelerate time to insight. This is especially true in the era of big data where data sources, data types and data volumes are proliferating at an unprecedented rate, overwhelming individual analysts.

Foster collaboration and reuse. The best big data catalogs capture the work of analysts using social media and auditing techniques. These digital breadcrumbs enable analysts—who often work independently in silos—to reuse data structures and analyses created by their colleagues in other areas. Moreover, the tools foster a digital collaboration that enables teams of analysts to surface and document tribal knowledge about a company’s data assets that usually resides in the heads of a few super users.

Capture data requirements. The auditing feature is also particularly useful for information technology (IT) departments. Rather than get blindsided by analysts working under the radar creating shadow data systems, IT administrators can watch and observe what analysts do, what data sources they use, and how they manipulate that data for an analysis project. This enables IT data architects to proactively capture data requirements and build new views and subject areas in data warehouse and analyst sandboxes to further speed time to insight.

Informatica’s Live Data Map

As a $1 billion provider of data integration software, Informatica has tried for almost two decades to provide an enterprise metadata catalog without success. Live Data Map is its newest incarnation, built on a modern data platform. If indications are correct, today’s technology may finally enable Informatica (and competitors) to achieve the vision of delivering an enterprise metadata catalog.

At the heart of Live Data Map is a graph database for relating disparate data objects. The technology also contains a Web crawler to discover data assets, a search index and engine to support keyword and faceted searches, data visualization tools, and APIs for capturing metadata from third party sources. The technology will run on Hadoop for scalability and work with both structured and unstructured data.

As enabling technology, Live Data Map will eventually undergird key Informatica products, including master data management, data quality, data governance, and data integration. An initial release currently powers Informatica’s new Secure@Source data security software. And a more robust version will appear in late 2015 as part of a new data prep environment that Informatica unveiled this week in Las Vegas code named Project Sonoma.

Project Sonoma

Project Sonoma combines Live Data Map with Rev—Informatica’s cloud-based data preparation tool—to provide a highly scalable and usable data discovery and preparation environment built on an enterprise data catalog.

In Project Sonoma, Live Data Map will capture and profile data from Informatica product repositories as well as third party applications (data models, BI and ETL tools) and both cloud and on-premise business applications. Analysts will be able to search the catalog for objects using keywords and facets, examine object properties, including usage patterns and data profiles, and view comments and ratings about the data objects submitted by other analysts. In essence, the tool serves as a collaboration portal to support data discovery and analysis.   

More importantly, Project Sonoma will enable analysts to graphically view and examine the lineage of data objects, which will help analysts evaluate the trustworthiness of data. It will also let analysts see relationships among data objects using interactive visual graphs. For example, a graph might display people, tables, reports, terms and domains related to a data object. Analysts can then filter and drill into the view to inspect relationships in more detail. They can they export the data to Rev to manipulate and transform the data for an analysis project.

Eventually, the product will also infer data models from the relationships it uncovers, which customers can then map them to Informatica’s packaged industry models to speed model and rules development for master data management, data quality, data governance and data security, among other things, according to Suresh Menon, vice president of product management in Informatica’s MDM business unit. It will also reduce some of the political infighting that often accompanies the creation of models and rules since every business manager wants to see their definition of data objects become the standard. “Now they can empirically examine what exists and build from there,” says Menon.

The Smoke Rises

To be clear, Informatica won’t ship Project Sonoma or Live Data Map until much later this year. The demos it showed at Informatica World were alpha grade at best. Although Informatica officials said they pre-announced the technology to help identify customers who would be interested in helping them test and refine the product, it is also likely that they would like to freeze the market a bit until they catch nimble startups that have already launched big data catalogs.

There are big stakes in the battle for the big data catalog. The vendor that provides the map to big data holds an enviable position. The catalog not only becomes a trusted companion to legions of analysts and data scientists, it holds the metadata that will drive innovative business applications that power the modern digital organization.

Wayne Eckerson

Wayne Eckerson is an internationally recognized thought leader in the business intelligence and analytics field. He is a sought-after consultant and noted speaker who thinks critically, writes clearly and presents...

More About Wayne Eckerson