Oct 18, 2021 / by Kevin Petrie Decoding Data Software

The Next Chapter for the Data Catalog: Governing AI and Machine Learning Models

As enterprises mature with their data cataloging strategies, they need to pivot to a new challenge: cataloging artificial intelligence (AI) and machine learning (ML) models.

The data catalog is an inventory of data assets such as tables, files, schema, queries, charts, and reports from across an enterprise. Catalogs centralize metadata about these assets, including their attributes, schema, lineage, descriptive tags, and usage metrics. By organizing all this metadata, the catalog helps data analysts find assets they need while still helping data curators and compliance officers manage data quality and control data access. Check out Deep Dive on Data Catalogs: Three Tools to Consider by my colleague Joe Hilleary to learn more about this market segment.

The time has come to catalog AI/ML models right alongside other data assets. Data scientists and governance officers need a single platform that centralizes metadata for both data science and business intelligence (BI) projects. This means cataloging metadata such as an AI/ML model’s technique, input data, features, expected outputs, and approval status. A unified catalog like this creates a common governed foundation for stakeholders to understand and share one another’s work.

For example, data scientists can publish their models for reuse—along with the right guidance about how to do that. Data analysts can discover AI/ML models, learn how they work, and import them into their BI tools. Data stewards can ensure the quality and accuracy of data used for all analytics projects across the enterprise. And business owners can gain visibility into the data and models underlying the recommendations they receive from both data analysts and data scientists.

Role of the data catalog

To put this in context, let’s consider the role of the data catalog. Enterprises depend on these tools to strike a balance between self-service and governance, which we can view as two ends of a seesaw.

Catalogs enable self-service. They help analysts and business owners discover, classify, and manipulate data assets for their analytics initiatives. This helps make their decisions timely and accurate.
Catalogs govern data. They help data stewards identify low-quality data, restrict access to sensitive data, and assist compliance with regulations such as the General Data Protection Regulation (GDPR) and California Consumer Privacy Act.

Effective data catalogs help achieve both objectives. They open up data access for self-service while managing governance. They restrict data usage for governance without preventing analysts from doing their jobs.

Striking the Right Balance

The new challenge: cataloging AI/ML models

Now consider the entry of AI and ML. AI refers to software that recognizes speech, makes decisions, or performs other tasks that traditionally require human intelligence. Many enterprises start with machine learning, a subset of AI in which an algorithm discovers patterns in data. ML relies on a model, which is essentially an equation that defines the relationship between data inputs and outcomes. Based on this relationship, the model generates a score that predicts, classifies, or prescribes a future outcome based on data inputs.

While many data science platforms provide model catalogs, it makes sense also to integrate model metadata into data catalogs that have accumulated data governance controls over many years. It is those data catalogs that support most analytics projects today.

Emerging Market

Informatica breaks some new ground with the Cloud Data Governance and Catalog (Cloud DGC) it unveiled this summer. The Cloud DGC combines Informatica’s heritage data catalog with new AI/ML model governance capabilities. It enables data scientists, data engineers, and governance officers such as data stewards to centralize metadata for their AI/ML projects, using the same foundation that traditional data analysts already use for BI projects. Most other data catalogs have yet to take this step.

For example, a data analyst can use Informatica Cloud DGC to search various customer datasets, including purchase histories, to understand their options for predicting customer behavior. They can pull up a file containing order data, then check its owner, logic, example entries, business logic, and lineage. They can study relevant privacy policies to understand suitable ways to use that data.

Then they bridge into the world of data science. They can pull up available AI/ML models to see which model might work best with that data. Perhaps they find a model that predicts customer churn or future purchases. They check the score for that model’s drift—that is, the degree to which changing business conditions have made model predictions less accurate—along with the model’s approval status, features, and expected outputs.

Now the data analyst has a new arrow in their quiver, a trusted AI/ML model to assist their BI project. This marks a good step forward.

Conclusion

This market will evolve slowly. Data science teams will continue to build, train, and deploy models with data science platforms such as Algorithmia (now part of DataRobot) and Domino Data Lab, and those platforms rightly offer cataloging and governance capabilities. But to provide the right governance controls in the production stage, enterprises also need to start tracking those AI/ML models and all the data they consume into traditional data catalogs. Informatica Cloud DGC therefore signals a promising trend.

Previous post by expert Next post by expert

Kevin Petrie

Kevin is the VP of Research at Eckerson Group, where he manages the research agenda and writes about topics such as data integration, data observability, machine learning, and cloud data...

More About Kevin Petrie