Managing Metadata for AI vs BI
Too often, finding data feels like playing detective. We spend hours tracking down the right set—emailing colleagues, interviewing project teams, and vetting assets all before we can even begin our actual work. Whether you’re a business analyst or a data scientist you’ve probably felt this pain. Metadata management solutions such as data catalogs can help alleviate some of the frustration, but in evaluating options we have to keep the different needs of our stakeholders in mind. In particular, organizations must balance the requirements of artificial intelligence (AI) or machine learning (ML) teams with those of business intelligence (BI) folks. Fortunately, the desires of these two groups often overlap. This article will explore what features optimize a data catalog for each and highlight capabilities that benefit both.
AI and BI continue to grow closer together. Many use cases now rely on a combination of the two, and BI tools increasingly allow analysts and business users to view and manipulate ML models. The obvious synergies between the disciplines suggest that a single metadata solution ought to be capable of serving both constituencies and helping them learn from one another. At the same time, data catalogs have increased their scope over the last few years. Modern catalogs include not only metadata on tables, but also reports, ML models, visualizations, and other data assets. Nevertheless, finding a solution that truly meets the needs of both BI and AI teams requires acknowledging that certain aspects appeal more to one group or the other.
Unique Business Intelligence Requirements
Many catalogs emerged as tools for self-service BI and provide tailored features for the analyst that are less valuable for other users.
Schema information. Analysts at most large organizations rely primarily on data stored in a centralized data warehouse. As a result, information about schemas is particularly valuable.
Suggested sources for basic information. Analysts’ reliance on core business metrics also lends itself to features including data certifications, which allow data curators to identify particular sets as sources of truth; standard business calculations; and usage statistics. Since analysts often serve specific departments, these data catalog features provide a way to ensure consistency across the enterprise.
Unique Artificial Intelligence and Machine Learning Requirements
Although data scientists and ML engineers also use data catalogs primarily to find and understand data, they care about different types of assets and have different evaluation criteria, which leads to different requirements.
Data lake compatibility. Data scientists depend chiefly on raw data stored in data lakes. This means they need data catalog solutions that facilitate easy searches across those types of sources.
Cataloging of all data assets. The ability to search within the catalog for a wide variety of data assets, not just tables, is also crucial because ML models often build off of previous attempts. Seeing what others did in the past can guide data scientists as they develop new approaches. Data catalogs serve a different role than a feature store, however, which transforms and manages key input variables for ML models rather than the completed models themselves.
Business context. Additionally, many companies don’t yet break ML teams up by lines of business, so although they don’t need to worry as much about standard calculations or nicely curated data, they have a much greater need for business context and domain knowledge. A data scientist is much more likely to be an expert on algorithms than on a particular set of data. Catalog tools that integrate business glossaries and provide common sense column names help combat this lack of knowledge about specific business systems and their output.
Statistical information. Finally, model developers tend to be more concerned about technical metadata than their business analyst peers. Basic statistical analysis of data sets tends to impact decisions about model development more often than visualizations.
Requirements in Common
Despite their different applications of data, BI analysts, data scientists, and ML engineers share many common requirements for a data catalog. At the end of the day, both just want to find and understand data assets and minimize the time they spend repeating what someone else has already done.
Despite their different applications of data, BI analysts and ML engineers share many common requirements for a data catalog.
Crowd-sourced knowledge. The most valuable metadata in this effort continues to be human-generated. The auto-generated insights embedded in data catalogs keep improving, but there’s no substitute yet for someone who knows the data inside and out. Social features such as a space to ask questions of the data owner or steward and comment sections where ordinary users can contribute tips allow searchers to tap into this human expertise.
Affiliated projects. Another oft-requested feature I’ve heard in conversations with practitioners on both sides of the AI/BI divide is the ability to see related data products. When an analyst needs to put together a new report or answer an ad hoc question for their boss, they benefit from seeing what others have done with the same data. At best, someone has already done the exact analysis, and they can avoid redundant effort. At worst, it gives them a jumping off point for something novel. On the ML side, data scientists benefit from seeing past analyses of a set of data. Looking at analytics of historic transactions can help them choose better features for a model to predict future transactions.
Figure 1. Shared Requirements for a Data Catalog
Conclusion
As ML grows more prevalent, organizations must balance the requirements of AI and data science teams alongside those of traditional BI practitioners. Thankfully, their needs often align. Data catalogs are well-suited to both groups, offering capabilities that meet the shared desire to find and understand data. Having a single catalog for both also maximizes the synergy between the disciplines as use cases increasingly merge the two. Although certain features lend themselves more toward one group or the other, many are vital to both workflows. Selecting a catalog that fits your business means negotiating between stakeholders. This article should provide a starting point for conversations within your organization, but at the end of the day, the best way to find out what your teams need from a catalog is to ask them.