Integrating, Governing, and Consuming Data for the Machine Learning Lifecycle
ABSTRACT: To succeed with machine learning projects, data science teams must handle key aspects of data management and consumption.
Ironically, data is a lot harder to manage than bleeding-edge algorithms.
Extraordinary computing power now enables data science software, including artificial intelligence and especially machine learning, to analyze massive volumes of data. But to capture real insights and business value, data science teams must master certain aspects of data management and consumption.
This blog defines the pivotal role of data integration, governance, and consumption in machine learning projects. Given the rising volumes and varieties of data, these complex processes require careful attention and diligence to ensure ML projects succeed. As a case in point, we’ll provide examples of new offerings that Informatica featured at its Informatica World user conference last month.
The data management processes of data integration, governance, and consumption play a pivotal role across the machine learning lifecycle.
The Machine Learning Lifecycle and Data Processes
What is the machine learning lifecycle?
Machine learning (ML), the most common category of artificial intelligence, learns patterns in data to predict, recommend, or classify outcomes. It depends on an ML model, which is an equation that defines the relationship between the most telling data inputs, known as features, and outcomes.
Cross-functional data science teams manage an ML lifecycle that includes data and feature engineering, model development, and model production:
Data and feature engineering. The data scientist works with the data engineer to transform input data and perform exploratory data analysis. They label historical outcomes and derive the features that best predict outcomes.
Model development. The data scientist selects an ML technique, downloads a model from a library, and “trains” that model by applying it to historical features. They check the results of their experiment, adjust parameters, and train the model again until it achieves the right accuracy.
Model production. The ML engineer implements the ML model in production applications and workflows with the help of the data scientist. They monitor models’ performance, accuracy, and cost.
The lifecycle involves frequent iterations to fix issues and adapt to business changes. (To learn more, check out Deep Dive on Machine Learning Platforms: Three Tools to Consider.)
Data processes for machine learning
Now let’s define the processes of data integration, governance, and consumption, and how they contribute to the ML lifecycle. ML lifecycle platforms such as DataRobot, Domino Data Lab, and Iguazio, and data platforms such as Databricks, Tibco, and Cloudera, contribute to these processes. But to cover the bases, data science teams also need data pipeline tools and governance platforms from vendors such as Informatica and dbt.
Data integration. Data engineers integrate data by ingesting and transforming it for analytics. They extract periodic batches or (near) real-time updates of data, schemas, and metadata from various sources; then load them to targets such as a cloud data platform for merging, formatting, and structuring. They design, build, test, and deploy data pipelines that perform these tasks; then monitor, tune, and reconfigure them. Together these capabilities support so-called “extract, transform, and load” (ETL) processes, along with as variations such as ELT and ETLT.
Each stage of the ML lifecycle depends on timely, reliable data integration. Data scientists, developers, and data engineers can use tools such as Informatica’s INFACore to integrate data within ML notebooks and development environments. They build and manage data pipelines to support exploratory data analysis, labeling, and feature engineering. Then they apply models to data pipelines for both training and production. When issues arise in production, data science teams go back and refresh training data, change features, or apply new/different models to those pipelines.
The machine learning lifecycle depends on data integration to help explore and label data, define features, then train and deploy ML models.
Data governance. Data engineers, stewards, and compliance officers govern data by enforcing policies and standards that manage its usage. They create rules to profile data, track its lineage, and check its quality; then standardize and centralize its metadata in a searchable catalog. They secure data and ensure privacy, for example with role-based access controls and masking of personally identifiable information (PII). Working with business domain experts, they also create master data that defines standard attributes and terms for key business entities. Together these capabilities support a data governance program that helps ensure data is accurate and fit for purpose.
The ML lifecycle depends on data governance to enable flexible access to data while maintaining oversight and controlling risk. Data engineers and stewards can use various tools to ensure that accurate data feeds their features and models with minimal compliance risk. They also can use Informatica’s Cloud Data Governance and Catalog to centralize metadata for ML models—including features summaries, drift scores, and training/production status—alongside metadata for traditional data assets such as tables and files.
The machine learning lifecycle depends on data governance to enable flexible access to data while maintaining oversight and controlling risk.
Data consumption. Data analysts and data scientists consume data by exploring and retrieving it for use in various tools and applications. They review metadata, search data assets, then run queries and visualize the results in dashboards and reports. Working with data engineers and developers, they also embed analytical outputs and algorithms in operational applications—or orchestrate workflows that string together multiple tasks and applications. Together these capabilities help consume data and drive action.
Each stage of the ML lifecycle depends on timely, reliable data consumption methods. Data scientists use BI tools to search and query input data as they perform exploratory data analysis and derive features. They can use Informatica ModelServe to apply ML models to data pipelines so they can consume data for both model training and model production. They also can use notebooks or ML platforms to build model-driven applications that consume training and production data.
The machine learning lifecycle depends on timely, reliable data consumption to support exploratory data analysis, then model training and production.
As with other technology advances, the rise of machine learning creates promising opportunities but makes old, familiar problems even more challenging—and even more crucial to solve. By getting serious about data integration, governance, and consumption, data science teams can increase the odds of success with machine learning.