Refining the Right Fuel: How Data Integration Drives the AI/ML Model Lifecycle

ABSTRACT: Data teams must filter, blend, and refine raw data inputs to create the high-octane fuel that drives innovation with artificial intelligence and machine learning (AI/ML).

Read time: 4 mins.

Sponsored by CData

The mantra that data is “the new oil” falls short. In reality, data teams must filter, blend, and refine raw data inputs to create the high-octane fuel that drives innovation with artificial intelligence and machine learning (AI/ML).

This blog, the final in a three-part series, explains this process. The first blog examined how a flexible, yet governed data architecture supports AI/ML innovation. The second blog described three use cases in which multi-style data integration supports AI/ML projects. With this blog, we consider how data integration drives the iterative lifecycle of AI/ML models.

The Accelerator

OpenAI’s release of ChatGPT 3.5 in November 2022 prompted business leaders to accelerate innovation with AI. Nearly two thirds (62%) of 513 respondents to a CompTIA survey this year expect a moderate to significant increase in their company’s AI investments. This will speed up rollouts that so far remain in early stages: while nearly half (45%) of respondents are exploring AI, just one third (33%) have reached limited implementation. And only 22% have matured to the point of integrating AI across the business.

The Engines

So what types of AI/ML models are they adopting? The five most popular types are predictive ML, recommendation engines, anomaly detection, natural language processing (NLP), and GenAI.

  • Predictive ML predicts events and trends such as customer decisions, company revenue, and market prices based on classification, clustering, or other modeling techniques.

  • Recommendation engines suggest actions to humans or applications based on predictive ML outputs, or additional techniques such as collaborative and content-based filtering.

  • Anomaly detection describes data patterns and identifies deviations to help find opportunities, assess risks, prepare for threats, and so on.

  • Natural language processing interprets and creates speech or text to assist tasks such as translation, sentiment analysis, or document processing.

  • GenAI generates content such as text, imagery, and audio using language models or potentially generative adversarial networks (GANs). This assists a wide range of use cases.

The Fuel

Despite the GenAI-induced hoopla about unstructured data such as text and imagery, for now structured tables remain the favorite input for AI/ML. In fact, almost two thirds (61%) of survey respondents say structured data is critical to their AI innovation, compared with 34% for semi-structured data and 33% for unstructured data, according to a recent report by Shawn Rogers of BARC and Merv Adrian of IT Market Strategy. Whatever the data type, more than half (56%) favor real-time delivery and 27% favor streaming in particular. But these raw inputs need refining. For example, our research shows that nearly half of companies lack the data quality and governance controls they need to support AI/ML, and nearly one third lack the necessary skills or tools for preparing unstructured data.

The Refinery 

Data integration refines the fuel that in turn drives the AI/ML model lifecycle, from model development to training to deployment and operation.

To understand this relationship, let’s consider how data integration supports the foundational steps of data labeling and feature engineering for predictive ML models. This process also can apply to recommendations and anomaly detection, in particular those that consume structured data. 

Data scientists, data engineers, and ML engineers start by collecting the historical input data that relates to the business problem at hand. For example, they might integrate their data using a mix of ETL and data virtualization. They extract, transform, and load their operational records into a lake house; and create virtual views of distributed unstructured data that is difficult to extract from heritage systems. The data scientist provides close oversight to ensure both the consolidated dataset and virtual views meet AI/ML model requirements—no easy task for this heterogeneous environment.

Next data engineers, ML engineers and data scientists collaborate with business owners to “label” various outcomes in their historical data sets. This means they add tags to data to easily identify historical outcomes, such as robotic arm failures, fraudulent transactions or the price of recent house sales. They also might label customer emails and social media posts as “positive” or “negative” to create an accurate model for classifying customer sentiment. Data engineers and data scientists need to label outcomes accurately and at a high scale. This requires a programmatic approach, automation, and assistance from business owners that best understand the domain. Note that labeling applies to supervised ML only. Unsupervised ML, by definition, studies input data without known outcomes, which means the data has no labels.

Performance

Now it’s time to drive model performance with the refined data. To start, the data scientist, data engineer and ML engineer employ feature engineering. This means they extract or derive, then share “features”—the key attributes that really drive outcomes—from all that input data. Features become the filtered, clean inputs for an ML algorithm to study, so that it does not drown in data while creating the model. Feature engineering supports everything that follows in the model lifecycle: selecting the right ML techniques based on the dataset and business use case; training and testing the ML models; and finally deploying and operating them. The ML engineer monitors models in production, ready to restart the lifecycle based on model performance. This might lead to changes to data labeling, feature engineering, or technique selection—all of which depend on data integration.

This series has assessed the relationship of AI/ML projects and data management from three angles. Blog 1 defined what a flexible and yet governed data architecture looks like, blog 2 described use cases for multi style data integration, and here blog 3 applies data integration and refinement to the lifecycle of AI/ML models. The overarching theme: AI/ML needs data inputs that are fit for purpose.

To learn more about your options for supporting your AI/ML initiatives, be sure to check out my recent webinar with Nick Golovin of CData.

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie