Multi-Style Data Integration for AI/ML: Three Use Cases
ABSTRACT: This blog describes the need for data teams to establish a flexible yet well-governed data architecture to support dynamic AI/ML projects.
Read time: 6 mins.
Sponsored by CData
Artificial intelligence and machine learning projects are akin to a team sport: different models compete on the field, each with its own style. This makes things tricky for the data teams that support those models from the sidelines. Our last blog described how they must create a flexible, yet governed data architecture to enable dynamic AI/ML projects.
Because AI/ML projects involve multiple models and datasets, they also require multiple styles of data integration. And that is the subject of this blog. We define the most popular styles overall: ETL, ELT, change data capture, streaming, and data virtualization. Then we explore use cases that combine styles: (1) ELT + CDC, (2) ELT + data virtualization, and (3) streaming ETL. BARC research finds that these three combinations are the most appropriate ways to train and feed diverse AI/ML models in complex environments. These style combinations strike the right balance between speed, migration complexity, and compute cost.
We start with some definitions.
Extract, load, and transform. As the name suggests, the ELT style ingests data, then transforms it on targets such as data warehouses or lake houses to prepare for analytics.
Extract, transform, and load. The more traditional ETL style transforms data before it arrives at the target, for example on an intermediate server cluster.
Change data capture. CDC tools scan database logs to identify new or changed records, then replicate and load them in real-time increments to analytical targets.
Streaming. Apache Kafka, Pulsar, and various commercial systems receive, store, and send streams of messages—transactions, telemetry logs, etc.—between platforms.
Data virtualization. Virtualization tools present logical views of distributed datasets to users or applications, eliminating the need to replicate or move data.
Perform with style
Now let’s consider our three example use cases, including their technical requirements and business results.
Extract, load, and transform (ELT) + change data capture
The ETL + CDC combination works well for AI/ML projects that involve diverse datasets, complex transformations, and fast-changing business conditions. To start, the ELT pipeline extracts one or more batches of data from various sources, then loads them into the target for consolidation and transformation. On the target, the pipeline merges the objects—files, tables, and so on—and converts them to a common format. It also might validate that record values are correct and filter out unneeded tables or columns. The CDC pipeline, meanwhile, keeps the target in synch with sources by ingesting incremental updates on a real-time or near real-time basis.
Together ELT and CDC support AI/ML projects that involve diverse datasets, complex transformations, and changing business conditions
Let’s consider an example of how ELT and CDC work together to support a customer recommendation engine based on machine learning. The data engineer with an ecommerce company designs an ELT pipeline that consolidates purchase records from Salesforce, customer service records from Zendesk, and clickstream data from the company website into the Databricks data lakehouse. The pipeline then converts these tables and log files into the Apache Parquet format so that the data scientist can identify the most telling inputs—i.e., features. Based on these features, she trains one ML model to classify customers into behavioral groups and another model to recommend purchases to them.
The ML engineer then rolls these ML models into production so they can classify customers and recommend purchases. The models are triggered when the CDC pipeline delivers updates—perhaps recent purchases, customer complaints, or website visits—to the target. The resulting customer recommendations help the ecommerce company increase cross-selling, average deal size, and customer retention.
ELT + data virtualization
Like ELT + CDC, the combination of ELT and data virtualization works well for AI/ML projects with diverse datasets. But unlike the other combination, ELT + data virtualization goes further. It helps analyze distributed datasets that cannot be fully consolidated due to data gravity, sovereignty requirements, or migration costs.
In this case the ELT pipeline ingests all the data that’s feasible into a common data warehouse or lakehouse, using similar steps as those described earlier. In addition, the data team constructs a virtualization layer that uses pointers to view the data in place and create derived values in a semantic layer. Those derived values complement the consolidated datasets, giving data scientists a rich feature set for model training and production. Data virtualization also enables rapid prototyping of candidate models and features, helping data scientists test ideas without incurring costly migration costs.
The combination of ELT and data virtualization works well for diverse datasets that cannot be fully consolidated due to data gravity, sovereignty requirements, or migration costs
To understand how ELT and data virtualization work together, consider a second scenario for our ecommerce company. Suppose this company’s UK division now wants to personalize content for French and German customers that visit its website. Its data engineer in London designs an ELT pipeline to consolidate and transform all possible data from these countries, including purchase records, clickstream log files, and results of recent customer satisfaction surveys. But he leaves the customer service records in place because they are not worth the cost of a migration. Instead, he creates a data virtualization layer that views and accesses the records without moving them.
The data scientist then uses the ELT pipeline and virtualization layer in tandem to train and prompt ML models that personalize web pages for individual customers. By engaging customers with more targeted content, the ecommerce company further increases cross-selling and retention.
Streaming extract, load, and transform (ETL)
Now we come to the combined style of streaming ETL, which works well for real-time AI/ML initiatives that involve small data volumes, simple transformations, and ultra-low latency windows. In such cases, a streaming ETL pipeline extracts data from a source, transforms it in flight, then loads it into a target. This data travels in a stream of small increments, each of which describes an event such as a database transaction, IT server task, or factory machine error. The transformation logic might merge, filter, or enrich the event streams in server memory before the pipeline loads them to the target.
Streaming ETL gives data science teams real-time features that they can feed to AI/ML model during both training and inference stages. For example, a credit card company might use streaming ETL to prevent fraud. The streaming ETL extracts real-time transaction requests from merchants, identifies the customer, and injects their recent purchase records into the stream. Then an anomaly detection model compares the transaction request to that customer’s history and typical transaction profiles. If the model spots anomalous behavior, it sends an alert to the onsite merchant and blocks the transaction.
In this way, streaming ETL enables the AI/ML initiative to block more fraudulent transactions and approve more legitimate transactions, which improves profitability.
Get started
These use cases are examples rather than a playbook. Each use case and data environment will demand its own version of these data integration styles. To learn more about your options for supporting your AI/ML initiatives, be sure to watch my recent webinar with Nick Golovin of CData.