Machine Learning and Streaming Data Pipelines, Part I: Definitions and Architecture

Once machines learn, they have to do some fast thinking, because many compelling machine learning (ML) use cases center on real-time calculations. Enterprises have a short window of opportunity to predict market prices, recommend customer purchases, or personalize web pages. This creates the need for streaming data pipelines to support ML models, or streaming ML.

This blog, the first in a series, defines streaming ML and its supporting architecture. Part II of the series will explain how the ML lifecycle—data and feature engineering, model development and training, and model operations and governance—enables streaming ML. Part III will explore advanced ML approaches, such as in-streaming learning, reinforcement learning, and deep learning. Finally, Part IV will provide guiding principles and best practices for streaming ML initiatives.

As described in an earlier blog series, machine learning is a subset of artificial intelligence (AI) in which a model discovers patterns in data in order to define the relationship between data inputs and outcomes. Based on this relationship, the model studies the most telling data inputs—known as features—and generates a score that predicts, classifies, or recommends future outcomes. Machine learning scores also might derive from anomalies—i.e., breaks in patterns that influence future outcomes.

Streaming ML is the application of an ML model to a streaming data pipeline, which is a workflow that ingests and transforms data in real-time increments between a source and a target. Real-time might mean milliseconds, seconds, or minutes, depending on the use case—whatever is fast enough to capture the perishable value of a given event. 

The machine learning model provides logic that helps the streaming data pipeline expose features within the steam and potentially within a historical data store. The model then generates real-time scores based on the features that the pipeline exposed. The pipeline delivers those real-time scores to business monitoring systems, business intelligence tools, applications, or workflows so they can make their predictions, classifications, or recommendations.


Streaming ML is the application of a machine learning model to a streaming data pipeline.


Architectural Overview

So, what does all this look like in an enterprise environment? Let’s explore how the architectural components operate and inter-operate to bring streaming ML to life. We’ll refer to the components illustrated in this streaming ML architecture: data sources, streaming data pipelines, ML models, data stores, and targets.

Streaming ML Architecture

Data sources. Modern enterprises have a rich set of data sources. Traditional on-premises databases such as Oracle and IBM Db2, enterprise resource planning (ERP) applications such as SAP, and new software as a service applications (SaaS) applications such as Workday or Salesforce all contain vast operational records. IT logs, meanwhile, track the workings of applications and their underlying infrastructure, and internet of things (IoT) sensors transmit data about the actions and health of physical equipment. These and other sources can feed valuable real-time events into streaming data pipelines.

Streaming data pipeline. Traditional ETL data pipelines extract, transform, and load batches of data from operational sources into data warehouse targets every hour, day, or week. Modern streaming ETL pipelines, in contrast, capture just the updates, also known as events, on a real-time basis. Streaming ETL pipelines improve efficiency vs. batch pipelines and enable real-time delivery. Variations on these patterns include ELT or ETLT, both for batch and streaming workloads. This blog series focuses on the common ETL pattern.


Streaming data pipelines extract, transform, and load data updates, also known as events.


Let’s walk through the four steps of a streaming pipeline: extract, transform, and load, as well as an extra step to enable ML that we’ll call “process.” Data engineers configure, execute, and monitor these steps, typically based on structured query language (SQL) commands, using either a graphical interface or command line scripting. ML projects might entail multiple streaming data pipelines that work in concert.

  • Extract. Change data capture (CDC) technology, application programming interfaces (APIs), or protocols such as MQ Telemetry Transport (MQTT), identify and capture events from various sources. Either commercial tools or open-source platforms such as Apache Kafka then capture and consolidate the events into streams that it routes to one or more targets.

  • Transform. Along the way, commercial ETL tools, or open-source processors, such as Apache Flink and ksqlDB, transform the streaming events while they flow through server memory on premises or in the cloud. These commercial tools and processors then transform the data in various ways. They reformat data, join streams, or derive new values based on events or groups of events called windows. 

  • Load. A pipeline tool loads the transformed data to the target, often a cloud-native object store such as Amazon S3 or cloud data platform, such as Databricks. It can append events—i.e., simply add them to existing datasets, or merge them, for example by reconciling records in a table using SQL commands such as “insert,” “update,” and “delete.”

  • Process. Now the road forks! To support the in-stream analytics of an ML model, the streaming data pipeline does some additional processing of events. It uses a stream processor such as Apache Spark, Flink, ksqlDB, or Tibco Streaming to filter, evaluate, and query the streaming events. (This might be the same stream processor that transformed data in the earlier stage.) The stream processor also might query or look up historical features within the data store, call external scoring services, or perform rule-based operations. Actions like these help expose the features—i.e., those telling data inputs—to ML models so they can generate their score. 

Machine learning model. Enter the machine! Our ML model now creates its score based on both the real-time and historical features that the stream processor exposed. This model—the product of much iterative work by the data scientist—delivers that score to a target that can make use of it.


The machine learning model creates its score based on both real-time and historical features.


Target. Now things get interesting. Four primary types of targets convert the model score into a prediction, classification, or recommendation: business intelligence (BI) tools, business monitoring tools, applications, and workflows. 

  • BI tools. Real-time reports and dashboards predict operational KPIs, classify customers, classify financial risks, recommend sales actions, or make other forward-looking statements. 

  • Business monitoring tools. Business monitoring tools serve as a smart notification system for business KPIs, sorting signal from a lot of noise to prescribe remediation plans when issues arise.

  • Application. Many applications now contain operational or analytical functions that take action based on ML scores. An ecommerce application might coax customers to buy something, for example, or automatically personalize web pages based on customer preferences.

  • Workflow. Automated workflows, such as intelligent process automation (IPA), streamline and adapt workflows across multiple applications. For example, a manufacturer might automatically stop or adjust processes when an ML model flags a faulty component.

Use Cases

So, how does all this take shape in modern business scenarios? Here are two actual use cases.

  • Fraud prevention. The digital payments company Payoneer uses an ML model to classify the risk of transactions as they flow through a streaming system based on RabbitMQ. When the model flags a high-risk transaction, its workflow automatically blocks that user before funds change hands. Payoneer built and trained its model with the Iguazio data science platform.

  • Supply chain and route optimization. CargoSmart, software provider to shipping companies, uses ML models on Tibco’s streaming platform to improve the efficiency of ocean vessels. For example, its ML models recommend speed and route changes to ships based on streams of IoT sensor data, helping reduce fuel consumption.

Exciting stuff! But as we learned in school, fast thinking isn’t always the best thinking. Our next blog will examine the ML lifecycle that helps enterprises build and deploy ML models that make timely and accurate decisions.

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie