Machine Learning and Streaming Data Pipelines, Part II: Training and Operating Streaming ML Models

To catch fish in a stream, you need to spot telltale signs—flies on the surface, a pocket of deep water, the quick rise of a trout—then take prompt and careful action.

The same holds true when you apply machine learning to a streaming data pipeline. You need your machine learning model to spot telltale signs within the data stream, analyze what they mean, and guide you to take prompt and careful action. 

This blog, the second in a series, explores how to train and operate streaming ML models, with a particular focus on how to identify and manage the telltale signs in data that we call features. To make this happen, data science teams must implement a feature store as part of a flexible streaming ML architecture.


You need your machine learning model to spot telltale signs within the data stream, analyze what they mean, and guide you to take action. 


So what is machine learning? ML is a subset of artificial intelligence (AI) in which a model finds patterns in data and then predicts, classifies, or recommends outcomes. It finds the patterns by studying features such as customer purchases or metrics for mechanical operations. ML models depend on a lifecycle of three iterative phases: data and feature engineering, model development, and model production. 

The Machine Learning Lifecycle

  • Data and feature engineering. The data scientist collaborates with the data engineer to ingest their historical input data. They transform the data, then define the “features” that predict historical outcomes. The data scientist also might need to label historical outcomes.

  • Model development. The data scientist selects an ML technique, such as a regression or classification technique, that can define the relationship between features and outcomes. They train their selected algorithm by “serving” features to the algorithm and seeing what outcomes it predicts as a result. They go back to adjust features and parameters until the algorithm is accurate enough to become a production ML model.

  • Model production. The ML engineer puts the ML model into operation. They integrate it with the right live data pipelines, perhaps working with DevOps or data engineers, then monitor its performance, accuracy and compliance with regulatory requirements. The ML engineer also collaborates with governance officers to track those metrics in a model catalog—and go back to the first two phases when models go sideways.

Managing Features in the Streaming Data Pipeline

Data science teams need a robust feature store to support the ML lifecycle in their streaming ML architecture. This might entail a standalone feature store such as Tecton or Featureform, or a feature store within an ML platform such as Databricks or Amazon SageMaker. Let’s explore what that looks like. 

You can view the following steps as a more granular picture of the streaming data pipeline described in the last blog, with a particular focus on managing dynamic features. The good news is that the same feature store and steps can support both model development and model production.

The Role of the Feature Store


The same feature store and process can support both model development and model production


The data engineer uses a streaming platform such as Apache Kafka or Apache Pulsar, or a commercial data pipeline tool such as Qlik or Equalum, to ingest data into the feature store. For example, to classify the risk that delivery trucks will break down, the data engineer with a logistics company might ingest various cellular IoT sensor feeds from trucks as well as local weather updates.

The feature store performs five critical tasks that assist the data engineer, as well as the data scientist and ML engineer.

  • It transforms the streaming data, which is essentially a series of events. This might mean filtering or cleansing events, reformatting those events, or deriving values from groups of events called windows. It also might mean merging multiple streams. In our delivery truck example, the data scientist or data engineer can create transformation logic that converts the truck sensor data feeds to a common format.

  • It helps define features based on the streaming data, using structured query language (SQL), Python, or PySpark. In our example, the data scientist or data engineer creates a script to identify and isolate four types of events—readings for tire pressure, oil level, air temperature, and precipitation—that contribute to truck breakdowns. They also create a script that ranks driver safety risk based on patterns in braking behavior. Tire pressure, oil level, air temperature, precipitation, and driver safety risk each become a feature.

  • It serves features to the ML model, enabling it to calculate the final score that predicts, classifies, or recommends future outcomes. This means the data scientist reviews and selects the right features for the ML model to classify the risk of truck breakdowns, and feeds them into that model.

  • It calculates feature values. Based on the transformation and feature scripts, they calculate the actual feature values within streaming data. This might include hourly checks of tire pressure, oil level, air temperature, and the presence of precipitation. The transformation and feature scripts also might create an hourly ranking of driver safety—high, medium, or low—based on when (and how hard!) they hit the brakes during that window of time.

  • These tasks rely on a registry within the feature store that catalogs all the features and centralizes their metadata. The data scientist, ML engineer, and data engineer use the registry to discover, manage, and govern their features as they support the ML lifecycle.

Data science teams also use the feature store to handle batch data that enriches the features and increases model accuracy. In our example, the streaming ML model can classify the risk of a truck breakdown more accurately if it incorporates a feature from batch histories of truck maintenance.

Most modern feature stores run on cloud platforms based AWS, Azure, or Google cloud infrastructure. These platforms help the data scientist and ML engineer handle data and features in three ways.

  • Cloud compute platforms such as Amazon Elastic Compute Cloud (EC2) and Google Compute Engine process streaming data in memory in order to transform it and calculate feature values. This supports production ML operations, and helps mimic production conditions when training ML models. If a delivery truck starts leaking oil, the delivery company can spot the risk fast and notify the driver in time to find a service technician.

  • Platforms such as the AWS DynamoDB or Apache Cassandra NoSQL databases store features and data online, ready for real-time retrieval. This also helps transform data and calculate feature values to assist training or production operations.

  • Cloud object stores such as Amazon Simple Storage Service (S3) or Azure Blob Storage store features and data offline, ready for periodic retrieval. This helps define features and serve them to the ML model during the training or production stage. Offline storage also supports the feature registry. If the data scientist wants to change out some of their truck sensor features, or update truck service histories, they use offline storage.

The people that catch the most fish have learned through experience how to spot and act on telltale signs in the stream. Enterprises can achieve similar success with streaming ML if they learn how to spot and act on those telltale features in the data stream.

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie