Register now for - "How to Create, Govern & Manage Data Products: Practices and Products You Need to Know" - Wednesday, December 13, 11:30 a.m. Eastern Time.

Machine Learning and Streaming Data Pipelines, Part III: Guiding Principles

ABSTRACT: A streaming ML program helps enterprises respond to the value and risks of business events in a repeatable way.

To respond to a business event, you need to do some fast thinking. But to respond to events on an ongoing basis, you need a well-planned program.

Machine learning models can help with the fast thinking. They help respond to events—perhaps the turn of a factory gear, an unusual payment request, or the moment a consumer clicks on their shopping cart—by analyzing the stream of data related to them. A streaming ML program, meanwhile, assembles the people, process, and technology needed to respond to events in a repeatable way, creating value and reducing risk over time.

This blog, the third and final in a series, recommends guiding principles for effective streaming ML programs. The first blog defined streaming ML and its supporting architecture, and the second blog defined how to train and operate streaming ML models. The guiding principles are to start with business objectives, assemble a cross-functional team, standardize on open platforms, and execute the program in phases. Let’s examine each in turn.

  • Start with business objectives. This principle, while self-evident, bears repeating because data teams tend to get excited and jump into science experiments that deliver hit-or-miss results. Instead, start your streaming ML program by defining the business objectives your enterprise seeks to achieve. The objectives might be to improve factory efficiency, reduce fraudulent transactions, or upsell mobile customers on a new product. If they center on the time-based value or risk of an event, then they likely will benefit from streaming ML.

Based on these business objectives, you can scope and quantify the upside of a potential streaming ML program. You can do this by setting measurable targets and defining clear use cases. Here are some example targets and use cases:

  • Toy manufacturer: increase factory yield 2% by spotting wear and tear, then triggering preventive maintenance to reduce downtime.

  • Commercial bank: reduce fraud 3% by identifying suspicious debit card transactions, then freezing the account to investigate.

  • E-commerce company: increase average deal sizes 4% by recommending add-on purchases to customers as they proceed to checkout.

These business objectives, measurable targets, and use cases should determine the program budget. Work within this budget as you hire, redeploy staff, and buy the tools that support this program to ensure its costs do not exceed the upside.

  • Assemble a cross-functional team. In line with the first principle, your program team should start with business sponsorship. Enlist a business leader to define business objectives and serve as program coach from start to finish. This individual might already feel the pain of missed goals, making them motivated to find a solution! Once you have your business owner on board, you can recruit technical program contributors to help define the use cases and set achievable targets. These contributors include the data engineer, data scientist, ML engineer, and operations engineer.

  • Data engineer. As the owner of data pipelines, the data engineer integrates architectural elements, transforms data for analytics, and helps operationalize ML models.

  • Data scientist. As the expert in statistics, the data scientist performs exploratory analytics, defines features, selects the ML technique, and trains the ML model.

  • ML engineer. As the liaison between the data scientist and operations, the ML engineer puts the ML model into production..

  • Operations engineer. The ITOps or CloudOps engineer helps the ML engineer monitor and tune the performance of model-driven applications.

These cross-functional team members, with the guidance of the business owner, should collaborate to design the people, process, and technology dimensions of your streaming ML program. They should brainstorm new ways to address the use cases and drive business results, and identify obstacles such as siloed datasets and processing bottlenecks. Be sure your team includes both innovative thinkers and detractors. By recruiting and convincing the detractors, you can devise a program that has broad appeal to the organization.

  • Standardize on open platforms. This principle centers on the value of flexibility. Have your team architect a streaming ML environment—using existing components where possible—that maintains tool interoperability and data mobility as well as the ability to grow in a modular fashion. For example, your feature store, streaming platform, and cloud data platform should share data via open file formats and open APIs. It should enable the program team to add or remove elements such as data sources, pipelines, and models without significant re-coding. By changing out modular elements in this way, your team can adapt and scale to meet dynamic business requirements.

Let’s say the data team at a commercial bank trains and deploys ML models to address the fraud reduction use case. These models detect anomalous transactions based on location, merchant, and transaction size, as well as historical cardholder behavior. Now the data scientist wants to start correlating these data points with hourly weather reports to confirm that, for example,  an umbrella purchase makes sense on a given day in a given zip code. The data engineer needs to integrate those periodic updates without significant effort to reformat data, redesign pipelines, or rework other aspects of the architecture.

  • Execute in phases. Plan your program as a series of phases with achievable goals. Aim to achieve a quick win with demonstrable business value that can secure budget and executive support before tackling bigger goals in phase 2 and beyond. To reduce fraud, the commercial bank might start by improving the accuracy of alerts for locations where fraud is easier to spot. By addressing these locations first, the data scientist gains valuable insights into telltale behavior that help them tune model parameters, retrain the model, and make other adjustments. This prepares them to address more difficult locations in phase 2.

Modern digital business centers on real-time events whose values and risks need fast, but programmatic responses. Enterprises can meet this challenge by basing their streaming ML program on sound business objectives, a cross-functional team, open platforms, and phased execution.

Kevin Petrie

Kevin is the VP of Research at Eckerson Group, where he manages the research agenda and writes about topics such as data integration, data observability, machine learning, and cloud data...

More About Kevin Petrie