The Machine Learning Lifecycle and MLOps: Building and Operationalizing ML Models - Part III
Read - The Machine Learning Lifecycle and MLOps: Building and Operationalizing ML Models - Part I
Read - The Machine Learning Lifecycle and MLOps: Building and Operationalizing ML Models - Part II
Parents tremble when their child steps into that first Little League game. Will they hit the ball? Will the ball hit them? Training cannot eliminate all the risks.
Similarly, training cannot eliminate the risks that arise when a machine learning model goes into production. Data scientists lead a team effort to prepare the data, build the algorithm, train it, and produce a final machine learning model. But when the ML engineer brings the model across the bridge to live operations, things get tricky fast.
Machine learning engineers work with DevOps engineers to adapt that model to production code, and ITOps engineers to monitor its performance. When performance or accuracy suffer, the machine learning engineer must help the data scientist adapt the model. These and other stakeholders can succeed by setting clear business objectives, designing for modular changes, and over-communicating with each other.
This blog, the third in a series, examines this third and final phase of the machine learning (ML) lifecycle: model operations or MLOps. The first blog in our series examined the steps and personas of the first phase: data and feature engineering. The second blog explores the second phase, model development and training. Future blogs will define how stakeholders must learn new skills and collaborate to make all this work. We also will review the landscape of ML tools that help them along the way. The following diagram illustrates the three phases of the ML lifecycle, each containing three steps, that we explore in this blog series.
The Machine Learning Lifecycle
To recap, a machine learning (ML) algorithm discovers patterns in data. These patterns help people or applications predict, classify, or prescribe future outcomes, such as customer purchases, house prices, or fraudulent transactions. A fully developed and “trained” algorithm becomes a model, which is essentially an equation that defines the relationship between key data inputs (your “features”) and outcomes (also known as “labels”). ML applies various techniques to train and thereby create this model. They include supervised learning, which studies known prior outcomes, and unsupervised learning, which finds patterns without knowing outcomes beforehand.
Our last blog concluded the model development phase when the data scientist delivered a trained, production-ready model to the ML engineer. The ML engineer manages this model alongside various training model versions in a model catalog, whose role-based access controls provide a governed interface to ML project workspaces and software development platforms.
Model Operations (MLOps)
The model operations phase includes three steps, which we call Implement, Operate, and Monitor. Our description of each step below includes an italicized summary of the key challenges for enterprises to navigate. While titles and roles vary, enterprise teams must collectively address the following tasks.
STEP ONE: IMPLEMENT
The ML engineer prepares to implement the ML model in production by reviewing each model version’s features, labels, assumptions, training data, change history, and documentation. With the oversight of the data scientist, they select the version they want for production, then validate their selection. For example, they might run final AB tests to compare the results of different model versions and datasets. (Many enterprises are just getting started, so have just one model to put to work.)
Next the ML engineer leads a cross-functional effort to integrate the model with production operations. They collaborate with the DevOps engineer to insert the model where it needs to go—perhaps in a BI application, payment-processing workflow, or ecommerce website—in a staging area. This integration might entail re-coding, for example to convert a model’s R program into the Java code of the commercial or custom application that operationalizes its output. Ideally the ML engineer already scoped this integration work during the model development stage.
The ML engineer collaborates with the data engineer to connect with the supporting data pipelines, and establish a way to track processed data for the purposes of spotting drift later on. The ML engineer also asks the ITOps engineer to predict the storage and compute requirements, then provision the necessary resources, typically on the cloud. They set up a process to share production model results with various stakeholders, including data scientists and business owners, and summarize results in the model catalogue.
Before going live, the ML engineer needs to determine model production targets on four dimensions. Data scientists and business owners should review and approve these targets to ensure alignment with business objectives.
Performance. Key performance indicators (KPIs) for criteria such as latency, throughput, and availability must satisfy service level agreements (SLAs) with business owners.
Accuracy. Models degrade over time and make less accurate predictions because the data “drifts.” This means that the business environment—such as consumer behavior during the COVID shock—changed, which forces you to adapt and retrain your model. Define your permissible ranges for accuracy before going into production. Also set simple benchmarks such as moving averages to make easy comparisons.
Governance. You should already have established the right policies to comply with internal and external requirements for handling models and Personally Identifiable Information (PII). These policies include the identification and prevention of bias based on gender, ethnicity, etc. Define the indicators and thresholds at which your model usage, outputs, or workflow tasks become non-compliant.
Cost. Like any data- or compute-intensive operation, ML costs money. Determine how you want to measure, chargeback, and show back the cost of ML computation, as well as data transfer and storage. Also prepare basic calculations of your model’s total impact. Perhaps it costs $3000 per month to run, but improves performance accuracy by $40,000.
ML teams should consider implementing their ML models and workflows in a microservices architecture to streamline inevitable changes once in production—data schema changes, KPI updates, etc. A microservices architecture breaks software programs into small, modular, and independent services. You can update, remove, or replace one microservice with minimal impact on the others. By adopting a microservices architecture for ML, for example, you can more easily swap in new model versions, reconfigure Spark processing clusters, or add a data source. Containers such as Docker make this easier by packaging a microservice with whatever it needs to run, including system tools, libraries, and configuration files.
Key Challenges: ML engineers must lead model implementation, while enlisting the expertise and support of their DevOps and ITOps colleagues to align with existing operational processes. They should take a risk-averse, “do no harm” approach, ready to spot trouble, revise, and roll back. Stay flexible and plan for change. Perhaps most importantly, keep data scientists and business owners informed of your implementation plan. And hold them accountable for the success of that plan.
STEP TWO: OPERATE
Time to go live! The ML and DevOps engineers tag-team with one another in this phase, dividing responsibilities based on skills and preferences. They kick off operations by scheduling, then executing the tasks that activate ML models in production workflows. In regulated industries, compliance officers might well need to vet and approve the production-ready model before activation. Be sure to make any necessary compliance checks like these, then track the usage dates of your models to assist future audits.
As should be clear by now, this is no time to set it and forget it. ML and DevOps engineers should apply the DevOps practice of continuous integration and continuous development (CI/CD), which means they frequently update code to fix minor errors and maintain production quality standards. The CI/CD practice includes the following.
Continuous integration. The ML and DevOps engineer split code branches to a development platform such as GitHub so they can update features, re-configure settings, or fix bugs. They test the revised code branch, identify issues, and resolve them.
Continuous delivery. The ML and DevOps engineer, potentially working with ITOps, inspect the revised code branch, approve it, and kick off an automated release process to go live in production.
Key challenges: Bugs and manual errors quickly wreak havoc in ML production workflows. Be sure you have the right tools and procedures in place for the ML and DevOps engineers to quickly notify various stakeholders and coordinate a response. Also seek out MLOps tools that automatically synchronize model updates with development platforms such as GitHub.
STEP THREE: MONITOR
Vigilant operations and response, of course, depend on the monitoring of metrics related to performance, accuracy, governance, and cost. ML engineers collaborate with ITOps engineers to track, alert, and report on these health indicators. Here are examples of how indicators might cross thresholds and prompt corrective measures.
Performance. Increased latency, decreased throughput, or outages generate alerts to stakeholders. Under the oversight of the ML engineer, the ITOps engineer troubleshoots, diagnoses the issue, and remediates the affected task, application, or workflow. Typically, these are standard IT remediation processes.
Accuracy. Customer responses to an ML recommendation decline, or false fraud alerts trigger authentication requests that cause merchants to cancel requests. The ML engineer spots the issue in an MLOps dashboard, pulls the model out of production, and consults the data scientist. They decide whether to re-train the model on more recent or more comprehensive historical data. They also might change or add to the ML model’s techniques, or even change its features.
Governance. The MLOps dashboard flags different ML recommendations for customers of different ethnicities with otherwise identical attributes. The ML engineer notifies the data scientist of potential bias, and they remove the model from production. They determine whether to re-train the model, change its ML techniques or change its features. When making any changes, they make sure they can still explain the model to potential auditors.
Cost. The ITOps engineer sees a spike in AWS compute charges for the ML model version that one business unit uses for preventive maintenance on its delivery trucks. They notify the business owner, ML engineer, and DevOps engineer, and discuss the potential cause of extra compute cycles. Did certain telemetry sensors issue too many signals? Can they be filtered?
These corrective measures all rely on agreed policies and established procedures. ML engineers, ITOps engineers, and DevOps engineers should have a game plan ready for when things go sideways. Who gets notified, and who activates the game plan? If a model crashes, do you roll back to the prior version? Over time, these teams also should seek to automate the routine aspects of alerting and response.
Key Challenges: Data drifts and models degrade rapidly in this volatile post-COVID business environment. Closely monitor your model accuracy and respond quickly to negative indicators. Also keep a close eye on governance needs, including evolving expectations and regulations about model bias. Minimize the risk that your ML model will automate bad decisions.
Effective MLOps depend on achievable business objectives, clear metrics, and an abundance of team communication. With these controls in place, ML engineers and their colleagues can reduce chaos in their production environment. But like Little League players, models need to mature rapidly to stay competitive. Establish an MLOps approach that plans for fast iterations in both your models and game plan.
Our next blog will dig further into the ways that various ML team players need to learn, teach one another, and help one another—as well as the tools they can use.
Read - The Machine Learning Lifecycle and MLOps: Building and Operationalizing ML Models - Part IV