DataOps for Machine Learning: A Guide for Data Engineers and Data Scientists
The emerging discipline of DataOps seeks to streamline data pipelines and improve data quality with a mix of DevOps, agile software development, total quality management, and lean manufacturing principles. And it is long overdue. As Wayne Eckerson has written in depth, we need new processes and tools to meet new requirements. Rising data volume, variety and velocity are breaking legacy systems and torpedoing projects.
There is perhaps no better target for DataOps than machine learning. Enterprises are widely deploying machine learning, currently the most popular form of artificial intelligence, to identify data patterns, infer meaning and make predictions. Done right, machine learning improves decision speed, accuracy and efficiency. It also enables new decisions. (Also see the article I wrote with Jordan Martz, “Data Management Best Practices for Machine Learning.”)
But doing it right is not easy. Popular applications of machine learning, such as medical diagnosis, personal data security and self-driving cars, underscore the need for timely, high-quality data. As Trifacta CEO Adam Wilson observed recently to Datanami, “the last thing you want to do is automate bad decisions faster.” This article seeks to help IT leaders understand and articulate the ways in which DataOps helps data engineers and data scientists get machine learning (ML) right. With this knowledge, you can get budget and allocate resources appropriately.
Fraud detection is an ideal case study for ML DataOps because it illustrates the knife edge on which ML treads. You need to minimize both “false negatives” and “false positives,” stopping the bad guys but letting the good guys through.
With this in mind, let’s apply DataOps best practices to the development, training, deployment and operation of machine learning models whose objective is to analyze fraud and money laundering activity. This is based on numerous enterprise projects, including those at PayPal, Allianz, Thomson Reuters and CapGemini, as well as principles from the excellent DataOps Cookbook by the team at software provider, DataKitchen. (To be clear, our focus is the use of DataOps to enable ML. Applying ML to data management processes themselves is a compelling but separate topic.)
Let’s consider three ML model scenarios to spot bad guys:
Cascade: In a Cascade, ML models operate sequentially to identify signs of a stolen account and progressively adjust the risk assessment. For example, we might identify an account name change, then contact info and address changes. After a “grooming” period of harmless behavior, the new account owner starts withdrawing large amounts of cash.
Segment: Here we apply one ML model to different datasets concurrently to map and characterize bad guy collusion, for example using graph analysis. This might entail a merchant account that uses stolen credit cards to process payments from a network of fraudulent consumer accounts.
Ensemble: In an ensemble arrangement, different models process the same evolving dataset concurrently. For example, we’ll observe periodic transfers, each just below the reporting threshold amounts, that are made to common overseas accounts. While remittance agents might vary, the contact info of ordering and receiving customers are often the same, signaling a risk of money laundering. (Strictly speaking, cascades also are a type of ensemble).
Figure 1 summarizes these ML model approaches.
Figure 1: Common Approaches to ML Modeling
Each of these ML models is developed, trained, simulated/validated, deployed and then operated, with iterative loops along the way. After training the model based on sample data, we might go back to the development phase and tune the model, for example by adding contact info fields to track in the stolen account scenario. After simulating and validating the model, for example by comparing its results with a production model, we might need to train it again with additional training data. After deployment, we’ll often find the model performance degrades over time as production datasets evolve. Then it is back to development, training and simulation. Figure 2 illustrates this process flow.
Figure 2: ML Modeling Lifecycle
We can see ample room for error and therefore financial loss. A military officer might get married, change her name and relocate to Afghanistan, then set up wire transfers to a new childcare facility back home. This is not a stolen account. But there are many ways in which your ML models can be led to flash false alerts. They include insufficient development and training, inconsistent application of models to different geographies, incorrect model tuning and deficient QA.
Here is how DataOps can get things back on track. This consolidates the “7 Steps to Implement DataOps” from the DataOps Cookbook and applies them to the ML lifecycle for fraud identification.
DataOps Best Practice #1: Modular (Re)Development
A series of commands will ingest, transform and analyze data within a machine learning pipeline. When you start generating too many false positives or false negatives post-deployment, you need to go back and identify the source of the problem. Assuming it is a code problem rather than data problem, you need to isolate that code and tune or de-bug it without disrupting production workflows. The DataOps Cookbook recommends the “branch and merge” best practice to handle this efficiently: copy relevant code from your version control tool, develop and test changes on that isolated local “branch,” then merge the revised code back into the “trunk” for production operations.
In addition to troubleshooting, the “branch and merge” approach enables improvements. Take our Cascade ML scenario for stolen accounts. Suppose we start to manually observe over time a pattern in the lawful “grooming period” behavior for certain geographies: New debit account owners buy baby diapers, dog food and golf balls in small amounts for exactly three weeks, then start withdrawing $200 of cash every three days until the balance is drained. We can build this new intelligence into the ML sequence by branching the “grooming period” model and inserting a new rule that identifies this pattern. We test the revised model and merge it back into the production workstream. By identifying this signature grooming period behavior, we are ready to shut down bad guys sooner when the withdrawals start.
We also might want to develop a new “unsupervised” model that identifies new patterns itself as it scans unlabeled data. This will help spot the creative bad guys that are less predictable and therefore more evasive. Modular development, as you might expect, maps to the ML model development, training and simulation/validation stages in our lifecycle.
DataOps Best Practice #2: Modular Deployment
Machine learning models and other software components are often applicable to multiple use cases. To replicate successes and minimize losses, you need to be able to make those components as interchangeable as Lego bricks. When implemented correctly, container technologies such as Kubernetes can enable you to rapidly and efficiently port an ML model across on-premises VMs and/or cloud provider environments. This would help, for example, replicate an effective new US-based collusion ML model to local cloud facilities in Europe to find collusive behavior there.
DataOps Best Practice #3: Flexible Execution
Effective machine learning models can adapt to changing runtime conditions without going back to the development phase. What does this mean? A well-designed model will enable you to easily change parameters in-stream and measure the impact on results. For example, we might initially decide to filter out certain contact information fields as data is ingested, in order to accelerate ML ensemble processing. After a few months, model performance degrades, letting more bad guys get away with money laundering. So we go back and reverse the parameter, reinjecting the contact fields to cast a wider net for bad guys.
We also might want to have a parameter that easily toggles between two models to understand their respective impacts on ensemble performance. The flexible execution best practice applies to the operational stage of the ML model lifecycle, although the right parameters must be set during development.
DataOps Best Practice #4: Data and Logic Tests
Finally and perhaps most importantly, we have testing, a mandatory aspect of any effective machine learning model pipeline. Data and ML engineers must check data quality and test code functionality at each stage of data processing. Automated testing is the best way to maintain consistent performance and results. With automated checks, we can find errors in source data before they corrupt the results, driving up false positives and/or negatives. For example, a currency field error that incorrectly affixes dollar signs to Chinese Yuan figures will inflate their value 6x, potentially flagging legitimate transactions as signs of a stolen account by exceeding configured thresholds. Spot this with an automated test and fix it, and you’ll avoid upsetting many law-abiding customers.
Figure 3 maps these best practices to the ML lifecycle.
Figure 3: ML Modeling Lifecycle and DataOps Best Practices
My objective with this article was to help data engineering and data science leaders understand plainly the ability of DataOps to minimize the risk and maximize the upside of ML outcomes. Fraud detection is just one of many ML use cases that have a pretty high impact on both sides of the ledger.
In future articles we will further explore DataOps best practices for machine learning, addressing ETL, data quality and other critical factors that shape the success of ML initiatives.