Data Management Best Practices For Machine Learning

Data Management Best Practices For Machine Learning

For decades machine learning was slow to evolve given the complexity of emulating human thought and processing sufficiently large datasets. Today, many enterprises are diving into machine learning with a more positive outlook, thanks to major advances in software development, CPU and GPU processing, and the availability of rich, massive data sets.

A common attribute of Artificial Intelligence (AI) systems, machine learning software uses statistical methods to make predictions and automatically teach itself to improve prediction accuracy over time. We are in the early stages of a steep adoption curve. According to a recent TDWI survey, 17% of enterprises are driving decisions today with machine learning, but half expect to do so within three years.

Designing, deploying, and managing effective data architectures is critical to the success of machine learning initiatives.  Implementations require flexibly and efficiently supporting high scale data flows to minimize the risk of project-killing bottlenecks. The pressure is high, because data volume and variety keep rising as organizations accelerate model training to improve the accuracy of results.

Here are recommended data management steps and best practices to drive successful machine learning initiatives.

  • Identify your machine learning use cases. Many analytics problems are well suited for self-teaching algorithms.  These include quantitative investing, customer recommendations, medical diagnosis, predictive maintenance and fraud prevention. While internal business clients often will define these based on their operational objectives, data availability and the considerations in this article will help define what is actually feasible. The following table shows sample operational objectives, machine learning use cases, and business data inputs, as well as opportunities for more advanced algorithms, for three business functions.

Examples of Operational Objectives and Machine Learning Use Cases

  • Define the dataset. Each use case requires a data set, its sources, and the frequency with which that data should be updated. These requirements will vary widely. Real-time use cases might need live transactional data, historical purchases, and clickstream data. Predictive maintenance use cases, meanwhile, might be based on sensor data feeds from mechanical equipment and histories of component installation and maintenance. Some business clients might be imaginative in recommending data sources. It is critical to carefully assess what is feasible given timelines and budgets.
  • Define data preparation requirements. Historical machine learning use cases often need standard preparation steps, including data collection, refinement, and delivery for production analytics. Real-time use cases, in contrast, might entail short-cutting these processes for the live data inputs but still depend on more prepared data sets for the historical component. Once the right procedures are established, imputation rules are required to replace missing data with substituted values. You also will need to establish the right data profiling and quality measures, to assess false positives and data skew, because in some cases the source system of record has incomplete metadata.
  • Define your logical data usage patterns.  Once use cases, data sets, and data preparation requirements are established, you can define how you intend to use the data – i.e., the decisions the machine learning algorithm will make, the necessary input data types, the temporality of those decisions, data to transform, etc. The following table shows examples of these components. 

Examples of Logical Data Usage Patterns

  • Architect a data pipeline for each use case. Next, define the data pipeline requirements for each machine learning use case individually, both for model training and production. Because different use cases require distinct datasets, processing capabilities, decision frequencies, etc., they might require different data pipelines. Furthering the complexity, each use case and pipeline might entail multiple machine learning models that are scored against one another. Once you assess the pipeline requirements for each use case in isolation, you can identify opportunities to combine pipeline components for efficiency while still meeting the requirements for each use case. 
  • Select your data platforms, using data sources and processing requirements as your guide. A vibrant open source and commercial ecosystem has created a rich set of options. Data lakes based on the Hadoop File System (HDFS) or Amazon S3 can efficiently process large and complex data sets from varied sources, either in large batches with MapReduce or near-real time micro-batches with Spark. Many organizations also use data lakes to transform complex data sets for focused analytics in structured data warehouses. Messaging systems such as Kafka and Amazon Kinesis can support real-time analysis of streaming business events and create a persistent store of records for historical analysis.
  • Plan for fast growth. When it comes to hardware planning, be sure to allocate significant buffer capacity in the processing power, storage, and network bandwidth that each data pipeline requires.  More than other analytics use cases, machine learning invites high growth in data volumes and sources for the reasons outlined earlier.
  • Streamline data flows.  Change data capture (CDC) technology copies data and metadata updates real-time from RDBMS, mainframe and other production sources, while reducing or eliminating the need for disruptive batch replication. CDC also improves scalability and bandwidth efficiency by sending only incremental data updates from source to target.
  • Carefully consider requirements for model testing. You should expect to test models to optimize results, for example with A/B comparisons, repeatedly both before and after going into production. This will require frequent adjustments to underlying data sets, as well as a complete change history. Change data capture technology can provide these change histories for source production databases, and streaming systems such as Kafka can replay them for re-examination.
  • Plan for fast and frequent iterations to your production environment. Just as machine learning algorithms teach themselves, so must practitioners work through their own trial and error process. Ongoing adjustments are required on several fronts, including feature engineering (data input changes,) selecting and replacing algorithms, and trying new “ensemble” algorithm combinations. By scoring algorithm models against one another, you can make the most informed decisions about how best to improve results.
  • Monitor and refine data flows. Organizations should consider solutions that centrally configure, execute, monitor, and analyze tasks such as replication across dozens or potentially hundreds of end points. This assists performance management, troubleshooting and capacity planning. With a consolidated command center, you can ensure data remains available, current, and ready for machine learning analytics.

One of the world’s largest payment processing firms serves as an ideal case study for best practices, including the deployment of specialized pipelines for different use cases and the use of CDC to streamline data flows. They are using machine learning to continuously improve the accuracy of automated decisions and thereby improve both revenue and margins. This firm runs three types of decisions on real-time transactions: whether to accept the payment card, the level of transaction risk, and processing cost. They must support more than 40 million transactions per day and peak periods of more than 1,000 per second. 

This firm architected two data pipelines. First, an intelligent authentication service makes in-stream decisions on transactions, including whether to require a customer to provide additional identifying information. This engine weighs the risk of fraud against the risk of users abandoning a valid transaction, then makes its recommendation, typically in less than less than 10 milliseconds. The second data pipeline uses change data capture technology (CDC) to publish live transactions from its production database to a Kafka message stream. This in turn feeds a Hortonworks Hadoop data lake in which Spark-based models assess retroactively the accuracy of those in-stream authentication decisions. The payment processing firm also periodically tests the expected behavior of new recommendation engine versions before deploying them, using change histories provided by CDC.

With a stable, efficient data architecture in place, you can realize the promise of machine learning and continuously improve the results you deliver to the business. You can address new use cases and accommodate new data sets. You can consider extending into deep learning, which enriches findings in complex areas by using the output from one machine learning model as input for the next. These opportunities and more all start with the data architecture best practices outlined in this article.

About the authors: Kevin Petrie is Senior Director of Product Marketing at Attunity, a provider of data integration software, and Jordan Martz is Director of Technology Solutions at Attunity.