Feature Soup: SparkBeyond Automates Feature Engineering

Data Science & AI Efficiency with Automation

While automating the creation of features to feed your machine learning algorithms can be a tad boring compared to creating new AI algorithms, it is nonetheless, at the vanguard of advancements in AI. We should pay attention and learn a bit about how it works and where it can be used.

Features are more important than AI algorithms

Features are really important. They are the inputs to any machine learning algorithm and dramatically impact the accuracy of the model produced from them. Just like the difference between feeding your dog good food vs. bad food, what comes out the other end is very dependent on the quality of what goes in. (If you have a dog you know exactly what I am talking about.) See figure 1 for a more pleasant visual metaphor of the importance of features.

For example, I used to work with the head of the analytics department at one of the largest pharmaceutical companies. He once told me:

“if you have the right features you can probably solve any problem by just using linear regression…”

 While this is perhaps a bit of an overstatement, he was noting what many data scientists understand: the ‘features’ in any machine learning problem define the shape of a multidimensional space within which an algorithm searches to find the best model.  If you can stretch and fold and mold that space correctly, you can make the job much easier. Find the right features and you can make that space very simple to search.

Figure 1: Features are often more important than algorithms in creating predictive models.

Examples of feature spaces

For example, if you provide an unprocessed binary image to a learning algorithm as a 1024 x 1024 matrix of ones and zeros, the learning algorithm will be searching in a space of 2^(1024 x 1024) possible inputs – which is quite large. If, on the other hand, you could feed your learning algorithm higher-level features such as lines, curves, geometric objects, the number of crossing lines and the number of holes, you could dramatically decrease the size of the input space that must be searched.

In figure 2 it is easy to see the difference in the feature spaces if you ask yourself the question: Which is an easier feature space in which to classify an image of a number?

  1. The 1 million black and white pixel values of the image:
    0100101001101010100100010100000100100010001010…
  2. Or a higher-level feature description:
    “This image has two holes in it, two circles, one point where lines cross and no straight lines …”

These higher-level features make it much simpler to recognize that the image is number eight.

Figure 2: A hand-drawn image of the number eight is easier for an
algorithm to classify when it is fed the right features.

Generating 82,000 features from 4 columns

I recently spoke with researchers at SparkBeyond about their automated tools for feature engineering. They are a leader in this space among many competitive offerings, like those from IBM or DataRobot, and open source offerings like Featuretools from Feature Labs.

SparkBeyond provides a simple but powerful example of automated feature engineering by using the classic Titanic data set that many noob data scientists cut their teeth on.

SparkBeyond’s system automatically generated features from the text strings stored in the passenger’s name field. Normally a passenger’s first or last name would not be predictive of surviving a nautical disaster. “Steve”s are no more likely to survive than “Stan”s, but in this case, their system detected a few short character combinations that were highly predictive.

One of these predictive sequences was the three characters “m” “r” “.”.

Though it may be obvious to a human being that “Mr.” is synonymous with being male and that being male was a strong predictor of not surviving the Titanic disaster, pulling this out of the text without background knowledge is a challenging task to automate.

To find these valuable features, SparkBeyond’s system generated over 82,000 possible features from just 4 columns of data in the dataset. The 4 columns were specifically chosen to be the most unstructured, noisiest and most difficult to extract a signal from.

Most of the generated features did not help improve the model and were discarded. The system kept only the best ones, with the most predictive potential. All 82,000 features were generated and evaluated in under 7 seconds. This is not something that an individual data scientist could accomplish without days of effort.

An analogy: feature soup

You can think of the automation of feature creation as similar to the process of creating a soup with everything you have in your refrigerator. Find a big pot and put everything into it (old carrots and wilted celery included).

Just like this ‘everything including the kitchen sink’ soup, a ‘feature soup’ includes all possible data that you might be able to get your hands on (even the obscure and dirty stuff that you don’t think has any value). The features are found, created, and mixed but, unlike a real soup, you can filter out and keep the parts that actually make the feature soup better.

This filtering process is the key. It is easy to create features but difficult to know which ones are valuable and make the soup taste better (i.e. improve the prediction accuracy of the models using the features). 

Here is a recipe for making ‘feature’ soup:

Automate the search of the internet for snippets of code and algorithms that manipulate data into features. Take anything you can find that looks promising.

Grab any data that might be relevant, even if it is messy, by looking into all the backwaters of your data lake or even scraping web pages, looking up people on the web or using Wikipedia resources such as the graph database from DBPedia. 

Apply these algorithms to your data to create a large ‘soup’ of possibly valuable features.

Apply a filter to the features such as information gain/entropy reduction or conditional mutual information to remove features that are likely to be non-productive.

Plug the most promising features into your favorite AI algorithm from sci-kit learn, H2O, DataRobot or others to build predictive models and let these algorithms sort out the most predictive features.

A real-world example

Consumers are increasingly expecting low-touch, low-friction products for life insurance. In the past, invasive qualifying evidence, such as blood tests or doctor’s reports, created friction with the client and reduced the number of applications. Ideally, a consumer would like a policy proposed to them with little more than an online application. While this can be done in some cases it has risks for the insurance company if they misprice a policy based on missing information.

SparkBeyond has worked with an insurance company to improve morbidity prediction for underwriting new life insurance policies. By utilizing eleven new data sources and automated feature creation they increased the number of policies, where non-invasive risk could be assessed, from 7% to 70%. About half of these applications could then be underwritten without any manual intervention.

This automation freed up manual labor and increased customer satisfaction and close rates. It also resulted in better predictive models than the previous manual process. This improvement over a heavily expert-based process was attributed to the fact that the automated system could search much more deeply and broadly for predictive features without being limited by pre-existing biases for preferred data or algorithms that are unavoidable with human analysts.

Algorithms and model management are being commoditized

Focusing on features to improve the predictive accuracy of models is not a new approach for data scientists. What has changed recently is that tool creators are now passionately trying to automate this heretofore manual process.

This shift has come about because progress has plateaued in the discovery of new AI learning algorithms (or at least there are now many years of opportunity for new applications to make use of the recent algorithmic breakthroughs). When this first started to happen (let’s call it the commoditization of algorithms) the tools providers shifted focus from algorithms to model management which was then the weakest point in the production line. This has borne fruit and tools now provide many strong solutions for model management across the data science production line.

The next area of focus (and the current weakness in the data science production line) is the automation, and eventual commoditization, of feature engineering. Figure 3 shows this evolution of automation.

Figure 3: The efficacy of data science has been rapidly progressing via the automation of various parts of its production line.

The hidden benefits of automated feature engineering

In addition to dramatic savings of staff hours and improvements in model accuracy, the automation of feature creation has several other benefits:

It decreases cognitive bias. All possible features are constructed and tested, not just the ones that the data scientist is familiar with or happens to think of.

It automates the finding of data. Since it is automated, it can look everywhere in the data ecosystem. Including places a particular data scientist may not have thought of or might incorrectly dismiss.

It encourages automation. Automation in one part of a system puts pressure to standardize and automate all other parts of the system that interact with it.  That is why WalMart vendors automate their supply chain processes if they want to work with WalMart. In the case of the automation of feature creation, your metadata will be strongly encouraged to be automated as well. When feature engineering is automated, it will no longer be feasible to retrieve metadata by asking people where data resides or what it means.

It fits the data science production line. Feature automation is not creating a new step in the production line which might be disruptive to current business processes. It is simply making an existing piece of the process faster, bigger, and better.

It is way faster. Data scientists and AI researchers are expensive animals to feed and maintain and the automation of feature engineering allows them to be more productive and remove some of the drudgeries from their jobs. This will lead to them being happier and increase their tenure at your company.

Four years from now

In the not too distant future, the automation of feature engineering will be well understood and many vendors will be providing tools to automatically create features.  Here are some more specific predictions that should occur within the next four years:

Commoditization is coming. Just like AI algorithms and model management, feature engineering will be commoditized within the next four years. It will become cheaper and easier to use and relieve much of the burden from the data scientist. The data scientist, in turn, will have time freed up to focus on exploration and solving new AI challenges.

Explainable algorithms with dominate. Simpler and more transparent machine learning algorithms will begin to predominate. For instance, a lot of the pre-processing that happens in the early layers of a convolution neural network (used for deep learning) may no longer be required.  New feature creation tools can provide those features as input to simpler neural networks. Less complex but more understandable predictive algorithms such as nearest-neighbor, clustering, or decision trees may see a resurgence of use as improved features make them competitive for accuracy.

Feature creation platforms will be assimilated. Companies like SparkBeyond will be highly desirable acquisition targets for companies with full-service data science platforms. This will be especially true if their platforms are integrated tightly with, and make use of the metadata provided by master data management tools.

Humans get more vacation time. Tagging of metadata and overall human intervention will become less important and more automated as these systems figure out for themselves the semantics and the syntax of the data. A nice by-product will be that metadata will be enriched and its production will become more automated.

Related articles: 

The Rise of Data Science Platforms: Key Features for Automating Analytics and Driving Value

DataOps for Machine Learning: A Guide for Directors of Data Engineering and Data Science

Four Signs You Need A Data Science Platform

DataOps Explained: A Remedy For Ailing Data Pipelines

Don't Underestimate the Power of Stupid Artificial Intelligence Algorithms

Commodity AI and the Next Best Experiment

References: 

"Machine Learning and AI via Brain simulations". Stanford University.