Deep Learning, AI and Privacy
ODSC Finally Explains Deep Learning
Last week I spoke at the Open Data Science Conference in Boston. My talk was on whether it was possible to automate predictive analytics. Or in other words: “Could we really ever have a ‘citizen data scientist’ who wasn’t actually a data scientist?”.
My answer was: “No we can’t create a data scientist out of someone who isn’t.” But it was also “Yes we can automate data science tools so that they benefit both the power user and the novice.”
But I had a caveat to my answers. Just like you don’t want too many people playing with plutonium or inexperienced carpenters playing with power saws, you may not want the average business user employing data science techniques. At least you don’t want novice users doing so unless they are using products that keep them from injuring themselves and others.
Consider the example of my ‘most expensive’ employee from a company I used to run. This employee was not a skilled data scientist and he didn’t follow our best practices so he ended up applying a predictive model multiple times to the same data. The result was that our mail vendor sent each prospect in a large customer list multiple free samples of an OTC healthcare product. The net cost to my company was over $200,000.
With this example and others I showed in my talk that data science can be a powerful force for good but it can also easily go awry. I proposed 12 product features that every advanced analytics tool should have that would keep both experts and novices much safer. Stay tuned for my next paper where I’ll talk about them in some detail.
Deep Learning Just Got a lot Easier with Tensorflow
Soumith Chintan from Facebook led an interesting talk on PyTorch, a new deep learning framework. He and the other presenters did a nice job explaining deep learning and confirmed that ‘deep learning’ really was just another name for using an Artificial Neural Network (ANN). Deep learning utilizes the same layers of nodes and connections and functions for attenuating and summing inputs from previous layers as do ANNs.
They also demonstrated the new Tensorboard visualization tool. If you are interested in learning more about Tensorflow and Tensorboard. Here are some excellent short introductions:
AI at Teradata
Ron Bodkin, the CTO of Services and Architecture at Teradata, gave a very interesting talk on Artificial Intelligence efforts at Teradata. He emphasized the importance of building cross-functional teams when executing on AI projects and restated a Gartner prediction that by 2020 AI will be a top 5 investment priority for more than 30% of CTOs. He also mentioned that Keras is looking promising as a high level API running on top of Tensorflow.
Privacy in Data Science
Jim Klucar from Immuta gave a great talk on privacy techniques in regulated environments. He argued that the GDPR (the EU’s General Data Protection Regulation) will be the most important change in the data privacy industry in the last 20 years.
He provided several very interesting examples where anonymized data was re-identified by adding other information to it (a.k.a. ‘data fusion’). For example the highly publicized Netflix prize provided anonymized Netflix user data for research purposes. Several data scientists were able to re-identify many users in the database by noting that people who did movie reviews on Netflix also did them on IMDB. By matching similar reviews between the two databases many of the individuals could be determined. Netflix has since withdrawn the database.
Another example occurred when New York City taxi data was made available to data scientists via a freedom of information request (FOIA). But the taxi ID fields were not well encrypted and several celebrities were matched in the taxi database to paparazzi photos that were stamped with the same time and place. This led to the revelation of tipping behavior of various celebs like Bradley Cooper.
There are new ways to anonymize data while it is being collected. One interesting technique goes as follows: if you want to collect sensitive mental health information ask people to think of a number between 1 and 5 and then say “if you thought of the number 4 OR if you suffer from depression respond affirmatively”. By doing so people who suffer from a sometimes underreported problem will feel safer in reporting because they could plausibly argue that they had done so because they had selected the matching number.
The data scientists can then later extract the number of people who raised their hands because of picking the number 4 from the data and get valid and useful information from the data even if they don’t know for sure whether any one person had depression.
The downside is that all of these processes that obscure and privatize data also incur a sizable computational load (e.g. for encrypting and then decrypting the data) or often some degradation in the quality of the research and outcomes that the data scientists can perform. Privacy does have its costs.
There is also something called a ‘model inversion’ attack where data scientists were able to take a neural network that was trained on thousands of images of peoples faces and reverse engineer it to make it reveal just one of those people in the database. So just because you’ve crunched down your large private database into a sophisticated model of neural network link weights doesn’t mean that it is impervious to giving up secrets about the original data.
Deep Learning at Facebook
Brandon Rohrer gave a great talk: “How Deep Neural Networks Work and How We Put Them to Work at Facebook”. He showed how two of the most important new techniques in deep learning: convolution and pooling worked.
Convolution works by automatically breaking up an image into smaller pieces or fragments and then sliding them over the original image to see if there is a match. So for instance, two hand-drawn images of the letter ‘X’ might not match each other very well due to differences in penmanship or the writer’s style but they do both have at least one place where two lines cross. This crossing point may be in different parts of the image but it does exist in both and helps to identify the images as the same letter.
Another technique that is part of the deep learning arsenal is the ReLU function (Rectified Linear Unit). The ReLU is 0 if the input value is negative and equal to the value if it is positive. This simplification makes it much faster to calculate than the normal sigmoid function that was used in the original ANNs.
Data Science is Software
There was also a workshop titled “Data Science is Software” led by Peter Bull. He argued that we should hold data science to the same standards of reproducibility as we now do for computer programming. He pointed out that when his company ran data science competitions they would receive entries in a wide variety of different formats. Which made it difficult for them to reproduce the data science model. He proposed that standardized formats, QA and archiving techniques that have become well developed and commonplace in writing computer code should be applied to predictive models.
Four Years From Now
ODSC East 2017was an amazing conference with some 3,000 attendees. Here are some predictions I’ll make based on what I learned. Four years from now, in 2021:
- Keeping data private and secure will be solved but with some significant degradation to the power of predictive analytics (i.e. data scientists won’t get easy access to any and all data anymore).
- Brute force and simple process solutions that have been proven effective in other domains (e.g. secrecy in the military) will gain traction. Examples would be: locking your data in a vault with restricted access and having very specific processes for which data can be combined and by whom. Bureaucracy to the rescue.
- There will be a Precambrian like explosion in the usage of deep learning for image recognition tasks – everything from sorting cucumbers (true) to tracking manatees (true).
- Open source tools like Tensorflow will continue to dominate the power users of data science but expect that standards for unstructured data and mainstream assimilation of deep learning techniques will result in more traditional vendor solutions to also be provided for the less than rocket scientist data scientist.
On a personal note, I have seen the light. The research in tools and techniques in deep learning really has resulted in some amazing breakthroughs in the last decade. The skeptic in me does still need to point out that all these techniques already existed in 1996. But like the Apollo space program, where the science and the technology existed in 1960, it still took a steady evolution of small steps to eventually put a man on the moon in 1969. Now with deep learning, the next person on the moon will be touring the lunar landscape in a self-driving car.