Validating Models: A Key Step on the Path to Artificial Intelligence


To stay competitive in a digital economy, businesses increasingly need to move beyond simple reporting and descriptive analytics to a more predictive approach that puts artificial intelligence (AI) strategies to work to engage with customers in new ways.

So how can you find a practical way to start applying AI in your business? One path forward follows three steps: leverage predictive models to improve how you engage with customers, put machine learning to work to improve those models, and then validate your models. In an earlier blog, I explored the dynamics of predictive analytics and machine learning. In this post, I will focus on the validation of predictive models First let me provide a quick overview of predictive analytics and machine learning, and explain why validation is important when you apply these approaches.

Predictive analytics

Predictive analytics is about using algorithms to predict the result of a measurement that you can’t make, based on measurements that you can make. Why can’t you make the measurements you need? Perhaps you are trying to predict what’s going to happen next. Unless you have a time machine (in which case you are probably wasting your time on analytics!), you can’t measure something that hasn’t happened yet. Forecasting customer behavior, business trends and future events is always of value in running a business.
Often, it is not possible to measure something in the present because of practical considerations. For example, suppose you want to present a lunch ordering application for your chain of lunchtime food delivery restaurants. Your best chance of getting someone’s attention is when they are hungry and online. Obviously, it’s not possible to ask everyone who is online whether they are hungry, so you need to infer their hunger status from their behavior: Is it lunchtime? Are they looking up lunch options? The goal of your predictive analytics in this case is to infer who is most likely to respond to your product or offer, based on the data you are able to collect.
In general, your predictive analytics application can take into account a customer’s past account history, past conversations with the call center, behavior of “similar” customers, the location of the customer, and even what’s trending on social media at any given time. Good predictive analytics will give you the best chance of a mutually beneficial interaction with your customer.

The challenge is that, compared with diagnostic and descriptive analytics, predictive analytics is a new world. You are actually making predictions or inferences based on past data. To be successful at this, and to avoid making grossly inaccurate predictions (or at least understand how accurate your predictions may be), you will need to validate your models to ensure that you have discovered useful, generalizable patterns in your data.

Machine learning as part of a predictive analytics system

So how do we build a good predictive analytics application? Two words: Machine learning (ML). Predictive analytics leverages machine learning algorithms that build systems that learn iteratively from data, identify patterns, and predict future results. Machine learning algorithms organize things into meaningful groups, find unusual patterns in data, and can predict the next data point in a time series.

There’s an important caveat to call out here. Machine Learning algorithms learn from data, but on their own they are not great at distinguishing between memorizing past data and finding generalizable underlying patterns in data. When learning from past data for predictive analytics the goal is to generalize, not memorize. Poorly constructed ML algorithms can memorize all of the data in a huge data set, resulting in a system that is very poor at predicting the outcome of any situation they haven’t already seen in the past. Instead, you need to train ML algorithms to focus on a limited number of free parameters that enable reliable predictions about the future. ML algorithms that generalize well are called “robust” algorithms.

The Importance of Validation

Given the concerns described above, how do you know if you can trust the results generated by a ML algorithm? You need to validate your predictive models.

Validation is the automated process of looking at all the data you have in different ways to determine how robust your predictive models are likely to be. To understand why validation is important, let’s look at the process of creating predictive analytics from data using machine learning models. This will help us see why not all models are created equal.

One way to proceed would be to create a model that you think is predictive, try it out for a while, and then see if it actually works. But there’s a big downside here. If the model doesn’t work, you could end up paying a big price in terms of lost opportunities and misguided business strategies. In practice, you will monitor the performance of your model in the “real world,” but you don’t want to only rely on that approach alone.

A better way to validate is to split your available data into a training set and a test set. You will use the training set to train your ML algorithm to predict the known outcomes in your training data. You then try out the model on the data in the test set to see how well your predictions match the actual outcomes in the test set.

This gives you a number of quantitative performance measures for your ML model. If your model has poor performance on the test set, then you know that the ML model you came up with did not discover a generalizable pattern in your data and you will need to change your assumptions and build another model.

Going forward, you can tweak your model to try different training and test set data partitions and repeat the validation sequence multiple times—define your training set, train your algorithm, test your algorithm. Most ML libraries, like Intel® Data Analytics Acceleration Library (Intel® DAAL) can automate this process for you using a variety of methods called “cross validation.” Cross validation enhances the likelihood that your algorithm will be reliable and robust.

Tools for the Data Scientist

Intel offers various software technologies and hardware products to support the efforts of data scientists and application developers who want to use ML methodologies to extract value from huge datasets. Let’s take a few examples.

Intel spearheaded the initiative to create the open source Trusted Analytics Platform. TAP accelerates the creation of advanced analytics and machine learning solutions by streamlining the process of assembling big data tools and by automating the many steps required to build and publish big data applications. It’s a key platform for putting powerful analytics into the hands of every business decision maker. For a fuller explanation, see my earlier post discussing three attributes of TAP.

Intel also offers access to a range of open-source frameworks for machine and deep learning, as well as code and reference architectures for distributed training and scoring. You can explore these resources on our Machine Learning site.

And then there is also the hardware side of the story. The Intel® Xeon® processor E7 v4 family delivers the processing performance and large memory capacities required for real-time analytics on huge datasets. It’s ready for the largest high-volume workloads, like those in healthcare, energy, financial trading, and logistics applications.

Intel non-volatile memory (NVM), including Intel® Optane™ technology, speeds things up even more. It can greatly accelerate machine learning systems by reducing memory latency to a matter of just tens of nanoseconds. These are the kind of breakthroughs you can achieve with new 3D XPoint™ technology, which brings non-volatile memory speeds up to 1,000 times faster than NAND, the most popular non-volatile memory in the marketplace today.

The main takeaway for a data scientist to achieve success in predictive analytics, if you’re using a predictive model, you want to be confident that it is both reliable and robust. The process of validating your ML model gives you this confidence.