Driving Innovations in Machine Learning with Intel

We’ve long known that there are many tasks that computers can perform faster – and better – than humans. Of course, we still have to teach computers HOW to do these tasks, and when using conventional programming techniques we have to be very specific about what computers should do and when. Machine learning changes this model. With machine learning, we’re essentially teaching computers how to learn what to do, and some of them are becoming better than we are at complex tasks. For example, machine learning is a key enabler of self-driving cars and experts predict that they will eventually be safer than human-driven vehicles. That’s just one example of how machine learning is letting us use computers in new ways to do new things.  At Intel, we are quickly moving machine learning from an academic pursuit to a driver of innovation and competitive advantage for businesses.

Machine learning is a technique implemented in software that lets a computer become better at a task the more it performs it. Rather than relying on explicit procedural programming (if X is true then do Y), it enables the computer to discern patterns in vast quantities of data. This allows the computer to develop neural network models representing these patterns and use these models to make decisions. While there have been simple implementations of this concept for years, much of our effort today is applied to enabling “deep learning”—the creation of complex, multi-level models that operate much the way humans do.

There are two parts of the process: training and scoring. In the training phase we “teach” the computer by having it process huge amounts of example data with embedded clues. This happens in the data center and requires vast computing power. In the process of training, the computer creates a compact model that can be deployed to and executed by end devices—like sensor and actuator systems or even PCs and smartphones—that are encountering real-world data of the same kind. They apply the model to make decisions and to act on them. What’s key is that when the model is applied, the results are evaluated (scored) and fed back into the training process, so the system continues to learn and the model gets better and better.

Machine learning requires both enormous computational ability that drives down the time to train and software frameworks that open up the possibilities of machine learning to software developers in every industry. At Intel, we’re working hard on both of those enablers and have announced significant advances this week.

The first advance is in computational ability. Machine learning requires parallel processing of vast amounts of mostly unstructured data—like video streams or sensor data feeds—in real time. Because of the specialized nature of this task, much of the preliminary effort has focused on specialized graphics processors rather than industry-standard computing platforms. However, this week we announced the Intel® Xeon Phi™ processor (earlier known as Knights Landing, or KNL). It’s the first bootable host processor specifically designed for highly parallel workloads, and the first to integrate both memory and fabric technologies.

We designed it from the ground up to eliminate bottlenecks and to scale out in a near-linear fashion across cores and nodes to reduce the time to train machine learning apps. For AlexNet, with respect to single-node, we achieve about 50x reduction in training time on 128 nodes of KNL[1].  For GoogLeNet training, we achieve a scaling efficiency of 87% at 32 nodes of KNL, 38% better than the best-such-published-data-to-date[2]. And it’s binary-compatible with Intel® Xeon® processors, which allows it to support a broad set of workloads.

On the software side, we already provide access to machine learning functions on Intel® Xeon® processors with tools like the Intel® Math Kernel Library (Intel® MKL) and the Intel® Data Analytics Acceleration Library (Intel® DAAL). We’ve optimized those libraries to provide up to 30 times greater performance in deep learning applications[3]. Now, we are developing the open source Intel MKL-DNN (deep learning neural network) optimized for machine learning. Available later this year, it integrates with and dramatically enhances the performance of machine learning applications developed with frameworks like Caffe* and Theano*. That provides a drag-and-drop interface to enable data scientists and programmers to create high-performance machine learning applications.

 

 

 

 

[1] Up to 50x faster training on 128-node as compared to single-node based on AlexNet* topology workload (batch size = 1024) training time using a large image database running one node Intel Xeon Phi processor 7250 (16 GB MCDRAM, 1.4 GHz, 68 Cores) in Intel® Server System LADMP2312KXXX41, 96GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat Enterprise Linux* 6.7 (Santiago), 1.0 TB SATA drive WD1003FZEX-00MK2A0 System Disk, running Intel® Optimized DNN Framework, training in 39.17 hours compared to 128-node identically configured with  Intel® Omni-Path Host Fabric Interface Adapter 100 Series 1 Port PCIe x16 connectors training in 0.75 hours.  Contact your Intel representative for more information on how to obtain the binary. For information on workload, see https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf.

[2] Up to 38% better scaling efficiency at 32-nodes claim based on GoogLeNet deep learning image classification training topology using a large image database comparing one node Intel Xeon Phi processor 7250 (16 GB, 1.4 GHz, 68 Cores) in Intel® Server System LADMP2312KXXX41, DDR4 96GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat* Enterprise Linux 6.7, Intel® Optimized DNN Framework with 87% efficiency to unknown hosts running 32 each NVIDIA Tesla* K20 GPUs with a 62% efficiency (Source: http://arxiv.org/pdf/1511.00175v2.pdf showing FireCaffe* with 32 each NVIDIA Tesla* K20s (Titan Supercomputer*) running GoogLeNet* at 20x speedup over Caffe* with 1 each K20).

[3] Up to 30X gain based on internal testing with LeTV Cloud (www.lecloud.com) for video detection.  Results based on system using Intel® Xeon® processor E5-2680 v3.  LeTV Cloud Caffe* optimization compared baseline using BVLC Caffe + OpenBlas to Caffe optimized for Intel® Architecture + Intel® MKL.