Pushing Machine Learning to a New Level with Intel Xeon and Intel Xeon Phi Processors

Traditionally, there has been a balance of intelligence between computers and humans where all forms of number crunching and bit manipulations are left to computers, and the intelligent decision-making is left to us humans.  We are now at the cusp of a major transformation poised to disrupt this balance. There are two triggers for this: first, trillions of connected devices (the “Internet of Things”) converting the large untapped analog world around us to a digital world, and second, (thanks to Moore’s Law) the availability of beyond-exaflop levels of compute, making a large class of inferencing and decision-making problems now computationally tractable.

This leads to a new level of applications and services in form of “Machine Intelligence Led Services”.  These services will be distinguished by machines being in the ‘lead’ for tasks that were traditionally human-led, simply because computer-led implementations will reach and even surpass the best human-led quality metrics.  Self-driving cars, where literally machines have taken the front seat, or IBM’s Watson machine winning the game of Jeopardy is just the tip of the iceberg in terms of what is computationally feasible now.  This extends the reach of computing to largely untapped sectors of modern society: health, education, farming and transportation, all of which are often operating well below the desired levels of efficiency.

At the heart of this enablement is a class of algorithms generally known as machine learning. Machine learning was most concisely and precisely defined by Prof. Tom Mitchell of CMU almost two decades back as, “A computer program learns, if its performance improves with experience”.  Or alternately, “Machine Learning is the study, development, and application of algorithms that improve their performance at some task based on experience (previous iterations).”   Its human-like nature is apparent in its definition itself.

The theory of machine learning is not new; its potential however has largely been unrealized due to the absence of the vast amounts of data needed to take machine performance to useful levels.  All of this has now changed with the explosion of available data, making machine learning one of the most active areas of emerging algorithm research. Our research group, the Parallel Computing Lab, part of Intel Labs, has been at the forefront of such research.  We seek to be an industry role-model for application-driven architectural research. We work in close collaboration with leading academic and industry co-travelers to understand architectural implications—hardware and software—for Intel's upcoming multicore/many-core compute platforms.

At the Intel Developer Forum this week, I summarized our progress and findings.  Specifically, I shared our analysis and optimization work with respect to core functions of machine learning for Intel architectures.  We observe that the majority of today’s publicly available machine learning code delivers sub-optimal compute performance. The reasons for this include the complexity of these algorithms, their rapidly evolving nature, and a general lack of parallelism-awareness. This, in turn, has led to a myth that industry standard CPUs can’t achieve the performance required for machine learning algorithms. However, we can “bust” this myth with optimized code, or code modernization to use another term, to demonstrate the CPU performance and productivity benefits.

Our optimized code running on Intel’s family of latest Xeon processors delivers significantly higher performance (often more than two orders of magnitude) over corresponding best-published performance figures to date on the same processing platform.  Our optimizations for core machine learning functions such as K-means based clustering, collaborative filtering, logistic regression, support vector machine training, and deep learning classification and training achieve high levels of architectural, cost and energy efficiency.

In most cases, our achieved performance also exceeds best-published-to-date compute performance of special-purpose offload accelerators like GPUs. These accelerators, being special-purpose, often have significantly higher peak flops and bandwidth than our general-purpose processors. They also require significant software engineering efforts to isolate and offload parts of computations, through their own programming model and tool chain. In contrast to this, the Intel® Xeon® processor and upcoming Intel® Xeon Phi™ processor (codename Knights Landing) each offer common, non-offload-based, general-purpose processing platforms for parallel and highly parallel application segments respectively.

A single-socket Knights Landing system is expected to deliver over 2.5x the performance of a dual socket Intel Xeon processor E5 v3 family based system (E5-2697v3; Haswell) as measured by images per second using the popular AlexNet neural network topology.  Arguably, the most complex computational task in machine learning today is scaling state-of-the art deep neural network topologies to large distributed systems. For this challenging task, using 64 nodes of Knights Landing, we expect to train the OverFeat-FAST topology (trained to 80% classification accuracy in 70 epochs using synchronous minibatch SGD) in a mere 3-4 hours.  This represents more than a 2x improvement over the same sized two socket Intel Xeon processor E5-2697 v3 based Intel® Endeavour cluster result.

More importantly, the coding and optimization techniques employed here deliver optimal performance for both Intel Xeon and Intel Xeon Phi processors, both at the single-node, as well as multi-node level.  This is possible due to their shared programming model and architecture.  This preserves the software investment the industry has made in Intel Xeon, and hence reduces TCO for data center operators.

Perhaps more importantly, we are making these performance optimizations available to our developers through the familiar Intel-architecture tool chain, specifically through enhancements over the coming couple of quarters to the Intel® Math Kernel Library (MKL) and Data Analytics Acceleration Library (DAAL).  This significantly lowers the software barrier for developers while delivering highly performant, efficient, and portable implementations.

Let us together grow the use of machine learning and analytics to turn big data into deep insights and prescriptive analytics – getting machines to reason and prescribe a course of action in real-time for a smart and connected world of tomorrow, and extend the benefit of Moore’s Law to new application sectors of our society.

For further information click here‌ to view the full presentation or visit http://www.intel.com/idfsessionsSF and search for SPCS008.