Intel® AI – The Tools for the Job

How heterogeneous hardware architectures break barriers between AI model and real-world deployment for complex (or diverse) AI deployments

This is an incredibly exciting time in the advancement of Artificial Intelligence (AI). Previously, AI capabilities were only accessible by companies with deep expertise in the field. In just a few years, we’re seeing Intel customers around the world realizing transformative successes across a wide range of use cases and environments using AI. This is due to the rising maturity of the software tools, ecosystems, and hardware capabilities.

Customers are discovering that there is no single “best” piece of hardware to run the wide variety of AI applications, because there’s no single type of AI. The constraints of the application dictate the capabilities of the required hardware from data center to edge to device, and this reiterates the need for a more diverse hardware portfolio. Covering a full variety of applications wherever they occur will deliver the highest return for Intel’s customers.

From Intel® Xeon® Scalable processors that excel at training and inferencing on massive amounts of unstructured voice and text data, to the flexible Intel® FPGAs that provide excellent throughput and low latency for real-time inferencing, to Intel® Movidius™ vision processing units (VPUs) delivering extreme, low-power inference for cameras, and the upcoming Intel® Nervana™ Neural Network Processor (Intel® Nervana™ NNP), built from the ground up to accelerate deep learning, Intel provides a deep silicon foundation tailored to enable data centric innovation wherever data lives, from end point device, to the edge, to the data center and cloud.

We’re investing heavily in the software to make these capabilities portable across our portfolio to bring AI to diverse applications, regardless of constraints. Open source projects like nGraph™ ease the difficulty of optimizing different deep learning frameworks (e.g., TensorFlow*, MXNet*, and PyTorch*) across different hardware platforms. This provides developers with choice to deliver the best experience to their customers.

At the recent Data Centric Innovation Summit, I had the chance to discuss Intel’s comprehensive product portfolio that addresses a variety of applications.

Advanced Deep Learning Training with Intel® Xeon® Scalable Processors

Challenge: Discover new therapies by automating the analysis of thousands of different features in microscopy images far larger than those in traditional deep learning datasets.

Solution: The large memory capacity and performant compute of an Intel® Xeon® Scalable processor-based platform.

High-content screening is an important tool in drug discovery. It is challenging and time consuming, requiring the extraction of thousands of predefined features from the images.

Today, at our Data Centric Innovation Summit, we described how Novartis teamed with Intel to use deep learning to accelerate the analysis of cell culture microscopy images to study the effects of various treatments. Due to using whole microscopy images, the images in this evaluation were much larger than those used in common deep learning benchmarks—more than 26 times larger than those in the ImageNet* dataset, for example.

Despite the computational demand and memory requirements due to the number of parameters in the training model and the size and number of images used, the team achieved a 20x improvement in time to train on a system using Intel® Xeon® Gold 6148 processors, Intel® Omni-Path Architecture, and TensorFlow v1.7.01. More than 120 3.9 megapixel images per second were supported in part due to the superior memory capacity offered by Intel® hardware.

As this and other examples demonstrate, CPU architectures are preferable to meet the demands of many real-world deep learning applications. Additionally, ongoing investments are enhancing the Intel® Xeon® Scalable processor-based platform performance, with improvements of more than 1.4x for training and nearly 5.4x for INT8 inference2 for many popular frameworks since the platform’s launch, extending to 11x for inference with the introduction of our next Intel® Xeon® Scalable processor, code named Cascade Lake3. Future platform support for Intel® Optane™ DC Persistent Memory will expand memory capacity near the CPU to enable training on larger data sets.

At the Innovation Summit, we also announced Intel® DL Boost, a set of processor technologies designed to accelerate AI deep learning. Cascade Lake also will feature the Vector Neural Network Instruction (VNNI) set that accomplishes in one instruction what had previously taking three. Cooper Lake, the Intel® Xeon® Scalable processor following Cascade Lake, will add support for bfloat16 to Intel® DL Boost, further improving training performance.

We’re also working to make Intel® Xeon® processors easier for customers to deploy in a full stack, like those in Intel® Select Solutions. The new Intel® Select Solution configuration for BigDL on Apache Spark* is the result of our work with industry leaders like Alibaba, Amazon, China Telecom, Microsoft, and Telefonica and key learnings from hundreds of deployments of BigDL to deliver a configuration that will enable customers to quickly deploy AI capability for existing data lakes. The solution includes both hardware and software components and is our first Intel® Select Solution for AI, available from our partners in the second half of 2018.

Real-Time Deep Learning Inference with Intel® FPGAs

Challenge: Develop a real-time deep learning platform with the flexibility to scale across multiple Microsoft use cases.

Solution: Microsoft Project Brainwave* hardware architecture utilizing Intel® Arria® FPGAs.

Microsoft Project Brainwave* is a deep learning acceleration platform built atop adaptable, power-efficient, high-throughput Intel FPGAs. Project Brainwave enables real-time inference at competitive cost and with very low latency. With the ability to be reprogrammed for maximum performance in an ever-evolving AI landscape, FPGAs are an important tool for many deep learning applications, from search, to speech recognition, to video content analysis.

Recently announced is Azure Machine Learning accelerated hardware backed by Project Brainwave. This service allows developers and data scientists to run real-time models on Azure and on the edge, across a variety of real-time applications, including those in manufacturing, retail, and healthcare.

Microsoft also applied Project Brainwave to new Bing* search features to speed up search results and present intelligent answers. Using machine learning and reading comprehension, Bing rapidly provides intelligent answers that help users find what they’re looking for faster, instead of a list of links for the users to manually check. Intel® FPGAs enabled Bing to decrease model latency by more than 10x while increasing model size by 10x4.

Visual Intelligence at the Edge with Intel® Movidius™ Myriad™ Vision Processing Units

Challenge: Automatically capture and curate motion photos of a person’s family, friends, and pets, with visual processing occurring within the edge device itself.

Solution: The Google Clips* wireless smart camera with Intel® Movidius™ Myriad™ 2 vision processing unit (VPU).

Low-power, high-performance VPUs from Intel® Movidius™ have helped Google bring its vision for its Google Clips* camera to life. With Intel® Movidius™ Myriad™ 2 VPUs, advanced machine learning algorithms can be run in real-time directly on the camera itself. This has enabled Google to improve camera capabilities, decrease power consumption, and enable offline use.

With target applications including embedded deep neural networks, pose estimation, 3D depth sensing, and gesture/eye tracking, Intel® Movidius™ VPUs offer the capabilities for innovative new applications as the “Internet of Cameras” explodes, while adhering to privacy and security policies by keeping these AI applications within the edge device itself. These and future Intel® Movidius™ VPUs will continue to deliver value in applications like video analytics, robotics, and augmented reality.

Next-Gen Training and Inference with Intel® Nervana™ Neural Network Processors

Challenge: Enable the next generation of breakthrough deep learning solutions by circumventing current system barriers with an architecture built from the ground up.

Solution: Intel® Nervana™ Neural Network Processors, coming in 2019.

As AI evolves, models are increasingly complex, requiring growing need for memory. Enabling the future of deep learning means surmounting the memory barrier that’s holding us back. Current solutions can’t take advantage of all available compute, like an engine starved for gasoline. Due to this, data scientists and researchers are increasingly seeing the need for silicon purpose-built for deep learning training and inference. Breaking this memory barrier has driven us to take an entirely new approach with the Intel® Nervana™ Neural Network Processor, which has been designed and built specifically to support deep learning.

Intel® Nervana™ NNP puts memory first with a large amount of high-bandwidth memory and SRAM much closer to where compute actually happens. This means more of the model parameters can be stored on-die for significant power savings and performance increase. It supports most deep learning primitives while making core hardware components as efficient as possible, ensuring that there’s nothing extra—like graphics—stealing memory from your deep learning applications. Additionally, Intel® Nervana™ NNP’s high-speed on- and off-chip interconnects enable massive bidirectional data transfer, so multiple processors can connect chassis-to-chassis to act as one larger, efficient chip to accommodate larger models for deeper insights.

Intel has worked with key customers on the Lake Crest software development vehicle (SDV) for NNP development, testing, and feedback. This is all being incorporated as we prepare to ship our first commercial product in 2019. I cannot wait to see our customers’ innovations and insights from these groundbreaking chips as the AI field is further advanced.

Software to Ease Heterogeneous Deployments and Accelerate AI Innovation

Frameworks and libraries are of the utmost importance in moving AI forward—the hardware is nothing without the software to bring it together and get the most impact. With a robust, multiarchitecture approach, our goal at Intel is to bring all things AI under one software umbrella. This is why initiatives like our open source nGraph compiler are so important. Your time shouldn’t be spent reinventing the wheel.

At Intel, we believe it is our responsibility to optimize software and provide tools that make our hardware perform at its best and simplify the process between model and reality. Get more from direct optimizations for deep learning with our open source performance library; explore nGraph, our open source deep learning compiler that runs both training and inference across multiple frameworks and architectures; use OpenVINO™ to quickly optimize pretrained models and deploy neural networks for video to a variety of hardware; and harness massive amounts of data with BigDL, our distributed deep learning library for Apache Spark and Hadoop* clusters.

Providing Customer Solutions Optimized for the Data Era

Intel is helping customers better deal with and derive value from the wealth of data that is being generated every day. We’re committed to providing a comprehensive portfolio of hardware and tools to achieve any AI vision.

The complexity of real-world AI requires a mix of the right hardware and software to make applications come to life. Intel provides these tools within a cohesive, versatile, well-known technology ecosystem.

For more on Intel’s heterogeneous portfolio of AI solutions, please visit AI.Intel.com.

Check out all the news from Intel Data Centric Innovation Summit here.


Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. § For more information go to www.intel.com/benchmarks.

The benchmark results may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user's components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.

Intel® technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com.

1 https://newsroom.intel.com/news/using-deep-neural-network-acceleration-image-analysis-drug-discovery/

2 For Intel® Caffe on Resnet-50

1.4x training throughput improvement:
Intel tested 8/2/2018 Processor :2 socket Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz / 28 cores HT ON , Turbo ON Total Memory 376.46GB (12slots / 32 GB / 2666 MHz). CentOS Linux-7.3.1611-Core kernel 3.10.0-693.11.6.el7.x86_64, SSD sda RS3WC080 HDD 744.1GB,sdb RS3WC080 HDD 1.5TB,sdc RS3WC080 HDD 5.5TB , Deep Learning Framework Intel® Optimizations for caffe version:a3d5b022fe026e9092fc7abc7654b1162ab9940d Topology::resnet_50  BIOS:SE5C620.86B.00.01.0013.030920180427 MKLDNN: version: 464c268e544bae26f9b85a2acb9122c766a4c396 NoDataLayer. Measured: 123 imgs/sec vs Intel tested July 11th 2017 Platform: Platform: 2S Intel® Xeon® Platinum 8180 CPU @ 2.50GHz (28 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance. Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). Intel C++ compiler ver. 17.0.2 20170213, Intel MKL small libraries version 2018.0.20170425. Caffe run with “numactl -l“.

5.4x inference throughput improvement:
Intel tested on July 26th 2018 on 2 socket Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz / 28 cores HT ON , Turbo ON Total Memory 376.46GB (12slots / 32 GB / 2666 MHz). CentOS Linux-7.3.1611-Core, SSD sda RS3WC080 HDD 744.1GB,sdb RS3WC080 HDD 1.5TB,sdc RS3WC080 HDD 5.5TB , Deep Learning Framework Intel® Optimized caffe version:a3d5b022fe026e9092fc7abc7654b1162ab9940d Topology::resnet_50_v1 BIOS:SE5C620.86B.00.01.0013.030920180427 MKLDNN: version:464c268e544bae26f9b85a2acb9122c766a4c396 instances: 2 instances socket:2 (Results on Intel® Xeon® Scalable Processor were measured running multiple instances of the framework. Methodology described here: https://software.intel.com/en-us/articles/boosting-deep-learning-training-inference-performance-on-xeon-and-xeon-phi)   NoDataLayer, datatype:INT8. Measured: 1233.39 imgs/sec measured vs Intel tested on July 11th 2017Platform: Platform: 2S Intel® Xeon® Platinum 8180 CPU @ 2.50GHz (28 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance. Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). Intel C++ compiler ver. 17.0.2 20170213, Intel MKL small libraries version 2018.0.20170425. Caffe run with “numactl -l“.

3 11X inference thoughput improvement with Cascade Lake:

Future Intel Xeon Scalable processor (codename Cascade Lake) results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance vs Tested by Intel as of July 11th 2017: 2S Intel® Xeon® Platinum 8180 CPU @ 2.50GHz (28 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance. Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (ResNet-50),. Intel C++ compiler ver. 17.0.2 20170213, Intel MKL small libraries version 2018.0.20170425. Caffe run with “numactl -l“.

4 https://windowsreport.com/bing-fast-search/

Published on Categories Artificial IntelligenceTags , , , , ,
Naveen Rao

About Naveen Rao

Trained as both a computer architect and neuroscientist, Rao joined Intel in 2016 with the acquisition of Nervana Systems. As chief executive officer and co-founder of Nervana, he led the company to become a recognized leader in the deep learning field. Before founding Nervana in 2014, Rao was a neuromorphic machines researcher at Qualcomm Inc., where he focused on neural computation and learning in artificial systems. Rao's earlier career included engineering roles at Kealia Inc., CALY Networks and Sun Microsystems Inc. Rao earned a bachelor's degree in electrical engineering and computer science from Duke University, then spent a decade as a computer architect before going on to earn a Ph.D. in computational neuroscience from Brown University. He has published multiple papers in the area of neural computation in biological systems. Rao has also been granted patents in video compression techniques, with additional patents pending in deep learning hardware and low-precision techniques and in neuromorphic computation.