Unlocking Data Insights with the Powerful Intel Xeon Scalable Processor

On-demand video streaming. Gene mapping. Recommendation engines. GPS-enabled smartphones that give you directions to and from anywhere, at any time. These are just a few examples of the incredible technology innovations in the past decade. And Intel® Xeon® processors that power data centers worldwide have been the foundation behind many of these advancements. Coupled with improvements in storage and networking, the ever-increasing performance gains of Intel Xeon processors have powered an explosion of compute capability.

Now add in the exponential growth of (mostly unstructured) data. As our digital world expands, so does the amount of data it generates. But what do we do with all this data? That’s where Artificial Intelligence (AI) comes in. AI will enable us to gain insight, and further fuel innovations, from this flood of data. To keep pace with the rapid advancement in AI and the flood of data, Intel is providing a foundation with Intel® Xeon® Scalable processors that address the compute, bandwidth and software optimization needs of this next wave of computing.

Excellent Performance Across a Range of Workloads

The recently launched Intel® Xeon® Scalable Processor family provides powerful performance for the widest variety of workloads, including a 1.73X average performance boost vs. the previous generation across key industry-standard workloads1. Architected with increased memory and IO bandwidth, as well as advanced security features, Intel Xeon Scalable Processors are optimized to deliver 2.2X higher deep learning training and up to 2.4X higher inference performance compared to the prior generation2.

Enterprises can benefit from real-time responsiveness leading to increased productivity for advanced analytics and AI workloads, while researchers and scientific organizations working on complex problems requiring High-Performance Computing, can unlock greater insights from data and drive breakthrough discoveries.

 Intel® Xeon® Scalable Processor average performance gains and AI performance gains vs. previous generation
Figure 1. Intel® Xeon® Scalable Processor average performance gains and AI performance gains vs. previous generation

Intel Xeon Scalable Processors: Architected for Data Insights

Intel Xeon Scalable Processor brings Intel® Advanced Vector Extensions-512 (Intel® AVX-512) together with 32 Double Precision Flops per cycle per core, as well as a cache hierarchy of 1MB L2 cache per core and a non-inclusive L3 cache. Intel Xeon Scalable Processor also includes the new mesh architecture, enhanced memory subsystem, new Intel® Ultra Path Interconnect, Intel® Speed Shift Technology and security and virtualization enhancements.

Principal Engineer, Akhilesh Kumar, provides a detailed description of the architectural advancements of Intel Xeon Scalable Processors.

AI: A Short Introduction

AI is currently undergoing a rapid expansion – some approaches and techniques are well known, some are emerging, and others we can’t yet imagine. Emerging applications for AI include image recognition, speech recognition natural language processing, page ranking, data generation, and reinforcement learning.

To get a good sense of deep learning and what benchmarks play an important role in performance, let’s look at one emerging use case: Image recognition.  Deep learning workloads are neural network layers that operate on an input image to either learn and train a network or use a trained network to make inferences.

In simple terms, the input image is parsed through a series of mathematical operations called layers. These operations enable the network to learn the features of the image.  For example, a filter (think of a matrix of randomly assigned weights) operates on the input image matrix and the results of this operation are captured, which subsequently act as inputs into the second layer of the neural network, and so on until the last layer.

During this process, the neural network learns the features of the image and the weights used are captured. This whole process is called forward propagation. The output is then compared with the real input value and errors computed. Next, we work backward through the network to correct the errors, a process called backward propagation.

The input images can be a set of images you operate on in parallel called a batch size. There are multiple parameters that you can tune through the learning process to achieve a well-trained model where the network successfully approximates the function and learns in a short amount of time. This is deep learning training. Some of the key performance metrics are time-to-train, accuracy, throughput and total cost of ownership.

We then use this trained model and provide it a new input image. The input images go through the forward propagation and the network then makes inferences about this new image and assigns probabilities to the output classification. If we trained the model well, we would expect to achieve a higher accuracy in our classification. This is deep learning inference. Some of the key performance metrics are throughput, accuracy and total cost of ownership.

Now that we have an idea of what deep learning entails, we can look at three factors that will drive deep learning performance: Compute, bandwidth and software optimizations.

AI Performance Driver: Compute

Compute performance can be measured in operations per second using Single Precision General Matrix Multiply (SGEMM) and lower precision Integer General Matrix Multiply (IGEMM) benchmarks. These are good indicators of achievable matrix multiply compute performance for a given precision.

This is important because the convolution operations in a neural network (Convolution layers) are essentially matrix multiply operations, and a higher number of achievable operations per second is key. Intel® Xeon® Platinum 8180 Processor performs 2.3X better than the previous generation Intel Xeon processor (codenamed Broadwell) on SGEMM and up to 3.4X better than the previous generation on INT8 IGEMM4.

Intel Xeon Scalable Processors also bring increased parallelism and vectorization with Intel AVX-512 with up to two 512 bit FMA units computing in parallel. Intel AVX-512 instructions enable processing of twice the number of data elements that Intel® AVX/AVX2 can process with a single instruction and four times the vector capabilities of Intel® SSE.

In an experimental setup, we measured the true benefit of Intel AVX-512 on Convolutional Neural Network layers to be up to 1.65X better performance when compared to running convolutions without Intel AVX-512 instructions on the same Intel Xeon Scalable Processor5.

AI Performance Driver: Bandwidth

Balanced IO and memory bandwidth is the second performance driver for AI. Intel Xeon Scalable contains high throughput and low latency innovations like up to 6 DDR4 memory channels per socket and the new mesh architecture, which help deliver outstanding STREAM Triad benchmark performance of up to 211GB/s 4.

This synthetic benchmark measures sustainable memory bandwidth (in MB/s) and a corresponding computation rate for simple vector kernels. The efficient large sized caches in Intel Xeon Scalable Processor also provide a major advantage in gaining higher performance. This is important for AI workload performance because the neural network layer computations access the data stored in the memory continuously such as in ReLu layers (Rectified Linear Unit) or Pooling layers.

In Max Pooling, for example, we make multiple read accesses to the memory to gather a set of data and compute the max of the dataset and write the result back to memory. Hence, to achieve better performance measured as faster time to train (hours or minutes) and higher throughput (images/second), these accesses to the memory need to be of low latency and high throughput to keep the compute churning.

AI Performance Driver: Software Optimization

Software optimizations are crucial for AI workloads. They bring out the true potential of the underlying hardware and continue to push out more performance through various optimization strategies. These optimizations happen across the entire software stack.

Intel® Math Kernel Library for Deep Neural Networks is an open source library that is continuously tuned to achieve the best performance. Software experts and data scientists also rigorously tune the deep learning frameworks to enable the industry with the best performing frameworks for Intel® architecture such as Neon, TensorFlow, Caffe, theano, torch, and others.

Some of the key challenges that data scientists and software optimization engineers work on a daily basis are efficient memory/cache usage, vectorization, and methodologies to best utilize all cores and improvements in scaling. They research the effects of data reuse, memory allocation, prefetching, data layout optimizations, load balancing and reducing synchronization events to gain better performance.

These optimizations to frameworks and high performant libraries affect the performance we see on a given hardware. The current snapshot of AI workload performance on Intel Xeon Scalable Processor for training and inference on multiple optimized frameworks and topologies is presented and shows up to 2.2X performance gains on deep learning training throughput and up to 2.4X performance gains on inference throughput.

Figure 2. Intel® Xeon® Scalable Processors are optimized to deliver 2.2X higher deep learning training compared to the prior generation
Figure 3. Intel® Xeon® Scalable Processors are optimized to up to 2.4X higher inference performance compared to the prior generation

It is imperative that users get the best and latest software libraries and frameworks to see advanced AI performance on the latest Intel hardware. The architectural improvements together with enhanced software optimizations demonstrate potent performance on the new Intel Xeon processor-based platforms with up to 138X improvement in inference and up to 113X improvement in training compared the typical install base running unoptimized software3.

Summary

The Intel Xeon Scalable Processor offers impressive performance across a range of workloads and has been architected for data-intensive AI and HPC workloads. When we dig into a deep learning example, it’s easy to see why compute, bandwidth, and software optimizations play a critical role in determining performance and that Intel Xeon Scalable processor delivers on benchmarks across these three areas.

You can begin your AI journey today using existing, familiar high-performance Intel Xeon Scalable processors and be secure in the knowledge that this scalable and efficient platform addresses your current and future AI needs.

For more detail on Xeon Scalable Processors, refer to the Intel Xeon Scalable Processor product brief.

For other additional information, visit the Intel® Nervana™ AI Academy for academia, developers, and startups.

 


Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.  Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions.  Any change to any of those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For complete information visit http://www.intel.com/performance. Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect the performance of systems available for purchase. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

1 1.73x Average Performance: Geomean based on Normalized Generational Performance (estimated based on Intel internal testing of  OLTP Brokerage, SAP SD 2-Tier, HammerDB, Server-side Java, SPEC*int_rate_base2006, SPEC*fp_rate_base2006, Server Virtualization, STREAM* triad, LAMMPS, DPDK L3 Packet Forwarding, Black-Scholes, Intel Distribution for LINPACK, AI training and inference on Neon ResNet18

Up to 1.33x on TPC*-E:  1-Node, 2 x Intel® Xeon® Processor E5-2699 v4 on Lenovo Group Limited with 512 GB Total Memory on Windows Server* 2012 Standard using SQL Server 2016 Enterprise Edition. Data Source:http://www.tpc.org/tpce/results/tpce_result_detail.asp?id=116032402, Benchmark: TPC Benchmark* E (TPC-E), Score: 4938.14 vs. 1-Node, 2 x Intel® Xeon® Platinum 8180 processor on Lenovo Group Limited with 1536 GB Total Memory on Windows Server* 2016 Standard using SQL Server 2017 Enterprise Edition. Data Source: http://www.tpc.org/tpce/results/tpce_result_detail.asp?id=117062701, Benchmark: TPC Benchmark* E (TPC-E), Score: 6598.36. Higher is better

Up to 1.40x on SPECvirt_sc* 2013:  Claim based on best-published 2-soclet SPECvirt_sc* 2013 result submitted to/published at  http://www.spec.org/virt_sc2013/results/res2016q3/virt_sc2013-20160823-00060-perf.html as of 11 July 2017, Score: 2360 @ 137 VMs vs. 1-Node, 2 x Intel® Xeon® Platinum 8180 Processor with 768 GB (24 x 32 GB, 2R x4 PC4-2666 DDR4 2666MHz RDIMM) Total Memory on SUSE Linux Enterprise Server 12 SP2. Data Source: http://www.spec.org, Benchmark: SPECvirt_sc* 2013, Score: 3323 @ 189 VMs Higher is better

Up to 1.44x on 2-Tier SAP* SD : Claim based on best-published two-socket SAP SD 2-Tier on Linux* result published at http://global.sap.com/solutions/benchmark/sd2tier.epx as of 11 July 2017. New configuration: 2-tier, 2 x Intel® Xeon® Platinum 8180 Processor (56 cores/112 threads) on DellEMC PowerEdge* R740xd with 768 GB total memory on Red Hat Enterprise Linux* 7.3 using SAP Enhancement Package 5 for SAP ERP 6.0, SAP NetWeaver 7.22 pl221, and Sybase ASE 16.0.  Source: Certification #: 2017017: www.sap.com/benchmark, SAP* SD 2-Tier enhancement package 5 for SAP ERP 6.0 score: 32,085 benchmark users.

Up to 1.53x on SPECint*_rate_base2006 :  Claim based on best-published two-socket SPECint*_rate_base2006 result submitted to/published at http://www.spec.org/cpu2006/results/ as of 11 July 2017. New configuration: 1-Node, 2 x Intel® Xeon® Platinum 8180 Processor on Huawei 2288H V5 with 384 GB total memory on SUSE Linux Enterprise Server 12 SP2 (x86_64) Kernel 4.4.21-69-default, using C/C++: Version 17.0.1.132 of Intel C/C++ Compiler for Linux. Source: submitted to www.spec.org, SPECint*_rate_base2006 Score: 2800. Results are pending SPEC approval; they are considered estimates until SPEC approves

Up to 1.58x on SPECjbb*2015 MultiJVM critical-jOPS: Claim based on best-published two-socket SPECjbb*2015 MultiJVM critical-jOPS results published at http://www.spec.org/jbb2015/results/jbb2015multijvm.html as of 11 July 2017. New configuration: 1-Node, 2 x Intel® Xeon® Platinum 8180 Processor on Cisco* Systems UCS C240 M5 with 1536 GB total memory on Red Hat Enterprise Linux* 7.3 (Maipo) using Java* HotSpot 64-bit Server VM, version 1.8.0_131. Source:  submitted to http://www.spec.org, SPECjbb2015* - MultiJVM scores: 141,360 max-jOPS and 118,551 critical-jOPS

Up to 1.65x on est SPECfp*_rate_base2006 :Claim based on best-published two-socket SPECfp*_rate_base2006 result submitted to/published at http://www.spec.org/cpu2006/results/ as of 11 July 2017. New configuration: 1-Node, 2 x Intel® Xeon® Platinum 8180 Processor on Huawei 2288H V5 with 384 GB total memory on SUSE Linux Enterprise Server 12 SP2 (x86_64) Kernel 4.4.21-69-default, using C/C++ and Fortran: Version 17.0.0.098 of Intel C/C++ and Intel Fortran Compiler for Linux. Source: submitted to www.spec.org, SPECfp*_rate_base2006 Score: 1850. Results are pending SPEC approval; they are considered estimates until SPEC approves

Up to 1.65x on est STREAM - triad:  1-Node, 2 x Intel® Xeon® Processor E5-2699 v4 on Grantley-EP (Wellsburg) with 256 GB Total Memory on Red Hat Enterprise Linux* 6.5 kernel 2.6.32-431 using Stream NTW avx2 measurements. Data Source: Request Number: 1709, Benchmark: STREAM - Triad, Score: 127.7 Higher is better vs. 1-Node, 2 x Intel® Xeon® Platinum 8180 Processor on Neon City with 384 GB Total Memory on Red Hat Enterprise Linux* 7.2-kernel 3.10.0-327 using STREAM AVX 512 Binaries. Data Source: Request Number: 2500, Benchmark: STREAM - Triad, Score: 199 Higher is better

Up to 1.73x on HammerDB:1-Node, 2 x Intel® Xeon® Processor E5-2699 v4 on Grantley-EP (Wellsburg) with 384 GB Total Memory on Red Hat Enterprise Linux* 7.1 kernel 3.10.0-229 using Oracle 12.1.0.2.0 (including database and grid) with 800 warehouses, HammerDB 2.18. Data Source: Request Number: 1645, Benchmark: HammerDB, Score: 4.13568e+006 Higher is better vs. 1-Node, 2 x Intel® Xeon® Platinum 8180 Processor on Purley-EP (Lewisburg) with 768 GB Total Memory on Oracle Linux* 7.2 using Oracle 12.1.0.2.0, HammerDB 2.18. Data Source: Request Number: 2510, Benchmark: HammerDB, Score: 7.18049e+006 Higher is better

Up to 1.73x on LAMMPS: LAMMPS is a classical molecular dynamics code, and an acronym for Large-scale Atomic/Molecular Massively Parallel Simulator. It is used to simulate the movement of atoms to develop better therapeutics, improve alternative energy devices, develop new materials, and more. E5-2697 v4: 2S Intel® Xeon® processor E5-2697 v4, 2.3GHz, 36 cores, Intel® Turbo Boost Technology and Intel® Hyperthreading Technology on, BIOS 86B0271.R00, 8x16GB 2400MHz DDR4, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.  Gold 6148: 2S Intel® Xeon® Gold 6148 processor, 2.4GHz, 40 cores, Intel® Turbo Boost Technology and Intel® Hyperthreading Technology on, BIOS 86B.01.00.0412.R00, 12x16GB 2666MHz DDR4, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.

Up to 1.77x on DPDK L3 Packet Forwarding: E5-2658 v4: 5 x Intel® XL710-QDA2, DPDK 16.04. Benchmark: DPDK l3fwd sample application Score:  158 Gbits/s packet forwarding at 256B packet using cores. Gold 6152: Estimates based on Intel internal testing on Intel Xeon 6152 2.1 GHz, 2x Intel®, FM10420(RRC) Gen Dual Port 100GbE Ethernet controller (100Gbit/card) 2x Intel® XXV710 PCI Express Gen Dual Port 25GbE Ethernet controller (2x25G/card), DPDK 17.02. Score:  281 Gbits/s packet forwarding at 256B packet using cores, IO and memory on a single socket

Up to 1.87x on Black-Scholes: which is a popular mathematical model used in finance for European option valuation. This is a double precision version. E5-2697 v4: 2S Intel® Xeon® processor CPU E5-2697 v4 , 2.3GHz, 36 cores, turbo and HT on, BIOS 86B0271.R00, 128GB total memory, 8 x16GB 2400 MHz DDR4 RDIMM, 1 x 1TB SATA, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327. Gold 6148: Intel® Xeon® Gold processor 6148@ 2.4GHz, H0QS, 40 cores 150W. QMS1, turbo and HT on, BIOS SE5C620.86B.01.00.0412.020920172159, 192GB total memory, 12 x 16 GB 2666 MHz DDR4 RDIMM, 1 x 800GB INTEL SSD SC2BA80, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327

Up to 2.27x on LINPACK*: 1-Node, 2 x Intel® Xeon® Processor E5-2699 v4 on Grantley-EP (Wellsburg) with 64 GB Total Memory on Red Hat Enterprise Linux*  7.0 kernel 3.10.0-123 using MP_LINPACK 11.3.1 (Composer XE 2016 U1). Data Source: Request Number: 1636, Benchmark: Intel® Distribution of LINPACK, Score: 1446.4 Higher is better vs. 1-Node, 2 x Intel® Xeon® Platinum 8180 Processor on Wolf Pass SKX with 384 GB Total Memory on Red Hat Enterprise Linux* 7.3 using mp_linpack_2017.1.013. Data Source: Request Number: 3753, Benchmark: Intel® Distribution of LINPACK, Score: 3295.57 Higher is better

2 2.4X deep learning inference and training performance: Inference throughput batch size 1, Training throughput batch size 256. Source: Intel measured as of June 2017 Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Platform: 2S Intel® Xeon® Platinum 8180 CPU @ 2.50GHz (28 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).  Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance.  Deep Learning Frameworks: Neon: ZP/MKL_CHWN branch commit id:52bd02acb947a2adabb8a227166a7da5d9123b6d. Dummy data was used. The main.py script was used for benchmarking , in mkl mode. ICC version used : 17.0.3 20170404, Intel MKL small libraries version 2018.0.20170425. Platform: Platform: 2S Intel® Xeon® CPU E5-2699 v4 @ 2.20GHz (22 cores), HT enabled, turbo disabled, scaling governor set to “performance” via acpi-cpufreq driver, 256GB DDR4-2133 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3500 Series (480GB, 2.5in SATA 6Gb/s, 20nm, MLC). Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact,1,0‘, OMP_NUM_THREADS=44, CPU Freq set with cpupower frequency-set -d 2.2G -u 2.2G -g performance. Deep Learning Frameworks: Neon: ZP/MKL_CHWN branch commit id:52bd02acb947a2adabb8a227166a7da5d9123b6d. Dummy data was used. The main.py script was used for benchmarking , in mkl mode. ICC version used : 17.0.3 20170404, Intel MKL small libraries version 2018.0.20170425.

3 138X improvement in inference and up to 113X improvement in training: INFERENCE using FP32 Batch Size Caffe GoogleNet v1 256  AlexNet 256.  Platform: 2S Intel® Xeon® Platinum 8180 CPU @ 2.50GHz (28 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).  Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance.  Deep Learning Frameworks: Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). Intel C++ compiler ver. 17.0.2 20170213, Intel MKL small libraries version 2018.0.20170425. Caffe run with “numactl -l“. Platform: 2S Intel® Xeon® CPU E5-2699 v3 @ 2.30GHz (18 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 256GB DDR4-2133 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.el7.x86_64. OS drive: Seagate* Enterprise ST2000NX0253 2 TB 2.5" Internal Hard Drive. Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact,1,0‘, OMP_NUM_THREADS=36, CPU Freq set with cpupower frequency-set -d 2.3G -u 2.3G -g performance. Deep Learning Frameworks: Intel Caffe: (http://github.com/intel/caffe/), revision b0ef3236528a2c7d2988f249d347d5fdae831236. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). GCC 4.8.5, MKLML version 2017.0.2.20170110. BVLC-Caffe: https://github.com/BVLC/caffe, Inference & Training measured with “caffe time” command.  For “ConvNet” topologies, dummy dataset was used. For other topologies, data was st ored on local storage and cached in memory before training  BVLC Caffe (http://github.com/BVLC/caffe), revision 91b09280f5233cafc62954c98ce8bc4c204e7475 (commit date 5/14/2017). BLAS: atlas ver. 3.10.1.

4 SGEMM, IGEMM, STREAM benchmarks performance: SGEMM: System Summary 1-Node, 1 x Intel® Xeon® Platinum 8180 Processor GEMM - GF/s 3570.48 Processor Intel® Xeon® Platinum 8180 Processor (38.5M Cache, 2.50 GHz)Vendor Intel Nodes 1 Sockets  1 Cores 28 Logical Processors 56 Platform Lightning Ridge SKX Platform Comments Slots 12 Total Memory 384 GB Memory Configuration 12 slots / 32 GB / 2666 MT/s / DDR4 RDIMM Memory Comments  OS Red Hat Enterprise Linux* 7.3 OS/Kernel Comments kernel 3.10.0-514.el7.x86_64 Primary / Secondary Software ic17 update2 Other Configurations BIOS Version: SE5C620.86B.01.00.0412.020920172159 HT No Turbo Yes 1-Node, 1 x Intel® Xeon® Platinum 8180 Processor on Lightning Ridge SKX with 384 GB Total Memory on Red Hat Enterprise Linux* 7.3 using ic17 update2. Data Source: Request Number: 2594, Benchmark: SGEMM, Score: 3570.48 Higher is better  SGEMM, IGEMM proof point: SKX: Intel(R) Xeon(R) Platinum 8180 CPU Cores per Socket 28 Number of Sockets 2 (only 1 socket was used for  experiments) TDP Frequency  2.5 GHz  BIOS Version SE5C620.86B.01.00.0412.020920172159 Platform  Wolf Pass OS Ubuntu 16.04  Memory  384 GB Memory Speed Achieved  2666 MHz BDX: Intel(R) Xeon(R) CPU E5-2699v4 Cores per Socket 22 Number of Sockets 2 (only 1 socket was used for experiments) TDP Frequency 2.2 GHz BIOS Version GRRFSDP1.86B.0271.R00.1510301446 Platform Cottonwood Pass  OS Red Hat 7.0 Memory 64 GB Memory Speed Achieved  2400 MHz

STREAM: 1-Node, 2 x Intel® Xeon® Platinum 8180 Processor on Neon City with 384 GB Total Memory System Configuration CFG1; Platform Wolf-Pass Qual; Number of Sockets 2;Motherboard Intel Corporation, S2600WFD; Memory         12x32GB DDR4 2666MH; OS Distribution          "RHEL 7.3Kernel: 3.10.0-514.el7.x86_64 x86_64"Bios Version                                                                                 SE5C620.86B.01.00.0470.040720170855 Storage                Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC)

Convolution, ReLU, Pooling Speedup Configuration Details:  Intel® Xeon® Processor E5-2699v4 E5-2699v4 @2.2 GHz, 145 Watt Peak*: 3,097 Gflops 2 sockets x 22 cores Turbo OFF, HT OFF LLC: 56320K  2 sockets x 4 dimms per socket x 8Gb per dimm @2400 MHz DDR4 = 64 GB RAM  RHEL 7.0 (Maipo) 3.10.0-123.el7.x86_64

Intel® Xeon® Platinum 8180 Processor  @2.5 GHz, 205 Watt Peak*: 8,960 Gflops 2 sockets x 28 cores EIST ON, Turbo OFF, HT OFF  LLC: 39424K  2 sockets x 6 dimms per socket x 32Gb per dimm @2666 MHz DDR4 = 376 GB RAM RHEL 7.2 (Maipo) 3.10.0-327.el7.x86_64MKL 2018.0.0 Gold (build 20170511) Intel Compiler (ICC) 17.0.4 20170512 IntelCaffe w/ MKL2017 engine built with -xAVX2-CORE hash 614c605d68f067d65888fe3e4573aabdf593d3fa SKX run: OMP_NUM_THREADS=56 KMP_AFFINITY=granularity=fine,compact numactl -l .build_release/tools/caffe time -iterations 150 -model $model -engine MKL2017  BDW run: OMP_NUM_THREADS=44 KMP_AFFINITY=granularity=fine,compact numactl –i all .build_release/tools/caffe time -iterations 150 -model $model -engine MKL2017 Batch Sizes AlexNet:256 VGG-19: 64 ResNet-50: 50 GoogleNet-V1:  96 IntelCaffe: commit 614c605d68f067d65888fe3e4573aabdf593d3fa Merge: fb54d025 c52cbd7e task-collect-compare-inplace-computations: Compare-collect tool does not check/report data used in in-place computations

5 AVX-512 On off experimental setup: Platform: 2S Intel® Xeon® Platinum 8180 CPU @ 2.50GHz (28 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC). Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance Deep Learning Frameworks:Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). Intel C++ compiler ver. 17.0.2 20170213, Intel MKL small libraries version 2018.0.20170425. Caffe run with “numactl -l“.