Covering Your HPC Workloads with Intel Xeon and Intel Xeon Phi Processors

At industry events and in my interactions with customers, I often encounter people who ask how the Intel® Xeon Phi™ processor compares with the latest Intel® Xeon® processors. Today, with the launch of the new Intel® Xeon® Scalable platform, this is an especially good time to field this question.

While both processors are binary compatible, providing the cross-compatibility of existing x86 software, the unique architectural features of the respective processors play a significant role in how an application will perform. So let’s walk through some features to show how the two product families complement each other to cover diverse high-performance computing (HPC) workloads.

Let’s begin by looking at some commonalities between the two platforms. On both platforms, the processors support common x86 instructions and cross-platform code portability, along with the standards-based programmability that Intel® architecture is known for. They also incorporate a new generation of Intel® Advanced Vector Extensions (Intel® AVX) and integrate the Intel® Omni-Path Architecture (Intel® OPA) fabric onto the chip package. Intel® Advanced Vector Extensions 512 (Intel® AVX 512) delivers up to double the flops per clock cycle compared to the previous-generation Intel® Advanced Vector Extensions 2 (Intel® AVX2). Intel AVX 512 revs up performance for demanding computational workloads, from modeling and simulation to data analytics, machine learning, and visualization. Intel OPA, an optional integrated feature on both processors, delivers 100Gbps port bandwidth and low-latency that is ideal for HPC clusters, from small deployments to exascale supercomputers. With this bandwidth, Intel OPA provides low fabric latency that can scale up to thousands or tens of thousands of nodes.

The Intel Xeon Scalable platform is the next generation of the Intel® Xeon® processor family. Designed into a broad portfolio of balanced platforms, this new processor family offers up to 28 cores and significant increases in memory and I/O bandwidth. With six DDR4 memory channels and 48 PCI Express* lanes, the Intel Xeon Scalable platform provides the performance needed for extremely large compute- and data-intensive workloads. The Intel Xeon Scalable processor is our best general-purpose CPU, offering great parallel and serial performance. It excels on compute-bound applications that demand faster cores with a larger cache. It’s built for the challenges of HPC, enterprise and cloud workloads, the Internet of Things (IoT), and other demanding applications.

For example, in LS-DYNA, a popular crash simulation application used by the automobile, aerospace, construction, military, manufacturing, and bioengineering industries worldwide, simulation performance on Intel Xeon Scalable Processor family was up to 25% faster than previous generation Intel Xeon processor E5 v4.1 This was driven by more cores and threads, 50 percent more memory bandwidth, Intel AVX-512, and an improved cache hierarchy. Overall, the Intel Xeon Scalable Gold 6148 processor accelerates insights an average of 63% faster2, with an innovative approach to platform design that unlocks scalable performance for a broad range of HPC workloads—from the smallest clusters to the world’s largest supercomputers.

With Xeon Phi processor’s uniquely high degree of parallelism, it excels on applications that are massively parallel and highly vectorizable. While the code executes as is, the best benefits are obtained by optimizing the HPC code to extract the parallelism. It’s also a great choice for applications that are memory bandwidth bound and can benefit from the on-package 16GB high bandwidth MCDRAM, which offers up to 490GB/s of data transfer. Platforms with the Intel Xeon Phi processor product family offer attractive total cost of ownership (TCO) benefits of many HPC applications with improved energy efficiency compared to Intel Xeon processors.

For example, let’s consider a widely used Financial Services Industry (FSI) workload, Black Scholes DP. Since it is massively parallel, this workload can be optimized to make use of all 72 cores in the Intel Xeon Phi processor. The highly vectorizable ability of the code enables users to execute the same code on up to 288 threads (four threads per core, up to 72 cores). The workload further benefits from the Intel AVX-512 Exponential and Reciprocal (ER) Instructions, which is a unique feature of Intel Xeon Phi processor series. Another FSI workload, Binomial Options DP though, also massively parallel and highly vectorizable, benefits in terms of raw performance from the larger cache-per-core-per-thread offered by the Intel Xeon Phi processor family.

Many organizations can benefit from deploying both Intel Xeon Scalable processors and Intel Xeon Phi processors in mixed clusters of servers, to gain the unique attributes of each of the platforms.  In many situations, a variety of applications are run on large clusters instead of being dedicated to singular applications. A mixed cluster of Intel Xeon processor-based and Intel Xeon Phi processor-based machines can be used to distribute the appropriate application to the servers best suited for running that application. Additionally, when other considerations such as power or cost are factored in, there are additional benefits to a mixed cluster. The Linpack benchmark performance is very similar on both Intel Xeon Scalable processors and Intel Xeon Phi processors but runs using less power on Intel Xeon Phi processors. This allows similar workloads to be distributed to the Intel Xeon Phi processors to reduce power consumed by the cluster overall.

Both Intel Xeon Scalable processors and Intel Xeon Phi processors are powerful, general-purpose CPUs using familiar x86 instructions and the well-known standards of Intel architecture. Both offer the core performance, memory and IO bandwidth, and parallelism required by the most demanding HPC workloads. Both are highly capable foundations for a variety of server configurations. With the two platforms to choose from, organizations have the flexibility to match workloads to the capabilities of the processor to get work down faster and more efficiently.

For a closer look at the features of the platforms, explore the Intel Xeon processor and Intel Xeon Phi processor sites.

1 2M elements, Car2car model with 120ms simulation time:

  • Intel® Xeon® Processor E5-2697 v4 Configuration:
    Dual Socket Intel® Xeon® processor E5-2697 v4 2.3 GHz, Turbo mode ON, 18 Cores/Socket, 36 Cores (HT off),  DDR4 128GB, 2400 MHz, Wildcat Pass Platform. Disk: 800GB Intel SSD Kernel: 3.10.0-229.20.1.el6.x86_64
  • Intel® Xeon® Gold 6148 Processor Configuration:
    Dual Socket Intel® Xeon® Gold 6148 processor 1.8 GHz, Turbo mode ON, 28 Cores/Socket, 56 Cores (HT off), DDR4 128GB, 2667 MHz, Wolf Pass Platform. Disk: 800GB Intel SSD Kernel:3.10.0-327.e17.x86-64
  • Intel® Xeon Phi™ Processor Configuration:
    Intel® Xeon Phi™ processor 7250 68 core, 1400 MHz Turbo mode ON,  HT off, 1700 MHz uncore freq., MCDRAM 16 GB 7.2 GT/s, DDR4 96GB 2400 MHz, cache mode, Adams Pass Platform. Disk: 480 GB Intel SSD. Kernel: 3.10.0-229.20.1.el6.x86_64.knl2

2 Up to 1.63x Gains based on Geomean of Weather Research Forecasting - Conus 12Km, HOMME, LSTCLS-DYNA Explicit, INTES PERMAS V16, MILC, GROMACS water 1.5M_pme, VASPSi256, NAMDstmv, LAMMPS, Amber GB Nucleosome, Binomial option pricing, Black-Scholes, Monte Carlo European options. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter.

Published on Categories High Performance ComputingTags , , , , ,
Barry Davis

About Barry Davis

Barry Davis has over 28 years of experience in the computing and telecommunications’ industries. While at Intel Corporation, Mr. Davis built multiple businesses from the ground up creating Intel’s I/O Storage and Wireless Networking groups. Mr. Davis was one of the original people behind the worldwide success of Wi-Fi, including the introduction of Intel CentrinoTM Mobile PCs. He has numerous industry awards and holds 11 U.S. and 2 international patents. Mr. Davis has been at the forefront of Intel’s recent fabric plans and was the lead on creating the strategies, plans, and products that have resulted in the Intel® Omni-Path Architecture. Barry is currently General Manager of the Accelerated Workload Group that is part of the Enterprise & Government Group (E&G) in Intel’s Data Center Group. He holds a B.S. Electrical Engineering from Lehigh University.