Applications on Intel® Omni-Path Architecture Run 4.6 Percent Faster in 2018 than 2017—with Spectre and Meltdown Mitigations

Our customers have asked us what impact the Spectre and Meltdown issues are expected to have on computing and communications performance in High Performance Computing (HPC) clusters—especially regarding Intel® Omni-Path Architecture (Intel® OPA) fabrics. We know it’s important for our customers to have access to accurate, useful, and complete data to assess the potential impact to their workloads, so I’m writing this blog to share what we’ve observed in our own labs.

The data shows that the mitigations, along with other Intel® OPA software stack enhancements, have actually improved the performance of our fabric—not only in tests comparing Intel OPA® 2018 performance to 2017, but with respect to performance compared to InfiniBand* EDR.

The most common question we hear is about onload versus offload communications processing. It’s helpful to remember that Intel® OPA uses intelligent decision-making to process communications in a manner that is best for the application transaction taking place at the time. The objective for Intel OPA is to benefit the application without delaying the flow. And, our fabric uses more than just offload/onload to help optimize communications. If you’re interested in learning more about these technologies, read the Intel® OPA article series, starting here.

While addressing Spectre and Meltdown is critical, we are also constantly improving the Intel® OPA software stack for the benefit of the HPC community. Over the last year, we’ve been tuning Intel® OPA drivers and the Intel® MPI library, and making OS refinements to further improve Intel® OPA performance across a wide range of MPI codes. We are pleased with the changes we’re seeing—including the mitigations for Spectre and Meltdown.

Testing across 54 MPI applications, Intel OPA runs 4.6 percent faster on average in 2018 when compared to 2017.  We tested these applications with our code improvements mentioned above, which include mitigations for Spectre and Meltdown, against the non-mitigated, 2017 software stack.

Testing across 54 MPI applications, Intel OPA runs 4.6 percent faster on average in 2018 when compared to 2017. We tested these applications with our code improvements mentioned above, which include mitigations for Spectre and Meltdown, against the non-mitigated, 2017 software stack.

Intel® OPA performance has dropped slightly on some applications from 2017 to 2018. Intel has a performance engineering team that examines these kinds of changes and develops enhancements to improve the results. But, for the majority of MPI codes used by scientists and engineers around the planet, users can expect the same or better performance, while resting assured code enhancements have enabled the Spectre and Meltdown mitigations.

Customers have also asked us how Intel OPA with the mitigations compares to InfiniBand. We not only achieved a 4.6 percent on average performance boost across 54 MPI codes compared to Intel® OPA performance in 2017, we’re outperforming InfiniBand EDR by an average of 4 percent across the same applications—with the mitigations turned on. Four percent is good. But, it gets better. Thirty applications run 19 percent better on average compared to InfiniBand EDR. The NWCHEM codes run from 44 percent to 141 percent faster than on InfiniBand EDR.

Customers have also asked us how Intel OPA with the mitigations compares to InfiniBand. We not only achieved a 4.6 percent on average performance boost across 54 MPI codes compared to Intel OPA performance in 2017, we’re outperforming InfiniBand EDR by an average of 4 percent across the same applications—with the mitigations turned on.

Some applications run slower on Intel® OPA compared to InfiniBand EDR; 24 applications run seven percent slower on average. Again, the Intel performance engineering team is working on addressing these differences and exploring refinements that may improve performance in the future. But, the data shows that for the majority of MPI codes used by scientists and engineers, users can expect great performance on Intel® OPA compared to InfiniBand EDR, while enabling Spectre and Meltdown mitigations.

The Intel® OPA design team is committed to delivering the most performant fabric with the lowest latency for distributed computing applications. Intel has a long history in HPC, and our continued investment in fabric technology, along with the rest of the ingredients that make up Intel’s balanced HPC framework, has resulted in 1) HPC systems that lead the list of the fastest supercomputers around the world, and 2) Intel® OPA leading the adoption of 100 Gbps fabrics that enable those systems.

Configuration for Application Performance—Intel® Xeon® Platinum 8170 Processors

As with all our claims about performance, at Intel we back up our statements about data with how we obtained that data, so readers can prove themselves that what we state is achievable.

Intel® Omni-Path Architecture 2018 Compared to 2017 Performance

Here is how we arrived at a 4.6 percent on average performance improvement for 2018 Intel® OPA versus 2017.

As with all our claims about performance, at Intel we back up our statements about data with how we obtained that data, so readers can prove themselves that what we state is achievable.

Application-specific configurations:

  • BSMBench - An HPC Benchmark for BSM Lattice Physics Version 1.0. 32 ranks per node. Parameters: global size is 64x32x32x32, proc grid is 8x4x4x4. Machine config build file: cluster.cfg
  • FDS (Fire Dynamics Simulator) version 6.5.3. strong_scaling_test, a General purpose input file to test FDS timings. 50 MPI ranks per node, 800 total MPI ranks.
  • GROMACS version 2016.2. http://www.prace-ri.eu/UEABS/GROMACS/1.2/GROMACS_TestCaseB.tar.gz lignocellulose-rf benchmark. -g -static-intel. CC=mpicc CXX=mpicxx  -DBUILD_SHARED_LIBS=OFF -DGMX_FFT_LIBRARY=mkl -DGMX_MPI=ON -DGMX_OPENMP=ON -DGMX_CYCLE_SUBCOUNTERS=ON -DGMX_GPU=OFF -DGMX_BUILD_HELP=OFF -DGMX_HWLOC=OFF -DGMX_SIMD=AVX512 GMX_OPENMP_MAX_THREADS=256. Run detail: gmx_mpi mdrun -s run.tpr -gcom 20 -resethway -noconfout
  • LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) Feb 16, 2016 stable version release. Official Git Mirror for LAMMPS (http://lammps.sandia.gov/download.html) ls, rhodo, sw, and water benchmark. 52 ranks per node and 2 OMP threads per rank. Common parameters: I_MPI_PIN_DOMAIN=core Run detail: Number of time steps=100, warm up time steps=10 (not timed) Number of copies of the simulation box in each dimension: 8x8x4 and problem size: 8x8x4x32k = 8,192k atoms Build parameters: Modules: yes-asphere yes-class2 yes-kspace yes-manybody yes-misc yes-molecule yes-mpiio yes-opt yes-replica yes-rigid yes-user-omp yes-user-intel. Binary to be built: lmp_intel_cpu. . Runtime lammps parameters: -pk intel 0 -sf intel -v n 1
  • LS-DYNA, A Program for Nonlinear Dynamic Analysis of Structures in Three Dimensions Version : mpp s R9.1.0 Revision: 113698, single precision (I4R4) OPA parameters: better of I_MPI_FABRICS shm:tmi and tmi EDR parameters: better of I_MPI_FABRICS shm:dapl and shm:ofa. Example pfile: gen { nodump nobeamout dboutonly } dir { global one_global_dir local /tmp/3cars }. 2017: mpp s R9.1.0, Revision: 113698     2018: mpp s R8.1.0 Revision 105896.
  • NAMD version 2.10b2, stmv and apoa1 benchmark. Build detail: CHARM 6.6.1. FFTW 3.3.4. Relevant build flags: ./config Linux-x86_64-icc --charm-arch mpi-linux-x86_64-ifort-smp-mpicxx --cxx icpc --cc icc --with-fftw3.
  • NWCHEM release 6.6. Binary: nwchem_armci-mpi_intel-mpi_mkl with MPI-PR run over MPI-1. Workload: siosi3 and siosi5. http://www.nwchem-sw.org/index.php/Main_Page. Intel OPA for 2017 result: -genv PSM2_SDMA=0. 2 ranks per node, 1 rank for computation and 1 rank for communication. -genv CSP_VERBOSE 1 -genv CSP_NG 1 -genv LD_PRELOAD libcasper.so
  • OpenFOAM is a free, open source CFD software package developed primarily by [OpenCFD](http://www.openfoam.com) . Version v1606+ . Gcc version 4.8.5 for Intel MPI. All default make options.
  • Quantum ESPRESSO is an integrated suite of Open-Source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials. http://www.quantum-espresso.org/ ./configure --enable-openmp --enable-parallel. BLAS_LIBS= -lmkl_intel_lp64   -lmkl_intel_thread -lmkl_core ELPA_LIBS_SWITCH = enabled SCALAPACK_LIBS = $(TOPDIR)/ELPA/libelpa.a -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 DFLAGS= -D__INTEL -D__FFTW -D__MPI -D__PARA -D__SCALAPACK -D__ELPA -D__OPENMP $(MANUAL_DFLAGS) AUSURF112 benchmark, all default options
  • SPECFEM3D_GLOBE simulates the three-dimensional global and regional seismic wave propagation based upon the spectral-element method (SEM). It is a time-step algorithm which simulates the propagation of earth waves given the initial conditions, mesh coordinates/ details of the earth crust. small_benchmark_run_to_test_more_complex_Earth benchmark, default input settings. specfem3d_globe-7.0.0. FC=mpiifort CC=mpiicc MPIFC=mpiifort FCFLAGS=-g -xCORE_AVX2 CFLAGS=-g -O2 -xCORE_AVX2. sh and run_mesher_solver.sh, NCHUNKS=6, NEX_XI=NEX_ETA=80, NPROC_XI=NPROC_ETA=10. 600 cores used, 52 cores per node
  • Spec MPI2007, https://www.spec.org/mpi/. *Intel Internal measurements marked estimates until published. Applications listed with “-Large” or “-Medium” in the name were part of the spec MPI suite. Compiler options: -O3 -xCORE-AVX2 -no-prec-div. Intel MPI: mpiicc, mpiifort, mpiicpc. Open MPI: mpicc, mpifort, mpicxx. Run detail: mref and lref suites, 3 iterations. 121.pop2: CPORTABILITY=-DSPEC_MPI_CASE_FLAG. 126.lammps: CXXPORTABILITY = -DMPICH_IGNORE_CXX_SEEK. 127.wrf2: CPORTABILITY      = -DSPEC_MPI_CASE_FLAG -DSPEC_MPI_LINUX. 129.tera_tf=default=default=default: srcalt=add_rank_support 130.socorro=default=default=default: srcalt=nullify_ptrs FPORTABILITY  = -assume nostd_intent_in CPORTABILITY = -DSPEC_EIGHT_BYTE_LONG CPORTABILITY = -DSPEC_SINGLE_UNDERSCORE. 2017 Intel® OPA: 32 MPI ranks per node for 115.fds4 benchmark
  • WRF - Weather Research & Forecasting Model (http://www.wrf-model.org/index.php) version 3.5.1. -xCORE_AVX2 -O3 . Net CDF 4.4.1.1 built with icc. Net CDF-fortran version 4.4.4 built with icc.

Intel® Omni-Path Architecture Compared to InfiniBand* EDR

Here is how we arrived at a 4 percent on average performance improvement for 2018 Intel® OPA compared to InfiniBand EDR.

how we arrived at a 4 percent on average performance improvement for 2018 Intel OPA compared to InfiniBand EDR.

Application-specific configurations:

  • BSMBench - An HPC Benchmark for BSM Lattice Physics Version 1.0. 32 ranks per node. Parameters: global size is 64x32x32x32, proc grid is 8x4x4x4. Machine config build file: cluster.cfg
  • FDS (Fire Dynamics Simulator) version 6.5.3. strong_scaling_test, a General purpose input file to test FDS timings. 50 MPI ranks per node, 800 total MPI ranks.
  • GROMACS version 2016.2. http://www.prace-ri.eu/UEABS/GROMACS/1.2/GROMACS_TestCaseB.tar.gz lignocellulose-rf benchmark. -g -static-intel. CC=mpicc CXX=mpicxx  -DBUILD_SHARED_LIBS=OFF -DGMX_FFT_LIBRARY=mkl -DGMX_MPI=ON -DGMX_OPENMP=ON -DGMX_CYCLE_SUBCOUNTERS=ON -DGMX_GPU=OFF -DGMX_BUILD_HELP=OFF -DGMX_HWLOC=OFF -DGMX_SIMD=AVX512 GMX_OPENMP_MAX_THREADS=256. Run detail: gmx_mpi mdrun -s run.tpr -gcom 20 -resethway -noconfout
  • LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) Feb 16, 2016 stable version release. Official Git Mirror for LAMMPS (http://lammps.sandia.gov/download.html) ls, rhodo, sw, and water benchmark. 52 ranks per node and 2 OMP threads per rank. Common parameters: I_MPI_PIN_DOMAIN=core Run detail: Number of time steps=100, warm up time steps=10 (not timed) Number of copies of the simulation box in each dimension: 8x8x4 and problem size: 8x8x4x32k = 8,192k atoms Build parameters: Modules: yes-asphere yes-class2 yes-kspace yes-manybody yes-misc yes-molecule yes-mpiio yes-opt yes-replica yes-rigid yes-user-omp yes-user-intel. Binary to be built: lmp_intel_cpu. . Runtime lammps parameters: -pk intel 0 -sf intel -v n 1
  • LS-DYNA, A Program for Nonlinear Dynamic Analysis of Structures in Three Dimensions Example pfile: gen { nodump nobeamout dboutonly } dir { global one_global_dir local /tmp/3cars }. Higher performance shown with mpp s R8.1.0 Revision 105896 or mpp s R9.1.0 Revision: 113698
  • NAMD version 2.10b2, stmv and apoa1 benchmark. Build detail: CHARM 6.6.1. FFTW 3.3.4. Relevant build flags: ./config Linux-x86_64-icc --charm-arch mpi-linux-x86_64-ifort-smp-mpicxx --cxx icpc --cc icc --with-fftw3.
  • NWCHEM release 6.6. Binary: nwchem_armci-mpi_intel-mpi_mkl with MPI-PR run over MPI-1. Workload: siosi3 and siosi5. http://www.nwchem-sw.org/index.php/Main_Page. 2 ranks per node, 1 rank for computation and 1 rank for communication. -genv CSP_VERBOSE 1 -genv CSP_NG 1 -genv LD_PRELOAD libcasper.so
  • OpenFOAM is a free, open source CFD software package developed primarily by [OpenCFD](http://www.openfoam.com) . Version v1606+ . Gcc version 4.8.5 for Intel MPI. All default make options.
  • Quantum ESPRESSO is an integrated suite of Open-Source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials. http://www.quantum-espresso.org/ ./configure --enable-openmp --enable-parallel. BLAS_LIBS= -lmkl_intel_lp64   -lmkl_intel_thread -lmkl_core ELPA_LIBS_SWITCH = enabled SCALAPACK_LIBS = $(TOPDIR)/ELPA/libelpa.a -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 DFLAGS= -D__INTEL -D__FFTW -D__MPI -D__PARA -D__SCALAPACK -D__ELPA -D__OPENMP $(MANUAL_DFLAGS) AUSURF112 benchmark, all default options
  • SPECFEM3D_GLOBE simulates the three-dimensional global and regional seismic wave propagation based upon the spectral-element method (SEM). It is a time-step algorithm which simulates the propagation of earth waves given the initial conditions, mesh coordinates/ details of the earth crust. small_benchmark_run_to_test_more_complex_Earth benchmark, default input settings. specfem3d_globe-7.0.0. FC=mpiifort CC=mpiicc MPIFC=mpiifort FCFLAGS=-g -xCORE_AVX2 CFLAGS=-g -O2 -xCORE_AVX2. sh and run_mesher_solver.sh, NCHUNKS=6, NEX_XI=NEX_ETA=80, NPROC_XI=NPROC_ETA=10. 600 cores used, 52 cores per node
  • Spec MPI2007, https://www.spec.org/mpi/. *Intel Internal measurements marked estimates until published. Applications listed with “-Large” or “-Medium” in the name were part of the spec MPI suite. Compiler options: -O3 -xCORE-AVX2 -no-prec-div. Intel MPI: mpiicc, mpiifort, mpiicpc. Open MPI: mpicc, mpifort, mpicxx. Run detail: mref and lref suites, 3 iterations. 121.pop2: CPORTABILITY=-DSPEC_MPI_CASE_FLAG. 126.lammps: CXXPORTABILITY = -DMPICH_IGNORE_CXX_SEEK. 127.wrf2: CPORTABILITY      = -DSPEC_MPI_CASE_FLAG -DSPEC_MPI_LINUX. 129.tera_tf=default=default=default: srcalt=add_rank_support 130.socorro=default=default=default: srcalt=nullify_ptrs FPORTABILITY  = -assume nostd_intent_in CPORTABILITY = -DSPEC_EIGHT_BYTE_LONG CPORTABILITY = -DSPEC_SINGLE_UNDERSCORE.
  • WRF - Weather Research & Forecasting Model (http://www.wrf-model.org/index.php) version 3.5.1. -xCORE_AVX2 -O3 . Net CDF 4.4.1.1 built with icc. Net CDF-fortran version 4.4.4 built with icc.

The benchmark results reported above may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.


Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase.  For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Intel® technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com.