Intel Omni-Path Architecture Performance – Setting the Record Straight

A recent press release by a provider of InfiniBand* EDR technology stated that InfiniBand* EDR demonstrated 30 to 250 percent better performance over Intel® Omni-Path Architecture (Intel® OPA). The release states three example codes were compared —GROMACS, NAMD, and LS-DYNA. While the release reveals that testing was “conducted at end-user installations and [the vendor’s] benchmarking and research centers”, it offers little details about the testing—the actual results, the configurations of the systems running the tests, or where the tests occurred. Intel believes the lack of transparency of testing, configuration, and setting information is not in the best interest of customers and does not reflect industry best practices on reporting performance results.

In the spirit of providing customers with complete and accurate data and to set the record straight, Intel engineers recently ran these same applications on an internal cluster. The results shown here illustrate that Intel OPA performance was seriously understated by the InfiniBand vendor. As always, Intel discloses test configurations and settings data, so that users can verify for themselves the accuracy of the results by replicating the tests.

Intel has located the performance data for GROMACS and NAMD on the HPC Advisory Council website [1, 2]. The LS-DYNA performance for Intel OPA is reported in a third-party article [3], which again did not disclose details about the testing configuration.

Intel results are significantly different from the claims presented in the vendor’s release. For the LS-DYNA testing, because the referenced article does not disclose the CPU or memory configuration, it is not possible to compare absolute performance.

GROMACS

The Advisory Council test configuration used a 128-node cluster. Intel performed the test on an available 64-node cluster with the same processor (E5-2697A). The Intel results on this cluster are compared below to the published results.

Intel tests reveal that GROMACS on the Intel 64-node OPA cluster performs 32 percent better than the published results on the HPC Advisory Council site—while running with one-quarter of the memory capacity (64 vs 256GB)  that was 11 percent slower (2133 vs 2400MHz).chart of GROMACS v2016.2 lignocellulose-rf

NAMD

The HPC Advisory Council also published NAMD tests. The apoa1 and stmv benchmarks were used to measure NAMD performance. At 64 nodes Intel’s apoa1 tests returned 110 percent better performance on Intel OPA than published on the HPC Advisory Council for Intel OPA, while running with one-quarter of the memory capacity (64 vs 256GB)  that was 11 percent slower (2133 vs 2400MHz). This large performance discrepancy between the results shown on the Advisory Council site versus those generated by Intel is an indication that the Intel OPA performance was again seriously understated. chart of NAMD 2.10b2 apoa1

LS-DYNA

The LS-DYNA performance for Intel OPA appears to come from a third-party article [3], which again reveals little configuration details for the tests and claims. Intel’s results here are normalized by the two-node performance to highlight the scaling of the Intel OPA fabric. The published measurements suggest that Intel OPA does not scale above ten nodes, but the Intel measured results, up to 16 nodes, indicates 45 percent better scaling than the published results at 12 nodes.chart of LS-DYNA Refined Neon Model

The Industry Has Voted

The industry is quickly adopting Intel OPA—both by end customers and the system vendor ecosystem – driven by Intel OPA’s performance, cost effectiveness and advanced features (e.g. Traffic Flow Optimization, Packet Integrity Protection and Dynamic Lane Scaling). As a proxy of Intel OPA’s success in the marketplace, one merely needs to look at the Top500 list of the fastest supercomputers in the world. Intel OPA wins twice the number of placements and 2.5 times the total petaflops as InfiniBand EDR, its competitor. Most of these customers chose Intel OPA after doing a performance comparison to InfiniBand EDR, as is a common best practice before investing in new technology. Most of the HPC server vendors have chosen and are promoting and have sold Intel OPA as a viable solution for a high-performance fabric. To learn more about why Intel Omni Path Architecture is becoming the fabric of choice for HPC leaders please discuss with your preferred system vendor, or visit Intel at www.intel.com/omnipath.

 

*Other names and brands may be claimed as the property of others

Configuration Information

Vendor data is obtained by digitizing plots on the public references.

All Intel tests performed on an internal Intel benchmarking cluster consisting of dual socket Intel® Xeon® E5-2697A v4 (Broadwell) nodes connected with Intel® Omni-Path Architecture. Each compute node has 64 GB of 2133 MHz DDR4 and is running Red Hat Enterprise Linux* Server release 7.2 (Maipo) with the 3.10.0-327.36.3.el7.x86_64 kernel. Intel® Turbo Boost Technology and Intel® Hyper Threading Technology are enabled. Relevant BIOS settings include IOU non-posted prefetch disabled, snoop timer for posted prefetch = 9, early snoop is disabled and cluster on Die is disabled. The Intel® OPA software level is IFS 10.3.1.0.22. There is one 100Gbps Intel® OPA Host Fabric Interface (HFI) PCIe* adapter per compute node. Non-default driver parameters: sge_copy_mode=2 eager_buffer_size=8388608 krcvqs=4 max_mtu=10240. All testing performed with one MPI rank per CPU core.

GROMACS version 2016.2, lignocellulose-rf benchmark. Benchmark obtained from http://www.prace-ri.eu/UEABS/GROMACS/1.2/GROMACS_TestCaseB.tar.gz. Open MPI 1.10.4-hfi as packaged with IFS 10.3.1.0.22. FFTW 3.3.4. Relevant Build flags: cmake .. -DGMX_BUILD_OWN_FFTW=OFF -DREGRESSIONTEST_DOWNLOAD=OFF -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx  -DGMX_MPI=on -DFFTWF_LIBRARY=/home/user/fftw-3.3.4/lib/libfftw3f.so -DFFTWF_INCLUDE_DIR=/home/user/fftw-3.3.4/include.

NAMD version 2.10b2, stmv and apoa1 benchmark. Intel® MPI 2017.1.132, I_MPI_FABRICS shm:tmi. CHARM 6.6.1. FFTW 3.3.4. Relevant build flags: ./config Linux-x86_64-icc --charm-arch mpi-linux-x86_64-ifort-smp-mpicxx --cxx icpc --cc icc --with-fftw3 --fftw-prefix /home/user/fftw-3.3.4.

LS-DYNA version mpp s R8.1.0, Feb 24 2016. Livermore Software Technology Company. Intel MPI 2017.1.132 with -PSM2 option. I_MPI_ADJUST_ALLREDUCE=5, I_MPI_ADJUST_BCAST=1, I_MPI_PIN_PROCESSOR_LIST=allcores, LSTC_MEMORY=AUTO. Run options:  i=input memory=200m memory2=30m  p=pfile_32. pfile_32 contains general { nodump nobeamout dboutonly  } , dir { global  tempdir local /dev/shm/tempdir }.

References

[1] http://hpcadvisorycouncil.com/pdf/GROMACS_Analysis_Intel_E5_2697Av4.pdf

[2] http://hpcadvisorycouncil.com/pdf/NAMD_Analysis_Intel_E5_2697Av4.pdf

[3] https://www.hpcwire.com/2016/04/12/interconnect-offloading-versus-onloading/