Intel Omni-Path Architecture (Intel OPA) Performance Scaling – The Real Numbers

In a recently published article1, there are unsubstantiated claims about Intel® Omni-Path Architecture (Intel® OPA) application performance vs Mellanox EDR InfiniBand*. This post is an effort to be completely transparent with regards to Intel OPA performance numbers and to correct some of the claims made in that article. To do so, we will provide the Intel-validated and substantiated performance of Intel OPA for ANSYS Fluent*, LS-DYNA and VASP. The article states that “Mellanox hauled out some benchmarks ran by Intel”1. Although we are not aware of the source, we will provide some data here on the test cases that were cited in the article.

ANSYS* Fluent

The ANSYS Fluent application is a memory bandwidth sensitive application. This means each core benefits from having a larger share of the system memory bandwidth. So if you run the application on one system with 32 cores per node and one system with 36 cores per node, given equal memory configuration on the nodes, the first system will have more memory bandwidth and shared data cache per core and likely perform better. In other words, the application usually scales better across cluster nodes than across cores on one node. In this sense, the article comparing EDR InfiniBand data with 32 cores per node with Intel OPA with 36 cores per node is not an apples-to-apples comparison.

Even with this discrepancy in the referenced article, we show below the true performance of Intel OPA measured with Intel® Xeon® Gold 6148 processor nodes (40 cores per node, even less memory bandwidth per core). We have compared the performance data represented in graphical form in the article with measurements in-house using ANSYS Fluent 18.2. We encourage readers to refer to the article: we have done our best to represent the data but were not able to locate configuration details for either InfiniBand or Intel OPA systems. One can see that the actual Intel OPA performance at 64 nodes is almost identical to the EDR InfiniBand performance. We have noticed performance improvements using successive releases of Fluent. Specifically, the oil_rig_7m case got a 68% performance boost from Fluent 17.0 to 18.0 on the same hardware on 1536 cores, so this is an important configuration piece missing from the referenced data1.68% performance boost from Fluent 17.0 to 18.0 on the same hardware on 1536 cores

...and at even larger cluster scales:

Using a larger f1_racecar_140m cell benchmark, Intel OPA scaling continues to impress. The following plot shows that even up to 128 cluster nodes and 5120 compute cores, Intel OPA provides 91% scaling efficiency relative to the 8 node result.shows a 25% to 45% increase compared to EDR InfiniBand across a range of simulations on larger configurations

ANSYS independently highlights Intel OPA performance in a July blog that shows a 25% to 45% increase compared to EDR InfiniBand across a range of simulations on larger configurations.2

LS-DYNA

In the LS-DYNA comparison, our internal tests show more than double the performance results in the referenced article.  Again, there is minimal configuration information available to understand how the data was collected. At risk of comparing unique configurations, we show below our internal benchmark performance vs. number of nodes, along with the digitized data from the article1. At the end of this article, the complete configuration is provided for the Intel measured data, and we encourage readers to inquire about the configuration for the EDR InfiniBand reference data. More importantly, we encourage you to test your own performance if possible.

The following figure shows the data from the article for EDR InfiniBand (red) and Intel OPA (grey). We include our internally measured and documented performance in blue, showing that Intel OPA performance is more than two times greater than the article reports at 32 nodes. Without the configuration detail of the reference data, there is no way to attribute the remaining 2% to the InfiniBand number at 32 nodes. This performance is achieved with earlier versions of Intel OPA host software by specifying the driver parameter eager_buffer_size=8388608 (bytes). This new value of 8 MB is greater than the previous default value of 2MB. The latest versions of host software, including version 10.6 which is now publically available, already have this driver setting and no user adjustment is required.

Intel OPA strives to provide the best performance out-of-the-box with minimal user tunings

Although Intel OPA strives to provide the best performance out-of-the-box with minimal user tunings, sometimes performance tunings are required until these changes are incorporated into software releases. The above-mentioned driver tuning has been recommended in many recent releases of the Intel OPA Performance Tuning Guide3. We strongly encourage users of Intel OPA consult the tuning guide periodically if they are concerned about lower than expected performance. It is suspected, based on the article’s Intel OPA performance claims, that there was either a node health or other performance problem with the cluster. It is certainly not indicative of actual Intel OPA performance.

VASP

In the VASP comparison, there is minimal configuration information available from the reference article to understand how the data was collected other than Intel® Xeon® Scalable processors1 were used. At the end of this article, the complete configuration is given for the Intel measured data, and we encourage readers to inquire about the configuration for the EDR InfiniBand reference data. More importantly, we encourage you to test your own performance if possible.

Below we show our previously measured VASP benchmark performance, along with the digitized data from the article1. We are currently unable to locate the referenced HEZB workload. In order to compare with the reference data, we took four other similar benchmarks and normalized all runs to the 4 node result. This is a way to reduce the impact of system configuration such as CPU and memory on the comparison, and focus on the scalability of the interconnect.

The following figure shows the data from the article for EDR InfiniBand (red) and Intel OPA (grey). We include our internally measured performance on Intel® Xeon® processor E5-2697v4 in various shades of blue. The first two cases, GaAsBi-64 and PdO4, are medium-sized DFT calculations, and the latter two cases, CuC (vdW) and Si256 (HSE), are expected to scale to higher node counts due to their problem sizes and computation intensity. One can see from the figure that at 16 nodes, Intel OPA scales 1.8 to 2.2 times better than the article reports. The two larger datasets continue to scale up to 32 nodes, with 82% scaling efficiency measured for the Si256 workload from 16 to 32 nodes.

At 16 nodes, Intel OPA scales 1.8 to 2.2 times better than the article reports

Unfortunately, we do not have access to the HEZB workload and therefore cannot directly refute the scaling claim made by the article. However, we do show that for a range of other VASP workloads, Intel OPA scaling is quite healthy. There are a range of publically available resources for discussion of performance optimizations and access to VASP workloads567. The latest Intel OPA Performance Tuning Guide3 has optimization tips which are then incorporated as defaults into future Intel OPA software releases. In general, Intel OPA strives to deliver the best out-of-the-box experience possible across a wide range of workloads.

We hope that after reading this post you’ll understand a little better how Intel OPA performs with ANSYS Fluent, LS-DYNA, and VASP benchmarks at various scales. We hope also that you’ll take the time to run these tests yourself. To learn more about Intel OPA, please visit http://intel.com/omnipath.

 

Intel® OPA configurations:

ANSYS* Fluent: Dual socket Intel® Xeon® Gold 6148 processors. 192 GB DDR memory per node, 2666MHz. Intel® Turbo Boost Technology and Intel® Hyper-Threading Technology enabled. Red Hat Enterprise Linux* Server release 7.3 (Maipo). 3.10.0-514.6.2.0.1.el7.x86_64 kernel. Intel® Omni-Path Architecture (Intel® OPA): Intel Fabric Suite 10.2.0.0.158, default driver parameters. Intel Corporation Series 100 Host Fabric Interface (HFI), Series 100 Edge Switch – 48 port. Command line options to enable Intel® OPA with Intel MPI 5.1.3 and AVX2: -mpi=intel -pib.infinipath. ANSYS* Fluent v 18.2 is a general purpose CFD and multiphysics solver widely used in automotive manufacturing, aerospace, academia, and Formula 1 racing. Typical workload sizes range from 2 million to 500 million cells. www.ansys.com.

LS-DYNA:  Dual socket Intel® Xeon® processor E5-2697Av4. 192 GB DDR memory per node, 2333 MHz. Intel® Turbo Boost Technology and Intel® Hyper-Threading Technology enabled. Red Hat Enterprise Linux* Server release 7.2. 3.10.0-327.36.3.el7.x86_64 kernel. Intel® Omni-Path Architecture (Intel® OPA): Intel Fabric Suite 10.3.1, eager_buffer_size=8388608. Intel Corporation Series 100 Host Fabric Interface (HFI), Series 100 Edge Switch – 48 port. LS-DYNA MPP 9.1 AVX2 binary (ls-dyna_mpp_s_r9_1_113698_x64_redhat54_ifort131_avx2_intelmpi-413.tar.gz). Intel MPI 2017 Update 1. Example run command: mpirun  -np 1024 -ppn 32 -PSM2  -hostfile ~/host32opa  ~/mppdynar91avx2  i=input memory=1400m memory2=140m  p=pfile_32.

VASP:   Dual socket Intel® Xeon® processor E5-2697v4. 128 GB DDR memory per node, 2400 MHz. Red Hat Enterprise Linux* 7.2. Intel® Omni-Path Architecture (Intel® OPA): Intel Fabric Suite 10.0.1.0.50. Intel Corporation Series 100 Host Fabric Interface (HFI), Series 100 Edge Switch – 48 port. IOU Non-posted prefetch disabled. Snoop hold-off timer=9 in BIOS.

VASP 5.4.1. Intel MPI 5.1.3 with I_MPI_FABRICS=shm:tmi. Example run command: mpiexec.hydra –genv OMP_NUM_THREADS=9 -ppn 4 -n 8 -f nodes.txt vasp_std