Intel Omni-Path Architecture Enables Deep Learning Training On HPC

We’re beginning to see the acceleration of deep learning algorithms on HPC clusters through multi-node processing. Customers are using their HPC clusters for more than simulation and analytics. They’re running sophisticated deep learning training projects on them with incredible success. Customers who have now deployed clusters with the new Intel® Xeon® Scalable processors and Intel® Omni-Path Architecture (Intel® OPA), such as Barcelona Supercomputing Center (BSC), are finding that the new processor along with Intel Omni-Path Architecture are a great platform for deep learning training across multiple nodes.

The Benefits of Multi-Node Training

Training is getting more complex as data and computational scientists explore deeper into the field. Deep learning training is an intensive and repetitive computation. There are several factors driving this. Many operational neural networks are part of different applications; each neural network may be trained with domain-specific training sets; evolving input datasets drive the need for re-training.  Another consideration is time to train—trimming from weeks to days to hours and minutes—to make AI much more viable across multiple verticals.

As researchers drive for greater accuracy, they’re finding that bigger data sets help. But, more data demands greater computational resources to deliver results in a timely manner. Single-node workstation solutions are not able to keep up with the flood of data and the complexity of deep learning. That leaves either scaling up to larger and larger nodes or scaling out with more nodes.

Scalability is what the Intel Xeon Scalable processor was designed for—delivering consistently powerful performance as the cluster expands. Scale out training solutions with Intel® architecture servers interconnected by a high-performance fabric are an excellent choice for improving performance and resource flexibility. Using Intel Omni-Path Architecture (OPA) to couple large numbers of nodes with data and model parallelism and smart node grouping makes near-linear scalability achievable:

  • Texas Advanced Computing Center (TACC) on Stampede2 reached 97% scalability up to 256 Intel® Xeon Phi™ Processor servers with Intel OPA and Resnet-50.1
  • BSC’s MareNostrum4, with Intel Xeon Scalable processors, delivered 90% efficiency.1

Scaling Efficiency on TACC Stampede2 with Xeon Phi and Intel OPA graph

TACC Stampede 2

  • 97% scaling efficiency from 4 to 256 Intel Xeon Phi processor 7250 nodes interconnected with Intel OPA
  • Convergence with Top1/5 > 74%/92%
  • 4 - 256 node runs: batch size of 16 per node, scaling efficiency of 97% in 63 minutes

Intel Caffe Resnet Xeon Platinum Omni-Path Architecture

BSC MareNostrum 4

  • Convergence with Top1/5 > 74%/92%
  • 4 - 256 node runs: Batch size of 32 per node, 90% scaling efficiency, Total time to train: 70 Minutes

HPC clusters are desirable resources for many businesses and research institutions to provide multiple users with sharable computing for simulation and high-performance data analytics. Over the last few years, many institutions have abandoned departmental resources in favor of sharable, ‘condo’ clusters to gain greater computing power for their projects. Now, they’re using those same condo clusters for AI, too.

Why Intel Omni-Path Architecture for AI

When we look at AI, and specifically deep learning training, we see many similarities to HPC applications and their need for high bandwidth, high message rates, and low latency. Models are trained in a recursive manner, requiring inter-node communication to proceed. Communication must be able to keep up with the requirements of the neural network calculations, or they are left waiting. This becomes more critical as more deep learning frameworks migrate to or make use of scale-out solutions to reduce the time to train.

So, as more nodes participate, the HPC fabric becomes key to ensure calculations continue and the processor cores have data to work on. High message injection rate and consistently low latency are critical for iterative global weight updates and driving improved AI performance as node counts increase. The Intel OPA architecture is designed to ensure fast, efficient, and scalable performance.

Intel OPA has only been in the market for about a year, yet it is sought after by institutions around the world for their latest HPC deployments. Intel’s fabric combines high bandwidth with deterministic low latency and very high message injection rate. Benchmarks have revealed:2

  • Low Latency - 940 ns latency measured through a single switch.
  • High Message Rate Injection - 249 million messages per second bi-directionally with one switch hop (157 Mmps uni-directionally).
  • 100Gb Bandwidth – 12.4 GB/s uni-directional bandwidth and 24.6 GB/s bi-directional bandwidth with one switch hop.

Intel OPA integrates a rich set of features for reliability and performance to maximize the quality of service traversing the fabric and maintain link continuity even in the event of lane failures. Performance and reliability reduce time to train, such as ImageNet-1K training in less than 40 minutes with the Intel® Distribution of Caffe*,3 and delivering reduced communication latency compared to InfiniBand EDR*, including:4

  • Up to 21% Higher Performance, lower latency at scale4
  • Up to 53% higher messaging rate2
  • Up to 9% higher application performance4

As shown above, Intel OPA is part of the framework that delivers 90+% scalability. It also enables dramatic reductions in time to train.5

Convergence of Simulation and AI on HPC

Because of the benefits of scale-out computing and existing HPC clusters resources, many disciplines are combining deep learning training with existing HPC installations for the following objectives:

  • Using trained models to improve and augment input data for research
  • Using trained models to identify and discard un-needed information captured by research equipment, or develop data when there are gaps or missing information from equipment in the field
  • Running traditional HPC simulations through a neural network to uncover additional patterns and insights
  • Aligning many training results with HPC research and development scenarios

Such work is going on at world-renowned institutions. Below is a small sampling of organizations using AI/HPC convergence to further their research.

TSUBAME 3.0 Marconi Stampede2 Bridges
Institute Tokyo Institute of Technology, a scientific research center Cineca, the largest super-computing center in Italy Texas Advanced Computing Center (TACC) Pittsburgh Super-computing Center (PSC)
Workload Combines AI with traditional HPC simulation Enables an AI system used for applications such as physics and precision medicine Pairs machine learning and HPC to classify neuroimaging data Liberatus, AI program beat the world’s top players at poker
Solution Intel Xeon processor E5 v4 Family and Intel Omni-Path Architecture from SGI/HPE with GPUs Intel Xeon and Intel Xeon Phi processor-based nodes in Lenovo’s NeXtScale platform connected with Intel Omni-Path Architecture Dell PowerEdge C6320P based on Intel Xeon Phi and Intel Xeon processors with Intel Omni-Path Architecture Intel Xeon processor-based  nodes, variety of HPE server types and Intel Omni-Path Architecture
Intel Omni-Path Architecture Advantages Price/performance and features to meet all requirements • High-performance interconnectivity required to efficiently scale thousands of servers

• Integrates IBM Spectrum Scale* (GPFS) file system in Lenovo GSS storage subsystem

• Intel’s complete end-to-end solution with processors and fabric

• Processor/ fabric integration for improved Total Cost of Ownership and optimization

• Leadership price/performance—enabled purchase of a more robust cluster

• High-performance fabric—100 Gbps with accelerated error detection and correction (no added latency)

• Fast installation, seamless integration—applications “just run”

Also see


As research and application continue with deep learning training, Intel Omni-Path Architecture has proven to be an excellent interconnect solution for a scalable performance of multi-node training solutions. The world’s leading computing institutions use clusters built with Intel OPA combined with Intel Xeon Scalable and Intel Xeon Phi processors. Intel OPA is helping enable the convergence of AI and HPC on Intel architecture.

Find out more at

Also see:
Goyal, Priya, et al.  “Accurate, Larg Minibatch SGD: Training ImagNetin 1 Hour.” arXiv preprint arXiv:1706.02677 (2017).
Cho, Minsik, et al. “powerAI DDL.” arXiv preprint arXiv:1708.02.
2 Intel® Xeon® Platinum 8170 processor, 2.10 GHz 26 cores, 64 GB 2666 MHz DDR4 memory per node. 52 ranks per node for message rate tests RHEL* 7.3, 3.10.0-514.el7.x86_64 kernel.
Dual socket servers with one switch hop. Intel® Turbo Boost Technology enabled, Intel® Hyper-Threading Technology enabled. OSU Microbenchmarks version 5.3. Benchmark processes pinned to the cores on the socket that is local to the PCIe adapter before using the remote socket. osu_mbw_mr source code adapted to measure bi-directional bandwidth. We can provide a description of the code modification if requested. EDR based on internal testing: Open MPI 2.1.1 built with hpcx-v1.8.0-gcc-MLNX_OFED_LINUX-4.0-  Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 - 36 Port EDR InfiniBand switch. MLNX_OFED_LINUX-4.0- Tuned performance obtained with MXM_TLS=rc specification.  FEC automatically disabled when using <=2M copper IB* cables.  2. Intel® OPA: Open MPI 1.10.4-hfi as packaged with IFS Intel® Xeon® Platinum 8180 processor, 2.50 GHz, 28 cores, 64 GB 2666 MHz DDR4 memory per node for Message Rate.
4 Intel® Xeon® Processor E5-2697A v4 dual-socket servers with 2133 MHz DDR4 memory. Intel® Turbo Boost Technology and Intel® Hyper Threading Technology enabled. BIOS: Early snoop disabled, Cluster on Die disabled, IOU non-posted prefetch disabled, Snoop hold-off timer=9. Red Hat Enterprise Linux Server release 7.2 (Maipo).  Intel® OPA testing performed with Intel Corporation Device 24f0 – Series 100 HFI ASIC (B0 silicon). OPA Switch: Series 100 Edge Switch – 48-port (B0 silicon). Intel® OPA host software 10.1 or newer using Open MPI 1.10.x contained within host software package.  EDR IB* testing performed with Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 - 36 Port EDR InfiniBand switch. EDR tested with MLNX_OFED_Linux-3.2.x.  OpenMPI 1.10.x contained within MLNX HPC-X.  Message rate claim:  Ohio State Micro Benchmarks v. 5.0. osu_mbw_mr, 8 B message (uni-directional), 32 MPI rank pairs. Maximum rank pair communication time used instead of average time, average timing introduced into Ohio State Micro Benchmarks as of v3.9 (2/28/13). Best of default, MXM_TLS=self, rc, and -mca pml yalla tunings. All measurements include one switch hop.  Latency claim:  HPCC 1.4.3 Random order ring latency using 16 nodes, 32 MPI ranks per node, 512 total MPI ranks.  Application claim:  GROMACS version 5.0.4 ion_channel benchmark. 16 nodes, 32 MPI ranks per node, 512 total MPI ranks. Intel® MPI Library 2017.0.064.  Additional configuration details available upon request.
For more information: