Improving Collectives Performance with Dispersive Routing and Intel® Omni-Path Architecture

In High Performance Computing (HPC), the speed of the simulation is determined by the individual server performance, but also by the interconnect, and with large-scale clusters, performance of the interconnect becomes even more important. For some workloads and communication patterns, the servers may try to use similar paths through the network, and this can lead to bottlenecks to performance. Intel® Omni-Path Architecture (Intel® OPA) has an advanced feature known as Dispersive Routing, which we show in this article to improve performance for the Alltoall collective communication pattern by up to 30%1>.

Dispersive routing assigns multiple static communication paths through the fabric between two endpoints. Each endpoint (compute node) is assigned multiple local identifiers, or LIDs, depending on the level of dispersive routing used. Performance Scaled Messaging (PSM2), the performant software layer that is used by MPI applications, can then break up the message among these multiple routes, distributing the traffic across a wider number of inter-switch links, or ISLs. This has the potential to reduce the congestion on any ISL in the fabric that would normally be recycled by multiple communicating compute nodes. Dispersive routing can be easily enabled by modifying the <Lmc> variable (LID mask control) in the opafm.xml configuration file, and re-starting the Intel® OPA Fabric Manager.

The Perfect Candidate for Performance Testing

Dispersive routing’s performance benefits can be observed when executing the bandwidth-demanding Alltoall MPI collective. MPI collectives are types of communication that involve, collectively, all ranks (communication cores) in the MPI application. They are therefore more demanding of the fabric and system level performance than point-to-point messages—a perfect candidate for testing dispersive routing.

Given that Alltoall collectives require all ranks to communicate with all other ranks, the communication requirements therefore grow with the number of ranks "squared", or Nr2. In a typical HPC cluster, the number of ISLs between the edge switches and the core switches are much less, usually on the order of the number of connected servers. This means, for a large rank count Alltoall collective, there will be communications which have to share ISLs. When ISLs are shared, the communicating ranks have to share bandwidth with other ranks, and that limits the performance of the collective.

Performance Gains with Dispersive Routing

In this example, we use a cluster arranged in a fat tree topology, as illustrated in Figure 1, to show the performance gains from dispersive routing. Each edge has 24 servers connected. Some are storage servers, some are management, some are compute nodes, etc. Between each edge and core are 12 ISLs.

We first test the Alltoall collective with static routing, meaning a single communication path between two endpoints through the ISLs are pre-determined and not changed during the application. In the second test, we enable dispersive routing. A non-zero Lmc value means that each compute node will talk to other compute nodes using 2Lmc unique routes through the fabric. The message is literally broken up into smaller pieces and "dispersed" across the available routes in the fabric. This reduces "hot spots" in the fabric where links would be traditionally over-utilized due to the high communication demand of an Alltoall collective.

In Figure 1, two unique routes are shown between Node A and Node B for the example case where Lmc=1. For static routing when Lmc=0, only one route would be used. In addition to a fully subscribed fabric with 24 ISLs from each edge switch, we also disable active ISLs through Intel® Fabric Suite (Intel® IFS) software tools, such as opadisableports, and to re-enable, opaenableports.  This provides an easy way to test the impact of subscription ratios through simple software commands.

a cluster arranged in a fat tree topology as illustrated in Figure 1 to show the performance gains from dispersive routing
Figure 1: Configuration and dispersive routing example. With Lmc=1 (2 LIDs per endpoint), a message from Node A to Node B will take two unique paths through fabric, Route 1 and Route 2.

The collective is performed across 59 available compute nodes consisting of dual-socket Intel® Xeon® processor E5-2697A v4. This amounts to 1888 MPI ranks, and roughly 3.5 million point-to-point communications required for one MPI message!

As seen in Figure 2, up to 21% faster2 Alltoall time is seen for 64KB and 256KB messages sizes using <Lmc>2</Lmc>. Larger message sizes with Alltoall exceeded memory capacity of the current test system. Negligible performance impact is seen at 8 byte message size, because these sizes do not consume the full bandwidth of the ISLs and congestion is not as prevalent.

Larger message sizes with Alltoall exceeded memory capacity of the current test system. Negligible performance impact is seen at 8 byte message size, because these sizes do not consume the full bandwidth of the ISLs and congestion is not as prevalent.
Figure 2: Performance improvements using Dispersive Routing (Lmc=2) relative to Static routing, fully subscribed fabric.

It is a common practice to "over-subscribe" clusters, meaning the ratio of connected nodes to the number of available ISLs is increased. This means, in theory, that each node has less available fabric bandwidth because there is more ISL sharing. The benefit is less switching infrastructure required to connect more nodes. Figure 3 shows the performance variation versus cluster subscription ratio for the 256KB message size (far left is heavily oversubscribed, far right is fully subscribed). Up to a 30% faster1 Alltoall collective is measured with dispersive routing for the 1.5:1 subscription ratio case!

less switching infrastructure is required to connect more nodes. Figure 3 shows the performance variation versus cluster subscription ratio for the 256KB message size (far left is heavily oversubscribed, far right is fully subscribed)
Figure 3: Performance improvements for 256 KB message size using Dispersive Routing (Lmc=2), relative to static routing, versus various cluster subscription ratios. Heavily oversubscribed (left) to fully subscribed (right)

We've demonstrated the advantages of using dispersive routing for the Alltoall MPI collective, with a peak benefit of 30% over baseline, static routing. The level of application performance impact this has will be dependent on the exact MPI collective usage in the application. Before enabling dispersive routing on your cluster, we recommend that you carefully evaluate the impact across a range of workloads of interest.

Stay tuned for future updates and examples of how you can tweak the performance of Intel® OPA using other enhanced features. For more information on Intel® OPA, please visit http://intel.com/omnipath.


System configuration: Dual socket Intel® Xeon® E5-2697Av4 processor nodes, Intel® Turbo Boost Technology and Intel® Hyper-Threading Technology enabled. 2133 MHz DDR4, 64GB per node. Intel® OPA 48-port Edge switches connected with 1-3 meter copper cables. Intel MPI 2018 Update 1, Intel compilers 18.0.1. I_MPI_FABRICS=shm:tmi. IMB-MPI1 Alltoall using 32 ranks per node (ppn=32).    Red Hat Enterprise Linux Server release 7.4 (Maipo). Intel Fabric Suite (IFS) 10.6.1.0.2. 3.10.0-693.21.1.el7.x86_64 kernel, 0xb00002a microcode. Default opafm.xml file used for “static” routing. <Lmc>2</Lmc> used for dispersive routing.

1 30% improvement is based on 256KB Alltoall collective completion time for 1.5:1 oversubscribed fabric, for dispersive routing normalized to static routing, as shown in Figure 3.

2 21% improvement is based on 256KB Alltoall collective completion time for fully subscribed fabric, for dispersive routing normalized to static routing, as shown in Figure 2.

The benchmark results reported above may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Optimization notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Intel, the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.