Adaptive Routing with Intel Omni-Path Architecture

Scaling applications to hundreds or thousands of servers is a common practice in today’s High-Performance Computing (HPC) centers. Intel® Omni-Path Architecture (Intel® OPA) is a leading interconnect used to connect servers (commonly referred to as nodes) in a data center to one another. The data throughput capability of Intel OPA in its first generation is 100 Gigabits per second in each direction of the link. This corresponds to up to 12.5 Gigabytes per second of uni-directional bandwidth and 25 Gigabytes of bi-directional bandwidth. In addition to high bandwidth, Intel OPA is a low latency and highly resilient interconnect with many different Quality of Service (QoS) features. One of these features is adaptive routing. In this article, we show exactly how to enable and test adaptive routing with a simple micro benchmark, and in a future article, we show its impact on application performance in a cluster environment. Stay tuned also for articles on other QoS features such as Traffic Flow Optimization, where the fabric intelligently prioritizes latency-sensitive messages during simultaneous transmission of larger messages, or bulk data such as storage.

When Intel OPA fabric is installed in a cluster, the nodes are often connected in a topology known as a "fat tree". In this configuration, compute nodes are connected to “edge” or “leaf” switches, and these edge switches are in turn connected to “core” or “spine” switches. This is a two-tier fabric. With the 48 port radix design in the Intel® Omni-Path Edge Switch (100 series), a maximum of 1152 nodes can be connected in a cluster with full bisectional bandwidth. In this configuration, 48 edge switches are used in the first edge switch tier, connected to 24 switches in the second core tier. 24 hosts are connected to each of the 48 edges, and the remaining 24 ports on each edge are connected to each of the core switches, one inter-switch link (ISL) between each edge and core. In theory, this configuration is non-blocking and is capable of providing full bandwidth from any node in the cluster to any other unique node at a given time. However, static fat tree routing algorithms (which determine exactly which path is taken between any two node pairs) have limitations and full bandwidth is not always seen in practice.

Static fat tree routing must provide routing paths for every node talking to every other node. Since a node talking to itself would typically use shared memory within the node, the number of potential routes R for a total node count of N is governed by the simple equation R=N(N-1)=N²-N. Since the number of routes increase as a function of  but the number of ISLs only grow as a function of N, this means that there are scenarios where certain host communications will use the same ISL. Although customized routing rules can be implemented for a very specific communication pattern that would allow for full bandwidth throughout the cluster, the more general routing requirement needs to be in place for a general purpose HPC cluster. How can you tell exactly which routing path your communication will take through the fabric? Packaged tools such as opareport can be used in conjunction with the sending and destination nodes, for example: opareport -o route -S nodepat:”NodeA hfi1_0”:port:1 -D nodepat:”NodeB hfi1_0”:port:1

Adaptive routing has the ability to detect when hosts are trying to use the same routes, and adjust the routes in real time to alleviate the congestion and potentially improve the bandwidth delivered to each node. Consider a simplified scenario with only two edge switches and two cores switches as illustrated in Figure 1. In this example we elect Node A to communicate to Node C, and Node B to communicate to Node D. This is representative of how a real HPC application might communicate. Based on the fat tree topology, all traffic needs to travel through either Core switch 1 or Core switch 2 to reach the destination node. To further simply the demonstration, we only connect each edge to each core switch with one ISL. Therefore, there are only two paths possible for any messages traveling between the node pairs. Monitoring the ISL traffic is easy with tools such as the Intel OPA Fabric Manager GUI, or tools such as opatop.

Intel OPA Inter-switch links (ISLs) diagram
Figure 1: Simplified Fat tree topology

The performance test selected is Ohio State Microbenchmarks* version 5.3, osu_mbw_mr test (http://mvapich.cse.ohio-state.edu/benchmarks/). This test returns the aggregate bandwidth and message rate between the number of nodes or CPU core pairs used in the test. In the source code, we modified the timing during the bandwidth calculation to use the maximum pair communication time instead of the average time. This prevents the code from returning bandwidth which is over maximum theoretical line rate of 12.5 GB/sec.

We have chosen to use a Message Passing Interface (MPI) library as packaged with Intel OPA software release 10.3.1.0.22, which is Open MPI 1.10.4. However, any MPI library could be used. It is especially important to use the MPI packaged with the -hfi suffix to ensure Performance Scaled Messaging (PSM2) is used, which is the most performant code path for Intel OPA. If you compile your Open MPI yourself, make sure you include PSM in your configure command. Running this benchmark between Node A and Node C, using one MPI rank on each node, returns the following result:
Intel OPA - OSU MPI Multiple Bandwidth / Message Rate Test Results image

We have omitted the first lines of output for the smaller message sizes for brevity. The bandwidth achieved between two nodes using one core per node is 12.38GB/s (uni-directional), which is 99% of the theoretical line rate of Intel OPA in its first generation. We expect when running between two node pairs (again, one core per node), the reported bandwidth should be close to double the bandwidth measured for one node pair. However, for static routing, that is not always the case as we will now show. In this test, in addition to Node A sending to Node C, now Node B is sending to Node D and the aggregate throughput is reported:
Intel OPA - OSU MPI Multiple Bandwidth / Message Rate Test v5.3 Results

Note that the order of the node list is important, as the first half of the nodes are identified as the “sending” nodes, or group 1, and the second half of the nodes are the corresponding “receiving” nodes in group 2. As seen by the output, it is only slightly higher than the single node pair. For this demonstration, we purposely selected these nodes because opareport told us that they share the same static route through Core switch 2 (see the red path in Figure 1). During the above four node benchmark test, Figure 2 is the output of “opatop” (0 → W → D keystrokes), showing 100% ISL utilization through Core switch 2 and only 50% utilization for the host ports, because two host pairs are sharing the single ISL routed through Core switch 2. Note that core switch 1 is not utilized, so it is not listed since opatop automatically sorts by the highest utilization, or may other available metrics.

Intel OPA - Output of poatop during 4 node test chart
Figure 2: Output of opatop during 4 node test with static routing. Only the red path in Figure 1 is used.

Let us repeat the above test, this time with adaptive routing enabled. To enable adaptive routing, simply toggle <Enable> to 1 in the <AdaptiveRouting> section of /etc/sysconfig/opafm.xml on the node running the fabric manager. In addition, we are using a <Threshold> value of 7 which will tolerate ISL congestion up to 55% before adaptive routing takes over. For this specific test where we expect full contention, a lower value of Threshold, implying higher congestion tolerance, would also suffice. After the modification to opafm.xml, we restart the opafm service on the fabric manager node with “systemctl restart opafm”. The complete contents of <AdaptiveRouting> is given at the end in the configuration section.

Now with adaptive routing enabled, we repeat the identical test as shown above between the same four hosts:
Intel OPA - OSU MIP Multiple Bandwidth / Message Rate Test chart

The aggregate bandwidth of 24.7 GB/s is now close to double of the single pair bandwidth, which is the expected result. Monitoring with opatop reveals now the fabric is balanced, with the ISL path through Core switch 1 and Core switch 2 both being fully utilized, allowing for full utilization of the corresponding nodes.

Intel OPA - Output of opatop during 4 node test graph
Figure 3: Output of opatop during 4 node test with adaptive routing enabled. Both the red and blue paths in Figure 1 are used.

In conclusion, we have demonstrated with a basic example the ability of adaptive routing to enable higher bandwidth through an HPC cluster. Multiple node bandwidth tests revealed the potential for static fat tree routing to oversubscribe inter-switch links. Enabling adaptive routing doubled the throughput achievable between the nodes. In future articles, we will demonstrate the impact of adaptive routing on application testing, as well as other QoS features such as Traffic Flow Optimization. These enhanced features of Intel OPA ultimately mean higher and more consistent performance out of an HPC cluster.

For more information on Intel Omni-Path Architecture, please visit intel.com/omnipath.

Co-Authored by Vivek Kumar Rai, Technical Sales Specialist at Intel Corporation

Test Configuration

Tests performed on dual socket Intel® Xeon® processor E5-2697Av4 nodes with 64 GB, 2133 MHz memory per node. Intel® Turbo Boost Technology and Intel® Hyper-Threading Technology enabled. Open MPI 1.10.4-hfi as packaged with IFS 10.3.1.0.22. Red Hat Enterprise Linux* Server release 7.2 (Maipo), 3.10.0-327.36.3.el7.x86_64 kernel. BIOS settings: IOU non-posted prefetch disabled. Snoop timer for posted prefetch=9. Early snoop disabled. Cluster on Die disabled. Intel® Omni-Path 100 Series Edge switch. Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 10). Contents of /etc/sysconfig/opafm.xml that enables adaptive routing:

<AdaptiveRouting> <Enable>1</Enable> <LostRouteOnly>0</LostRouteOnly> <Algorithm>2</Algorithm> <ARFrequency>0</ARFrequency> <Threshold>7</Threshold> </AdaptiveRouting>