Taking Low I/O Latency Even Lower with RDMA

It is inevitable that as processing power increases, and cores multiply on servers, that new bottlenecks will be found. If you make a processor that can do more, you will increase your system performance until you find where the next area of focus is. I’ve talked on several occasions about the system bottlenecks that are now apparent by using the latest Xeon processors or when a many virtual machines are loaded onto a single physical box, and what can be done to improve them. As you may have noticed, a key laggard that will frequently be found in modern multi-processor servers is the network connection. Moving to 10 Gigabit clearly alleviates this issue in many high horsepower systems, and the addition of advanced features like VMDq and SR-IOV will continue to push the envelope on Network I/O in virtualized environments.

However, it is clear that for some applications, especially in the HPC market and certain applications in the Financial Service industry, an even more performing I/O solution than standard 10 Gigabit will be needed when latency is of the utmost importance. A solution that has existed in this space for a few years, but is gaining momentum is what is referred to as Remote Direct Memory Access (RDMA), which lets one server directly place information into the memory of another server; essentially bypassing the kernel and networking software stack.

At first blush, this solution may sound odd. Why bypass the entire stack; doesn’t this complicate things? Well, it certainly does add some complications by requiring a modified OS stack and support on both sides of the network link, but there are some telling details about where real world latencies come from that make this methodology attractive in certain circumstances.

If you look at the typical breakdown of CPU utilization in the context of processing networking traffic, you see the workload consumed by buffer copies, application context switching and some TCP/IP processing. If you look at this (albeit, in a simplified way) visually, there is a vertical stack of tasks that need to take place in the server:

iWARP Stack Before1.JPG

There are application buffer copies from the app to kernel which then get handled by the NIC. Additionally, there is the TCP/IP processing that takes place within the OS and is a large consumer of CPU cycles, and there are also I/O commands that add additional latencies into the communication process.

So the question is what to do to help reduce these latencies? RDMA has been adapted for standard Ethernet via the IETF Internet Wide Area RDMA Protocol (iWARP). Until the iWARP specification, RDMA had been a capability only seen using Infiniband networking. By porting the goodness of RDMA to Ethernet, iWARP offers the promise of ultra low latency, but with all the benefits of standard Ethernet.

With Intel’s acquisition in October of NetEffect, we now have one of the leading iWARP product lines for 10 Gigabit Ethernet. Within these products, the iWARP processing engine can help eliminate some of the key bottlenecks described above, and provide very low latency and high performance networking for even the most demanding HPC applications.

The first item that an iWARP engine can address is to help offload some of the TCP processing task which can bog down processing power as bandwidth loads increase. The Intel NetEffect iWARP solution can offload this TCP processing by handling the sequencing, payload reassembly, and buffer management in dedicated hardware.

The next item that iWARP addresses is the extra copies that need to be done by the system when transferring data. iWARP extensions for RDMA and Direct Data Placement (DDP) allow the iWARP engine to tag the data with the necessary application buffer information and place the payload directly in the target Server’s memory. This eliminates the delays associated with memory copies by moving to a so called ‘zero copy’ model.

Finally, iWARP extensions also implement user-level direct access which allows a user-space application to post commands directly to the network adapter without having to make latency consuming calls to the OS for requests. This along with the other pieces of iWARP provides dramatically reduced latency and increased performance.

The diagram below summarizes what the new system stack looks like after the implementation of iWARP. Much simpler and which much lower latency.

iWARP Stack After2.JPG

One obvious issue that is raised after thinking about the above diagram is what modifications need to be made at the OS or application level. Clearly applications need to be modified to be iWARP compatible and this can be a time consuming process. This is one of the reasons that this solution has been slow to gain adoption. However, there is an Open Fabrics Alliance (OFA) which is working on a unified stack for RDMA for the open source community. The OFA has an Open Fabrics Enterprise Distribution (OFED) release which is a set of libraries and code that can unify different solutions that use RDMA. There have been several OFED releases so far and further plans are coming to align and expand various RDMA capabilities. In this way, applications that run using the OFED stack under Infiniband can be run without any changes over iWARP and Ethernet.

As more applications get modified to support the feature set of iWARP RDMA, there will be a wider understanding and acceptance in the HPC community of the incremental performance, cost, and standards advantages of using Ethernet with RDMA for the most performance sensitive applications. Moving from standard non-iWARP Ethernet to iWARP enabled Ethernet provide more bounded latency reduction from ~14 us to <<10 us… now that is fast.

We live in exciting times!

Ben Hacker


For those looking for some more detail, there is a nice whitepaper on iWARP located here.

For those interested in learning more about the Open Fabrics Alliance (OFA) please see here.