Optane and Intel Memory Drive Technology, big surprise

New era of memory and it's not a DRAM

I believe March 19th, 2017 is one of the most exiting days in the Non-Volatile industry when Intel® Optane™ SSD DC P4800X was introduced. This is an outstanding drive, which opens new possibilities in storage and in memory domains. While storage area is clear and, most likely, you’re reading the drive’s specifications at the moment, memory area sounds unusual for an SSD being a block device. Even with the Optane SSD connected to the PCIe bus, the unique latency and QoS characteristics allow the SSD to be well suited as a Memory device. There are multiple ways to do so, starting from typical OS paging, going down to app-based accelerations.

We support any of those directions, but also introduced something special – Intel® Memory Drive Technology. It’s a software middle layer, which runs below OS and converts SSDs into a memory.  When you pair this software with an Intel Optane SSD and some system DRAM, the combination is presented up to the Operating system as a single memory pool, transparent to the OS and applications.  This means there is no need to change your applications to take advantage of using an Optane SSD to expand the memory pool. (For more details, please, see a product brief Intel® Optane™ SSD DC P4800X Series .)

While that’s expected that Optane SSDs perform slower that DRAM, the combination of both with an intelligence of a software allows background data movement for write back and read prefetching which are the key for the predictable performance. Being optimized for Optane SSDs, it utilizes low-latency capabilities of the drive. Meanwhile with all software “magic” and Optane advantages, a performance hit comparing to All-DRAM configuration is expected. High concurrency applications are running extremely efficient with near DAM performance, such as MySQL under SysBench stress demonstrated at 80% of All DRAM system performance with memory footprint scalability beyond hardware platform limitations (up to 24TB for 2-socket systems and up to 48TB in 4-socket configurations).

GEMM and its big surprise

However, one particular example that Intel presented at the Optane launch, definitely was not in-line with the expectations. We ran a General Matrix Multiplication, to be exact Segmented GEMM (relying Intel MKL library). It’s easy to achieve a high memory footprint by just scaling matrix sizes. For the performance comparison, we put two systems side-by-side. One with 768GB DRAM configuration and another one with 128GB of DRAM and 4 x Optane SSDs and the Intel Memory Drive Technology software. SGEMM is configured to generate three matrixes  with an overall size of 700GB.

With Intel® Memory Drive Technology and Optane it runs at 112% of DRAM performance. The first time we looked at that and were surprised by the unexpected performance results. Why did that happen? Test scenario mistake? Definitely not.

intel Optane plus Intel Memory Drive configuration and memory usage

Optane + Intel Memory Drive configuration – 2 x Intel® Xeon® CPU E5-2699 v4, Intel® Server Board S2600WT, 128GB DDR4 + 4* Intel® SSD Optane® (SSDPED1K375GA), CentOS 7.3.1611.

All DRAM configuration – 2 x Intel® Xeon® CPU E5-2699 v4, Intel® Server Board S2600WT, 768GB DDR4 CentOS 7.3.1611.

Test – SGEMM MKL, segment size 18689, factor 22, threads 42.

Data location is the key. When GEMM initializes matrixes in memory, it runs in a single thread and allocates them sequentially not considering NUMA locality. Once benchmark started and every thread started a segmented compute it realized the data is physically located on both local and remote memory.  The Intel® Memory Drive Technology, whenever it can, will proactively move the needed data into the DRAM adjacent the workload to avoid the penalty of a QPI traffic. In this GEMM workload, it automatically optimizes the data locality by using nearest DRAM and Optane SSD, and so minimizes QPI impact. It makes a lot of sense since we used a high core count CPU Intel® Xeon® CPU E5-2699 v4. Easy button, isn’t it?

When you are using a very large working set, and programming a workload to run on multiple CPUs, it is extremely difficult to design the workloads to ensure the memory for an operation is directly attached to the CPU which needs that data, so the CPU does not need to reach out to the other CPU to get the data across the QPI link. Also, it is not easy to optimize every workload to avoid NUMA accesses, because every unique data set and workload would need to be optimized uniquely.  We knew that it was possible to optimize the application for All DRAM configuration and came up with 20% improvement. This makes non-optimized GEMM running under Intel® Memory Drive Technology, with an 8% difference, to the optimized application on DRAM only configuration. Also note, this improvement is only suitable for that particular hardware configuration (assuming CPU and DRAM location) and with a new system we must work on a memory layout again.

Why GEMM? What does make that case special?

Deep learning is a rapidly emerging branch of machine learning, which relies on large data sets to iteratively “train” many-layered neural networks inspired by the human brain. Trained neural networks are used to “infer” the meaning of new data, with increased speed and accuracy for processes like image search, speech recognition, natural language processing, and other complex tasks. Typically, Speech, and NLP Deep learning applications have multiple stages and the last stage involves mapping the results of the network onto a large vocabulary using 1-hot vector or other such representation. The vocabulary tends to be very large, tending be up to an order of 1e6, so this translates into a larger fully connected layer mapping up to 1e6 neurons. This is what the GEMM benchmark attempts to capture. This last layer becomes a primary bottleneck for the entire model since the other stages of the model are relatively much smaller. So, this disproportionately large layer poses severe challenges -  having to go to a distributed implementation just for this layer or compromise on accuracy by performing block operations with partial updates or take the additional performance hit and run it off disks. With Intel ®Memory Drive Technology and Optane these overheads get alleviated, which benefits both in terms of performance, cost and effort/complexity of programming.

Big thanks to:

James Myers, Director, SSD Solutions Architecture, Intel Corp
Dheevatsa Mudigere, Software Architect, Intel Corp