Tuning the performance of Intel® Optane™ SSDs on Linux Operating Systems

Frank_O_Intel · ‎01-17-2020

Updated: For changes in Linux kernel 4.20 and beyond (5.x)

Intel® Optane™ SSDs are ultra-fast and we wanted to share a few tips about Linux to help you get the most out of one of the world’s fastest SSDs. Intel Optane is an SSD that can achieve sub-10 microsecond response time of 4 KiB I/O and can operate as Software Defined Memory. There are a few key things to know and do before you run that first fio script to test the device and verify that it’s working at its peak capability. This is fast and easy, so you can quickly get into your application efforts. You should have your own fio script that matches the needs of your application or use mine below as a very first test.

Intel Optane SSDs perform best when they are used in a newer architecture (with newer Intel Xeon Scalable Processors) and a higher performance processor with a base frequency of 3.0 GHz or higher is recommended, but not required. Of course, Intel Optane SSDs will work on slower CPUs, but you’ll experience less throughput per worker than what we are achieving in this blog. The P4800X is an NVMe SSD, so any x4 capable PCIe 3.0 slot will work fine for connectivity. PCIe is always backward compatible as well. In addition to the add-in-card shown in the picture above, the U.2 interface is also available, so choosing an NVMe capable server with front enclosures will improve serviceability. Read this blog for more details on how to optimize hardware for best Intel Optane SSD performance.

What else do we specifically recommend in Linux?

Steps to improving performance of Intel® Optane™ SSDs on Linux OS

Step 1: Put your CPU’s in performance mode

# echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Ensure the CPU scaling governor is in performance mode by checking the following; here you will see the setting from each processor (vcpu).

# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

You should see performance as the return of this command.

I recommend you make this setting persistent between reboots by changing your Linux restart configuration (i.e. rc).

Step 2: Disable IRQ balance (only in kernel versions prior to Linux 4.8)

In kernels before version 4.8 the irq balancing was not managed as efficiently as in the current in-box Linux nvme driver. So, if you are on a kernel version older than 4.8, please turn off the irqbalance service and run a short script (below) to balance your irq’s, this will allow for the best io processing possible.

You can stop and disable the service with the following command on CentOS and Ubuntu:

# systemctl disable -–now irqbalance

Here is a bash script to set SMP affinity, if you wish to do that. Remember this is only necessary on kernels prior to 4.8.

#!/bin/bash

folders=/proc/irq/*;

for folder in $folders; do

files=”$folder/*”;

for file in $files; do

if [[ $file == *”nvme”* ]]; then

echo $file;

contents=`cat $folder/affinity_hint`;

echo $contents > $folder/smp_affinity;

cat $folder/smp_affinity;

fi

done

done

Step 3: Enable polling or poll queues in your Linux in-box NVMe driver

Since Linux 4.20 there have been optimizations to the NVMe driver to allow for a new parameter that governs polling. Polling should not involve interrupts of any kind, and NVMe driver developers needed to make changes to allow for this improvement. This brought the advent of poll queues which are now available in 4.20 and later.

To enable NVMe to run with poll queues, load the driver with the io polling enabled. You might want to setup poll queues equal to the number of virtual cores in your system, or decide based on your implementation goals, there is no simple answer here.

My example is:

To enable polling mode by device (before kernel 4.20) -

# echo 1 > /sys/block/nvme0n1/queue/io_poll

To enable poll_queues in the NVMe driver system wide (kernel 4.20) and later -

# modprobe -r nvme && modprobe nvme poll_queues=4

The above is for a quick system test. If you want to enable it on boot, or if you are booting from an NVMe drive don’t use the method above, use the following method instead.

# more /etc/modprobe.d/nvme.conf

options nvme poll_queues=4

Next, rebuild initramfs so the parameter in the module parameter is picked up. Using these commands, first backup the image, and then rebuild it.

# cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.$(date +%m-%d-%H%M%S).bak

# dracut --force

Now reboot your system and check your work…

# systool -vm nvme

You should set that poll_queues is set to your desired setting, and no longer 0.

poll_queues         = "4"

You can also setup Grub2 to add this kernel option by following instruction in the CentOS wiki for CentOS 7.

https://wiki.centos.org/HowTos/Grub2

Step 4: Choose appropriate fio ioengine, and I/O polling mode

Next, check the configuration of the system you built. The most critical performance will show itself at QD1 (queue depth 1) with just 1 worker thread. You can run this with any number of ioengines, but we recommend pvsync2 or io_uring in hipri mode. Here are the requirements:

Polling mode per device requires Linux kernel 4.8 or newer.

Poll queues require Linux kernel 4.20 or newer.

io_uring requires Linux kernel 5.1 or newer.

If you are new to developing with io_uring, you should move to the most stable Linux 5.x kernel available when you do this.

Below is a recommended fio script:

[global]

name= OptaneFirstTest

ioengine=pvsync2

hipri

direct=1

size=100%

randrepeat=0

time_based

ramp_time=0

norandommap

refill_buffers

log_avg_msec=1000

log_max_value=1

group_reporting

percentile_list=1.0:25.0:50.0:75.0:90.0:99.0:99.9:99.99:99.999:99.9999:99.99999:99.999999:100.0

filename=/dev/nvme0n1

[rd_rnd_qd_1_4k_1w]

bs=4k

iodepth=1

numjobs=1

rw=randread

cpus_allowed=0-17 # dependent on your NUMA goals

runtime=300

write_bw_log=bw_rd_rnd_qd_1_4k_1w

write_iops_log=iops_rd_rnd_qd_1_4k_1w

write_lat_log=lat_rd_rnd_qd_1_4k_1w

We use cpus_allowed for numa locality here and for no other reason. Based on the server I ran this on, this job will burn just one core. You may also need to compile fio a specific way to get the particular ioengines that you desire, in fact around here at Intel Storage land, there are custom fio’s all over the place.

Results from a build with Intel Xeon Gold 6254 CPUs and Linux 5.4.1-1 kernel

Summary output from fio on my system:

rd_rnd_qd_1_4k_1w: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=pvsync2, iodepth=1

fio-3.16-64-gfd988

Starting 1 process

Jobs: 1 (f=1): [r(1)][100.0%][r=487MiB/s][r=125k IOPS][eta 00m:00s]

rd_rnd_qd_1_4k_1w: (groupid=0, jobs=1): err= 0: pid=3036: Wed Jan 15 14:00:45 2020

  read: IOPS=125k, BW=487MiB/s (511MB/s)(143GiB/300001msec)

    clat (usec): min=7, max=202, avg= 7.76, stdev= 1.29

    lat (usec): min=7, max=202, avg= 7.78, stdev= 1.29

    clat percentiles (usec):

     | 1.000000th=[ 8], 25.000000th=[ 8], 50.000000th=[ 8],

     | 75.000000th=[ 8], 90.000000th=[ 8], 99.000000th=[ 10],

     | 99.900000th=[ 33], 99.990000th=[ 38], 99.999000th=[ 106],

     | 99.999900th=[ 159], 99.999990th=[ 167], 99.999999th=[ 204],

     | 100.000000th=[ 204]

bw ( KiB/s): min=498144, max=500708, per=100.00%, avg=499144.54, stdev=489.10, samples=299

iops : min=124536, max=125177, avg=124786.12, stdev=122.32, samples=299

lat (usec) : 10=99.27%, 20=0.55%, 50=0.17%, 100=0.01%, 250=0.01%

cpu : usr=5.80%, sys=94.07%, ctx=1018, majf=0, minf=23

IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

     submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     issued rwts: total=37437069,0,0,0 short=0,0,0,0 dropped=0,0,0,0

     latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):

   READ: bw=487MiB/s (511MB/s), 487MiB/s-487MiB/s (511MB/s-511MB/s), io=143GiB (153GB), run=300001-300001msec

Disk stats (read/write):

  nvme0n1: ios=37424564/0, merge=0/0, ticks=261212/0, in_queue=0, util=99.99%

We hope these steps provide you a great first experience with Intel Optane based SSDs on Linux. Now comes the fun part. It’s time for you to achieve amazing innovations and a new level of storage flexibility for your business goals; getting more per server or cpu core just got a lot easier. Intel Optane SSDs are clearly a great choice for a fast accelerator, tier, or caching device. A whole new world of application use cases have been evolving around this device. Particularly with the advent of io_uring; a new asynchronous high speed poll mode interface that will allow for applications to change the way they do I/O and what they can do with I/O or also offloading or complementing system memory.

Feel free to reach out to the Intel support site and we’ll be happy to give you more help on achieving amazing performance with Intel Optane SSDs.

Please note: All tests were done on a CentOS 8.0 distribution and a bit on Ubuntu 18.04.3 LTS, using kernel 5.4.1-1.el8.elrepo.x86_64 and fio-3.16-64-gfd988. Your hardware, and software configuration may not be compatible with everything you read here.