Updated: For changes in Linux kernel 4.20 and beyond (5.x)
Intel Optane SSDs are ultra-fast and we wanted to share a few tips about Linux to help you get the most out of one of the world’s fastest SSDs. Optane is an SSD that can achieve sub-10 microsecond response time of 4 KiB I/O and can operate as Software Defined Memory. There are a few key things to know and do before you run that first fio script to test the device and verify that it’s working at its peak capability. This is fast and easy, so you can quickly get into your application efforts. You should have your own fio script that matches the needs of your application or use mine below as a very first test.
Optane SSDs perform best when they are used in a newer architecture (with newer Intel Xeon Scalable Processors) and a higher performance processor with a base frequency of 3.0 GHz or higher is recommended, but not required. Of course, Optane will work on slower CPUs, but you’ll experience less throughput per worker than what we are achieving in this blog. The P4800X is an NVMe SSD, so any x4 capable PCIe 3.0 slot will work fine for connectivity. PCIe is always backward compatible as well. In addition to the add-in-card shown in the picture above, the U.2 interface is also available, so choosing an NVMe capable server with front enclosures will improve serviceability. Read this blog for more details on how to optimize hardware for best Optane SSD performance.
What else do we specifically recommend in Linux?
Steps to improving performance of Intel SSDs on Linux OS
Step 1: Put your CPU’s in performance mode
# echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Ensure the CPU scaling governor is in performance mode by checking the following; here you will see the setting from each processor (vcpu).
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
You should see performance as the return of this command.
I recommend you make this setting persistent between reboots by changing your Linux restart configuration (i.e. rc).
Step 2: Disable IRQ balance (only in kernel versions prior to Linux 4.8)
In kernels before version 4.8 the irq balancing was not managed as efficiently as in the current in-box Linux nvme driver. So, if you are on a kernel version older than 4.8, please turn off the irqbalance service and run a short script (below) to balance your irq’s, this will allow for the best io processing possible.
You can stop and disable the service with the following command on CentOS and Ubuntu:
# systemctl disable -–now irqbalance
Here is a bash script to set SMP affinity, if you wish to do that. Remember this is only necessary on kernels prior to 4.8.
for folder in $folders; do
for file in $files; do
if [[ $file == *”nvme”* ]]; then
echo $contents > $folder/smp_affinity;
Step 3: Enable polling or poll queues in your Linux in-box NVMe driver
Since Linux 4.20 there have been optimizations to the NVMe driver to allow for a new parameter that governs polling. Polling should not involve interrupts of any kind, and NVMe driver developers needed to make changes to allow for this improvement. This brought the advent of poll queues which are now available in 4.20 and later.
To enable NVMe to run with poll queues, load the driver with the io polling enabled. You might want to setup poll queues equal to the number of virtual cores in your system, or decide based on your implementation goals, there is no simple answer here.
My example is:
To enable polling mode by device (before kernel 4.20) -
# echo 1 > /sys/block/nvme0n1/queue/io_poll
To enable poll_queues in the NVMe driver system wide (kernel 4.20) and later -
# modprobe -r nvme && modprobe nvme poll_queues=4
The above is for a quick system test. If you want to enable it on boot, or if you are booting from an NVMe drive don’t use the method above, use the following method instead.
# more /etc/modprobe.d/nvme.conf
options nvme poll_queues=4
Next, rebuild initramfs so the parameter in the module parameter is picked up. Using these commands, first backup the image, and then rebuild it.
# cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.$(date +%m-%d-%H%M%S).bak
# dracut --force
Now reboot your system and check your work…
# systool -vm nvme
You should set that poll_queues is set to your desired setting, and no longer 0.
poll_queues = "4"
You can also setup Grub2 to add this kernel option by following instruction in the CentOS wiki for CentOS 7.
Step 4: Choose appropriate fio ioengine, and I/O polling mode
Next, check the configuration of the system you built. The most critical performance will show itself at QD1 (queue depth 1) with just 1 worker thread. You can run this with any number of ioengines, but we recommend pvsync2 or io_uring in hipri mode. Here are the requirements:
- Polling mode per device requires Linux kernel 4.8 or newer.
- Poll queues require Linux kernel 4.20 or newer.
- io_uring requires Linux kernel 5.1 or newer.
If you are new to developing with io_uring, you should move to the most stable Linux 5.x kernel available when you do this.
Below is a recommended fio script:
cpus_allowed=0-17 # dependent on your NUMA goals
We use cpus_allowed for numa locality here and for no other reason. Based on the server I ran this on, this job will burn just one core. You may also need to compile fio a specific way to get the particular ioengines that you desire, in fact around here at Intel Storage land, there are custom fio’s all over the place.
Results from a build with Intel Xeon Gold 6254 CPUs and Linux 5.4.1-1 kernel
Summary output from fio on my system:
rd_rnd_qd_1_4k_1w: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=pvsync2, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=487MiB/s][r=125k IOPS][eta 00m:00s]
rd_rnd_qd_1_4k_1w: (groupid=0, jobs=1): err= 0: pid=3036: Wed Jan 15 14:00:45 2020
read: IOPS=125k, BW=487MiB/s (511MB/s)(143GiB/300001msec)
clat (usec): min=7, max=202, avg= 7.76, stdev= 1.29
lat (usec): min=7, max=202, avg= 7.78, stdev= 1.29
clat percentiles (usec):
| 1.000000th=[ 8], 25.000000th=[ 8], 50.000000th=[ 8],
| 75.000000th=[ 8], 90.000000th=[ 8], 99.000000th=[ 10],
| 99.900000th=[ 33], 99.990000th=[ 38], 99.999000th=[ 106],
| 99.999900th=[ 159], 99.999990th=[ 167], 99.999999th=[ 204],
| 100.000000th=[ 204]
bw ( KiB/s): min=498144, max=500708, per=100.00%, avg=499144.54, stdev=489.10, samples=299
iops : min=124536, max=125177, avg=124786.12, stdev=122.32, samples=299
lat (usec) : 10=99.27%, 20=0.55%, 50=0.17%, 100=0.01%, 250=0.01%
cpu : usr=5.80%, sys=94.07%, ctx=1018, majf=0, minf=23
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=37437069,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=487MiB/s (511MB/s), 487MiB/s-487MiB/s (511MB/s-511MB/s), io=143GiB (153GB), run=300001-300001msec
Disk stats (read/write):
nvme0n1: ios=37424564/0, merge=0/0, ticks=261212/0, in_queue=0, util=99.99%
We hope these steps provide you a great first experience with Optane based SSDs on Linux. Now comes the fun part. It’s time for you to achieve amazing innovations and a new level of storage flexibility for your business goals; getting more per server or cpu core just got a lot easier. Optane SSDs are clearly a great choice for a fast accelerator, tier, or caching device. A whole new world of application use cases have been evolving around this device. Particularly with the advent of io_uring; a new asynchronous high speed poll mode interface that will allow for applications to change the way they do I/O and what they can do with I/O or also offloading or complementing system memory.
Feel free to reach out to the Intel support site and we’ll be happy to give you more help on achieving amazing performance with Optane SSDs.
Please note: All tests were done on a CentOS 8.0 distribution and a bit on Ubuntu 18.04.3 LTS, using kernel 5.4.1-1.el8.elrepo.x86_64 and fio-3.16-64-gfd988. Your hardware, and software configuration may not be compatible with everything you read here.