Back in 2011, I made the statement, "I have put my Oracle redo logs or SQL Server transaction log on nothing but SSDs" (Improve Database Performance: Redo and Transaction Logs on Solid State Disks (SSDs). In fact since the release of the Intel® SSD X25-E series in 2008, it is fair to say I have never looked backed. Even though those X25-Es have long since retired, every new product has convinced me further still that from a performance perspective a hard drive configuration just cannot compete. This is not to say that there have not been new skills to learn, such as configuration details explained here (How to Configure Oracle Redo on SSD (Solid State Disks) with ASM). The Intel® SSD 910 series provided a definite step-up from the X25-E for Oracle workloads (Comparing Performance of Oracle Redo on Solid State Disks (SSDs)) and proved concerns for write peaks was unfounded (Should you put Oracle Database Redo on Solid State Disks (SSDs)). Now with the PCIe*-based Intel® SSD DC P3600/P3700 series we have the next step in the evolutionary development of SSDs for all types of Oracle workloads.
Additionally we have updates in operating system and driver support and therefore a refresh to the previous posts on SSDs for Oracle is warranted to help you get the best out of the Intel SSD DC P3700 series for Oracle redo.
One significant difference in the new SSDs is the change in interface and driver from AHCI and SATA to NVMe (Non-volatile memory express). For an introduction to NVMe see this video by James Myers and to understand the efficiency that NVMe brings read this post by Christian Black. As James noted, high performance, consistent, low latency Oracle redo logging also needs high endurance, therefore the P3700 is the drive to use. With a new interface comes a new driver, which fortunately is included in the Linux kernel at the Oracle supported Linux releases of Red Hat and Oracle Linux 6.5, 6.6 and 7.
I am using Oracle Linux 7.
Booting my system with both a RAID array of Intel SSD DC S3700 series and Intel SSD DC P3700 series shows two new disk devices:
First the S3700 array using the previous interface
- Disk /dev/sdb1: 2394.0 GB, 2393997574144 bytes, 4675776512 sectors
- Units = sectors of 1 * 512 = 512 bytes
- Sector size (logical/physical): 512 bytes / 4096 bytes
- I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Second the new PCIe P3700 using NVMe
- Disk /dev/nvme0n1: 800.2 GB, 800166076416 bytes, 1562824368 sectors
- Units = sectors of 1 * 512 = 512 bytes
- Sector size (logical/physical): 512 bytes / 512 bytes
- I/O size (minimum/optimal): 512 bytes / 512 bytes
Changing the Sector Size to 4KB
As Oracle introduced support for 4KB sector sizes at Oracle release 11g R2, it is important to be at a minimum of this release or Oracle 12c to take full advantage of SSD for Oracle redo. However ‘out of the box’ as shown the P3700 presents a 512 byte sector size. We can use this ‘as is’ and set the Oracle parameter ‘disk_sector_size_override’ to true. With this we can then specify the blocksize to be 4KB when creating a redo log file. Oracle will then use 4KB redo log blocks and performance will not be compromised.
As a second option, the P3700 offers a feature called ‘Variable Sector Size’. Because we know we need 4KB sectors, we can set up the P3700 to present a 4KB sector size instead. This can then be used transparently by Oracle without the requirement for additional parameters. It is important to do this before you have configured or started to use the drive for Oracle as the operation is destructive of any existing data on the device.
To do this, first check that everything is up to date by using the Intel Solid State Drive Data Center Tool from https://downloadcenter.intel.com/download/23931/Intel-Solid-State-Drive-Data-Center-Tool Be aware that after running the command it will be necessary to reboot the system to pick up the new configuration and use the device.
- [root@haswex1 ~]# isdct show -intelssd
- - IntelSSD Index 0 -
- Bootloader: 8B1B012D
- DevicePath: /dev/nvme0n1
- DeviceStatus: Healthy
- Firmware: 8DV10130
- FirmwareUpdateAvailable: Firmware is up to date as of this tool release.
- Index: 0
- ProductFamily: Intel SSD DC P3700 Series
- ModelNumber: INTEL SSDPEDMD800G4
- SerialNumber: CVFT421500GT800CGN
Then run the following command to change the sector size. The parameter LBAFormat=3 sets it to 4KB and LBAFormat=0 sets it back to 512b.
- [root@haswex1 ~]# isdct start -intelssd 0 Function=NVMeFormat LBAFormat=3 SecureEraseSetting=2 ProtectionInformation=0 MetaDataSetting=0
- WARNING! You have selected to format the drive!
- Proceed with the format? (Y|N): Y
- Running NVMe Format...
- NVMe Format Successful.
After it ran I rebooted, the reboot is necessary because of the need to do an NVMe reset on the device because I am on Oracle Linux 7 with a UEK kernel at 3.8.13-35.3.1. At Linux kernels 3.10 and above you can also run the following command with the system online to do the reset.
- echo 1 > /sys/class/misc/nvme0/device/reset
The disk should now present the 4KB sector size we want for Oracle redo.
- Disk /dev/nvme0n1: 800.2 GB, 800166076416 bytes, 195353046 sectors
- Units = sectors of 1 * 4096 = 4096 bytes
- Sector size (logical/physical): 4096 bytes / 4096 bytes
- I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Configuring the P3700 for ASM
For ASM (Automatic Storage Management) we need a disk with a single partition and, after giving the disk a gpt label, I use the following command to create and check the use of an aligned partition.
- (parted) mkpart primary 2048s 100%
- (parted) print
- Model: Unknown (unknown)
- Disk /dev/nvme0n1: 195353046s
- Sector size (logical/physical): 4096B/4096B
- Partition Table: gpt
- Disk Flags:
- Number Start End Size File system Name Flags
- 1 2048s 195352831s 195350784s primary
- (parted) align-check optimal 1
- 1 aligned
I then use udev to set the device permissions. Note: the scsi_id command can be run independently to find the device id to put in the file and the udevadm command used to apply the rules. Rebooting the system is useful during configuration to ensure that the correct permissions are applied on boot.
- [root@haswex1 ~]# cd /etc/udev/rules.d/
- [root@haswex1 rules.d]# more 99-oracleasm.rules
- KERNEL=="sd?1", SUBSYSTEM=="block", PROGRAM=="/usr/lib/udev/scsi_id -g -u -d /dev/$parent", RESULT=="3600508e000000000c52195372b1d6008", OWNER="oracle", GROUP="dba", MODE="0660"
- KERNEL=="nvme0n1p1", SUBSYSTEM=="block", PROGRAM=="/usr/lib/udev/scsi_id -g -u -d /dev/$parent", RESULT=="365cd2e4080864356494e000000010000", OWNER="oracle", GROUP="dba", MODE="0660"
Successfully applied, the oracle user now has ownership of the DC S3700 RAID array device and the P3700 presented by NVMe.
- [root@haswex1 rules.d]# ls -l /dev/sdb1
- brw-rw---- 1 oracle dba 8, 17 Mar 9 14:47 /dev/sdb1
- [root@haswex1 rules.d]# ls -l /dev/nvme0n1p1
- brw-rw---- 1 oracle dba 259, 1 Mar 9 14:39 /dev/nvme0n1p1
Use ASMLIB to mark both disks for ASM.
- [root@haswex1 rules.d]# oracleasm createdisk VOL2 /dev/nvme0n1p1
- Writing disk header: done
- Instantiating disk: done
- [root@haswex1 rules.d]# oracleasm listdisks
As the Oracle user, use the ASMCA utility to create the ASM disk groups.
I now have 2 disk groups created under ASM.
Because of the way the disk were configured Oracle has automatically detected and applied the sector size of 4KB.
- [oracle@haswex1 ~]$ sqlplus sys/oracle as sysasm
- SQL*Plus: Release 184.108.40.206.0 Production on Thu Mar 12 10:30:04 2015
- Copyright (c) 1982, 2014, Oracle. All rights reserved.
- Connected to:
- Oracle Database 12c Enterprise Edition Release 220.127.116.11.0 - 64bit Production
- With the Automatic Storage Management option
- SQL> select name, sector_size from v$asm_diskgroup;
- NAME SECTOR_SIZE
- ------------------------------ -----------
- REDO 4096
- DATA 4096
SPFILES in 4K DISKGROUPS
In previous posts I noted Oracle bug “16870214 : DB STARTUP FAILS WITH ORA-17510 IF SPFILE IS IN 4K SECTOR SIZE DISKGROUP” and even with Oracle 18.104.22.168 this bug is still with us. As both of my diskgroups have a 4KB sector size, this will affect me if I try to create a database in either without having applied patch 16870214.
With this bug, upon creating a database with DBCA you will see the following error.
The database is created and the spfile does exist so can be extracted as follows:
- ASMCMD> cd PARAMETERFILE
- ASMCMD> ls
- ASMCMD> cp spfile.282.873892817 /home/oracle/testspfile
- copying +DATA/TEST/PARAMETERFILE/spfile.282.873892817 -> /home/oracle/testspfile
This spfile is corrupt and attempts to reuse it will result in errors.
- ORA-17510: Attempt to do i/o beyond file size
- ORA-17512: Block Verification Failed
However, you can extract the parameters by using the strings command and create an external spfile or a spfile in a diskgroup with a 52b sector size. Once complete, the Oracle instance can be started.
- SQL> create spfile='/u01/app/oracle/product/12.1.0/dbhome_1/dbs/spfileTEST.ora' from pfile='/home/oracle/testpfile';
- SQL> startup
- ORACLE instance started
Creating Redo Logs under ASM
In viewing the same disks within the Oracle instance, the underlying sector size has been passed right through to the database.
- SQL> select name, SECTOR_SIZE BLOCK_SIZE from v$asm_diskgroup;
- NAME BLOCK_SIZE
- ------------------------------ ----------
- REDO 4096
- DATA 4096
Now it is possible to create a redo log file with a command such as follows:
- SQL> alter database add logfile ‘+REDO’ size 32g;
…and Oracle will create a redo log automatically with an optimal blocksize of 4KB.
- SQL> select v$log.group#, member, blocksize from v$log, v$logfile where v$log.group#=3 and v$logfile.group#=3;
Running an OLTP workload with Oracle Redo on Intel® SSD DC P3700 series
To put the Oracle redo on P3700 through its paces I used a HammerDB workload. The redo is set with a standard production type configuration without commit_write and commit_wait parameters. A test shows we are running almost 100,000 transactions per second at redo over 500MB / second and therefore we would be archiving almost 2 TBs per hour.
|Redo size (bytes):||
Log file sync even at this level of throughput is just above 1ms
Total Wait Time (sec)
% DB time
|log file sync||19,927,449||23.2K||1.16||38.7||Commit|
…and the average log file parallel write showing the average disk response time to just 0.13ms
Total Wait Time (s)
Avg wait (ms)
% bg time
|log file parallel write||3,359,023||0||442||0.13||0.12||2237277.09|
There are six log writers on this system. As with previous blog posts on SSDs I observed the log activity to be heaviest on the first three and therefore traced the log file parallel write activity on the first one with the following method:
- SQL> oradebug setospid 67810;
- Oracle pid: 18, Unix process pid: 67810, image: email@example.com (LG00)
- SQL> oradebug event 10046 trace name context forever level 8;
- ORA-49100: Failed to process event statement [10046 trace name context forever level 8]
- SQL> oradebug event 10046 trace name context forever, level 8;
The trace file shows the following results for log file parallel write latency to the P3700.
|Log Writer Worker||Over 1ms||Over 10ms||Over 20ms||Max Elapsed|
Looking at a scatter plot of all of the log file parallel write latencies recorded in microseconds on the y axis clearly illustrate that any outliers are statistically insignificant and none exceed 15 milliseconds. Most of the writes are sub-millisecond on a system that is processing many millions of transactions a minute while doing so.
A subset of iostat data shows the the device is also far from full utilization.
- avg-cpu: %user %nice %system %iowait %steal %idle
- 77.30 0.00 8.07 0.24 0.00 14.39
- Device: wMB/s avgrq-sz avgqu-sz await w_await svctm %util
- nvme0n1 589.59 24.32 1.33 0.03 0.03 0.01 27.47
As a confirmed believer in SSDs, I have long been convinced that most experiences of poor Oracle redo performance on SSDs has been due to an error in configuration such as sector size, block size and/or alignment as opposed to performance of the underlying device itself. In following the configuration steps I have outlined here, the Intel SSD DC P3700 series shows as an ideal candidate to take Oracle redo to the next level of performance without compromising endurance.