Should you put Oracle Database Redo on Solid State Disks (SSDs)?

Writing this blog post on St Patrick’s Day I am reminded through the St Patrick’s Day: Google Doodle that amongst other things St Patrick’s Day is a day that brings Myths and Legends to the fore and is therefore an ideal day to look at Oracle redo on SSDs – surely one technology topic that provides a great deal of conflicting opinion.  For example the Oracle support document Troubleshooting: "log file sync" Waits (Doc ID 1376916.1) describes the following issue:

If the proportion of the 'log file sync' time spent on 'log file parallel write' times is high, then most of the wait time is due to IO … As a rule of thumb, an average time for 'log file parallel write' over 20 milliseconds suggests a problem with IO subsystem.

and includes the following as one of the recommendations:

  • Do not put redo logs on Solid State Disk (SSD)
  • Although generally, Solid State Disks write performance is good on average, they may endure write peaks which will highly increase waits on 'log file sync'

This seems unequivocal, firstly if your log file parallel write time is on average (and not peak) greater than 20 milliseconds you should tune your IO, this I can agree with, however “Do not put redo logs on Solid State Disk (SSD)”  is worth investigating to see if it is really based on established fact.  After all I have used SSDs exclusively for Oracle workloads for the last 5 years, starting with the Intel® SSD X25-E Series in 2009 and currently with the Intel® SSD 910 Series and DC S3700 Series. I have also worked with other storage for Oracle but always an SSD based solution.  Not only do SSDs have excellent latency characteristics, and the modern endurance of Intel SSDs measured in DWpD essentially eliminates concerns over endurance, also modern SSD’s are very, very good at data path protection exactly what you need for high intensity redo.  Nevertheless Oracle Support are specific on the issue they have identified that SSD write performance may be good on average, however it is “write peaks” that impact log file sync times and using the rule of thumb of an average time for 'log file parallel write' over 20 milliseconds we would expect these “write peaks” to exceed 20 milliseconds to cause these concerns.  This is something we can test for to see if there is truth behind the myths and legends that always surround disruptive technologies.

Log File Sync and Log File Parallel Write

Stepping back a bit it is worth clarifying the relationship between ’log file sync’ and ‘log file parallel write.  ’Log file sync’ is the time elapsed by a foreground session after waiting for the redo it has put in the memory resident redo log buffer to be flushed to disk when it issues a COMMIT (or ROLLBACK) to make the transaction permanent and ‘log file parallel write’ which is the time it takes for the redo to be written to disk by the background process.  Prior to 12c this background process doing the writing was the log writer (LGWR) however at 12c the’ log file parallel writes’ are performed by a number of log writer workers (LGnn) with LGWR time elapsed under the ‘target log write size’ event.   Historically the log writer and foreground processes have communicated with a post/wait mechanism using semaphores (semctl) however a more recent parameter _use_adaptive_log_file_sync  being set to true by default since 11g indicates that polling may be used by the foreground processes instead with the method used dynamically selected. What this means is that whereas ‘log file parallel write’ is the actual write to disk, ’log file sync’ also includes the scheduling of and communication between the foreground and background processes and therefore on a busy system most of the time spent in ’log file sync’ by the foreground process may not be waiting for the ‘log file parallel write’ by the background process. However if there are significant “write peaks” then there may be a number of foreground processes waiting in ’log file sync’ for that write to complete exacerbating the elapsed time.  Consequently what we want to do is capture the time spent on ‘log file parallel write’ by the background processes on SSD storage to observe whether it  will highly increase waits on 'log file sync'.

Capturing Log File Parallel Write

To do this I used a 4 socket E7 v2 system to drive a significant level of throughput through Oracle 12c running on Oracle Linux 6.5.  It is worth noting that the E7 v2 includes the Intel® Integrated I/O  and Intel® Data Direct I/O Technology features and therefore the CPU latency aspect of I/O is further minimised.  For the SSD storage I used 2 x Intel® SSD 910 Series configured with Oracle ASM as per the post referenced here. Naturally I used HammerDB for the workload and configured the redo logs of a size to ensure log file switch checkpoint activity during the test.

I could list the Log Writer and Log Writer worker processes as follows:

oracle   127307      1  0 10:23 ?        00:00:00 ora_lgwr_IVYEXDB1

oracle   127312      1  0 10:23 ?        00:00:01 ora_lg00_IVYEXDB1

oracle   127316      1  0 10:23 ?        00:00:00 ora_lg01_IVYEXDB1

oracle   127320      1  0 10:23 ?        00:00:00 ora_lg02_IVYEXDB1

oracle   127322      1  0 10:23 ?        00:00:00 ora_lg03_IVYEXDB1

and begin a trace of all of the Log Writer Workers as follows with the example for process ora_lg03_IVYEXDB1 above.

[oracle@ivyex1 ~]$ sqlplus sys/oracle as sysdba

SQL> oradebug setospid  127322;

Oracle pid: 21, Unix process pid: 127322, image: (LG03)

SQL> oradebug event 10046 trace name context forever, level 8;

Statement processed.

I ran a test for 10 minutes with a number of virtual users to run the system at a high CPU utilisation generating a significant number of transactions and then when complete stopped the trace as follows before collecting the trace files from the trace directory:

SQL> oradebug event 10046 trace name context off;

Statement processed.

Looking in the trace file it shows the timing of the event 'log file parallel write' that we are interested in as shown below:

WAIT #0: nam='log file parallel write' ela= 501 files=1 blocks=58

WAIT #0: nam='LGWR worker group idle' ela= 39

WAIT #0: nam='log file parallel write' ela= 466 files=1 blocks=52

WAIT #0: nam='LGWR worker group idle' ela= 33

WAIT #0: nam='log file parallel write' ela= 368 files=1 blocks=54

Of course on of the advantages of HammerDB is that with a scripted interface you are not restricted to simply running the pre-built workloads, you can run any workload you choose. Therefore in the Script Editor window of HammerDB I entered the following to process the trace files and ran it to extract the elapsed time into a CSV format.


set filename "lg00.trc"

set filename2 "output00.csv"

set fid [open $filename r]

set fid2 [open $filename2 w]

set elapsed 0

set maxelapsed 0

set count 0

set overmilli 0

set over10milli 0

set over20milli 0

while {[gets $fid line] != -1} {

if {([string match {*log file parallel write*} $line])} {

incr count

regexp {ela= ([0-9]+) } $line all elapsed

puts $fid2 "$count,$elapsed"

if { $elapsed > 1000 } {

incr overmilli

if {  [ expr $elapsed > $maxelapsed ] } { set maxelapsed $elapsed }

if { $elapsed > 10000 } {

incr over10milli

if { $elapsed > 20000 } {

incr over20milli






puts "max elapsed was [ format "%.2f" [ expr {double($maxelapsed)} /1000 ]] millisecond"

puts "[ format "%.2f" [ expr {double($overmilli) / $count} * 100 ]]% over 1 millisecond"

puts "[ format "%.2f" [ expr {double($over10milli) / $count} * 100 ]]% over 10 millisecond"

puts "[ format "%.2f" [ expr {double($over20milli) / $count} * 100 ]]% over 20 millisecond"

close $fid

close $fid2

The output from the script was summarised as follows (Note that as shown below nearly all of the workload went through workers LG00 and LG01 and LG02 and LG03 were mostly idle) :

Log Writer Worker Over  1ms Over 10ms Over 20ms Max Elapsed
LG00 0.56% 0.02% 0.00% 13.66ms
LG01 0.39% 0.01% 0.00% 13.44ms
LG02 7.96% 0.07% 0.00% 13.05ms
LG03 7.07% 0.08% 0.00% 13.00ms

And  I verified that the output from the trace files corresponded with the output from v$event_histogram that over 99% of the redo writes were completed in less than 1 millisecond and the maximum elapsed write time was consistent around 13 milliseconds but also was an extremely small proportion of all the writes.

SQL> select wait_time_milli, wait_count from v$event_histogram where event = 'log file parallel write';


--------------- ----------

1    2460371

2       6605

4       1702

8       1774

16        726

Of course though how busy was the system?  The AWR report shows a load average around 75%

Host CPU




Load Average Begin

Load Average End








75.13 114.68 73.1 5.1 0.5 21.2

And a fairly busy redo rate of almost 340MB/sec.  Given the redo per transaction and the redo per second it is clear that this system is processing tens of thousands of transactions a second and millions of transactions of a minute.

Load Profile

Per Second

Per Transaction

DB Time(s): 160.0 0.0
DB CPU(s): 93.2 0.0
Redo size (bytes):



And the total time spent waiting for ‘log file sync’

Foreground Wait Events



%Time -outs

Total Wait Time (s)

Avg wait (ms)

Waits /txn

% DB time

log file sync




0.70 39.68

was considerably greater than the ‘log file parallel write’ component and therefore consistent with high performance IO with most ‘log file sync’ time spent in communication between foreground and background processes.

Background Wait Events



%Time -outs

Total Wait Time (s)

Avg wait (ms)

Waits /txn

% bg time

log file parallel write





0.05 125.03
target log write size





0.03 94.05

I then loaded the CSV file into a spreadsheet, highlighted the data and selected scatter plot. Although they don’t overlay perfectly on a time basis the data is close enough to combine for LG00 and LG01 and LG02 and LG03 respectively, noting that LG00 and LG01 were considerably busier workers than LG02 and LG03. The data for LG00 and LG01 is here:


and the number of blocks written for LG00 only to indicate write latency against time as follows:


And to complete the picture the latency data for LG02 and LG03 is here:


Clearly the latency is as we would expect strongly related to the amount of redo written and typically and on average the response time for redo writes is sub-millisecond even though we are running at a throughput of millions of transactions a minute and generating over 20GB of redo per minute.  Where there are longer latencies these are typically in the order of 12 to 13ms for 0.01 to 0.02% of writes so  certainly not a factor to highly increase waits.  Furthermore, cross-referencing with iostat data even at this rate of redo the average disk utilization was still only around 10%.

Running a 3 Hour Workload

Of course maybe a 10 minute test is not sufficient, so I then ran the same configuration for 3 hours generating as shown 342MB/sec of redo with 4.7 Terabytes in total.

Function Name

Writes: Data

Reqs per sec

Data per sec

Waits: Count

Avg Tm(ms)

LGWR 4.7T 11574.37 342.071M


DBWR 1.5T 7746.17 110.681M


And the waits on ‘log file parallel write’?   proportionally the same as before.

SQL> select wait_time_milli, wait_count from v$event_histogram where event = 'log file parallel write'


--------------- ----------

1   51521618

2     165210

4      36313

8      36242

16      18478


There is one further test we can do.  By setting the following parameter as follows:

commit_logging BATCH
commit_wait NOWAIT

redo will continue to be generated however with commit_logging set to ‘batch’ the foreground process will not notify the log writer to write its redo and with commit_wait set to ‘nowait’ the foreground process will also not wait for the log writer to notify it that the redo has been written. In other words these settings remove the scheduling aspect of ‘log file sync’ and therefore if the disk is not a bottleneck we should see throughput increase while noting with batches that the writes will be larger.  Sure enough this gives a significant increase in throughput with Oracle now writing over 400MB/sec of redo and CPU utilisation at 97%.

Per Second

Per Transaction

Redo size (bytes):



In other words removing the waits for scheduling meant that the SSDs could easily cope with higher levels of throughput.


In on my previous posts on Redo on SSD I mentioned a new performance view in 12c called V$LGWRIO_OUTLIER. This view reports log writer writes that take over 500ms. I touched on this in passing but didn’t investigate further so decided to take another look and sure enough the new  view had entries.  This is not the only place that long writes to the log file are reported. A trace file is also generated when a write to the log file takes more than 500ms for example:

Warning: log write elapsed time 780ms, size 4KB

So it would be reasonable to expect these to correlate and for a trace file to be generated for each entry in V$LGWRIO_OUTLIER.  It does not help that the documentation is not explicit about the precise time value that this view mentions however from seeing related output it seems reasonable  that the IO_LATENCY value is reported in milliseconds so we should expect entries above the value of 500. Additionally the view does not report a timestamp for each entry however the underlying x$ view does.

SQL> select view_definition from v$fixed_view_Definition where view_name ='GV$LGWRIO_OUTLIER';

So querying these ordered by timestamp shows the following result:

SQL> select IO_SIZE, IO_LATENCY, TIMESTAMP from X$KSFDSTLL order by timestamp;


---------- ---------- ----------

32 2694340698 1394629089

72 2695317938 1394629186

28 2696135038 1394629268

40 2696636948 1394629318

64 2697984098 1394629454

52 2698638788 1394629519

68 2699724048 1394629628

56 2699982768 1394629653

24 2700330618 1394629688

32 2752639988 1394634918

72 2752946678 1394634949

36 2753861848 1394635041

108 2754161328 1394635071

64 2754341738 1394635089

and if the value is in milliseconds then 2754341738 milliseconds is approximately equivalent to 1 month! and the latency is increasing every time. If we cross reference this against the trace data, the AWR report, event histogram and log write trace files and the only correlation appears to be with the timestamp.  Just to be certain after running the 3 hour test  the outlier data showed  396 entries (note that V$LGWRIO_OUTLIER only shows the last 180) all in ascending order.

SQL> select IO_SIZE, IO_LATENCY, TIMESTAMP from X$KSFDSTLL order by timestamp;


---------- ---------- ----------

40 3471738158 1394706828

44 3472415808 1394706896

52 3472562658 1394706911

16 3475418188 1394707196

16 3475860828 1394707240

40 3477258658 1394707381

128 3626810268 1394722336

256 3627883738 1394722444

20 3627919588 1394722447

128 3628119698 1394722467

320 3628620898 1394722517

112 3629137298 1394722569

396 rows selected.

This is sufficient evidence for me to suggest that for the time being V$LGWRIO_OUTLIER on Linux should not be relied upon for measuring IO Latency or at the very least there is insufficient documentation to accurately interpret what it is meant to show.


So in summary should you put Oracle redo on SSD?  If you want sub-millisecond response times  coupled with high levels of data path protection  then surely the answer is yes. In fact more to the point a modern SSD can handle redo with such high throughput and low latency that the ‘log file parallel write’ component of ‘log file sync’ and therefore the SSD write performance even with systems generating redo at very high rates is not the determining factor.  If it is then I would recommend reviewing how to correctly configure Oracle redo on SSD before taking your own measurements of Oracle redo performance on SSD.