Database Performance: How to reduce “library cache: mutex X” waits for scalability in the Oracle database

As the demand for database workload throughput increases along with server CPU core counts there can be an increasing challenge of maintaining scalability even with the most optimized workloads.  In fact in some scenarios following best practices can result in contention as more sessions running on more CPUs contend for access to the same SQL or data. In this blog post I will look at an extremely high throughput transactional Oracle database workload and the options available for improving concurrency and throughput without modifying the application.

The workload I will use is the HammerDB Oracle OLTP workload on a 4-socket Intel Xeon server - importantly the server CPU utilisation is over 95% and the throughput is already very high  so we need to be aware that contention may not "true" ie if the CPU is already fully utilised then contention may be as a result of a process waiting for CPU time as much as genuine contention on a particular object.  Nevertheless we are in the business of maximising as much as possible from our IT investments and it is interesting to see that even with full CPU utilisation there are options to reduce contention and improve performance still further.  This HammerDB workload is a good example of an optimized workload as it shares cursors, reduces parsing and makes use of bind variables. How it does this is when a workload is run, for each stored procedure a cursor is opened and as shown here in Oratcl the statement that calls the stored procedure is parsed once only:

set curn_sl [oraopen $lda ]

set sql_sl "BEGIN slev(:st_w_id,:st_d_id,:threshold); END;"

oraparse $curn_sl $sql_sl

return $curn_sl

When the stored procedure is called bind variables are used

orabind $curn_sl :st_w_id $w_id :st_d_id $stock_level_d_id :THRESHOLD $threshold

oraexec $curn_sl

using bind variables means that the same cursor can be shared between sessions, reducing the impact on the amount of memory needed for the shared pool and keeping hard parsing activity to a minimum: Running the workload and measuring the output with an AWR report shows the following top 10 foreground events:

Top 10 Foreground Events by Total Wait Time

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                            Total Wait    Wait   % DB

Event                                 Waits Time (sec) Avg(ms)   time Wait Class

------------------------------ ------------ ---------- ------- ------ ----------

DB CPU                                           34.2K           84.5

library cache: mutex X            1,443,752       2284       2    5.6 Concurrenc

cursor: pin S                     1,060,588     2004.2       2    5.0 Concurrenc

latch: In memory undo latch       1,036,240        401       0    1.0 Concurrenc

latch: enqueue hash chains          103,783      331.7       3     .8 Other

enq: TX - row lock contention       171,164      266.9       2     .7 Applicatio

db file sequential read             249,612      151.8       1     .4 User I/O

log file sync                        35,981      117.3       3     .3 Commit

db file scattered read              155,862      105.7       1     .3 User I/O

buffer busy waits                   270,967         90       0     .2 Concurrenc

Performance is good and CPU utilisation is maximised, however the top wait events of library cache: mutex X and cursor: pin S are related to the good practices of cursor sharing. Each of these events covers multiple mutexes and drilling down into the AWR report shows more detailed information on each.

                                                                         Wait

Mutex Type            Location                               Sleeps    Time (ms)

--------------------- -------------------------------- ------------ ------------

Cursor Pin            kksfbc [KKSCHLFSP2]                   634,827      933,897

Library Cache         kglpin1   4                           610,798      747,221

Library Cache         kglpndl1  95                          599,973      753,431

Library Cache         kglpnal1  90                          503,140      155,556

Cursor Pin            kksLockDelete [KKSCHLPIN6]            454,516      701,271

Library Cache         kglhbh1   63                                2            1

Here the top mutexes are related to child cursor lookup activity (kksfbc) and library cache pinning and unpinning activity (kglpin/kglpndl) - for further information on this there is an excellent presentation on Library Cache internals here.  Mutexes are extremely lightweight and use atomic instructions implemented at the CPU level with the LOCK CMPXCHG instruction, however a session sleeping waiting to acquire a mutex will do so utilising CPU time and therefore in this case the optimization has meant the high CPU and session count running the same stored procedures shows evidence of contention.

The tried and tested workaround is to modify the application to to artificially duplicate the SQL by changing the text. One such way to do this is by introducing 'dummy hints' that are not stripped out by the parser but are sufficient for Oracle to recognise that the statements are different and thereby reduces contention by having different sessions use different 'versions' of the same statements. This is not ideal however by modifying the application for a particular hardware platform and in some cases it is simply not an option to modify the application, instead, dbms_shared_pool.markhot can be used.  Instead of duplicating the SQL ourselves dbms_shared_pool.markhot serves this purpose introducing multiple versions of the same packages, procedures and cursors into the library cache for different sessions to use - to do this all we have to do is mark the objects in question hot.  We already know which packages and procedures are being used for this workload so can mark them hot at a SQL prompt as the SYS user as follows:

exec dbms_shared_pool.markhot(schema=>'SYS',objname=>'DBMS_RANDOM',NAMESPACE=>1);

exec dbms_shared_pool.markhot(schema=>'SYS',objname=>'DBMS_RANDOM',NAMESPACE=>2);

exec dbms_shared_pool.markhot(schema=>'SYS',objname=>'DBMS_OUTPUT',NAMESPACE=>1);

exec dbms_shared_pool.markhot(schema=>'SYS',objname=>'DBMS_OUTPUT',NAMESPACE=>2);

exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'NEWORD',NAMESPACE=>1);

exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'PAYMENT',NAMESPACE=>1);

exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'DELIVERY',NAMESPACE=>1);

exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'OSTAT',NAMESPACE=>1);

exec dbms_shared_pool.markhot(schema=>'TPCC',objname=>'SLEV',NAMESPACE=>1);

Re-running the workload shows a significant impact on the library cache: mutex X event and despite the high CPU utilisation already throughput has also improved.

Top 10 Foreground Events by Total Wait Time

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                            Total Wait    Wait   % DB

Event                                 Waits Time (sec) Avg(ms)   time Wait Class

------------------------------ ------------ ---------- ------- ------ ----------

DB CPU                                           34.5K           88.6

cursor: pin S                     1,071,740     2093.3       2    5.4 Concurrenc

enq: TX - row lock contention       373,472      548.9       1    1.4 Applicatio

latch: In memory undo latch         956,992      415.6       0    1.1 Concurrenc

latch: enqueue hash chains          103,312      355.7       3     .9 Other

db file sequential read             211,498      139.8       1     .4 User I/O

db file scattered read              168,610      123.5       1     .3 User I/O

log file sync                        36,418      118.7       3     .3 Commit

buffer busy waits                   257,958       95.9       0     .2 Concurrenc

library cache: mutex X               23,288       24.3       1     .1 Concurrenc

The detailed mutex information also shows the reduction in library cache mutex activity.

                                                                         Wait

Mutex Type            Location                               Sleeps    Time (ms)

--------------------- -------------------------------- ------------ ------------

Cursor Pin            kksfbc [KKSCHLFSP2]                   644,771      973,310

Cursor Pin            kksLockDelete [KKSCHLPIN6]            455,400      725,202

Library Cache         kglpin1   4                            10,879       11,115

Library Cache         kglpndl1  95                            8,330       10,836

Library Cache         kglpnal1  90                            6,959        1,857

Library Cache         kglhbh1   63                                1            0

Using dbms_shared_pool.markhot can be also be used for cursors and the most active ones can be seen within the view X$KGLOB -  there is a SQL statement from Andrey Nikolaev that shows the hottest objects in the library cache  here so we can modify that to find the top 4 cursors for the TPCC user and mark them hot as follows:


SET ECHO OFF

SET SERVEROUTPUT ON

DECLARE

BEGIN

FOR hash_id in (

select KGLNAHSV from (

select

case when (kglhdadr =  kglhdpar) then 'Parent' else 'Child' ||kglobt09 end,

kglhdadr ADDRESS,substr(kglnaobj,1,20),

parsing_schema_name, kglnahsh, KGLNAHSV, kglobtyd, kglobt23, kglobt24, kglhdexc, kglhdnsp

from x$kglob, v$sql

where x$kglob.kglnahsh = v$sql.hash_value

and v$sql.parsing_schema_name='TPCC'

and kglhdexc > 1

and kglobt24 > 1

order by kglobt24 desc) where rownum < 5

    ) LOOP

DBMS_OUTPUT.PUT_LINE('hash=>'|| hash_id.KGLNAHSV ||',namespace=>0');

SYS.DBMS_SHARED_POOL.MARKHOT(hash=>hash_id.KGLNAHSV,namespace=>0);

END LOOP;

END;

/

An example output is as follows - it is important to note that the hash value required is the full 16 byte hash value of KGLNAHSV in X$KGLOB although on some versions of Oracle no error is given if using the incorrect hash value however it has no effect so you should ensure that you are using the correct hash value as shown in the example: 

SQL> @markhot4sql

hash=>b45763c79b8f4f0e8439f6a3190f1b8f,namespace=>0

hash=>554b1093c9cd87db8f6f55016b20198a,namespace=>0

hash=>15e9d1f93391ce5edb7be748a2ab9f80,namespace=>0

hash=>6508e1443bf2a2948165d5ad91803582,namespace=>0

In this case cursor: pin S events have been reduced, however at the expense of increasing library cache mutex events and lowering throughput.

Top 10 Foreground Events by Total Wait Time

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                            Total Wait    Wait   % DB

Event                                 Waits Time (sec) Avg(ms)   time Wait Class

------------------------------ ------------ ---------- ------- ------ ----------

DB CPU                                           33.4K           79.7

library cache: mutex X              849,072     5296.8       6   12.6 Concurrenc

enq: TX - row lock contention       277,268      745.3       3    1.8 Applicatio

cursor: pin S                       222,045      455.4       2    1.1 Concurrenc

...

The detail shows however that it is a different area of library cache mutex x contention and from referring to the presentation noted previously kglhd is related to the handle structure and therefore creation of a new object.

Mutex Type            Location                               Sleeps    Time (ms)

--------------------- -------------------------------- ------------ ------------

Library Cache         kglhdgn1  62                        1,032,273    2,763,927

Cursor Pin            kksfbc [KKSCHLFSP2]                   202,786      338,137

Library Cache         kglhdgn2 106                          186,681      496,834

Library Cache         kglpin1   4                           127,693      190,311

Library Cache         kglpndl1  95                          126,592      200,987

Library Cache         kglpnal1  90                           83,421       19,982

Cursor Pin            kksLockDelete [KKSCHLPIN6]             56,208      100,403,


As a consequence with the system already at full CPU utilisation further optimization has proved counter

productive. In fact testing showed that even when marking packages and procedures hot only and limiting the number of copies created with the following parameter resulted in the optimal overall system performance.


SQL> show parameter kgl

NAME                     TYPE     VALUE

------------------------------------ ----------- ------------------------------

_kgl_hot_object_copies             integer     4


In summary what we have seen is that if experiencing contention on library cache: mutex X or cursor: pin S events then dbms_shared_pool.markhot can be a highly effective tool.  Whether looking to achieve a few extra percent on an extreme high performance workload or to achieve high scalability on systems with very high core counts Oracle on Intel makes this possible all without without needing to modify the application.