Image from Bertrand Drouvot
UPDATE: thanks to a discussion in the comments it’s come to my attention that there should be some more clarification on values used and what the mean at the beginning of this post.
Ever wonder where CPU wait comes from in EM performance screens as seen above? well the following discussion will give you a SQL query to calculate CPU wait. In the above image we see both CPU and CPU Wait as two colors (light green and dark green) in EM. Below we will get these two values from the output of a SQL script (CPU_ORA and CPU_ORA_WAIT).
Oracle CPU statistics are measured from calls to the OS to see how much CPU is burned over a given elapsed time (i.e. certain quantity of code). The CPU values are cycles used and not time, thus it does not include time on the run queue waiting to get the CPU.
Oracle ASH, on the other hand, lists all Oracle sessions that want to run from Oracle’s perspective, i.e. they aren’t idle and they aren’t waiting for a non-idle wait event like I/O. Thus ASH includes time spent both running on CPU burning cycles and time spent waiting to get on the CPU.
Thus we can take the the amount of time “On CPU” from ASH and subtract the amount of CPU in Oracle statistics for CPU usage then the remainder is roughly time spent by Oracle sessions waiting to get onto the CPU.
The two challenges to getting the value of “Wait for CPU” are getting CPU cycles burned and ASH time “ON CPU” into the same units and making sure that we are measuring both over the same interval.
Oracle already reports CPU in % used and in centi-seconds used. I transform these units into the unit Average Active Sessions on CPU so I can compare CPU time in ASH to CPU cycles burned from the CPU statistics.
There are 3 kinds of CPU in the Oracle stats.
Oracle CPU used
System CPU used
Oracle demand for CPU
Starting in 10g Oracle records both the CPU used by the instance as well as the load on the system in v$sysmetric. This is awesome as we can see how busy the system is and how much of the CPU Oracle is responsible for:
col metric_name for a25
col metric_unit for a25
select metric_name, value, metric_unit from v$sysmetric where metric_name like'%CPU%' and group_id=2;
METRIC_NAME VALUE METRIC_UNIT
------------------------------ ---------- ------------------------------
CPU Usage Per Sec 251.067016 CentiSeconds Per Second
CPU Usage Per Txn 5025.52477 CentiSeconds Per Txn
Host CPU Utilization (%) 11.6985845 % Busy/(Idle+Busy)
Database CPU Time Ratio 76.3291033 % Cpu/DB_Time
Now the question is how do we convert these to something useful? For me I put it into the equivalent of AAS and compare it to the core count:
select 'CPU_ORA_CONSUMED' CLASS,
round(value/100,3) AAS
from v$sysmetric
where metric_name='CPU Usage Per Sec'
and group_id=2
union
select 'CPU_OS' CLASS ,
round((prcnt.busy*parameter.cpu_count)/100,3) sAAS
from
( select value busy from v$sysmetric
where metric_name='Host CPU Utilization (%)'
and group_id=2 ) prcnt,
( select value cpu_count
from v$parameter
where name='cpu_count' ) parameter;
CLASS AAS
---------------- ----------
CPU_ORA_CONSUMED .002
CPU_OS .022
An AAS of 1 is equivalent to 100% of a core, so, OS CPU is about 2.0% of a core and of that Oracle used 0.2% of a core. Not a very active system, and we can look at an active system later, but what I wanted to point out is that this query is missing an important statistic: the demand for CPU by Oracle. We can only add that by joining in ASH:
select 'CPU_ORA_CONSUMED' CLASS,
round(value/100,3) AAS
from v$sysmetric
where metric_name='CPU Usage Per Sec'
and group_id=2
union
select 'CPU_OS' CLASS ,
round((prcnt.busy*parameter.cpu_count)/100,3) AAS
from
( select value busy from v$sysmetric
where metric_name='Host CPU Utilization (%)'
and group_id=2 ) prcnt,
( select value cpu_count from v$parameter
where name='cpu_count' ) parameter
union
select
'CPU_ORA_DEMAND' CLASS,
nvl(round( sum(decode(session_state,'ON CPU',1,0))/60,2),0) AAS
from v$active_session_history ash
where SAMPLE_TIME > sysdate - (60/(24*60*60));
CLASS AAS
---------------- ----------
CPU_ORA_CONSUMED .001
CPU_ORA_DEMAND .02
CPU_OS .019
So the demand for CPU was higher than the amount consumed. Now the demand for CPU is coming from ASH which is sampled so the accuracy is weak, but in larger sample sets or busier systems it’s pretty darn good. A higher demand for CPU demanded than CPU used alerts us to CPU starvation on a busy system.
I like to wrap all this up into a query with all the wait classes to see the overall load on Oracle including CPU consumed by Oracle, CPU demanded by Oracle and CPU used at the OS level:
select
decode(n.wait_class,'User I/O','User I/O',
'Commit','Commit',
'Wait') CLASS,
sum(round(m.time_waited/m.INTSIZE_CSEC,3)) AAS
from v$waitclassmetric m,
v$system_wait_class n
where m.wait_class_id=n.wait_class_id
and n.wait_class != 'Idle'
group by decode(n.wait_class,'User I/O','User I/O', 'Commit','Commit', 'Wait')
union
select 'CPU_ORA_CONSUMED' CLASS,
round(value/100,3) AAS
from v$sysmetric
where metric_name='CPU Usage Per Sec'
and group_id=2
union
select 'CPU_OS' CLASS ,
round((prcnt.busy*parameter.cpu_count)/100,3) AAS
from
( select value busy from v$sysmetric where metric_name='Host CPU Utilization (%)' and group_id=2 ) prcnt,
( select value cpu_count from v$parameter where name='cpu_count' ) parameter
union
select
'CPU_ORA_DEMAND' CLASS,
nvl(round( sum(decode(session_state,'ON CPU',1,0))/60,2),0) AAS
from v$active_session_history ash
where SAMPLE_TIME > sysdate - (60/(24*60*60));
CLASS AAS
---------------- ----------
CPU_ORA_CONSUMED .002
CPU_ORA_DEMAND .03
CPU_OS .023
Commit 0
User I/O 0
Wait 0
Ideally I’d want the CPU stats to be subsets of each other so that I could have a graphically stack-able set of statistics
now rolling it all together
with AASSTAT as (
select
decode(n.wait_class,'User I/O','User I/O',
'Commit','Commit',
'Wait') CLASS,
sum(round(m.time_waited/m.INTSIZE_CSEC,3)) AAS
from v$waitclassmetric m,
v$system_wait_class n
where m.wait_class_id=n.wait_class_id
and n.wait_class != 'Idle'
group by decode(n.wait_class,'User I/O','User I/O', 'Commit','Commit', 'Wait')
union
select 'CPU_ORA_CONSUMED' CLASS,
round(value/100,3) AAS
from v$sysmetric
where metric_name='CPU Usage Per Sec'
and group_id=2
union
select 'CPU_OS' CLASS ,
round((prcnt.busy*parameter.cpu_count)/100,3) AAS
from
( select value busy from v$sysmetric where metric_name='Host CPU Utilization (%)' and group_id=2 ) prcnt,
( select value cpu_count from v$parameter where name='cpu_count' ) parameter
union
select
'CPU_ORA_DEMAND' CLASS,
nvl(round( sum(decode(session_state,'ON CPU',1,0))/60,2),0) AAS
from v$active_session_history ash
where SAMPLE_TIME > sysdate - (60/(24*60*60))
)
select
( decode(sign(CPU_OS-CPU_ORA_CONSUMED), -1, 0, (CPU_OS - CPU_ORA_CONSUMED )) +
CPU_ORA_CONSUMED +
decode(sign(CPU_ORA_DEMAND-CPU_ORA_CONSUMED), -1, 0, (CPU_ORA_DEMAND - CPU_ORA_CONSUMED ))) CPU_TOTAL,
decode(sign(CPU_OS-CPU_ORA_CONSUMED), -1, 0, (CPU_OS - CPU_ORA_CONSUMED )) CPU_OS,
CPU_ORA_CONSUMED CPU_ORA,
decode(sign(CPU_ORA_DEMAND-CPU_ORA_CONSUMED), -1, 0, (CPU_ORA_DEMAND - CPU_ORA_CONSUMED )) CPU_ORA_WAIT,
COMMIT,
READIO,
WAIT
from (
select
sum(decode(CLASS,'CPU_ORA_CONSUMED',AAS,0)) CPU_ORA_CONSUMED,
sum(decode(CLASS,'CPU_ORA_DEMAND' ,AAS,0)) CPU_ORA_DEMAND,
sum(decode(CLASS,'CPU_OS' ,AAS,0)) CPU_OS,
sum(decode(CLASS,'Commit' ,AAS,0)) COMMIT,
sum(decode(CLASS,'User I/O' ,AAS,0)) READIO,
sum(decode(CLASS,'Wait' ,AAS,0)) WAIT
from AASSTAT)
/
CPU_OS CPU_ORA CPU_ORA_WAIT COMMIT READIO WAIT
---------- ---------- ------------ ---------- ---------- ----------
.02 .002 0 0 0 0
Now let’s run up some load on a machine and database. Take two databases, run up the CPU demand on both and add some wait contention. The machine has 24 cores so there is a definitely a problem when the CPU_TOTAL goes over 24. I’m running 14 sessions each trying to burn a core on two different databases. The first few lines the test is ramping up
SQL> /
CPU_TOTAL CPU_OS CPU_ORA CPU_ORA_WAIT COMMIT READIO WAIT
---------- ---------- ---------- ------------ ---------- ---------- ----------
14.887 .387 13.753 .747 0 0 .023
SQL> /
CPU_TOTAL CPU_OS CPU_ORA CPU_ORA_WAIT COMMIT READIO WAIT
---------- ---------- ---------- ------------ ---------- ---------- ----------
21.989 7.469 12.909 1.611 0 0 .044
SQL> /
CPU_TOTAL CPU_OS CPU_ORA CPU_ORA_WAIT COMMIT READIO WAIT
---------- ---------- ---------- ------------ ---------- ---------- ----------
26.595 12.125 11.841 2.629 0 0 .025
SQL> /
CPU_TOTAL CPU_OS CPU_ORA CPU_ORA_WAIT COMMIT READIO WAIT
---------- ---------- ---------- ------------ ---------- ---------- ----------
27.045 12.125 11.841 3.079 0 0 .025
Historically CPU used by Oracle was derived from
v$sysstat.name=’CPU used by this session’
but this statistic had problems as the value was only updated every time a call ended. A call could be a 1 hour PL/SQL procedure which would thus report zero cpu usage in the stats until it finished and the CPU would spike off the scale.
ASH had always been the most stable way to gather CPU demand, though Oracle has made improvements in gathering CPU statistics. I believe that the time model gathers CPU every 5 seconds in 10g, and in 11g it’s possible that CPU stats are gathered every second
Here is a visual example of a machine that has server memory contention, massive amounts of paging. There is OS CPU being used, but hardly any CPU being used by Oracle which makes sense as it’s an idle database, but what is revealing is the massive amount of CPU wait by Oracle. Oracle only has a little bit of work to do to take care of an idle database but we can see that most of Oracle’s CPU time is wait for CPU time as when it wants to work, pages have to be read back in,
I have my doubts as to the clarity of the layout of the above graph. A possibly clearer graph would be simply adding a line representing available CPU and take out the OSCPU bars. In the above graph I’ve charted OSCPU usage as AAS, ie average active sessions, mixing AAS of the database with AAS at the OS level. I think a possible clear representation would be to show the Core count line, and draw the OSCPU usage shown upside down from the # of core lines, thus the space from the bottom axis to where the OSCPU reaches down would be available CPU.
UPDATE
Thanks to the eagle eyes of John Beresniewicz a small error was identified in the above script. The last script didn’t correlate the time windows of v$sysmetric with v$active_session history. They both reported the last minute of statistics but the last minute reported in v$sysmetric could be up to a minute behind those in v$active_session_history, so here is a version that tries to correlate to two time windows so they are in sync
with AASSTAT as (
select
decode(n.wait_class,'User I/O','User I/O',
'Commit','Commit',
'Wait') CLASS,
sum(round(m.time_waited/m.INTSIZE_CSEC,3)) AAS,
BEGIN_TIME ,
END_TIME
from v$waitclassmetric m,
v$system_wait_class n
where m.wait_class_id=n.wait_class_id
and n.wait_class != 'Idle'
group by decode(n.wait_class,'User I/O','User I/O', 'Commit','Commit', 'Wait'), BEGIN_TIME, END_TIME
union
select 'CPU_ORA_CONSUMED' CLASS,
round(value/100,3) AAS,
BEGIN_TIME ,
END_TIME
from v$sysmetric
where metric_name='CPU Usage Per Sec'
and group_id=2
union
select 'CPU_OS' CLASS ,
round((prcnt.busy*parameter.cpu_count)/100,3) AAS,
BEGIN_TIME ,
END_TIME
from
( select value busy, BEGIN_TIME,END_TIME from v$sysmetric where metric_name='Host CPU Utilization (%)' and group_id=2 ) prcnt,
( select value cpu_count from v$parameter where name='cpu_count' ) parameter
union
select
'CPU_ORA_DEMAND' CLASS,
nvl(round( sum(decode(session_state,'ON CPU',1,0))/60,2),0) AAS,
cast(min(SAMPLE_TIME) as date) BEGIN_TIME ,
cast(max(SAMPLE_TIME) as date) END_TIME
from v$active_session_history ash
where SAMPLE_TIME >= (select BEGIN_TIME from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2 )
and SAMPLE_TIME < (select END_TIME from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2 )
)
select
to_char(BEGIN_TIME,'HH:MI:SS') BEGIN_TIME,
to_char(END_TIME,'HH:MI:SS') END_TIME,
CPU_OS CPU_TOTAL,
decode(sign(CPU_OS-CPU_ORA_CONSUMED), -1, 0, (CPU_OS - CPU_ORA_CONSUMED )) CPU_OS,
CPU_ORA_CONSUMED CPU_ORA,
decode(sign(CPU_ORA_DEMAND-CPU_ORA_CONSUMED), -1, 0, (CPU_ORA_DEMAND - CPU_ORA_CONSUMED )) CPU_ORA_WAIT,
COMMIT,
READIO,
WAIT
-- ,( decode(sign(CPU_OS - CPU_ORA_CONSUMED), -1, 0,
-- (CPU_OS - CPU_ORA_CONSUMED))
-- + CPU_ORA_CONSUMED +
-- decode(sign(CPU_ORA_DEMAND - CPU_ORA_CONSUMED), -1, 0,
-- (CPU_ORA_DEMAND - CPU_ORA_CONSUMED ))) STACKED_CPU_TOTAL
from (
select
min(BEGIN_TIME) BEGIN_TIME,
max(END_TIME) END_TIME,
sum(decode(CLASS,'CPU_ORA_CONSUMED',AAS,0)) CPU_ORA_CONSUMED,
sum(decode(CLASS,'CPU_ORA_DEMAND' ,AAS,0)) CPU_ORA_DEMAND,
sum(decode(CLASS,'CPU_OS' ,AAS,0)) CPU_OS,
sum(decode(CLASS,'Commit' ,AAS,0)) COMMIT,
sum(decode(CLASS,'User I/O' ,AAS,0)) READIO,
sum(decode(CLASS,'Wait' ,AAS,0)) WAIT
from AASSTAT)
/
The output now looks like
BEGIN_TI END_TIME CPU_TOTAL CPU_OS CPU_ORA CPU_ORA_WAIT COMMIT READIO WAIT
-------- -------- ---------- ---------- ---------- ------------ ---------- ---------- ----------
07:23:35 07:24:35 .044 .024 .002 .018 0 0 .001
I’m still open that there might be some more tweaking to do, so your milage may vary. Test, Test, Test and only trust yourself! As a reminder the code above is on Github as part of the ASH Masters project. Feel free to get a Github account, fork the code and make changes. If you find cool new code or errors in old code, let me know and we will merge it into the ASH Masters project
Comments