A customer was working with a clone copy of their production database for reports, but the cloned database reports were running slower than the same reports on production and they wanted to know why.
Ostensibly the machines were similar and the memory was the same, though it’s always worth verifying. The first thing I checked was the load profile. If you have OEM you can look at the performance page and/or top activity page to see the load profile. The load profile of production was much more CPU bound and much less I/O bound than the clone and their was clearly a difference.
The customer knew the time they the report was run on the clone but was unsure exactly when it was run on production, though knew it had been run in the past day.
On the clone box, during the report run the load profile using SQL script to reproduce the performance graph looked like :
07-NOV CLONE
TM AAS GRAPH
----- ---- ----------------------------------------------------
11:00 .2 + 6
12:00 .2 - 6
13:00 .1 6
14:00 .1 6
15:00 .1 6
16:00 14.9 ++++++ooooooooo6ooooooooooooooooooooooooooooooooooo-
17:00 .0 6
The reports were run between 4pm and 5pm (ie 16:00-17:00) and that’s easy to see from the load chart. The “+” represent CPU, “o” represent I/O, and “-” other wait time. Now the reports on production were suppose to be run around 7am but the client wasn’t sure. Here is what the load profile looked like on production looked like
07-NOV PRODUCTION
time AAS GRAPH
----- --- ---------------------------------------------------------------------
00:00 1.5 ++++o
01:00 1.1 +++
02:00 .4 +
03:00 .2 +
04:00 .4 ++
05:00 1.5 ++ooo
06:00 1.6 +++oo
07:00 3.2 ++++++++ooo-
08:00 3.6 ++++++++++ooo--
09:00 6.1 +++++++++++++++++oooooo----
10:00 4.7 +++++++++++++++++ooo
11:00 5.3 +++++++++++++++++++ooo-
12:00 0.0 +++++++++++++++++++++++++++++++++oooooooo---------------------------
13:00 0.5 ++++++++++++++++++++++++oooooooooooooooooooo
14:00 19.2++++++++++++++++++++++++++++++++oooooooooooooooooooooooooooooooooooooo
15:00 6.5 +++++++++++++++oooooo
The customer thought the report had been run at 8am on production and at 8am there is little I/O wait and some CPU. Looking at the top SQL from the clone during the report verses the production database at 8am didn’t show any of the same top SQL. At this point I ask the customer to send me AWR exports of the two databases. I imported the AWR exports giving the clone database DBID=1 and production DBID=2
The first thing I check was the I/O latency in the AWR repository data. Both production and the clone showed I/O averaging 6ms, so latency was unlikely to be the problem.
Clone Latency during the report:
BTIME EVENT_NAME AVG_MS CT
--------------- ------------------------------ --------- ----------
07-NOV-11 16:00 db file scattered read 6.48 4,246
07-NOV-11 16:00 db file sequential read 8.15 4,760,454 **
07-NOV-11 16:00 direct path read 3.50 680,192
07-NOV-11 16:00 direct path write 10.22 14
07-NOV-11 16:00 direct path write temp .62 2
Production looked like (for example)
BTIME EVENT_NAME AVG_MS CT
--------------- ------------------------------ --------- ----------
07-NOV-11 14:00 db file scattered read 2.93 148,226
07-NOV-11 14:00 db file sequential read 6.35 4,961,761 **
07-NOV-11 14:00 direct path read 2.32 2,706,322
07-NOV-11 14:00 direct path write .63 204
07-NOV-11 14:00 direct path write temp .86 820
Thus the clone looks slightly slower, but not enough to explain I/O load seen on the clone.
Then I check for the top SQL during the report on the clone which gave:
Top SQL on clone during the report
select
SQL_ID ,
sum(decode(session_state,'ON CPU',1,0)) "CPU",
sum(decode(session_state,'WAITING',1,0)) -
sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0)) "WAIT" ,
sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0)) "IO" ,
sum(decode(session_state,'ON CPU',1,1)) "TOTAL"
from dba_hist_active_sess_history
where SQL_ID is not NULL
and dbid=1
group by sql_id
order by sum(decode(session_state,'ON CPU',1,1)) desc
SQL_ID CPU WAIT IO TOTAL
------------- ------ ------ ------ ------
4sfx95xxxxc03 31.0 .0 736.0 767
I then looked for the reports most expensive query on the clone in the AWR from production. The load profile for the same query on production was completely different.
Load on Production of top SQL from report on Clone
select
SQL_ID ,
sum(decode(session_state,'ON CPU',1,0)) "CPU",
sum(decode(session_state,'WAITING',1,0)) -
sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0)) "WAIT" ,
sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0)) "IO" ,
sum(decode(session_state,'ON CPU',1,1)) "TOTAL"
from dba_hist_active_sess_history
where SQL_ID is not NULL
and dbid=2
and sql_id = '4sfx95xxxxc03',
group by sql_id
order by sum(decode(session_state,'ON CPU',1,1)) desc
/
SQL_ID CPU WAIT IO TOTAL
------------- ------ ------ --------- ------
4sfx95xxxxc03 12.0 .0 39.00 51
So the load profile of the most expensive report query on the clone looks completely different than production. On production the overall load is much lower and in particular I/O is drastically lower.
Why?
Now I wanted verify the stats for each SQL statement to make sure the query was doing the same work:
CLONE
SNAP_ID EXECS DISK LIO RWS CPU ELAPSE IO_TIME READS
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
18798 25648 2746798 21669637 12554 1768040000 7867477859 7430523627 2746594
Prod
18887 116449 52123 3026366 77023 67260000 377033758 313874605 52122
So query is being executed more on Prod yet doing much less disk and less CPU.
Why?
Let’s fine the sql plan hash for production and the clone :
DB SQL_ID PLAN_HASH_VALUE
--------- ------------- ---------------
clone 4sfx95xxxxc03 1805550729
Prod 4sfx95xxxxc03 3038004819
They are different!
Lets look at the two execution plans
col operation for a75
SELECT
LPAD(' ',depth)||
OPERATION||'_'||
OPTIONS||' '||
OBJECT_NAME operation
FROM
DBA_HIST_SQL_PLAN
WHERE
sql_id='&SQL_ID'
and DBID=&DBID
order by ID,PLAN_HASH_VALUE
/
(I used a direct query on DBA_HIST_SQL_PLAN but you can also used the package provided by Oracle, for example:
select * from table(dbms_xplan.display_awr('4sfx95xxxxc03',1805550729,1)); -- sql_id, plan_hash, DBID
)
Plan Hash
CLONE Production
----------- ------------
1805550729 3038004819
OPERATION
---------------------------------- --------------------
SELECT STATEMENT_ SELECT STATEMENT_
SORT_ORDER BY SORT_ORDER BY FILTER_
FILTER_ FILTER_
NESTED LOOPS_ NESTED LOOPS_
NESTED LOOPS_ NESTED LOOPS_
-> TABLE ACCESS_BY INDEX ROWID PAY_CHECK TABLE ACCESS_BY INDEX ROWID PAY_EARNINGS
-> INDEX_RANGE SCAN I1PAY_CHECK INDEX_RANGE SCAN EPAY_EARNINGS
-> INDEX_RANGE SCAN PAY_EARNINGS INDEX_UNIQUE SCAN PAY_CHECK
-> TABLE ACCESS_BY INDEX ROWID PAY_EARNINGS TABLE ACCESS_BY INDEX ROWID PAY_CHECK
TABLE ACCESS_BY INDEX ROWID PAY_CHECK TABLE ACCESS_BY INDEX ROWID PAY_CHECK
INDEX_RANGE SCAN I0PAY_CHECK INDEX_RANGE SCAN I0PAY_CHECK
So the execution plans have changes. Now why have they changed? That’s a story for another day, but the plan would only change if the data and/or the statistics on the tables had changed.
The clone in fact wasn’t an exact clone. Either it had different object statistics or the actual data was changed. Ih this case and with more investigation, it turns out the statistics were run more recently on the clone than on production.
REFERENCE: DBA HIST Views
photo by Daniel Sjöström
Comments