Attached is the power point version of the presentation Joe Priest and I delivered at PTCLive 2015.

Below is my 2014 PTCLive Global presentation "Guerilla Oracle Tuning for Windchill Administrators".


Presentation Description:

PTC Windchill performance and stability depends heavily on a well functioning Oracle database. Unfortunately for many PTC Windchill administrators they are the mercy of their DBA`s who for various reasons may not give the PTC Windchill system as much attention as maybe they should. This presentation will aim to make the PTC Windchill administrator less reliant on the DBA, and allow them to be able to tell when there are problems in Oracle and how they can deal with some of the problems which have been identified.

Below is the presentation Tim Atwood and I just presented at PTCLive Global 2014 this morning titled "PTC System Monitor Deep-Dive: PTC Windchill Pro-Active Monitoring and Performance Troubleshooting".


Internal PSM Training

Posted by vaillan Feb 27, 2014

There was recently an effort to locate and combine all the existing internal Technical Support PSM training materials into one file.  The idea was to create a document for a technical support engineer who wasn't familiar with PSM could spend an hour or two reviewing its content and at by the end have an understanding of the basics of PSM along with how to get it installed. To the best of knowledge it is the most concise set of intro to mid-level knowledge available for PSM and I wanted to make it available to the community.


Special thanks to Pat Kulenkamp from Integrity support for putting this document together. 


Other resources and the latest documentation links can be found on the main PSM resource page.

This post is the first in a planned series of posts on how deal with performance problems relating to queue processing with an emphasis on workflow performance. 


Even though the name 'Workflow', suggests doing something workflow processes are more like a container that direct queue entries and hold the results when the entries are processed.  The queue entries are executed are executed on one of two types of queues.  Queues are either 'processing' queues and  aim to execute an entry as quickly as possible, or they are 'schedule' queues which will execute an entry at a determined point in the future.


Processing queues are first in first out (FIFO) queues.  They execute one entry at a time based on the order it was received into the queue.  The oldest "Ready" entry will next to process.  Processing queues also will not allow another entry to start until the entry currently executing completes.  This model of operating isn't a problem when a queue entry takes a fraction of second to complete, but when an entry takes minutes or longer to complete it blocks the waiting from executing which could result in users not getting a task as expected.  Long running expressions or set states in a system with a large number of class based sync robots are common reasons why a processing queue may not process as quickly as it should.


A Schedule queue differs from a FIFO in that it doesn't act on a first in first out basis but more like a calendar function where at a specific time in the future the schedule entry needs to "fire" and do something.  An over due notifications or timer robots are examples of workflow actions which would be processed in a schedule queue.  When editing a workflow template, anytime a time to wait is entered you are creating an entry to be executed in the schedule queue . 


A 'Queue' is a made up of several components:


1) It's definition in either the processingQueue or ScheduleQueue database tables which defines how the queue operates.  A sql statement to see some of the attributes for all of the prodessing queue's in the system is:


set linesize 200


col  QUEUESTATE for a20




b) For a list of the Schedule queues change "from processingQueue" in the above to sql to "from ScheduleQueue".



The same list of queue attributes can also be viewed in the queue manager.


The most important attribute from a performance perspective is "suspendDuration".  This parameter is the default number of seconds a queue will stop processing entries when a queue entry generates an error.  In some releases the default time can be set as high as 120 seconds.  According to the developer who set this default, the two minute time was to give the administrator time to fix the problem.  I pointed out to the developer that detecting, diagnosing and correcting a failed queue entry in two minutes wasn't going to happen.  And that having a queue suspend itself especally if a series of entries failed in a short period of time; was would give the appearance of a hung queue. 

Related to the suspend duration is the exception retries setting, there are some types of queue errors which are resource contention problems, and waiting a a couple of seconds the queue entry will be able to process and re-trying the entry will allow it to process.  But unfortunately for other failed entries, no number of retries will matter.  It's hard for an administrator to know which type of exception they experienced, but in general, deadlock and Oracle exceptions are most likely transient and exceptions like classCast or object no longer exists probably aren't going to be fixed by waiting.  Setting exception retries and suspend duration to something low is generally a good idea.  Values of between 2 and 5 maybe reasonable, exception retries of 3 and suspend durations of 5 are something I've recommended in the past.



2) The second part of a queue is a  Background Method Server (BGMS) thread to process entries.  On startup of the BGMS the queue definition is read into memory and defines the parameters of the java thread thread created to process the entries for that queue. Queues can run on multiple background method servers, but each queue name must be unique among all BGMS.  For example it's not possible to have a queue named WfProcessingQueue2 running on two different background method servers.


3) Queue Entries are the third part and final component of a queue. A workflow queue is notified when an entry is placed into a queue and there is also a queue polling mechanism in place as a backup to poll the queueEntry table every 60 seconds to make sure an entry hasn't missed something. In my experience though, the queue missing entries to process isn't something which happens.  A queueEntry can be generated either by a user action like completing a task or being automatically generated by another step in the workflow.  One of the most common problems with workflows is a failure of a queue entry to process.  The SQL below will show the ordered summary of failures which have built up over time.


select codec5, name,  count(*) count from queueEntry,processingQueue where queueEntry.ida3a5=processingQueue.ida2a2 group by codec5, name order by 3,1;


Sometimes administrators think that failed or severe queue entries should be removed or deleted from the queue. NO!  Failed entries are potential work which may or may not have been done.  If the related task or workflow has been closed, then it's ok to remove the entry.  Similarly, resetting a severe or failed entry to a READY state to process, isn't a good idea either.  Queueentries fail for all sorts of reason.  But at it's core a queue entry is business task which needs to be completed wrapped inside a queue processing jacket.  In some cases the workflow related part can complete (like completing a task) but for some reason the queue entry task part of processing the entry fails leading a severe queue entry.  Setting the queue entry to ready will re-execute the business task which can result in earlier steps in the workflow "waking up".  Unless is understood what the queue entry was supposed to do and the current state of things, neither delete or set to ready the failed queue entry.


There are two ways to deal with failed queue entries.  One is to use the QueueAnalyzer utility; this utility will generate a csv file of all processing queue entries which can be imported into excel for analysis.  There are a lot instructions in the above link, but interpreting the output really comes down to, if the workflow or task is complete the entry is probably safe to remove it.  Other conditions need to be evaluated on a case by case basis.  The second option is to use #Site #Utilities #Workflow Process Administration capabilities to drill into problems workflows/queue entries.


The next post will be on using the Workflow report tool spot and diagnose problems along with details on how sync robots impact the performance and stability of a BGMS are the number one trouble spot with workflow and queues. 


Lastly, one last problem which I see every particularly in large systems or ones which have a large number of custom queues is a lack of DB connections.  In most method servers having a maximum of 15 or 20 connections is typical, but in a BGMS this few connections can lead problems depending on the concurrency of work being done in the queues.  In some systems setting the max db connections to 50 or more is necessary.  I'd use the wt.method.MethodContextMonitor.stats.summary procedure to see if the max number of active contexts was exceeding the available connections.  If the BGMS is configured with 4G+ of heap it might be worthwhile setting the maxDBconnections to 50 to prevent problems too.

I had the opportunity this morning to present to some of PTC's partners on Windchill Performance.  The presentation is below.


Comment Jan 24, 2014:

There was a typo on slide 5 which has been corrected.  The database parameter should be memory_max_target and not memory_max_size as was originally in the presentation.

I was recently asked to look at a set of log files from a MethodServer from a Windchill PDMLink 10.x system that became unresponsive because of high memory usage and garbage collection activity and in addition to what I thought the root cause of the problem was, they were also interested in knowing in how I did the analysis. 

Although PSM was not installed to aid in an after action root cause analysis; in 10.x systems for certain types of critical warnings including out of memory errors the JMX infrastructure is configured to generate thread dumps to the log files.  The thread dumps can be of great benefit in doing the root cause analysis.


Note: I reference Textpad below, it's my editor of choice when doing log analysis, other text editors should work well also


Step 1: Finding the right log file(s) to analyze


It is quite common for me to get a set of data for analysis that might be 50+ log files and hundreds of megabytes (or more) in size.  Opening and scrolling through each one isn't efficient (and it's time consuming); instead I search for a couple of patterns in the log files using the "Find in Files" capability. 


I like to have all the files I'm going to look at in a single directory or directory structure (including the date in the directory name can make things easier if you need to look for patterns in the future among a series of outages). 



  • I most often start by searching for either "summary" or "wt.method.MethodContextMonitor.stats.summary"
  • This will generate a large list of results, from within this list I then search for "WARN"
  • This will lead to a much smaller list of matches indicating both the MethodServer log(s) which had problems but also the times when those problems occurred (Summary lines by default cover the previous 10 minutes)

(Click on the image to expand it)

textpad sorting3.png

  • More information on information in the Summary lines please see this link
  • Other options to search the logs for are:
    • WARN
    • Low Memory Dector
    • Full thread stack trace

Of these additional Search terms the second two (Low memory Dector and Full thread stack trace) are going to be the most beneficial in identifying the logs containing problems.  WARN messages can appear for a large number of reasons and result in a lot noise in the search results.


Step 2: Which thread(s) are responsible for the problems in the MethodServer logs identified in Step 1?

If you find the long running thread and you'll very likely have identified the root cause of the method servers problem. Occasionally it will be two or threads working together but most often you are looking for one thread. It used to be prior to Windchill 10.x the thread names containing the lower numbers were the ones that had been running the longest.  However in 10.x systems with Tomcat being embedded into the MethodServer there are now a pool of thread names which are re-used.  In almost all cases there is one or handful of threads which have been running for orders of magnitude longer than the others.  Identifying these threads is the goal, my usual minimum time of interest is two minutes. 

There are several ways to identify long running threads:

  • Searching for "StatementCache" lines, a series of log file lines with the same thread name generating these lines in a short period of time indicates a high resource consuming operating. I will often copy all the lines from this thread into a separate file and look to understand what was happening based on the verbosity, errors and stacks which the thread generated.
  • Sometimes the operation completes and the operation will be flagged with a "WARN" because of a long elapsed time 
  • My favorite method though is to identify the probably problematic threads from the thread dumps which have been generated automatically


Step 3: Making sense of thread dumps


Most often because of the "Low Memory Detector" warnings, the MethodServer generates stacks, the first line of each stack trace starts with "Full thread stack trace".  There are often multiple stacks in a single MethodServer log.  A sample stack looks like:


"ajp-bio-8010-exec-12" Id=3218 daemon prio=5 RUNNABLE (in native)

      Blocked (cnt): 652; Waited (cnt): 422

      CPU nanos: 32573008800; User nanos: 32073805600; Bytes allocated: 2728562624

  at Method)



  at org.apache.coyote.ajp.AjpProcessor.readMessage(

  at org.apache.coyote.ajp.AjpProcessor.process(

  at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(


    - locked

  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(

  at java.util.concurrent.ThreadPoolExecutor$



Locked synchronizers:

    - java.util.concurrent.locks.ReentrantLock$NonfairSync@76edd8e3


The most crucial information for diagnosing a problem is the bolded third line containing the CPU and User Nanos times.  Nano time is 1x10E-9.  In this example 32573008800 translates to (moving the decimal 9 places to the left for seconds) 32.5 seconds.


Using Textpad, find and "mark all" the lines in the MethodServer log file containing "CPU nanos"  Using the #edit #copy other #Bookmarked Lines and copy these lines into a new file in Textpad


Next sort the lines on CPU time (#Tools #Sort) for example:

textpad sorting.png


Starting at the top of the sorted list (i.e. with the longest running thread), search in the MethodServer log for the CPU time and it's associated stack.  Work your way down the list until a pattern  becomes clear. 


For the case that led me to create this posting, the problem was a search operation which had been running for 3283.7 seconds.



"DataSource Task Executor-285" Id=4411 prio=5 RUNNABLE

      Blocked (cnt): 469; Waited (cnt): 0

      CPU nanos: 3283727449400; User nanos: 3282354640600; Bytes allocated: 12848812024

      Method context: 182lr2z;hnalyq3j;5908;oyzooc;7146; DB session: 473

  at java.util.LinkedList.remove(




  at wt.pds.CompositeAccessController$FilterResultProcessorBridge.addElement(

  at wt.pds.SQLDatabasePds.query(

  at wt.pom.PersistentObjectManager.query(

  at wt.fc.StandardPersistenceManager._query(

  at wt.fc.StandardPersistenceManager.query(

  at com.ptc.core.query.server.impl.BasicQueryService.execute(

  at com.ptc.core.query.server.impl.BasicQueryService.query(

  at com.ptc.core.query.server.impl.BasicQueryService.query(






  at com.ptc.mvc.components.ds.AbstractDataSourceComponentDataBuilder$

  at com.ptc.mvc.ds.server.DefaultDataSourceManager$DataSourceTaskCallable.callInContext(


This customer didn't appear to have any query limits set (which is a very common cause of memory problems) the recomendation to them was to consider setting the currently unset (unbounded) query limit properties:

    1. (
    2. wt.pom.paging.snapshotQueryLimit =10000 (
    3. wt.pom.queryLimit=250000 (


If you go through this analysis and identify the problem stack(s), searching the knowledge base is a good next option.  If you can't find anything in the knowledge, opening a case with Technical Support and providing them with your analysis is the next step.

I was looking at a Dytrace Ajax capture of slow scrolling in a table with 900 rows in a Windchill 10 system this morning.  The complaint was when using IE 8 the operation would appear to hang indefinatly but the same operation would complete in FireFox.  The Dynatrace AJAX showed all of the time was spent in the internal browswer operation of 'redering', I talked to our resident guru for all things client side and his suggestions were the following:


  • Apply the budled registry entries from CS23960
  • Disable compatibility view mode for IE
  • Set the browser preference to use smooth scrolling (tools > internet options > advanced > Use smooth scrolling)
  • Upgrade the browser to use IE 9 32 bit (but not 64 bit, it's about equivalent to IE 8), based on the testing which has been done.  IE 9 is about twice as fast as IE 8. 

Image taken from page 8

  • 8-7-2013 11-31-00 AM.png

A Windchill based system can be slow or may appear not to respond for any number of reasons.  The underlying reasons can range from the unlikely such as a DNS routing problem to the more common problems like a misconfiguration of JVM or method server parameter.  But by far the most common cause of poor Windchill system performance relates to database interaction with Windchill. 


Too Many Small SQL statements Being Sent to the Database:


Database related performance problems problems come in two main forms, the first is a problem in the software which results in hundreds or thousands of SQL statements being sent to the database from the method server.  With this type of problem each SQL statement executes quickly within the database, but the back and forth of sending the statement and then receiving the results leads to the delay in response.  In some cases these types of performance problems are caused by lack of a cache or by an existing cache which is too small, but more commonly these problems are caused by software defects and require a correction to the code from R&D.  To identify a code defect relating to missing cache or too many back and forth's, we usually use either the Windchill Profiler or PSM .  To identify an undersized cache the Windchill Configuration Assistant (see chapter 1) can be used.


Without using a profiler of some sort it tough for an administrator to know wether there are problems in their system or not or wether to bother looking for them using either the Windchill profiler or PSM.  However, using a SQL statement like the one below will identify SQL statements which might be part of scalability problems and warrant further investigation using the PSM or Windchill Profiler.


SQL: Identify SQL statements which could indicate a scalability problem at the application level which needs to be looked into further using PSM or the Windchill Profiler

    • Highest execution counts are found at the bottom of the report 
    • Generally only hundreds of thousands or more of concern and then the concern is moderate

spool executions.txt
set linesize 200
col LAST_LOAD_TIME for a20
col SQL_TEXT for a60
select username, LAST_LOAD_TIME,EXECUTIONS,SQL_TEXT from v$sql, dba_users
(upper(SQL_TEXT) like '%INSERT%'
or upper(SQL_TEXT) like '%DELETE%' 
or upper(SQL_TEXT) like '%UPDATE%'
or upper(SQL_TEXT) like '%DBMS_LOB.WRITE%')
and upper(SQL_TEXT) not like '%SELECT%'
and dba_users.user_id=v$sql.PARSING_USER_ID
and EXECUTIONS >250000
and username not in ('SYS','SYSTEM','OUTLN','SYSMAN')
order by EXECUTIONS;

spool off


SQL Statements taking too long in Oracle


The second type of performance problem is a slow running SQL.  Unfortunately "Slow" can mean different things depending on the context and who is being affected, if anyone (some SQL statements originate in the BGMS and don't have a user waiting for the response).   When I evaluate an Oracle system to identify "Slow" SQL I think of do so in a couple of different ways.  I want to find the SQL which cumulatively is taking the most amount of time (per execution time * # number of executions), the SQL statements which don't execute very often but take a long time when they do, sometimes several minutes.  Lastly, I look at SQL statements which may not appear in either of the two above lists but primarily because of the disk reads is negatively impacting the rest of the system.

The goal of all three of the queries below is to identify a small handful of SQL which are most impacting the system and need to be improved.  When the problem SQL has been identified there a couple of possible next steps:

  1. Contact the system's DBA and ask them to look into the problem
  2. Search the Technical Support Knowledge base for the same SQL and the index which improves it.  When searching it's generally better to use the tables and their analyses in the FROM clause and some of the WHERE clause conditions, rather than starting with the columns listed in the select clause.
  3. Open a case with technical support to a have performance specialist revue the statement and make a recommendation.


  • The cumulative execution time of the SQL statement is too long.  What's too long depends, but for a statement that has been executed hundreds or thousands of times, an everage time of 2 or 3 seconds is probably too long.   To identify SQL statements that meet this criteria the following SQL can used:



SQL 1: Identify SQL statements in which combined elapsed times are too long.


    • SQL statements of most concern are at the bottom
    • I usually look at the ELAPSED_TIME column to understand the overall impact of the statement
    • A new SQL statement starts when the LINE# is 0


spool long_cumulative_elapsed_times.txt

set linesize 200

set pagesize 75

col username for a16

col cpu_time_secs for 999,999,999,999

col elapsed_time_secs  for 999,999,999,999,999

col Elap_Exec  for 999,999,999,999

col ROWS_PROCESSED for 999,999,999,999,999

col executions for 999,999,999

col sql_text for a65

select a.SQL_ID,c.username, CPU_TIME*0.000001 cpu_time_secs,

ELAPSED_TIME*0.000001 elapsed_time_secs,round(ELAPSED_TIME*0.000001/nvl(nullif(executions,0),1)) Elap_per_Exec,

executions,ROWS_PROCESSED,b.piece line#,b.sql_text

from v$sql a, v$sqltext b,dba_users c




(ELAPSED_TIME*0.000001>10 or executions  > 1000)

and executions>0

and round(ELAPSED_TIME*0.000001/nvl(nullif(executions,0),1)) >1.99

and c.user_id=a.PARSING_USER_ID

and username not in ('SYS','SYSTEM','OUTLN','SYSMAN')

order by ELAPSED_TIME,CPU_TIME,a.HASH_VALUE, b.piece asc


spool off


  • The execution time of a single SQL statement is high.  These are generally SQL statements that execute infrequently, often less than couple of dozen times.  The threshold of acceptability to some extent depends on what the SQL is doing and the impact it's taking combined with what is reasonable for the statement.  Not all Windchill statements can execute quickly, there is a class of statements because of the statement design, inputs, datasize Oracle can't quickly execute the statement..  As a general rule many SQL statements which are taking more than 20-60 seconds is probably too long and should be looked at for tuning.  If for a SQL statement that executed once two days ago and took 45 seconds and a was keyword search or custom report execution, it may or may not make sense to spend time looking into that query. 


SQL 2: Identify Long Running SQL statements


spool long_eplapsed_time.txt
set linesize 200
set pagesize 75
col username for a16
col cpu_time_secs for 999,999,999,999
col elapsed_time_secs  for 999,999,999,999,999
col Elap_Exec  for 999,999,999,999
col LAST_LOAD_TIME for a20
col executions for 999,999,999
col sql_text for a65

select c.username,CPU_TIME*0.000001 cpu_time_secs,
ELAPSED_TIME*0.000001 elapsed_time_secs,
round(ELAPSED_TIME*0.000001/nvl(nullif(executions,0),1)) Elap_per_Exec,
b.piece line#,
b.sql_text sql_text
from v$sql a, v$sqltext b, dba_users c
or executions  > 1000
and executions>0
and c.user_id=a.PARSING_USER_ID
and round(ELAPSED_TIME*0.000001/nvl(nullif(executions,0),1)) >20
and username not in ('SYS','SYSTEM','OUTLN','SYSMAN')
order by Elap_per_Exec,ELAPSED_TIME,CPU_TIME,a.HASH_VALUE, b.piece asc;

spool off



  • Resource consumption by a statement is high.  In many cases most of the SQL statements of interest will be found by the immediately above two queries, but sometimes some database scalability limiting statements arn't captured by the SQL above.  The SQL below may identify additional statements which are worth investigating. This SQL applies percentage bad (% bad) to each statement as compared to the other statements in the system and then returns a limited number of under performing SQL statements which according to the formala are the most resource intensive in the system. 



SQL3 : Resource Intensive SQL statements


spool resource_intensive_sql_statements.txt

set linesize 160

set pagesize 75

col  buffer_gets for 999,999,999,999

col disk_reads  for 999,999,999,999

col "% Bad" for 99.999

col sql_text for a64


select a.SQL_ID SQL_ID,buffer_gets,disk_reads,executions,b.piece line#,a.HASH_VALUE HASH_VALUE,

((disk_reads*1000)+buffer_gets)/tot.sumed*100  bad ,b.sql_text sql_text

from v$sql a, v$sqltext b,

(select sum((disk_reads*1000)+buffer_gets) sumed from v$sql) tot





disk_reads > 1000

or buffer_gets>100000

or executions  > 1000


where bad >3

order by  disk_reads*1000 + buffer_gets ,HASH_VALUE,line# asc


spool off


Note: All above SQL statements are intended to be run from SQL*Plus as the sys or system user. 

At this year's PTC Live Global I was a co-presenter of the presentation: Using PSM to Diagnose Performance Bottlenecks, and is downloadable below.  This presentation covered how to use PSM (Product System Monitor) to both monitor a Windchill system and how to dig into a problem when one occurs.


The second presentation is titled Improving Windchill Performance and was given by Ram Krishnamurthy the product manager for Windchill Performance and covers a lot of the core information to attaining acceptable Windchill Performance.  If you only look at one slide, make it slide #9 which includes:

A problem that came across my desk today involved a customer who recently went edown because of a method server stability problem related to displaying the Event Manager for a specific user.  There have been numerous other systems which have also experienced stability problems related to this problem. The event manager contains the data for conflicts and failed operations related to Wildfire/Creo interactions with PDMLink, the data is intended to be both informational or if the operation failed it can be retried.  The problem is that this data is only removed when the user accesses the Event Manager; but if the results are too large to display the records aren't removed compounding the problem.  This can lead to a user experiencing a hang or performance problem displaying in the a page containing the event manager or even some cases causing the method server to run out of memory.  This problem is unfixed and is present in R9.x and R10.x systems.


Earlier today I updated the article for this problem and made the delete script which corrects the problem available for download. 

This is the first post in a series of posts proving greater details on using a the gather info script to diagnose both Windchill and other database performance problems.


  1. Introduction

Using the gather info script to perform Guerilla Oracle tuning is a proven processes in which key aspects of a databases performance are considered, the major bottlenecks and most impactful SQL statements are identified,  and the corrective actions for the major problems are determined.  The goal is this process is to have a high positive impact on the systems overall performance with a moderate amount of time,  invested by an analyst.  Included in the process are identifying the systems biggest problems, guidelines on tuning SQL statements and rules for creating indexes along with considerations for carefully making changes to a production system. 


  1. Process History, Benefits and Limitations

This tuning process first started to be developed around 2000, in the years since has been refined and improved.  The drive behind the processes development was the need to enable people in an enterprise software technical support department to diagnose and resolve Oracle performance problems remotely.  In many, even most cases gaining the appropriate database login and network security access is time consuming process, difficult or in the case of government locations or defense contractors not feasible.  It’s hard to overstate the importance of an application’s performance to its success, this combined the needs to evaluate a large volume of systems per day with little delay between the data coming in and recommendations being made combined, and the absolute requirement to NOT to cause additional problems and also knowing that most recommendations would be made directly to a production system with little or no testing; required a process which was repeatable, fast scalable, accurate, and gave results that didn’t cause harm to the system in the process.  The guerilla Oracle tuning process is the result.  It has been taught to numerous people and applied to vast majority of Windchill customers around the world at one time or another.

The first version of the diagnostic SQL script which supports the process was developed initially for Oracle 8 and has been enhanced for each new Oracle release.  The script started development back when Oracle's stats pack was the only diagnostic output available, and not even close to being enough to diagnose a system.  Since then AWR & ADDM have been added, and both outputs fit into this process if their outputs are available; in particularly the AWR report covering a specific time period fits in well.  But an AWR report isn’t enough, the output from the SQL script is needed also.  It was the competing needs of quick accurate, remote access only, high stakes and the lack of available Oracle diagnostic output which led to the drive behind the script and acquiring the knowledge to interpret its output.  The result is the Oracle tuning process documented here.

  1. The goals and benefits of the Guerilla Tuning Process are:


    • Reduce the time it takes diagnose a problem.  The goal is to within a couple of minutes of reviewing the data to have a pretty good idea of the major problems are.  Drilling down into the specifics of each of the problems, and creating solutions takes longer.  But generally most systems can be evaluated and recommendations generated in under an hour by a trained analyst.


    • Quickly create solutions to improve the performance of a system.  Generating a series of steps to be applied to the target system is the output of the process.  It is important the recommendations be both developed speedily and are not too complicated to be applied easily by someone with little database knowledge.  For example changing a table to be partitioned might be “the best” solution but it would be exceedingly rare for this to be recommendations both because of the complexity to initially create and on it's going maintenance costs.  It is better to work up to more involved and complicated solutions rather then immediately jumping to them.  Most recommendations involve changing init.ora parameters or adding indexes.  Other solutions can be recommended on sub-sequent passes at improving the system, bit keeping recommendations as simple as possible initially is preferred.


    • A “check your analysis” step to ensure the solutions are addressing the biggest problems.  Understanding the databases “story” (discussed in a later post) allows an analyst to understand the major bottlenecks in a system and ensure that the solutions provided address these biggest problems creating a built in feedback loop between the wait events, segment statistics and the resource intensive SQL.  For example, if the system wait events indicate an I/O problem, the memory advisor views indicate a lack of buffer cache, the segment statistics will indicate that particular tables or indexes are experiencing a disproportionate amount of I/O.  With this information the analyst knows they are likely going to need to either give the system additional memory, provide tuning solutions for SQL involving the high I/O table, indexes or make a change involving the physical location of these segments to improve performance (there are several other solutions which could be pursued for this example too).  For a given problem there are a series of solution available.  Working from the easiest to implement   Identifying the scope of the problem(s) the database is experiencing is the first step then and ensuring that the solutions which are recommended for implemented are aimed at improving these problems.  This may sound obvious, but so much of the literature about Oracle performance involves, things which could improve performance but very little about when to recognize when that solution is the most appropriate one, and when it’s worthwhile to implement.  For example, it’s possible to look at a system and realize it’s disk I/O bound, but without knowing if there are memory bottlenecks or if there was a SQL statement or statements overwhelming the system and in need of tuning.  It’s not possible to determine if balancing the disk I/O would be a good solution, although it is a solution which is found in the literature for disk I/O problems.  The guerilla tuning process generally aims to address the underperforming SQL first, then look to adjust init.ora parameters, and then lastly maybe look to adjust a physical aspect of the database.   In many cases however, adding indexes and changing a few init.ora parameters are done together.


    • Requires Limited Analysis time.  The time required to analyze a report varies but generally the time required is somewhere 15 - 90 minutes range.  How long an analysis takes depends on a number of factors, including the skill of the analyst, the number of problems present and the severity of those problems.  Often counter-intuitively the more severe the problems are, the faster the analysis can be done.  This is because there is usually one overriding problem that needs to be addressed, often a severe lack of available memory or one particularly egregious SQL statement which needs to be tuned or re-written.  However, it is when the when a system is struggling as the result of a number of different problems, the analysis will take longer to complete as solutions for each of these problems need to be worked out.


  1. Limitation of the Guerilla Tuning Process:

There are unfortunatly a few limitations of the process, but these are generally not serious and do not generally impose that many consrtaints on the end tuned system.


    • May miss some problems first time through the report, although the process can be repeated.  However when the report output is combined with AWR reports from time intervals when there was a problem, problems will not go un-missed.  It is only when depending on the diagnostic report (gather info report) output that something may be missed.  This can happen if SQL statements which have caused a problem in the past are aged out of the shared pool and no longer in the V$ views there report is accessing.    However, based on experience it is unusual for the report.txt file to not have the most common and the biggest problems of the system contained in it.  Also, it’s possible for AWR reports covering narrow time intervals to miss the bigger problems the database is experiencing.


    • Requires a higher level of skill by analyst.  The process can be done at different levels.  At one end it’s possible to bring relative newcomers to Oracle performance up to speed and being productive in a relatively short period of time.  However, to get the highest level of results from the analysis, it requires a higher level of skill and a great deal of practice.  There is a dedictated performance group in many regions and many of the members of these groups have looked at hundreds of reports if not thousands in some cases.  If you are having a critical or highly impactful performance problem opening a technical support case is something that should be done asap.


    • Aims for acceptable solution.  In order to quickly and competently tune a system it’s necessary to not get stuck in ‘paralysis by analysis’, this is easy enough to do.  There need to be reasonable and defensible recommendations made based on an analysis; these recomendations must be reasonably expected to work and bring relief to the users and not cause harm or degreade performance.  Unfortunately, there will be situations where uncertainty exists and in those situations, making the minimum of number of changes and then re-evaluating is often the best course of action.  Things will be clearer with the second time through with the first set of recommendations in place, and what the next steps should be will be clearer.


The next post will discuss the processes guiding principles and how there is no such thing as a free change.

I was a presenter at this year's Hotsos Symposium on Oracle performance.  The Symposium was three days of presentations dedicated to Oracle performance given by some of the top experts in the field. I came back, having learned a number of things.  I will be working on a relatively major change to the gather info script in the next couple of months as the result of one of these presentaitons (it relates to including a lightweight version SQLTXPLAIN in the report output). 


My presentation was titled "Guerilla Database Tuning" and covered the key components needed to remotely, quickly and safely improve the performance of an Oracle dataabse.  My presentation is available at the bottom.

There have recently been several reports of serious intermittent stability problems resulting from long garbage collection times when method servers are running on Red Hat Linux.  This problem is the result of a bug in the Linux operating system. 


An example garbage collection line when this problem occurs is:


2013-01-24T16:10:26.934-0500: 6090.538: [GC6216.218: [SoftReference, 0 refs, 0.0000090 secs]6216.218: [WeakReference, 1918 refs, 0.0003090 secs]6216.219: [FinalReference, 1864 refs, 0.0035750 secs]6216.222: [PhantomReference, 3 refs, 0.0000080 secs]6216.222: [JNI Weak Reference, 0.0061630 secs] [PSYoungGen: 1039229K->134051K(1120960K)] 1693558K->838339K(3851648K), 125.6905170 secs] [Times: user=82.96 sys=1301.77, real=125.67 secs]


Observations about this line:

  1. Garbage collection (GC) times were excessive; 125.67 seconds elapsed time in this case.  Similar GC activity earlier in the file was taking fractions of a second.
    • The Server Manager was unable to contact the MS during this time and it marked the MS as dead and tried to start a new one, but this method server was unable to respond to the stop command because GC was still occurring.  On some machines where RAM is limited, having one extra method server running can cause swapping, further degrading performance.
  2. There are no memory problems which would explain the long GC time. 
    • The heap was configured for a maximum of 8G, but the heap was only using about 4G (3851648K) which was the Xms setting. 
    • GC problems typically occur when more memory that the maximum configured is required.  What usually happens is the system uses up to the maximum (8g in this case) and it still needs more memory to satisfy the request.  The JVM then tries to perform GC collection and free more memory but with limited success.  The system continues to try and free memory but with little memory actually being freed. Eventually almost all time is spent in GC and no other work is being done; the method server appears unresponsive when this happens.  A common cause of this condition is not having the wt.pom.queryLimit set, but there are other situations which this condition can result from also.
  3. Memory reclaimed by the garbage collection activity was moderate; about 800M (1693558K->838339K).  Based on earlier GC activity in the log file recovering this memory should have happened quickly.  An earlier line from the same file which recovered similar but slightly less memory was:
    • 2013-01-24T14:31:47.485-0500: 171.089: [GC171.137: [SoftReference, 0 refs, 0.0000110 secs]171.137: [WeakReference, 1154 refs, 0.0002600 secs]171.137: [FinalReference, 2078 refs, 0.0014780 secs]171.139: [PhantomReference, 0 refs, 0.0000030 secs]171.139: [JNI Weak Reference, 0.0022820 secs] [PSYoungGen: 601033K->44968K(972480K)] 1089070K->533005K(3703168K), 0.0520670 secs] [Times: user=0.60 sys=0.00, real=0.05 secs]
  4. There was a disproportionate and excessive amount of time spent in "sys" time (kernel time) compared to 'user' time.  Normal GC activity is largely recorded against user time and usually there is very little time is spent in "sys" time. 
    • In the 125.67 seconds of wall clock time, the CPU's spent a total of 1301.77 seconds in the kernel mode.  It was a 16 cpu machine during, uring a 125 seconds there was 2000 seconds of CPU time available (125*16).  This one GC was using about 70% of the available resources during this period((1300+125)/2000) very likely negatively impacting the performance of other method servers or other processes running on the same node.


See the article CS113478 for additional information and a workaround to prevent this problem.

I recently reviewed a gather info report from a site which has a around 1000 registered users and a peak concurrency of around 100 users a day.  The SQL statement below stuck out like a sore thumb in the report as needing to be tuned; it was the last statement in both the "Report End: Worst Total SQL by CPU Time" & "Report End: Worst SQL By Factor" sections of the the report.txt file.  The statement had been executed more than 1500 times, with an average elapsed time of 11 seconds for each execution.  The statement had a cumulative elapsed time of 17,535 seconds which was almost five hours users of the system spent waiting for this one statement to complete.  But the statement only had a CPU time of 1,118 seconds suggesting there was a significant delay which on analysis was determined to be 40 million disk I/O reads.  Any statement that performs excessive disk I/O reads inhibits the overall system scalability in two ways:

  • First by using a disproportionate amount finite disk I/O resources available, other operations may need to wait or are delayed because of over use of resources by one statement. 
  • The second way is by is by forcing out cached table index blocks from memory forcing them to be re-read (incurring additional disk I/O) delaying the next SQL statements which will need to wait for the I/O to happen rather than getting a more immediate response if the blocks had already been cached in memory.


When looking to improve the performance of a SQL statement several things to initially consider are:

  • How many rows & blocs are in the table.  Does it make sense based on this information to look evaluate adding an index?
    • In this case the table had almost 500k rows, and 26k table blocks with an 8k block size.  It was reasonable to assume that a table of this size could cause performance problems and has enough rows that adding an index could be justified.  In some reports high disk I/O tables have very a modest number of rows, maybe 2K rows or fewer.  In other cases there is sometimes an extreme shortage of SGA available (see this link on how to do a quick SGA check) or if there is a large amount of empty space in the table which is caused by row deletion.  To tell if the table is made up of a lot more blocks than are necessary check and a table should be rebuilt, use a this formula like the following (the numbers required for this equation are in the gather info report output):
      • The number of table rows (NUM_ROWS)* average row length (AVG_ROW_LEN)
        • should be about equal (or within 10-15%)
      • of the number of table BLOCKS * db_block_size


  • Are all of the necessary indexes present to ensure optimal execution of the statement?
    • There are several different reasons to make an index:
      1. The most common reason reason is to reduce I/O by allowing Oracle to more quickly identify the table rows which need to be returned to the user making the request.
      2. The next most common reason is to keep Oracle from needing to access the table when the statement is executing by having all of the columns in the SQL statement contained within the index.  This type of index can work very well, but can have a serious downside if not done correctly which I'll discuss in a future blog post
      3. The index is created to give Oracle better information about the columns in the where clause.  This is the least common type of index to create, and requires an in depth understanding of the understanding of the current execution plan and why it's incorrect.


In this case I suspected since there was only one table in the FROM clause the index would likely fall into the first category of reducing disk I/O, if there been only a few columns in the 'select' part of the statement adding the second type of index might be worth considering.  If the right index had existed for this statement (which it didn't) and was then not being used, it's possible the third type of index would have been considered.


In this case as in most cases when I tuning a SQL initial statement the steps I followed for a statement are:

  1. Bring the SQL statement into a new file, reformat it, include the existing table index definitions along with the table's statistics.  The index definitions and table statistics are in the gather info report output.  An example of this out below (reformatted and abbreviated to fit)
  2. Determine if the 'right' index or indexes exist for the statement.  Start with the input conditions to the WHERE clause, which in this case are the idA3A2A5 & markForDeleteA2 columns.
  3. Based on the existing indexes for the table, it was clear neither of these columns was part of any index.  The question is then which if either column should be indexed and if it's both columns should be in the same index which order should the columns be included in the index definition?
  4. Using the table column statistics, I could see:
    • There was one distinct value for the markForDeleteA2 column, an index on this column would not decrease the number of table blocks Oracle needed to access and therefore this column for a type #1 index should not be included in an index
    • There are 13770 distinct values for the idA3A2A5 column, compared against the distinct values for the ida2a2 column (which is the Primary Key for all Windchill tables, the number of distinct values in the ida2a2 should/will be the same as the number of table rows) indicates that for one input value, there will be 392344/13770 ~= 29 rows returned on average.  Oracle should prefer to access the index and then at most 29 table blocks (compared to the 26k total number of table blocks the table is made up from) instead of performing a full tablescan which is what should be happening now.
      • Note: wether Oracle uses an index or not depends on large number of things and there is no guarantee any index will be used
  5. Creating an index on idA3A2A5 is a reasonable next step based on the analysis.
    • create index EPMDocConfigSpec$idA3A2A5 on EPMDocConfigSpec(idA3A2A5);


SQL Statement to tune:

Select 'wt.epm.EPMDocConfigSpec',A0.administrativeLockIsNull,A0.typeadministrativeLock,A0.asStoredActive,A0.asStoredConfig,


A0.idA3domainRef,A0.effectivityIsNull,A0.effectiveConfigItemRefIsNull,A0.classnamekeyA2D5,A0.idA3A2D5,TO_CHAR(A0.effectiveDateD5,'dd mm yyyy hh24:mi:ss'),



A0.idA3A2A2E5,A0.classnamekeyB2E5,A0.idA3B2E5,A0.promotionNoticeActive,A0.sandboxActive,A0.sandboxConfig,A0.securityLabels,TO_CHAR(A0.createStampA2,'dd mm yyyy hh24:mi:ss'),

A0.markForDeleteA2,TO_CHAR(A0.modifyStampA2,'dd mm yyyy hh24:mi:ss'),A0.idA2A2,A0.updateCountA2,TO_CHAR(A0.updateStampA2,'dd mm yyyy hh24:mi:ss'),A0.workingIncluded

FROM EPMDocConfigSpec A0

WHERE ((A0.idA3A2A5 = :1 ))

   AND (A0.markForDeleteA2 =0)


Indexes Available:


TABLE_NAME                     INDEX_NAME                               COLUMN_NAME      COLUMN_POSITION           DISTINCT_KEYS

------------------------------           ------------------------------                          ------------                     ---------------                          -------------

EPMDOCCONFIGSPEC       PK_EPMDOCCONFIGSPEC            IDA2A2                           1                                      392344



Table Statistics:


Filter Blog

By date:
By tag: