Wednesday, August 31, 2005

Essential performance forecasting, part 1

Craig Shallahamer
16 Aug 2005

SearchOracle: Oracle tips, scripts, and expert advice

Craig's Corner
Insights on Oracle technologies and trends by Craig Shallahamer

At a basic level, forecasting Oracle performance is absolutely essential for every DBA to understand and perform. When performance begins to degrade, it's the DBA who hears about it, and it's the DBA who's supposed to fix it. It's the DBA who has the most intimate knowledge of the database server, so shouldn't they be able to forecast performance? When a bunch of new users are going be added to a system, it's the DBA who is quickly asked, "That's not going to be a problem, is it?" Therefore, DBAs need the ability to quickly forecast performance. Low precision forecasting can be done very quickly and it is a great way to get started forecasting Oracle performance.

The key metrics we want to forecast are utilization, queue length, and response time. With only these three metrics, as a DBA you can perform all sorts of low precision what-if scenarios. To derive the values, you essentially need 3 things:

- a few simple formulas
- some basic operating system statistics
- some basic Oracle statistics


Before you are inundated with the formulas, it's important to understand some definitions and recognize their symbols.

S : Time to service one workload unit. This is known as the service time or service demand. It is how long it takes the CPU to service a single transaction. For example, 1.5 seconds per transaction or 1.5 sec/trx. The best way to get the value for Oracle systems is to simply derive it.

U : Utilization or CPU busyness. It's commonly displayed as a percentage and that's how it works in this formula. For example, in the formula it should be something like 75% or 0.75, but not 75. A simple way to gather CPU utilization is simply running sar -u 60 1. This will give you the average CPU utilization over a 60 second period.

λ : Workload arrival rate. This is how many transactions enter the system per unit of time. For example, 150 transactions each second or 150 trx/sec. When working with Oracle, there are many possible statistics that can be used for the "transaction" arrival rate. Common statistics gathered from v$sysstat are logical reads, block changes, physical writes, user calls, logons, executes, user commits, and user rollbacks. You can also mix and match as your experience increases. For this paper, we will simply use user calls.

Q : Queue length. This is the number of transactions waiting to be serviced. This excludes the number of transactions currently being serviced. We will derive this value.

M : Number of CPUs. You can get this from the instance parameter cpu_count.

The CPU formulas for calculating averages are as follows:

U = ( S λ ) / M [Formula 1]

R = S / (1 - U^M) [Formula 2]

Q = ( MU / (1 - U^M) ) - M [Formula 3]


Before we dive into real-life examples, let's check these formulas out by doing some thought experiments.

Thought experiment 1. Using formula (1), if the utilization was 50% with 1 CPU, it should be 25% with 2 CPUs. And that's what the formula says. As you probably already figured out, scalability is not taken into consideration.

Thought experiment 2. Using formula (1), if we increased the arrival rate, CPU utilization would also increase.

Thought experiment 3. Using formula (1), if we used faster CPUs, the service time would decrease, then the utilization would also decrease.

Thought experiment 4. Using formula (2), if the utilization increased, the denominator would decrease, which would cause the response time to increase!

Thought experiment 5. This one's tricky, so take your time. Using formula (2), if the number of CPUs increased, the denominator would increase, which would cause the response time to decrease.

Thought experiment 6. Using formula (3), if the utilization increased, the denominator would decrease and the numerator would increase, which would cause the queue length to increase.

Now that you have a feel and some trust in the formulas, let's take a look at a real life example.

Example 1. Let's say for the last 60 seconds you gathered the average CPU utilization and the number of user calls from a two CPU Linux box. You found the average utilization was 65% and Oracle processed 750 user calls. The number of user calls each second is then 12.5 (i.e., 750/60 = 12.5).

Therefore,

S = 0.104 sec/call ; U = ( S λ ) / M ; 0.650 = ( S * 12.500 ) / 2

R = 0.180 sec/call ; R = S / (1 - U^M); R = 0.104 / ( 1 - 0.65^2 )

Q = 0.251 calls ; Q = ( MU / (1 - U^M) ) - M ; Q = ( 2*0.65/(1-0.65^2) ) - 2

The only number that is immediately useful to us is the queue length. There is, on average, less than one process waiting for CPU cycles. That's OK for performance and for our users. But there is some queuing occurring, so now would be a good idea to plan for the future!

The response time and service time calculations will become more useful when we recalculate them using a different configuration or workload scenario. For example, let's say your workload is expected to increase 15% each quarter. How many quarters do we have until response time significantly increases? For Example 2, we will see this demonstrated.

Example 2. Let's suppose performance is currently acceptable, but the DBA has no idea how long the situation is going to last. Assuming the worst case, workload will increase each quarter by 15%. Using the system configuration described in Example 1 and using our three basic formulas, here's the situation quarter by quarter.

Right away we can see that utilization is over 100% by the fourth quarter (i.e., 114%). This results in an unstable system because the queue length will always increase. The response time and queue length calculations also both go negative, indicating an unstable system.

The answer to the question, "When will the system run out of gas?" is something like, "Sometime between the first and second quarter." The answer is not the third quarter and probably not even the second quarter! While the system is technically stable in the second and third quarters, the response time has massively increased and by the third quarter there are typically 85 processes waiting for CPU power! Let's dig a little deeper.

Performance degradation occurs way before utilization reaches 100%. Our simple example shows that at 75% utilization, response time has increased by 33% and there is usually over one process waiting for CPU power. So while the system will technically function into the 3rd quarter, practically speaking it will not come close to meeting users expectations.

Based upon the above table, users are probably OK with the current performance. If you have done a good job setting expectations, they may be OK with performance into the first quarter. But once you get into the second quarter, with the utilization at 86%, the response time more than doubling, and over four processes waiting for CPU power, not matter what you do, your uses will be very, very unhappy.

So what are the options? There are many options at this point, but we'll save that for another article…sorry.

The forecast precision using the method described above is very low. This is because of a few reasons, some of which are; only one data sample was gathered, the forecasts were not validated, the workload was not carefully characterized, and our model only considered the CPU subsystem. When a more precise forecast is required, a product like HoriZone (horizone.orapub.com) is required. But many times a quick and low precision forecast is all that is necessary. When this is the case, you can get a general idea of the sizing situation using the formulas outlined above.

As you can see, with only a few basic formulas and some performance data, an amazing amount of useful forecasting can occur. Performance forecasting is an fascinating area that can expand a DBAs area of expertise, help answer those nagging questions we all get asked at 4:30pm on Fridays, and help anticipate poor performance.


About the Author
Craig Shallahamer has 18-plus years experience in IT. As the president of OraPub, Inc., his objective is to empower Oracle performance managers by "doing" and teaching others to "do" whole system performance optimization (reactive and proactive) for Oracle-based systems.

In addition to course development and delivery, Craig is a consultant who was previously involved with developing a landmark performance management product, technically reviews Oracle books and articles, and keynotes at various Oracle conferences.

To view more of Craig's work, visit www.orapub.com.

No comments: