Verification using unit, functional, and regression
tests is extremely important for thoroughly testing an application, but
it is performance testing that really gets the attention of most
developers. Performance testing is imperative for ensuring a positive
user experience. Users don't want to wait any longer than absolutely
necessary for data.
Developing
high-performance database applications is getting more difficult. As
applications grow larger, and especially as disk space continues to get
cheaper, the amount of data that must be processed by the typical
application has grown to be enormous. Whereas 5 years ago applications
with hundreds of gigabytes of data seemed large, today that figure is in
the terabytes—and that number will only continue to grow. Jim Gray of
Microsoft Research estimates that the cost for a petabyte (that's 1000
terabytes) of storage will be down to $1000 within 15 to 20 years.
The key to dealing with
large amounts of data in a timely manner is to reduce bottlenecks.
Unfortunately, problem areas can often be subtle, manifesting themselves
only under load—almost always, it seems, when a key user is trying to
get some extremely important piece of data. The only way to ensure that
this doesn't happen is to exercise the application to the highest level
of expected load and determine which areas need improvement.
Capturing Baseline Metrics
Just as with unit
and functional testing, having an overall process in place is extremely
important when it comes to performance evaluation. Performance tests
should be repeatable and should be done in an environment that can be
rolled back to duplicate the same conditions for multiple test runs.
Keep in mind that any
component in the application may be contributing in some way to
performance, so starting from the same point each time is imperative. I
recommend using a test database that can be restored to its original
state each time, as well as rebooting all servers involved in the test
just before beginning a run, in order to make sure that the test starts
with the same initial conditions each time. Another option that might be
easier than backing up and restoring a test database is using SQL
Server 2005's database snapshot feature. Try each technique in your
environment to determine which fits better into your testing system.
In addition to making
sure the servers are in the same state, you should also collect exactly
the same performance counters, query trace data, and other metrics in
precisely the same way for each test run. Consistency is the key to not
only validating that changes are effective, but also measuring how
effective they are.
During a testing process, the first test that is run should be used as a baseline.
The metrics captured during the baseline test will be used to compare
results for later runs. As problems are solved, or if test conditions
change (for instance, if you need to collect more performance counters
in a certain area), you should establish a new baseline from which to go
forward. Keep in mind that fixing issues in one area of an application
might have an impact on performance of another area. For instance, a
given query may be I/O-bound, whereas another is CPU-bound. By fixing
the I/O problems for the first query, you may introduce greater CPU
utilization, which in turn will cause the other query to degrade in
performance if they are run simultaneously.
Baselining
metrics in a database environment is generally a fairly straightforward
process. Server-side traces should be used to capture performance data,
including query duration and resources used. This data can then be
aggregated in order to determine minimum, maximum, and average
statistics for queries. In order to determine which resources are
starved, performance counters can be used to track server utilization.
As changes are made to fix performance issues, the baseline data can be
analyzed against other test data in order to establish performance
trends.
Profiling Using Traces and SQL Server Profiler
SQL Server
2005's tracing capabilities allow DBAs and developers to store or view
real-time information about activity taking place on the server. With
almost 200 events available for monitoring, there is a wealth of
information about the state of the server available through traces.
However, for most performance monitoring work, there are only a few key
events that you'll need to worry about.
When initially baselining an application, I'll generally start by looking at only the SQL:BatchCompleted and RPC:Completed events. Each of these events fires on completion of queries; the only difference between them is that RPC:Completed fires on completion of a remote procedure call (RPC), whereas SQL:BatchCompleted fires on completion of a SQL batch. Different access methods, same end result.
The most valuable columns available for both of these events are Duration, CPU, Reads, and Writes.
The Duration
column reports the total time elapsed for the call, in microseconds
(note, the Profiler tool shows this column in milliseconds by default).
The duration of a query is a direct reflection on the user experience,
so this is generally the one to start with. If the application is
performing slowly, find the worst offenders using this column.
The CPU
column reports the amount of CPU time, in milliseconds, spent parsing,
compiling, and executing the query. Due to the fact that this column
reports compilation time, it is common to see the reported amount of
time drop on consecutive queries thanks to plan caching.
The Reads column reports the number of logical
reads performed by the query. A logical I/O occurs any time SQL
Server's query engine requests data, whether from the physical disk or
from the buffer cache. If you see high numbers in this column, it may
not necessarily indicate a performance problem, because the majority of
data may be read from cache. However, even reading data from memory does
cost the server in terms of CPU time, so it is a good idea to try to
keep any kind of reads to a minimum.
The Writes column reports the number of physical
writes performed by the query. This means that only writes that were
actually persisted to disk during the course of the query will be
reported.
By using these basic
columns, you can isolate potential candidates for further investigation.
First, think about limits that need to be set for any given query in
the system. What is the maximum amount of time that a query can be
allowed to run? What should the average amount of run time be? By
aggregating the Duration column, you can determine whether these times have been exceeded.
Once you've isolated possible problem areas (see the "Granular Analysis" section), you can delve deeper in with more in-depth sets of events. For instance, the Scan:Started
event can be used to identify possible queries that are making
inefficient use of indexes and therefore may be causing I/O problems.
The SP:Recompile event, on the
other hand, indicates queries that are getting recompiled by the query
optimizer, and may therefore be consuming larger than necessary amounts
of CPU time.
Server-Side Traces
It is important to
choose your tool wisely when tracing SQL Server during performance
tests. The Profiler tool that ships with SQL Server 2005 is extremely
useful, and the ease with which it can be used to set up and monitor
traces cannot be beat. However, in order to facilitate real-time data
collection and display, SQL Server needs to continually stream the data
back to the tool—and there is definitely overhead associated with doing
so.
In an extremely
high-transaction performance test, you should strive to minimize the
impact of monitoring on results of the test by using server-side traces
instead of the Profiler tool. A server-side trace runs in the
background on the SQL Server, saving its results to a local file instead
of streaming them to Profiler.
To create a server-side
trace, first use Profiler to define the events, columns, and filters
needed, and then click the Run button. Immediately stop the trace, click
the File menu, expand the Export option, and then expand the Script
Trace Definition option. Choose For SQL Server 2005 and select a
filename to save the script.
Once the script has been
saved, open it for editing in SQL Server Management Studio. The
following line of the script must be edited, and a valid path must be
specified, including a filename:
exec @rc = sp_trace_create @TraceID output, 0, N'InsertFileNameHere',
@maxfilesize, NULL
The specified filename should not
include an extension of any kind. One will automatically be added by
the trace. You might also wish to modify the default maximum file size
set at the top of the script, which is 5MB. This will help to minimize
the number of rollover files created during the trace. I generally set
this to 200MB as a starting point.
Once you have finished
editing, the script can be run, and the trace will begin collecting data
in the background. The generated script will also select back a trace
identifier, which you should make note of so that you can easily control the trace later.
When you are done
tracing, use the trace identifier to stop and close the trace, using the
following T-SQL (in this case the trace identifier is listed as 99):
EXEC sp_trace_setstatus @traceid=99, @status=0
EXEC sp_trace_setstatus @traceid=99, @status=2
Once the trace is stopped and closed, the fn_trace_gettable
function can be used to read the data from the trace file. This
function takes two arguments: a full path to the trace file
name—including the .trc extension
automatically added by SQL Server—and the maximum number of rollover
files to read. The following T-SQL would be used to read the trace file
from the path C:\Traces\myTrace.trc.
The number of rollover files is set high enough that all of the data
will be read back, even if the trace rolled over to new files several
times:
SELECT *
FROM ::fn_trace_gettable('C:\Traces\myTrace.trc', 999)
Once selected in this way,
the trace data can be used just like any other data in SQL Server. It
can be inserted into a table, queried, or aggregated in any number of
ways in order to evaluate which queries are potentially causing
problems.
Evaluating Performance Counters
For a bigger-picture view of
the overall performance of a server, performance counters are an
invaluable resource. Similar to SQL Server trace events, there are
hundreds of counters from which to choose—but only a handful generally
need to be monitored when doing an initial performance evaluation of a
SQL Server installation.
The following counters are a
good starting point for determining what kinds of performance issues to
look for (note, the MSSQL$ counter categories include the name of the
SQL Server instance, denoted here by <instance_name>):
Processor:% Processor Time
reports the total processor time with respect to the available capacity
of the server. If this counter is above 70% during peak load periods,
it may be worthwhile to begin investigating which routines are making
heavy use of CPU time.
PhysicalDisk:Avg. Disk Queue Length
indicates whether processes have to wait to use disk resources. As a
disk is fulfilling requests (i.e., reading and writing data), requests
that cannot be immediately filled are queued. Too many simultaneous
requests results in wait times, which can mean query performance
problems. It's a good idea to make sure that queue lengths stay below 1
(meaning, effectively, that there is no queue) whenever possible.
PhysicalDisk:Disk Read Bytes/sec and PhysicalDisk:Disk Write Bytes/sec
report the number of bytes read from and written to the disk,
respectively. These figures are not especially interesting on their own,
but coupled with Avg. Disk Queue Length can help to explain problems.
Slow SELECT queries coupled with
high physical reads and low queue lengths can indicate that the buffer
cache is not being effectively utilized. Slow DML queries coupled with
high physical writes and high queue lengths are a typical indication of
disk contention, and a good sign that you might want to evaluate how to
reduce index fragmentation in order to decrease insert and update times.
MSSQL$<instance_name>:Locks:Average Wait Time (ms)
reports the average amount of time that queries are waiting on locks.
Decreasing lock contention can be quite a challenge, but many shops have
discovered that it can be solved in many cases by using either dirty
reads (the READ UNCOMMITTEDSNAPSHOT isolation level).
MSSQL$<instance_name>:Buffer Manager:Page life expectancy
is the average amount of time, in seconds, that pages remain in buffer
cache memory after they are read off of the disk. This counter, coupled
with Disk Read Bytes/sec, can help to indicate where disk bottlenecks
are occurring—or, it might simply indicate that your server needs more
RAM! Either way, values below 300 (i.e., five minutes) may indicate that
you have a problem in this area.
MSSQL$<instance_name>:Plan Cache:Cache Hit Ratio and MSSQL$<instance_name>:Plan Cache:Cached Pages
are counters that deal with the query plan cache. The Cache Hit Ratio
counter is the ratio of cache hits to lookups—in other words, what
percentage of issued queries are already in the cache. During a
performance run, this number should generally start out low (assuming
you've rebooted the SQL Server before starting in order to put it into a
consistent state) and go up during the course of the run. Toward the
end, you should see this number fairly near to 100, indicating that
almost all queries are cached. The Cached Pages counter indicates how
many 8KB pages of memory are being used for the procedure cache. A low
Cache Hit Ratio combined with a high Cached Pages value means that you
need to consider fixing the dynamic SQL being used by the system.
These counters can be read using the System Monitor console (a.k.a. performance monitor or perfmon),
although many load testing tools have built-in counter collection and
reporting mechanisms. SQL Server Profiler also has the ability to import
performance counter logs in order to correlate them with traces. This
can be useful for helping to pinpoint the cause of especially large
spikes in areas such as CPU time and disk utilization.
Big-Picture Analysis
Once you have set up
performance counters and traces, you are ready to begin actual
performance testing. But this raises the question, "Where to begin?"
Especially in a large legacy application, running an end-to-end
performance test can be a daunting task.
A first step is to
determine what kinds of unit and functional tests exist, and evaluate
whether they can be used as starting points for performance tests. Some
load testing tools, such as Microsoft's Visual Studio 2005 Team System,
have the ability to directly load test prebuilt unit tests. However,
most commercial load tools are designed to exercise applications or web
code directly. Try to collect as many tests as possible to cover the
most-used parts of the application. Absolute coverage is nice, but is
unrealistic in many cases.
The next step is to implement
a load testing suite using the prebuilt unit and functional tests.
Depending on which load tool you are using, this can take some time. The
key is to make sure that the load tool passes random or semirandom
inputs into any function that goes back to the database, in order to
simulate real traffic and make sure that caching does not play too big a
part in skewing numbers. Nonrandom inputs can mask disk I/O issues
caused by buffer cache recycling.
Set goals for the test to
determine what level of load you need to test at. If you are testing on a
system that mirrors the application's production environment, try to
test at a load equal to what the application encounters during peak
periods. If the test servers are less powerful than the production
systems, scale back appropriately. Note that it can be difficult to test
against servers that aren't scaled the same as production systems. For
instance, if the production database system has eight processors and is
attached to a dedicated SAN, and the test database system has four
processors and internal drives, there may be an I/O mismatch. In this
situation it might be advisable to modify SQL Server's processor
affinity on the test system such that less processor power is available,
which will make the available processor to available disk I/O ratio
fall into line with the actual environment in which code needs to run.
In addition to making sure
that the test and production systems are scaled similarly, make sure
that the SQL Server configurations in both systems are similar. For
example, ensure that the maximum degree of parallelism is set similarly
so that processors will be used the same way in queries on both the test
and production systems. Likewise, you should monitor the RAM options to
ensure that they are configured to equivalent percentages on both
systems—so if your production system has 16GB of RAM but SQL Server's
maximum server memory setting is 12GB, you'll want to set your test
system to use 75 percent of the available RAM as well.
Once user goals are set,
load tests should generally be configured to step up load slowly, rather
than immediately hit the server with the peak number of users. Stepping
up load more closely matches most production environments, in which the
server may get rebooted or reset during a maintenance period, then
slowly accept user requests and warm up its caches before encountering a
larger number of requests during more active times of the day. Note
that step testing may not be an accurate figure if you're testing a
situation such as a cluster failover, in which a server may be subjected
to a full load immediately upon starting up.
The goal of a big-picture
test is to see how the system scales overall. Try to look at general
trends in the performance counters to determine whether the system can
handle load spikes, or generally sustained load over long periods of
time (again, depending on actual application usage patterns, if
possible). SQL Server traces should be run during these tests in order
to capture data that can be used later for more granular analysis of
specific components.
Granular Analysis
If the results of a
big-picture test show that certain areas need work, a more granular
investigation into specific routines will generally be necessary. Using
aggregated trace data collected from a full-system test, it's important
to evaluate both queries and groups of queries that are long running or resource intensive.
While it is often
tempting to look only at the worst offending queries—for instance, those
with the maximum duration—this may not tell the complete story. For
instance, you may notice that certain stored procedures are taking
longer than others to run during peak load. This may translate into
longer user interface wait times, but may not indicate the longest user
interface wait times due to stored procedures. This is due to the fact
many applications call more than one stored procedure every time the
interface needs to be updated with additional data. In these cases it is
important to group procedures that are called together and aggregate
their total resource utilization.
If stored procedures are
called sequentially, duration should be totaled in order to determine
the total user wait time for that group, and maximum resource
utilization should be noted. If, on the other hand, they are called
simultaneously (for instance, on different connections), resource
utilization should be totaled in order to determine the group's overall
impact on the system, and the maximum duration for the group should be
noted.
For example, assume that
in a given system, whenever a user logs in, three different stored
procedures are called to get data for the first screen. Table 1 shows the average data collected for these stored procedures.
Table 1. Stored Procedures Called After Login, with Averaged Data
Stored Procedure | Duration (ms) | CPU | Reads | Writes |
---|
LogSessionStart | 422 | 10 | 140 | 1 |
GetSessionData | 224 | 210 | 3384 | 0 |
GetUserInfo | 305 | 166 | 6408 | 0 |
If the system calls these
stored procedures sequentially, the total duration that should be
recorded for this group is 951 ms. Since each is called individually,
total system impact at any given time will only be as much as the
maximum values for each of the given columns. So we record 210 for CPU,
6408 for Reads, and 1 for Writes.
On the other hand, if these
stored procedures are called simultaneously, the impact will be much
different. Total duration will only be as much as the longest running of
the three—422 (assuming, of course, that the system has enough
resources available to handle all three requests at the same time).
However, CPU time during the run should be recorded as 386, Reads as
9932, and Writes as 1.
By grouping stored
procedures in this way, the total impact for a given feature can be
assessed. It may be the case that individually long-running stored
procedures are not the primary performance culprits, and are actually
being overshadowed by groups of seemingly less resource-intensive stored
procedures. This can also be an issue with cursors that are doing a
large number of very small fetch operations. Each individual fetch may
fall under the radar, but taken as a whole, it may become clear that the
cursor is using a lot of resources.
Another benefit of this
kind of grouping is that further aggregation is possible. For instance,
given these figures, it is possible to determine how much impact a
certain number of users logging in simultaneously would have on the
system. That information can be useful when trying to reach specific
scalability goals.
Fixing Problems: Is Focusing on the Obvious Issues Enough?
When evaluating the
performance of a system and trying to determine where to look to fix
problems, it can be tempting to focus on the obvious worst offenders
first. However, some care should be taken to make effective use of your
time; in many cases what appear to be the obvious problems are actually
side effects of other, more subtle issues.
Looking at duration alone
is often the easiest mistake to make when analyzing performance issues.
Duration tells a major part of the story, but it does not necessarily
indicate a performance problem with that stored procedure. It may
indicate that the query had to wait for some other query—or queries—to
complete, or that the query was competing with other queries for
resources. When performance tuning, it is best to be suspicious of
long-running queries with very low reported resource utilization. These
are often not the real culprits at all.
By using the granular
analysis technique and aggregating, it is often possible to find the
real offenders more easily. For instance, in one fairly high-transaction
system a procedure was getting called from time to time that was
writing 10MB of data to the disk. This procedure reported a high
duration, which was interpreted as a possible performance problem.
Unfortunately, there wasn't much to tune in that stored procedure, but
further aggregate analysis revealed another stored procedure in the
system that was getting called over 1000 times a minute and writing as
much as 50KB to the disk each time it was called. Each call to the
second stored procedure reported a small enough duration that it did not
appear to be causing performance problems, yet as it turned out it was
causing issues in other areas. By tuning it and reducing the amount of
data it was writing on each call, the average duration of the first
stored procedure was reduced dramatically.