Measuring Latch Contention
A latch is like a lock on a piece of
memory. As more threads get involved, they will start to compete to
access the same pieces of memory, causing blocking. Blocking due to
latch contention is exhibited in waits; but unlike a lock, a latch can
be released as soon as the physical operation is completed.
The main sources of data about latches are two DMVs called sys.dm_os_wait_stats and sys.dm_os_latch_stats.
The details of the values stored in these DMVs will be examined later
in an explanation of latch modes and latch types, but for the purposes
of recognizing the symptoms, a brief explanation will suffice. The DMVs
are restarted when the service restarts, or when the DBCC SQLPERF command is called to clear them, as in the following code (code file Ch7Symptoms.sql):
DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR);
DBCC SQLPERF('sys.dm_os_latch_stats', CLEAR);
Among the columns of the DMV sys.dm_os_wait_stats are ones called wait_type, wait_time_ms and waiting_tasks_count.
These three columns represent the wait type, the total number of
milliseconds that have been spent waiting on this wait type, and the
number of times this type of wait has occurred, respectively. Wait
types that associate with latches start with LATCH_, PAGELATCH_, or PAGEIOLATCH_.
Dividing the total wait time by the number of
waits will give an average wait time (I’d recommend wrapping the
waiting_tasks_count column in the NULLIF function to avoid a division by zero error). By querying this DMV repeatedly over time (as in the following example, code file Ch7Symptoms.sql, further illustrated in Figure 4),
a picture can be built up of the frequency of latch waits, along with
the amount of time the system must wait for these latches. This picture
allows a database administrator to understand what kind of behavior is
being exhibited — whether it matches Figure 1 or Figure 2.
SELECT
wait_type,
wait_time_ms,
waiting_tasks_count,
wait_time_ms / NULLIF(waiting_tasks_count,0) AS avg_wait_time
FROM sys.dm_os_wait_stats
WHERE wait_type LIKE 'LATCH_%'
OR wait_type LIKE 'PAGELATCH_%'
OR wait_type LIKE 'PAGEIOLATCH_%';
The DMV called sys.dm_os_latch_stats lists similar statistics for latch classes.
Measuring Spinlock Contention
For the time being, you can think of a
spinlock as a latch, except that if the memory needing access is not
available, the spinlock will keep checking it (known as spinning) for a
while. There is slightly more to it, but that will be kept for later in
the chapter.
The main DMV for spinlocks is sys.dm_os_spinlock_stats. The metrics of concern for spinlocks are around collisions and spins_per_collision, which are both columns in this DMV, along with the name column for the type of spinlock. Collisions and spins will be described later in the chapter. Here is an example of using sys.dm_os_spinlock_stats (code file Ch7Symptoms.sql), further illustrated in Figure 5:
SELECT name, collisions, spins_per_collision
FROM sys.dm_os_spinlock_stats
ORDER BY spins_per_collision DESC;
The sys.dm_os_spinlock_stats DMV can also be cleared using DBCC SQLPERF.
Contention Indicators
Locks are used to maintain the ACID
properties of a transaction, but latches are used to provide
consistency over a memory operation. A spinlock should be considered as
similar, in that a resource is needed for a particular operation, but
is not available.
Therefore, when you see higher than normal wait
stats associated with latches and spinlocks, you may need to
investigate further.
However, only seeing higher than normal wait
stats does not in and of itself indicate a contention problem. It is
perfectly reasonable to expect that if your system is busier now than
when you took your benchmark, your wait stats would be higher. You
should look for the proportion of wait stats compared to others, and
compare this to the throughput being achieved. You might also want to
consider how many transactions per second you’re managing to achieve as
you increase the load, and the effect of extra processors.
One strong indicator of latch contention can be
seen when the proportion of latch-related wait stats increases
unreasonably as your throughput increases, as shown in Figure 2.
For spinlock contention, a strong
indicator is the number of spins per collision increasing, typically
combined with increased CPU. Spinning is an activity that requires CPU
effort, so if spinning increases disproportionally, CPU is likely to do
the same. CPU may go up simply because of a busier system, but if the
transactions per second counters are leveling off while CPU is
increasing, then this would suggest the CPU is being used for something
else — such as spinning. This is the behavior shown by the graph in Figure 3.