SQL Server 2008 R2 : A Performance Monitoring Approach (part 3) - Monitoring Memory, Monitoring the Disk System

7/24/2013 9:42:21 AM

3. Monitoring Memory

Memory, like a processor, is divided into segments for each process running on the server. If there is a demand for more memory than is available, the operating system has to use virtual memory to supplement the physical memory. Virtual memory is storage allocated on the hard disk; it is named PAGEFILE.SYS under Windows.

It is important to remember that when the operating system or SQL Server isn’t able to use memory to find something and has to use virtual memory stored on the disk, performance degrades. You need to work on minimizing this situation, known as swapping or page faulting.

Monitoring Memory with Performance Monitor

Table 3 reflects the main performance objects and counters that are best utilized to monitor memory for SQL Server.

Table 3. Memory-Related Performance Objects and Counters
Performance Monitor Object	Description
Process: Working Set\|sqlservr	The set of memory pages touched recently by the threads in the process (SQL Server, in this case).
SQLSERVER:Buffer Manager: Buffer cache hit ratio	The percentage of pages that were found in the buffer pool without having to incur a read from disk.
SQLSERVER:Buffer Manager: Total Pages	The total number of pages in the buffer pool, including database pages, free pages, and stolen pages.
SQLSERVER:Memory Manager: Total Server Memory(KB)	The total amount of dynamic memory the server is currently consuming.
SQLSERVER:Memory Manager: SQL Cache Memory(KB)	The total amount of dynamic memory the SQL Server cache is currently consuming.
SQLSERVER:Plan Cache: Cache hit ratio	The ratio between lookups and cache hits for query plans in the plan cache.
SQLSERVER:Buffer Manager: Total Pages	The total number of pages in the plan cache.
Memory: Pages/sec	The number of pages read from or written to disk to resolve hard page faults. This value usually gives a direct indication of memory issues.
Memory: Pages Read/sec	The number of times the disk was read to resolve hard page faults.
Memory: Page Faults/sec	The overall rate at which faulted pages are handled by the processor.
Process: Page Faults/sec\|sqlservr	The rate of page faults occurring in the threads associated with a process (SQL Server, in this case).

To observe the level of the page faulting, you can look at the Memory: Page Faults/sec and Process: Page Faults (for a SQL Server instance) counters. Next in line are the MSSQL Buffer Manager: Buffer Cache hit ratio and MSSQL Buffer Manager: Total Pages counters. They directly indicate how well SQL Server is finding data in its controlled memory (cache). You want to achieve a near 90% or higher ratio here.

If the Memory: Pages/sec counter is greater than 0 or the Memory: Page Reads/sec counter is greater than 5, the operating system is being forced to use the disk to resolve memory references. These are called hard faults. The Memory: Pages/sec counter is one of the best indicators of the amount of paging that Windows is doing and the adequacy of SQL Server’s current memory configuration.

Because the memory used by SQL Server 2008 dynamically grows and shrinks, you might want to track the exact usage by using either Process: Working Set: SQLServer or MSSQL: Memory Manager: Total Server Memory (KB) counters. These counters indicate the current size of the memory used by the SQL Server process. If these numbers are consistently high, compared to the amount of physical memory in the machine, you are probably ready to install more memory on the server. If you see a performance degradation because SQL Server must continually grow and shrink its memory, you should either remove some of the other services or processes running or configure SQL Server to use a fixed memory allocation.

Monitoring Memory in SSMS

If you are collecting performance data into the MDW using the Data Collector for a SQL Server instance, you can view more detailed memory usage and memory wait information in the Server Activity History report . You can click on the Memory Usage graph to drill down and examine the details of the total memory usage in SQL Server and the cache and page ratios. It also displays a list of all processes in the server and the total working set size. You can use this information to see if other running processes may be using memory that should be left available for SQL Server.

DMVs or System Views for Monitoring Memory

The DMVs associated with memory are numerous. The ones you’ll most likely utilize are memory clerks, memory pools, and cache counters.

You can find memory allocations by type by using the sys.dm_os_memory_clerks DMV view:

SELECT type, sum(multi_pages_kb) from sys.dm_os_memory_clerks
WHERE multi_pages_kb <> 0
GROUP BY type
order by 2 desc

To see how the cache is being used, you can query sys.dm_os_memory_cache_counters:

SELECT substring(name,1,25) AS Name, single_pages_kb,
       single_pages_in_use_kb
FROM sys.dm_os_memory_cache_counters
order by single_pages_kb desc

Finally, when you want to see the total pages allocated to the different objects in memory, you use the sys.dm_os_memory_objects DMV:

SELECT substring(type,1,25) as Type,
       sum(pages_allocated_count) as Total_Memory_Allocated
FROM sys.dm_os_memory_objects
group by type
order by 2 desc

Several new DMVs were introduced in SQL Server 2008 to simplify retrieval of information that can be helpful with memory troubleshooting. In some cases, the newly introduced DMVs provide information that was previously available only in DBCC MEMORYSTATUS output. Table 4 provides a summary of the new DMVs available for memory troubleshooting.

Table 4. New Memory-Related DMVs in SQL Server 2008
DMV	Description
sys.dm_os_memory_brokers	Provides information about memory allocations using the internal SQL Server memory manager. The information provided can be useful in determining very large memory consumers.
sys.dm_os_memory_nodes and sys.dm_os_memory_node_access_stats	Provide information about physical nonuniform memory access (NUMA) memory nodes and node access statistics grouped by the type of the page. (sys.dm_os_memory_node_access_stats is populated only if dynamic trace flag 842 is enabled due to its performance impact.)
sys.dm_os_nodes	Provides information about CPU node configuration for SQL Server. This DMV also reflects software NUMA (soft-NUMA) configuration.
sys.dm_os_process_memory	Provides overview information about SQL Server memory usage, including the total physical memory in use and the memory utilization percentage
sys.dm_os_sys_memory	Provides overview information about the system memory usage including total physical memory on the system and the available physical memory.
sys.dm_resource_governor_configuration, sys.dm_resource_governor_resource_pools, and sys.dm_resource_governor_workload_groups	Provide information about the state of the Resource Governor feature of SQL Server 2008. Some of the configuration parameters of Resource Governor affect how SQL Server allocates memory; you should check these parameters during memory troubleshooting.

4. Monitoring the Disk System

SQL Server performance depends heavily on the I/O subsystem. SQL Server is constantly reading pages from and writing pages to disk via the data cache. It is also constantly writing to disk via the transaction log. Focusing on the database data files, transaction log files, and especially tempdb can yield great performance for your SQL Server platform. Table 5 lists the essential performance objects and counters related to monitoring the disk system.

Table 5. Disk Usage-Related Performance Objects and Counters
Performance Monitor Object	Description
Physical Disk: Current Disk Queue Length	The number of outstanding requests (read/write) for a disk.
Physical Disk: Avg. Disk Queue Length	The average number of both read and write requests queued for disks.
Physical Disk: Disk Read Bytes	The rate at which bytes are transferred from the disk during read operations.
Physical Disk: Disk Write Bytes	The rate at which bytes are transferred to the disk during write operations.
Physical Disk: % Disk Time	The percentage of elapsed time that the selected disk drive is busy servicing read or write requests.
Logical Disk: Current Disk Queue Length	The number of outstanding requests (read/write) for a disk.
Logical Disk: Avg. Disk Queue Length	The average number of both read and write requests queued for disks.
Logical Disk: Disk Read Bytes	The rate at which bytes are transferred from the disk during read operations.
Logical Disk: Disk Write Bytes	The rate at which bytes are transferred to the disk during write operations.
Logical Disk: % Disk Time	The percentage of elapsed time that the selected disk drive is busy servicing read or write requests.

Slow disk I/O causes a reduction in the transaction throughput. To identify which disks are receiving all the attention, you should monitor both the Physical Disk and Logical Disk performance objects. You have many more opportunities to tune at the disk level than with other components, such as processors. This has long been the area where database administrators and system administrators have been able to get better performance. You can start by looking at the behavior of the Physical Disk: Current Disk Queue Length and Physical Disk: Avg. Disk Queue Length counters for all disks or for each particular disk. This way, you can identify where most of the activity is, from a disk-usage point of view.

As you monitor each individual disk, you might see that some drives are not as busy as others. You can relocate heavily used resources to minimize these long queue lengths that you have uncovered and spread out the disk activity. Common techniques for this are to relocate indexes away from tables, isolate read-only tables away from volatile tables, and so on. You need to take special care with tempdb. The best practice is to isolate it away from all other disk I/O processing.

The Physical Disk: % Disk Time counter for each physical disk drive shows the percentage of time that the disk is active; a continuously high value could indicate an underperforming disk subsystem.

Of course, the monitoring up to this point shows only half the picture if drives are partitioned into multiple logical drives. To see the work on each logical drive, you need to examine the logical disk counters; in fact, you can monitor read and write activity separately with Logical Disk: Disk Write Bytes/sec and Logical Disk: Disk Read Bytes/sec. You should be looking for average times below 20ms. If the averages are over 50ms, the disk subsystem is in serious need of replacement, reconfiguration, or redistribution.

If you use RAID, you need to know how many physical drives are in each RAID array to figure out the monitored values of disk queuing for any one disk. In general, you just divide the disk queue value by the number of physical drives in the disk array. This calculation gives you a fairly accurate number for each physical disk’s queue length.

5. Monitoring SQL Server’s Disk Activity

In the preceding section, we looked at monitoring overall disk activity. In this section, we examine what SQL Server’s contribution is to all this disk activity. Disk activity can be categorized into reads and writes. SQL Server carries out writes to the disk for the following processes:

Logging records
Writing dirty cache pages at the end of a transaction
Freeing space in the page cache

Logging is a constant occurrence in any database that allows modifications, and SQL Server attempts to optimize this process by batching a number of writes together. To see how much work is done on behalf of the database logs, you can examine the SQLServer:Databases:Log Bytes Flushed and SQLServer:Databases:Log Flushes/sec counters. The first tells you the quantity of the work, and the second tells you the frequency.

The third kind of write occurs to make space within the page cache. This is carried out by the Lazy Writer process, which you can track with the counter SQLServer:Buffer Manager:Lazy Writes.

You also can easily monitor the amount of reading SQL Server is doing by using the counter SQLServer:Buffer Manager:Page Reads.

Monitoring Disk Performance in SSMS

If you are collecting performance data into the MDW using the Data Collector for a SQL Server instance, you can view more detailed disk performance in the Server Activity History report. You can click on the Disk Usage graph to drill down and examine the details on Disk Response Time, Average Disk Queue Length, and Disk Transfer Rate. In addition, the System Disk Usage detail report also lists the top 20 processes with the highest I/O writes or highest I/O reads per second. This information can help you determine which processes besides SQL Server are performing a large number of disk reads and/or writes that could be affecting the I/O performance of SQL Server.

DMVs or System Views for Monitoring Disk System Items

There are several I/O-related DMVs and functions. They cover backup tape I/O, pending I/O requests, I/O on cluster shared drives, and virtual file I/O statistics.

The best of these is the sys.dm_io_virtual_file_stats function, which allows you to see the file activity within a database allocation. You supply the database ID as the first parameter, along with the file ID of the database file as the second parameter. This yields an accumulating set of statistics that can be used to isolate and characterize heavy I/O:

SELECT cast(db_name(database_id) as varchar(12)) as dbname,
       file_id,
       num_of_reads as numreads,
       num_of_bytes_read as bytesread,
       num_of_bytes_written as byteswritten,
       size_on_disk_bytes as size
FROM sys.dm_io_virtual_file_stats (5,1)
GO
dbname       file_id numreads bytesread byteswritten size
------------ ------- -------- --------- ------------ ----------
UnleashedMDW 1       7022     411140096 759291904    1048576000

In addition, the OS wait stats for I/O latch waits are great for identifying when reading or writing of a page is not available from the data cache. These latch waits account for the physical I/O waits when a page is accessed for reading or writing. When the page is not found in the cache, an asynchronous I/O gets posted. If there is any delay in the I/O, the PAGEIOLATCH_EX or PAGEIOLATCH_SH latch waits are affected. An increased number of latch waits indicates that an I/O bottleneck exists. The following query reveals this latch wait information:

SELECT substring(wait_type,1,15) AS Latch_Waits, waiting_tasks_count, wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_type like 'PAGEIOLATCH%'
ORDER BY wait_type
GO
Latch_Waits     waiting_tasks_count  wait_time_ms
------------- -------------------- --------------------
PAGEIOLATCH_DT  0                    0
PAGEIOLATCH_EX  2871                 61356
PAGEIOLATCH_KP  0                    0
PAGEIOLATCH_NL  0                    0
PAGEIOLATCH_SH  7305                 98120
PAGEIOLATCH_UP  1372                 7318

6. Monitoring Other SQL Server Performance Items

In addition to the main system items that affect performance which have been covered already, there are other items in SQL Server that affect system performance that you should include in your performance monitoring approach.

Monitoring Locks

One of the often-overlooked areas of performance degradation is locking. You need to ensure that the correct types of locks are issued and that the worst kind of lock, a blocking lock, is kept to a minimum. A blocking lock, as its name implies, prevents other users from continuing their own work. An easy way to identify the level of blocking locks is to use the counter SQLServer:Memory Manager:Lock Blocks. If this counter frequently indicates a value greater than 0, you need to examine the queries being executed or even revisit the database design.

Monitoring Users

Even though you cannot always trace performance problems directly to the number of users connected, it is a good idea to occasionally monitor how this number fluctuates. It is fairly easy to trace one particular user who is causing a massive performance problem.

The leverage point here is to see the current number of user connections with the SQLServer:General Statistics:User Connections counter in conjunction with other objects and counters. It is easy to say that the disk subsystem is a bottleneck, but how many users is SQL Server supporting at the time?