Hopefully the previous section has given
you sufficient guidance to architecting and deploying your first
virtualized database servers, even if only in non-production
environments. This section focuses on real-world monitoring of your
virtualized database servers, identifying specific changes you may need
to make to your monitoring processes and confirming which aspects of
your current monitoring can remain unchanged.
Traditionally we have monitored Windows servers
and servers running SQL Server with tools such as Performance Monitor.
These tools have counters that are designed to expose the true
utilization of a server’s hardware and the operating system’s demands on
it. For example, we can look at the workload of a server’s CPUs by
monitoring the % utilization values shown in Performance Monitor.
Likewise, we can see how much memory the server has both used and
available by looking at similar counters. These counters were perfect in
the physical server world because we knew if Windows booted up and saw 4
logical CPUs and 16GB of memory then all of that resource would be
available to the operating system and usually SQL Server as well.
This can cause issues; what does 100% of CPU
utilization or 8GB of available memory actually represent in the virtual
world? In environments where no restriction, contention, or
over-allocation of resources has been configured, some certainty can be
found from performance data. In larger, more complex environments,
contention ratios or memory allocation might be changing on a
minute-by-minute basis.
The example shown in Figure 1
demonstrates how in a Hyper-V environment, the same performance metric
monitored in two places can be so different because of an underlying
resource limitation in place. VMware provides its own Performance
Monitor counters through the VM Memory and VM Processor objects.
Information and Misinformation from Performance Monitor
Clearly, the same tools that used to
reliably provide us with monitoring information can now be sources of
misinformation as you’ll now see.
Performance Monitor is still the most efficient
way to monitor a virtual server’s performance. The only caveat is to
ensure that you monitor the right counters from the right source.
Some Performance Monitor counters collected from
within the virtual server are as valid and useful as they are on a
physical server. Logical storage performance data, for example, will
help you monitor the virtual server’s I/O workload and enables you to
measure what percentage of the host server’s HBA capacity SQL Server is
using, as well as ensure that SQL Server’s read and write latencies are
acceptable.
While the role of the hypervisor is to make the
virtual server believe it is running on dedicated hardware and totally
abstract it from the underlying physical hardware, some calls to
specific hardware-related APIs are passed by the hypervisor straight
through to the physical hardware. An example of this is retrieving
technical information about the CPU that an instance of Windows is
using. Figure 2 shows an example of information retrieved by Windows running on a Hyper-V virtual server.
These hardware query requests are passed straight
through to the hardware because it would be difficult for virtualization
vendors to know what “artificial” value to pass back today and in the
future in order to guarantee compatibility with any applications that
check the version of the CPU on which they’re running. This behavior
enables you to put the information that Windows or a tool like CPU-Z
returns into perspective, particularly as it’s able to find the clock
speed of the physical CPU even though the hypervisor might be limiting
your access to only a portion of the available clock speed.
SQL Server wait stats are another area to consider
when you are determining your sources of information or misinformation.
However, even in the physical world, wait stats identify only the
symptom of a system issue, not its cause. Therefore, in the virtual
world they are still excellent indicators of potential issues hindering
SQL Server’s performance, and wait stats are a good source of
information, rather than potentially misleading misinformation.
Agent job runtimes are another source of excellent
information within SQL Server you can use for performance monitoring.
By creating jobs that perform the same tasks with the same volumes of
data repeatedly, you can compare the time they took to run today with
the time they took to run yesterday.
If, for example, you have a job to back up a
database approximately 20GB in size, and for six weeks it took 20
minutes to run but in the last few days it started taking longer, you
may have identified a reduction in the host server’s I/O capabilities.
This information on its own may not be of significant operational value,
but if your SQL Server instance has also started reporting a much
greater occurrence of pageiolatch_xx wait stats, you may well want to start looking outside of your virtual server first.
Likewise, if you have a very CPU-intensive SQL
Server agent job, then looking for changes in that job’s runtime might
also help you detect signs of CPU contention at the host server level.
Detecting Hypervisor Memory Reclamation
Knowing how much memory your virtual
server has access to at any point in time is something you should be
able find the moment you suspect a previously healthy SQL Server
instance has issues. While different hypervisors have different ways to
assign memory to virtual servers — based on either demand, static
allocations, or host server load — they usually all use a balloon driver
to reclaim memory if they ever have to start taking memory back from a
virtual server.
Although Windows and SQL Server have hot-add
memory features that enable you to add memory to a running instance of
SQL Server, they don’t have an opposite feature whereby memory can be
taken away from Windows and SQL Server, yet this is sometimes a
requirement in the virtual world when memory is to be reclaimed by a
balloon driver.
Monitoring this reclamation happening can be
tricky because the amount of “physical” memory the virtual server’s
operating system thinks it has never decreases. However, when the
balloon driver “inflates,” the amount of available memory within Windows
begins to drop, and when it falls below an internal threshold SQL
Server begins releasing memory to prevent the operating system from
running out. Figure 3 shows a before and after representation of the balloon drive inflating.
One way to detect changes in the allocation of the
memory SQL Server is using is to look for falls in the Page Life
Expectacy counter value or increases in the Pages/sec value.
If you’re using Hyper-V, then another way is to
query how SQL Server’s memory utilization has potentially changed in
response to activity by the Dynamic Memory feature. The following query
returns not how much memory SQL Server is currently using but what
percentage of your virtual server’s memory it’s using. The reason for
monitoring a percentage rather than an absolute value is that the
percentage is proportional to the total amount of the virtual server’s
memory. If the hypervisor grants the virtual server more memory at any
time, you would expect the percentage of memory being used to remain
constant even though SQL Server is using more memory. Likewise, if the
balloon driver begins to inflate, you would expect the percentage of
total memory SQL Server is using to begin to drop. You could use
absolute values, but monitoring a percentage takes into account the
physical memory size. The way this query works is dependant on Hyper-V’s
Dynamic Memory model so it won’t work in VMware based environments.
select (m.total_physical_memory_kb /1024)
as 'Physical Memory (MB)',
convert(decimal(3,1),(convert(decimal(10,1),m.available_physical_memory_kb / 1024)
/ convert(decimal(10,1),m.total_physical_memory_kb / 1024)) * 100)
as 'Available Memory as % of Physical Memory'
from sys.dm_os_sys_memory m,sys.dm_os_sys_info i
select convert(decimal(10,1),(convert(decimal(10,1),i.committed_target_kb / 1024)
/convert(decimal(10,1),m.total_physical_memory_kb / 1024))
* 100) as 'Committed Target as % of Physical Memory',
convert(decimal(3,1),(convert(decimal(10,1),i.committed_kb / 1024)
/convert(decimal(10,1),m.total_physical_memory_kb / 1024))
* 100) as 'Total Committed as % of Physical Memory'
From sys.dm_os_sys_memory m, sys.dm_os_sys_info i
Examples of output from this script run on a Hyper-V virtual server are shown in Figure 4 and Figure 5.
The first instance was run just after SQL Server started, the second
after a heavy workload when the hypervisor allocated additional memory
to the virtual server using Dynamic Memory.
Recommended Performance Monitor Counters
I recommend monitoring the following
three general groups of Performance Monitor counters to track the health
of your virtual database server.
On the Host Server
Collecting performance information from
the host server your SQL Server instance is running on will help you
understand whether SQL Server itself should be performing as you expect
it to be. For example, how busy are the host server’s physical CPU cores
and how much of the HBA’s capacity is currently being used? It may also
be necessary to collect performance data from the host server about a
virtual server’s resource utilization. This depends on your hypervisor
but with Hyper-V, for example, a virtual server’s true CPU utilization
data is exposed at Virtual Processor counters on the host server.
Interestingly, VMware does the opposite, exposing performance data about
the host server to Performance Monitor in the virtual server.
Monitor the following performance counters on the host server:
- Percentage of physical CPU utilization
- Percentage of physical memory utilization
- Any CPU latency experienced by the hypervisor (VMware calls this CPU Ready Time)
- Total amount of memory currently allocated to the host’s virtual servers
- Physical storage MB/s and IOPS that the hypervisor is using
- Physical storage read and write latency times
- Percentage of HBA bandwidth utilization
- Percentage of NIC bandwidth utilization
On the Virtual Server
Performance data gathered from within a
virtual server is likely to be more meaningful if it refers to logical
objects or is measuring the proportion of the virtual server’s resource
currently being used. For example, the % CPU utilization values that
Task Manager and Performance Monitor display inside a virtual server
indicate how much of the processor resource available to the virtual
server is currently being used. Unfortunately, those values cannot be
related to a specific physical and quantifiable hardware sizing metric —
i.e., they don’t indicate whether 100% is a throttled 1GHz of CPU
resource or the entire native 3GHz of the physical CPU. These metrics
still have value, however. If you look at Task Manager on a virtual
server and see that your anti-virus software is using 60% of your CPU
resource, that’s an issue whether the server is virtual or physical.
The counters you should monitor here are as follows:
- % logical CPU utilization
- Total physical memory (MB)
- Available memory (MB)
- Logical storage MB/s and IOPS that the operating system is using
- Logical storage read and write latency times
On the SQL Server Instance
The monitoring you perform within SQL
Server when it’s running on a virtual server shouldn’t be very different
from your regular SQL Server monitoring. The only difference now is the
potential for new or more causes of performance issues to investigate.
The DMV sys.dm_os_sys_info
has been changed in SQL Server 2012 and offers some new columns that
provide information for virtual environments. First, it confirms that
SQL Server recognized that it started on a virtual server. This is shown
in the virtual_machine_type and virtual_machine_type_desc
columns; and although it may not seem to offer much benefit, it can be
useful if you ever forget which of your servers are virtual and which
are physical! The other interesting column in this DMV is committed_kb,
which now describes the total amount of memory SQL Server is using at
the time you query the DMV. This is important to know if you operate in a
demand-based memory allocation environment.
The traditional performance counters you probably
monitor already are included in the list of SQL Server–related
performance metrics that can be used to monitor a virtual server:
- SQL Server file I/O latencies from sys.dm_io_virtual_file_stats
- Committed and Target Committed Memory values from sys.dm_os_sys_info
- Batch requests/sec as a benchmark of server activity levels