Now that you are familiar with some of
the fundamental concepts of virtualization, this section looks at some
of the more advanced features and capabilities the technology offers.
This is where the unique magic of the technology begins to appear, as
some of these concepts simply weren’t available to traditional physical
servers for all the time we were using them. While a hypervisor’s
primary function is to “run” a virtual server and grant it the resources
it requires as it needs them, the current versions of VMware and many
of Microsoft’s server virtualization products also provide many of the
features discussed in the following sections.
Snapshotting
Snapshotting a virtual server is very
similar to how SQL Server’s own snapshot function works. In principle,
the hypervisor suspends the virtual machine, or perhaps requires it to
be shut down, and places a point-in-time marker within the virtual
machine’s data files. From that point on, as changes are made within the
virtual machine’s virtual hard drive files, the original data is
written to a separate physical snapshot file by the hypervisor. This can
have a slight performance overhead on the I/O performance of the
virtual server and, more important, require potentially large amounts of
disk space because multiple snapshots can be taken of a virtual server,
each having its own snapshot file capturing the “before” version of the
data blocks. However, a copy of all of the pre-change data gets saved
to disk.
Having these snapshot files available to the
hypervisor enables it, upon request, to roll back all the changes in the
virtual server’s actual data files to the state they were in at the
point the snapshot was taken. Once completed, the virtual server will be
exactly in the state it was at the point in time the snapshot was
taken.
While this sounds like a great feature which can
offer a level of rollback functionality, it is un-supported by Microsoft
for use with virtual servers running SQL Server. Microsoft gives more
information about this in the Knowledge Base article 956893; however,
until Microsoft supports its use, snapshotting should not be used with
virtual servers running SQL Server.
High-Availability Features
You read earlier that encapsulation
means that a virtual server is ultimately just a collection of files
stored on a file system somewhere. These files can normally be broken
down into the virtual hard drive data files, as well as a number of
small metadata files that give the hypervisor information it needs to
“run” the virtual server, such as the CPU, memory, and virtual hard
drive configuration. Keeping these files in a centralized storage
location — a SAN, for example — enables several different host servers
to access the virtual server files. The trick that the file system and
hypervisor have to perform is controlling concurrent read/write access
to those files in a way that prevents corruption and two host servers
running the same virtual server at once.
Support for this largely comes from the file
systems they use; VMware, for instance, has a proprietary VMFS file
system that is designed to allow multiple host servers to both read and
write files to and from the same logical storage volumes at the same
time. Windows Server 2008 has a similar feature called Clustered Shared
Volumes that is required in larger Hyper-V environments where multiple
physical host servers concurrently run virtual servers from the same
file system volume. This is a departure from the traditional NTFS
limitation of granting only one read/write connection access to an NTFS
volume at a time. Ensuring that a virtual machine is only started in one
place at a time is controlled by the hypervisors themselves. A system
using traditional file system file locks and metadata database updates
is typically used to allow or prevent a virtual server from starting
(see Figure 1).
By the way, while the cluster shared volumes
feature of Windows sounds like a great solution to numerous other
requirements you might have, the technology is only supported for use
with Hyper-V.
Online Migration
After you have all the files needed to
run your virtual servers stored on some centralized storage, accessible
by multiple physical host servers concurrently, numerous features unique
to virtualization become available. The key differentiator here between
the physical and virtual worlds is that you are no longer dependent on a
specific physical server’s availability in order for your virtual
server to be available. As long as a correctly configured physical host
server with sufficient CPU and memory resources is available and it can
access your virtual server’s files on the shared storage, the virtual
server can run.
Although Microsoft calls it Live Migration and VMware calls
it vMotion for their implementations. Online migrations enable a virtual
server to be moved from one physical host server to another without
taking the virtual server offline.
For those unfamiliar with this technology and who can’t believe what they’ve just read, an example should clarify the idea. In Figure 2,
the virtual server SrvZ is currently running on the physical host
server SrvA, while all of its files are stored on the SAN. By performing
an online migration, you can move SrvZ to run on SrvB without having to
shut it down, as shown in the second half of the diagram.
Why you might want to do this is a legitimate
question for someone new to virtualization, especially as in the
physical world this kind of server administration was impossible. In
fact, server administrators receive many benefits from being able to
move running virtual servers off of a specific physical host server. If a
specific host requires patching, upgrading, or repairing, or perhaps
has too much load, then these issues can be resolved without affecting
the availability of the applications and services that the virtual
servers support. Some or all of the virtual servers running on a host
server can transparently be migrated to another host, freeing up the
host server for maintenance.
The basic concept behind online migration is
readily understandable, but some complex operations are needed to
actually perform it. After the virtualization administrator identifies
where the virtual server should move from and to, the hypervisor
logically “joins” the two host servers and they start working together —
to support not only the running of the virtual server but also its
migration. Each host server begins sharing the virtual server’s data
files stored on the shared storage; the new host server loads the
virtual server’s metadata, allocates it the physical hardware and
network resources it needs, such as vCPUs and memory, and, the final
clever part, the hypervisor also sends a snapshot of the virtual
machine’s memory from the original host server to the new host server
over the local area network.
Because changes are constantly being made to the
memory, the process can’t finish here, so at this point every memory
change made on the original server needs to be copied to the new server.
This can’t happen as quickly as the changes are being made, so a
combination of virtual server activity and network bandwidth determine
how long this “synchronization” takes. As a consequence, you may need to
perform online migrations during quiet periods, although server
hardware, hypervisor technology, and 10GB Ethernet mean that these
migrations are very quick these days. Before the last few remaining
memory changes are copied from the original host server to the new host
server, the hypervisor “pauses” the virtual server for literally a
couple of milliseconds. In these few milliseconds, the last remaining
memory pages are copied along with the ARP network addresses the virtual
server uses and full read/write access to the data files. Next, the
virtual server is “un-paused” and it carries on exactly what it was
doing before it was migrated with the same CPU instructions and memory
addresses, and so on.
If you are thinking that this pause sounds
dangerous or even potentially fatal to the virtual server, in reality
this technology has been tried and tested successfully — not only by the
vendors themselves but also by the industry. Online migrations have
been performed routinely in large service provider virtualization
environments, and with such confidence that the end customer never
needed to be told they were happening. Nor is this technology limited to
virtual servers with low resource allocations; Microsoft has written
white papers and support articles demonstrating how its LiveMigration
feature can be used with servers running SQL Server. In fact, the SQLCat
team has even released a white paper downloadable on their website with
advice about how to tune SQL Server to make online migrations slicker
and more efficient.
However, while the technology is designed to make
the migration as invisible to the virtual server being migrated as
possible, it is still possible for it to notice. The dropping of a few
network packets is typically the most visible effect, so client
connections to SQL Server can be lost during the process; or perhaps
more critical, if you deploy Windows Failover Clustering on to virtual
servers, the cluster can detect a failover situation. Because of this,
Windows Failover Clustering is not supported for use with online
migration features.
While online migrations may seem like a good
solution to virtual and host server availability, keep in mind that they
are on-demand services — that is, they have be manually initiated; and,
most important, both the original and the new servers involved have to
be available and online in order for the process to work. They also have
to have the same type of CPU as well; otherwise, the difference in low
level hardware calls would cause issues.
Highly Available Virtual Servers
Understanding how online migrations work
will help you understand how some of the high-availability features in
hypervisors work. When comparing the high-availability features of the
two most prevalent server platform hypervisors, you can see a difference
in their approach to providing high availability. VMware’s vSphere
product has a specific high-availability feature, vSphere HA, built-in;
whereas Microsoft’s Hyper-V service utilizes the well-known services of
Windows Failover Clustering.
Both of these HA services use the same principle
as online migration in that all the files needed to start and run a
virtual server have to be kept on shared storage that is always
accessible by several physical host servers. This means a virtual server
is not dependent on any specific physical server being available in
order for it to run — other than the server on which it’s currently
running, of course. However, whereas online migrations require user
intervention following an administrator’s decision to begin the process,
HA services themselves detect the failure conditions that require
action.
VMware and Microsoft’s approach is ultimately the
same, just implemented differently. Both platforms constantly monitor
the availability of a virtual server to ensure that it is currently
being hosted by a host server and the host server is running it
correctly. However, running according to the hypervisor’s checks doesn’t
necessarily mean that anything “inside” the virtual server is working;
monitoring that is an option available in VMware’s feature where it can
respond to a failure of the virtual server’s operating system by
re-starting it.
As an example, the hypervisor would detect a
physical host server going offline through unexpected failure, causing
all the virtual servers running on it to also go offline — the virtual
equivalent of pulling the power cord out of the server while it’s
running, and then if configured to, re-start all of the virtual servers
on another host server.
In this situation, whatever processes were running
on the virtual server are gone and whatever was in its memory is lost;
there is no preemptive memory snapshotting for this particular feature
as there is for online migrations. Instead, the best the hypervisor can
do is automatically start the virtual server on another physical host
server when it notices the virtual server go offline — this is the
virtual equivalent of powering up and cold booting the server. If the
virtual server is running SQL Server, then, when the virtual server is
restarted, there may well be an initial performance degradation while
the plan and data catches build up, just like in the physical world.
What makes this feature exciting is the
opportunity to bring some form of high availability to virtual servers
regardless of what operating system or application software is running
inside the virtual server. For example, you could have standalone
installations of Windows and SQL Server running on a virtual server,
neither of which are configured with any high-availability services, and
yet now protect SQL Server against unplanned physical server failure.
This technology isn’t a replacement for the
application-level resilience that traditional failover clustering
brings; we already saw that while the hypervisor might be successfully
running the virtual machine, Windows or SQL Server may have stopped.
However, this feature can provide an increased level of availability for
servers that may not justify the cost of failover clustering or
availability groups.
Host and Guest Clustering
To conclude this discussion of
virtualization’s high-availability benefits, this section explains how
the traditional Windows failover clustering instances we’re used to
using fit in with it. Host clustering is Microsoft’s term for
implementing the virtual server high availability covered in the
previous section; that is, should a physical host server fail, it will
re-start the virtual servers that were running on it on another physical
host server. It does this by using the Windows Failover Clustering
services running on the physical host servers to detect failure
situations and control the re-starting of the virtual servers.
Guest clustering is where Windows Failover
Clustering is deployed within a virtual server to protect a resource
such as an instance of SQL Server and any resource dependencies it might
have like an IP address and host name.
This is deployed in the same way a Windows
Failover Clustering would be in a physical server environment, but with
virtual rather than physical servers.
Support from Microsoft for clustering SQL Server
in this manner has been available for some time now, but adoption had
been slow as the range of storage options that could be used was small.
Today however, there are many more types of storage that are supported,
including the SMB file share support in SQL Server 2012 and raw device
mappings by VMware, which is making the use of guest clustering much
more common.
Deploying SQL Server with Virtualization’s High-Availability Features
When SQL Server is deployed in virtual
environments, trying to increase its availability by using some of the
features described becomes very tempting. In my experience, every
virtualization administrator wants to use online migration features, and
quite rightly so. Having the flexibility to move virtual servers
between host servers is often an operational necessity, so any concerns
you may have about SQL Server’s reaction to being transparently
relocated should be tested in order to gain confidence in the process.
You might find that you agree to perform the task only at quiet periods,
or you might feel safe with the process irrespective of the workload.
Likewise, the virtualization administrator is also
likely to want to use the vendor’s high-availability feature so that in
the event of a physical host server failure, the virtual servers are
automatically restarted elsewhere. This is where you need to carefully
consider your approach, if any, to making a specific instance of SQL
Server highly available. My advice is not to mix the different
high-availability technologies available at each layer of the technology
stack. This is because when a failure occurs, you only want a single
end-to-end process to react to it; the last thing you want is for two
different technologies, such as VMware’s HA feature and Windows Failover
Clustering to respond to the same issue at the same time.