Recovering from a Site Failure
When a site becomes
unavailable because of a physical access limitation or a disaster such
as a fire or earthquake, steps must be taken to provide the recovery of
the Exchange server in the site. Exchange does not have a single-step
method of merging information from the failed site server into another
server, so the process involves recovering the lost server in its
entirety.
To prepare for the
recovery of a failed site, an organization can create redundancy in a
failover site. With redundancy built in to a remote site, the recovery
and restore process can be minimized if a recovery needs to be
performed.
For environments where
SLAs offer very little time to bring up a recovery location,
administrators should strongly consider implementing Cluster Continuous
Replication, a new feature of Exchange 2007.
Creating Redundant and Failover Sites
Redundant sites are created
for a couple of different reasons. First, a redundant site can have a
secondary Internet connection and bridgehead routing server so that if
the primary site is down, the secondary site can be the focus for
inbound and outbound email communications. This redundancy can be built,
configured, and set to automatically provide failover in case of a site
failure.
The other reason
for redundant site preparation is to provide a warm spare server site so
that a company will be prepared to perform a restore of a site server
in case of a site failure. The site recovery can simply be having server
documentation available in another site or having a full image of
server information stored in another site. The more preparatory work is
conducted up front, the faster the organization will be able to recover
from a system failure.
If you plan to utilize
warm spares, be sure to update those warm spares with patches and
applications as you apply them to the production systems. This
eliminates several steps when it comes time to put them into use.
Creating the Failover Site
When an organization
decides to plan for site failures as part of a disaster recovery
solution, many areas need to be addressed and many options exist. For
organizations looking for redundancy, network connectivity is a
priority, along with spare servers that can accommodate the user load.
The spare servers need to have enough disk space to accommodate a
complete restore. As a best practice, to ensure a smooth transition, the
following list of recommendations provides a starting point:
Allocate
the appropriate hardware devices, including servers with enough
processing power and disk space to accommodate the restored machines’
resources.
Host
the organization’s DNS zones and records using primary DNS servers
located at an Internet service provider (ISP) collocation facility, or
have redundant DNS servers registered for the domain and located at both
physical locations.
Publish
the recovery site’s IP address as a lower-priority MX record. This way,
when the recovery server comes online you won’t have to wait for DNS
propagation to advertise the new MX record.
For
the Exchange servers, ensure that the host records in the DNS tables
are set to low Time to Live (TTL) values so that DNS changes do not take
extended periods to propagate across the Internet. The Microsoft
Windows Server 2003 default TTL is 1 hour.
Ensure that network connectivity is already established and stable between sites and between each site and the Internet.
Create
at least one copy of backup tape medium for each site. One copy should
remain at one location, and a second copy should be stored with an
offsite data storage company. An optional third copy could be stored at
another site location and can be used to restore the file to spare
hardware on a regular basis, to restore Windows if a site failover is
necessary.
Have
a copy of all disaster recovery documentation stored at multiple
locations as well as at the offsite data storage company. This provides
redundancy if a recovery becomes necessary.
Allocating hardware and
making the site ready to act as a failover site are simple tasks in
concept, but the actual failover and failback process can be
troublesome. Keep in mind that the preceding list applies to failover
sites, not mirrored or redundant sites configured to provide load
balancing.
Failing Over Between Sites
Before failing over
between sites can be successful, administrators need to be aware of what
services need to failover and in which order of precedence. For
example, before an Exchange server can be restored, Active Directory
domain controllers, global catalog servers, and DNS servers must be
available.
To keep such a cutover at a high level, the following tasks need to be executed in a timely manner:
1. | Update Internet DNS records pointing to the Exchange server(s) if the recovery site wasn’t already advertised.
|
2. | Restore any necessary Windows Server 2003 domain controllers, global catalog servers, and internal DNS servers.
|
3. | Restore the Exchange server(s).
|
4. | Test client connectivity, troubleshoot, and provide remote and local client support as needed.
|
Failing Back After Site Recovery
When the initial
site is back online and available to handle client requests and provide
access to data and networking services and applications, it is time to
consider failing back the services. This can be a controversial subject
because failback procedures are usually more difficult than the initial
failover procedure. Most organizations plan on the failover and have a
tested failover plan that might include database log shipping to the
disaster recovery site. However, they do not plan how they can get the
current data back to the restored servers in the main or preferred site.
Questions to consider for failing back are as follows:
Will downtime be necessary to restore databases between the sites?
When is the appropriate time to fail back?
Is
the failover site less functional than the preferred site? In other
words, are only mission-critical services provided in the failover site,
or is it a complete copy of the preferred site?
The
answers really lie in the complexity of the failed-over environment. If
the cutover is simple, there is no reason to wait to fail back.
Providing Alternative Methods of Client Connectivity
When failover sites are too
expensive and are not an option, it does not mean that an organization
cannot plan for site failures. Other lower-cost options are available
but depend on how and where the employees do their work. For example,
many times users who need to access email can do so without physically
being at the site location. Email can be accessed remotely from other
terminals or workstations.
The following are some ways to deal with these issues without renting or buying a separate failover site:
Consider renting racks or cages at a local ISP to colocate servers that can be accessed during a site failure.
Have users dial in from home to a terminal server hosted at an ISP to access Exchange.
Set
up remote user access using Terminal Services or Outlook Web Access at a
redundant site so that users can access their email, calendar, and
contacts from any location.
Rent
temporary office space, printers, networking equipment, and user
workstations with common standard software packages such as Microsoft
Office and Microsoft Internet Explorer. You can plan for and execute
this option in about 1 day. If this is an option, be sure to find a
computer rental agency first and get pricing before a failure occurs and
you have no choice but to pay the rental rates.
Recovering from a Disk Failure
Organizations
create disaster recovery plans and procedures to protect against a
variety of system failures, but disk failures tend to be the most common
in networking environments. The technology used to create processor
chips and memory chips has improved drastically over the past couple
decades, minimizing the failure of system boards. And although the
quality of hard drives has also drastically improved over the years,
because hard drives are constantly spinning, they have the most moving
parts in a computer system and tend to be the items of most failure.
Key to a disk
fault-tolerant solution is creating hardware fault tolerance on key
server drives that can be recovered in case of failure. Information is
stored on system, boot, and data volumes that have varying levels of
recovery needs. Many options exist such as storage area networks (SANs)
or various RAID levels to minimize the impact of drive failures.
Hardware-Based RAID Array Failure
Common
uses of hardware-based disk arrays for Windows servers include RAID 1
(mirroring) for the operating system and RAID 5 (striped sets with
parity) for separate data volumes. Some deployments use a single RAID-5
array for the OS, and data volumes for RAID 0+1 (mirrored striped sets)
have been used in more recent deployments.
RAID controllers provide
a firmware-based array-management interface, which can be accessed
during system startup. This interface enables administrators to
configure RAID controller options and manage disk arrays. This interface
should be used to repair or reconfigure disk arrays if a problem or
disk failure occurs.
Many controllers offer
Windows-based applications that can be used to manage and create arrays.
Of course, this requires the operating system to be started to access
the Windows-based RAID controller application. Follow the manufacturer’s
procedures on replacing a failed disk within hardware-based RAID
arrays.
Note
Many RAID controllers allow an array to be configured with a hot spare disk.
This disk automatically joins the array when a single disk failure
occurs. If several arrays are created on a single RAID controller card,
hot spare disks can be defined as global and can be used to replace a
failed disk on any array. As a best practice, hot spare disks should be
defined for arrays.
System Volume
If a system disk
failure is encountered, the system can be left in a completely failed
state. To prevent this problem from occurring, the administrator should
always try to create the system disk on a fault-tolerant disk array such
as RAID 1 or RAID 5. If the system disk was mirrored (RAID 1) in a
hardware-based array, the operating system will operate and boot
normally because the disk and partition referenced in the boot.ini
file will remain the same and will be accessible. If the RAID-1 array
was created within the operating system using Disk Manager or diskpart.exe, the mirrored disk can be accessed upon bootup by choosing the second option in the boot.ini
file during startup. If a disk failure occurs on a software-based
RAID-1 array during regular operation, no system disruption should be
encountered.
Boot Volume
If Windows Server 2003
has been installed on the second or third partitions of a disk drive, a
separate boot and system partition will be created. Most manufacturers
require that for a system to boot up from a volume other than the
primary partition, the partition must be marked active before
functioning. To satisfy this requirement without having to change the
active partition, Windows Server 2003 always tries to load the boot
files on the first or active partition during installation, regardless
of which partition or disk the system files will be loaded on. When this
drive or volume fails, if the system volume
is still intact, a boot disk can be used to boot into the OS and make
the necessary modification after changing the drive.
Data Volume
A data volume is by far the
simplest of all types of disks to recover. If an entire disk fails,
simply replacing the disk, assigning the previously configured drive
letter, and restoring the entire drive from backup will restore the data
and permissions.
A few issues to watch out for include the following:
Setting the correct permissions on the root of the drive
Ensuring that file shares still work as desired
Validating that data in the drive does not require a special restore procedure