Logging Daily Backup Results and Evaluation
When
running regular backups of mission-critical systems, it is important to
monitor the process to ensure that backup jobs are running properly. It
is equally important to ensure that the data being backed up can
actually be restored.
Tracking Success and Failure
Most third-party
backup software packages have the ability to send a summary of the
result of the backup job to the administrator. This is a critical
function because failures or inconsistent results need to be immediately
brought to the attention of the administrator who is responsible for
backups.
The results of these
nightly backups should be reviewed each day to ensure not only the
success of the backup process, but also to sanity check the results. For
example, if your backup normally ran for 6 hours and filled up 80GB of
space, you should be suspicious of a 16-hour job of the same size or a
1-hour job that only backed up 12GB of data. Because either of those
results could show up as a successful run of the backup job, it is
critical for an administrator to review the results.
In the case of NTBackup,
the built-in backup utility included in Windows, the ability to get the
results of the backup job is fairly limited. Luckily, this information
is posted into the event log of the server and can be easily checked
each morning.
The status of the backup will appear as an event 8019.
Validating Your Backups
The benefit of
backing up data to a remote location or media is the ability to recover
the data at a later time. As such, it is very important to regularly
validate that your backups are valid and can be successfully restored.
It is recommended that you adopt a practice of randomly pulling backups
and picking random directories and files and performing a restore to a
nonproduction location. After the restore, verify that you can access
the data successfully. This process helps ensure that your data can be
restored in the event of an emergency.
Roles and Responsibilities
With any process that is
likely to include more than one person, it is useful to clearly define
the roles and responsibilities of those people. This ensures that the
people involved know what is expected of them and they know who to go to
in various situations.
Separation of Duties
A typical
Exchange environment involves members from potentially many groups. For
example, one group might be responsible for Exchange services and
configuration, whereas another group might be tasked with management of
Windows and security patches. Often, yet another group is responsible
for performing backups of the systems. It is very important for each of
these groups to be aware of what other groups are doing. For example,
if the Windows group needed to install Windows patches on the Exchange
servers, the backup group would also need to be aware of this because
they might need to change the scheduling of the backup job. This type of
interdependency must be taken into account when configuring the backup
schedule.
Escalation and Notification
If a backup job fails, it is
critical for the support staff to know what they are supposed to do and
who they should contact. It is recommended to build a matrix of common
issues and create an escalation path for various events. It is also
quite useful to have those events automatically notify the responsible
party. For example, the server monitoring group might be told that in
the event of a backup failure, they should do the following:
Contact the backup group to alert them of the failed job.
Contact the Exchange group to alert them of the failed job.
If neither group contacts you within 30 minutes, contact the IT manager.
If the IT manager doesn’t contact you within 60 minutes, contact the IT director.
By knowing who to call, it
is easier to get a qualified party to look at the issue and potentially
fix the issue in time to allow another backup job to be attempted before
the backup window is expired.
Developing a Backup Strategy
Developing an
effective backup strategy involves detailed planning around the
logistics of backing up the necessary information or data via backup
software, media type, and accurate documentation. To truly be effective,
organizations should not limit a backup strategy by not considering the
use of all available resources for recovery.
Along with
planning and documentation, other aspects of a backup strategy include
assigning specific tasks and responsibilities to individual IT staff
members, considering the best person to be responsible for backing up a
particular service or server and ensuring that documentation is accurate
and current depending on their strengths and area of expertise.
What Is Important to Exchange Backups?
In general, the critical
thing to capture in an Exchange backup is any unique data whose loss
would impact users. This typically means that you need to back up the
mailbox databases, public folder databases, and the log files that go
with them. Files such as the operating system itself or the System State
data are less important.
Creating Standard Backup Procedures
Creating
a regular backup procedure helps ensure that the entire enterprise is
backed up consistently and properly on a regular basis. When a regular
procedure is created, the assigned staff members soon become accustomed
to the procedure because they are given a guide that walks through each
required step. If there is no documented procedure, certain items might
be overlooked and not be backed up, which can be a major problem if a
failure occurs. For example, a regular backup procedure for an Exchange
2007 server might back up the Exchange databases on the local drives
every night, and perform an Automated System Recovery (ASR) backup once a
month and whenever a hardware change is made to a server. These
differences might be overlooked if no one is following regular change
control and documented procedures.
Tip
It is a best
practice to add documentation updates into standard server change
control processes. This ensures that any modifications to server
configurations also get added into server build documents.
Protecting Data in the Event of a System Failure
Server
failures are the primary concern most organizations plan for, because a
complete system failure creates the most impact and, ultimately, a
scenario where data needs to be restored from backup tape. Server
hardware failures include failed motherboards, processors, memory,
network interface cards, disk controllers, power supplies, and, of
course, hard disks. Each of these failures can be minimized through the
implementation of RAID-configured hard disk drives, error correcting
memory, redundant power supplies, or redundant controller adapters. In a
catastrophic system failure, however, it is likely that the entire data
backup would have to be restored to a new system or repaired server.
Because data is read and
written to hard drives on a constant basis, hard drives are frequently
singled out as the most possible cause of a server hardware failure. To
address this, Windows Server 2003 supports hot-swappable hard drives and
RAID storage systems, allowing for the replacement of the drive without
server downtime. However, this is only if the server chassis and disk
controllers support such a change. Windows Server 2003 supports two
types of disks: Basic disks, which provide backward compatibility, and
Dynamic disks, which enable software-level disk arrays to be configured
without a separate disk controller. Both Basic and Dynamic disks, when
used as data disks, can be moved to other servers easily. This provides
data or disk capacity elsewhere if a system hardware failure occurs and
the data on these disks needs to be made available as soon as possible.
Note
If
hardware-level RAID is configured, the controller card configuration
should be backed up using a utility available through the vendor.
With most array
controllers today, dynamic reading of the disk configuration can be done
as long as the disks are placed into a new system using the same disk
order. If this
is not supported, the controller can be moved to the new systems or the
configuration might need to be re-created from scratch to complete a
successful disk move to a new machine.
This process should
always be tested, verified, and documented in a lab environment before
being considered as a valid recovery option.
To protect against a
system failure, organizations need to have a full image backup that can
then be restored in its entirety to a new or repaired server system.
This also requires completing and documenting these steps in advance to
ensure that it can be completed and administrators understand the steps
involved.
Protecting Data in the Event of a Database Corruption
Data recovery also is
needed in the event of a database corruption in Exchange. Unlike a
catastrophic system failure, which can be restored from the last tape
backup, data corruption creates a more challenging situation for
information recovery. If data is corrupt on the server system, a restore
from the last backup might also contain corrupt information in its
database, so a data restore needs to predate the point of corruption.
This typically requires the ability to restore the database from an
older full backup tape and then recover incremental data since the clean
database restoral.
Providing the Ability to Restore a Message, Folder, or Mailbox
In other situations,
an organization might need to recover a single message, folder, or
mailbox rather than a full database. With most full backups of an
Exchange server, the restore process requires a full restore of all
messages, folders, and mailboxes. If an administrator has to work with
only a full image backup, typically a full restore must be performed on a
spare server and information extracted from the full restore as
necessary.
If message, folder,
or mailbox recovery is required on a regular basis, the organization
might elect to back up information in a format or process that provides
an easier method of information recovery. This might involve the
purchase and use of a third-party tape backup system, or a combination
of various utilities available in Exchange 2007 to restore individual
sets of information.
Assigning Tasks and Designating Team Members
Each particular server or
network device in the enterprise has specific requirements for backing
up and creating documentation around hardware and the service it
provides. To make sure that a critical system is being backed up
properly, IT staff should designate a single individual to monitor that
device and ensure the backup is completed and documentation is accurate
and current at all times. Assigning a secondary staff member who has the
same set of skills to act as a backup if the primary staff member is
unavailable is a wise decision, to ensure that there is no point of
failure among IT staff performing these tasks.
Assigning only
primary and secondary resources to specific devices or services helps
improve the overall security and reliability of the device and services
provided to network users.
By limiting who can back up and restore data—and even who can manage
servers and devices—to just the primary and secondary qualified staff
members, the organization can rest assured that only competent, trained
individuals are working on systems they are assigned to manage. Even
though the backup and restore responsibilities lie with the primary and
secondary resources, the backup and recovery plans should still be
documented and available to the remaining IT staff for additional
training and a final means of support if needed.
Selecting the Best Devices for Your Backup
Each device used on
any network could have specific backup requirements. As mentioned
earlier, each assigned IT staff member should also be responsible for
researching and learning the backup and recovery requirements of each
device to ensure that all backups will have everything that is necessary
to also recover from a device failure.
As a rule of thumb for
network devices, the device configuration should be backed up whenever
possible—using the device manufacturer’s configuration software whenever
possible or just by documenting the configuration for use as a
reference should a device require reconfiguration.
Tip
It is also a best practice
to evaluate the hardware used in your environment to determine which
areas might be the most likely points of failure. Having spare devices
can reduce the overall downtime in case of a failure. When dealing with
Exchange 2007 considerations, these spare hardware devices can be pieces
such as hard drives to support a failed drive in a RAID configuration.
Understanding How Devices Affect Backups
Depending on how a given
environment is architected, there might be several different options on
how it will be backed up. Administrators lucky enough to have network
attached storage (NAS) or storage area networks (SANs) for their
Exchange 2007 servers might have significantly faster options for
performing backups than administrators who are using direct attached
storage (DAS). Many times, the NAS or SAN devices are able to perform
local snapshots, or the SAN might be able to be backed up by a tape
device that is plugged directly into the Fibre Channel fabric. This has
great advantages when compared to backing up an Exchange 2007 server
over the network. For example, Gigabit Ethernet allows for 1Gb/sec of
throughput. Fibre Channel not only offers speeds of 4Gb/sec, but is also
a more efficient protocol.
Determining Backup Speeds and Times
The time needed to
perform a backup of Exchange 2007 is influenced mostly by the speed of
the backup device itself. Although vendors quote values for MB per
minute that their device can backup, this isn’t always an accurate value
when backing up an Exchange 2007 server. It is always recommended to
perform test backups of Exchange servers to determine the speed at which
they can be backed up. By knowing how long jobs will take, an administrator
can better select the backup window in which the backups will occur. As
Exchange servers grow in terms of the storage used by mail data, the
backups take longer to occur. Pay careful attention to the network
utilization and to the backup device utilization so that you can watch
for bottlenecks that cause backup jobs to take too long.
Tip
Consider backing up
Exchange 2007 to a backup server that is using disks as the media for
the backup. This is typically the fastest media that you will be able to
utilize for “over the network” backups. Then take the locally stored
backup and back that up to tape. Because you are backing up “cold” data,
there is no concern about performing the backup during the day. This
allows you to keep your backup window relatively short. The side benefit
is that if you ever experience a failure that requires you to restore
from the backups, you’ll be doing a disk-to-disk restore, which is much
faster than a tape-to-disk restore.
Validating the Backup Strategy in a Test Lab
Regardless of what
methodology you choose for backups of your Exchange 2007 environment, it
is critical to test the processes in a lab environment. The goal of
this validation is not only to prove that data can be backed up and
restored, but also to refine and document the exact steps used. It is
much easier to figure out how to perform a restore in the lab than it is
in production when hundreds or thousands of mailbox users are down. The
goal of a production restore is to be able to follow accurate,
validated instructions and not have to figure out what you need to do on
the fly.