Backing Up the Exchange Server 2007 Environment : Developing a Backup Strategy

6/15/2012 11:26:23 AM

Logging Daily Backup Results and Evaluation

When running regular backups of mission-critical systems, it is important to monitor the process to ensure that backup jobs are running properly. It is equally important to ensure that the data being backed up can actually be restored.

Tracking Success and Failure

Most third-party backup software packages have the ability to send a summary of the result of the backup job to the administrator. This is a critical function because failures or inconsistent results need to be immediately brought to the attention of the administrator who is responsible for backups.

The results of these nightly backups should be reviewed each day to ensure not only the success of the backup process, but also to sanity check the results. For example, if your backup normally ran for 6 hours and filled up 80GB of space, you should be suspicious of a 16-hour job of the same size or a 1-hour job that only backed up 12GB of data. Because either of those results could show up as a successful run of the backup job, it is critical for an administrator to review the results.

In the case of NTBackup, the built-in backup utility included in Windows, the ability to get the results of the backup job is fairly limited. Luckily, this information is posted into the event log of the server and can be easily checked each morning.

The status of the backup will appear as an event 8019.

Validating Your Backups

The benefit of backing up data to a remote location or media is the ability to recover the data at a later time. As such, it is very important to regularly validate that your backups are valid and can be successfully restored. It is recommended that you adopt a practice of randomly pulling backups and picking random directories and files and performing a restore to a nonproduction location. After the restore, verify that you can access the data successfully. This process helps ensure that your data can be restored in the event of an emergency.

Roles and Responsibilities

With any process that is likely to include more than one person, it is useful to clearly define the roles and responsibilities of those people. This ensures that the people involved know what is expected of them and they know who to go to in various situations.

Separation of Duties

A typical Exchange environment involves members from potentially many groups. For example, one group might be responsible for Exchange services and configuration, whereas another group might be tasked with management of Windows and security patches. Often, yet another group is responsible for performing backups of the systems. It is very important for each of these groups to be aware of what other groups are doing. For example, if the Windows group needed to install Windows patches on the Exchange servers, the backup group would also need to be aware of this because they might need to change the scheduling of the backup job. This type of interdependency must be taken into account when configuring the backup schedule.

Escalation and Notification

If a backup job fails, it is critical for the support staff to know what they are supposed to do and who they should contact. It is recommended to build a matrix of common issues and create an escalation path for various events. It is also quite useful to have those events automatically notify the responsible party. For example, the server monitoring group might be told that in the event of a backup failure, they should do the following:

Contact the backup group to alert them of the failed job.
Contact the Exchange group to alert them of the failed job.
If neither group contacts you within 30 minutes, contact the IT manager.
If the IT manager doesn’t contact you within 60 minutes, contact the IT director.

By knowing who to call, it is easier to get a qualified party to look at the issue and potentially fix the issue in time to allow another backup job to be attempted before the backup window is expired.

Developing a Backup Strategy

Developing an effective backup strategy involves detailed planning around the logistics of backing up the necessary information or data via backup software, media type, and accurate documentation. To truly be effective, organizations should not limit a backup strategy by not considering the use of all available resources for recovery.

Along with planning and documentation, other aspects of a backup strategy include assigning specific tasks and responsibilities to individual IT staff members, considering the best person to be responsible for backing up a particular service or server and ensuring that documentation is accurate and current depending on their strengths and area of expertise.

What Is Important to Exchange Backups?

In general, the critical thing to capture in an Exchange backup is any unique data whose loss would impact users. This typically means that you need to back up the mailbox databases, public folder databases, and the log files that go with them. Files such as the operating system itself or the System State data are less important.

Creating Standard Backup Procedures

Creating a regular backup procedure helps ensure that the entire enterprise is backed up consistently and properly on a regular basis. When a regular procedure is created, the assigned staff members soon become accustomed to the procedure because they are given a guide that walks through each required step. If there is no documented procedure, certain items might be overlooked and not be backed up, which can be a major problem if a failure occurs. For example, a regular backup procedure for an Exchange 2007 server might back up the Exchange databases on the local drives every night, and perform an Automated System Recovery (ASR) backup once a month and whenever a hardware change is made to a server. These differences might be overlooked if no one is following regular change control and documented procedures.

Tip

It is a best practice to add documentation updates into standard server change control processes. This ensures that any modifications to server configurations also get added into server build documents.

Protecting Data in the Event of a System Failure

Server failures are the primary concern most organizations plan for, because a complete system failure creates the most impact and, ultimately, a scenario where data needs to be restored from backup tape. Server hardware failures include failed motherboards, processors, memory, network interface cards, disk controllers, power supplies, and, of course, hard disks. Each of these failures can be minimized through the implementation of RAID-configured hard disk drives, error correcting memory, redundant power supplies, or redundant controller adapters. In a catastrophic system failure, however, it is likely that the entire data backup would have to be restored to a new system or repaired server.

Because data is read and written to hard drives on a constant basis, hard drives are frequently singled out as the most possible cause of a server hardware failure. To address this, Windows Server 2003 supports hot-swappable hard drives and RAID storage systems, allowing for the replacement of the drive without server downtime. However, this is only if the server chassis and disk controllers support such a change. Windows Server 2003 supports two types of disks: Basic disks, which provide backward compatibility, and Dynamic disks, which enable software-level disk arrays to be configured without a separate disk controller. Both Basic and Dynamic disks, when used as data disks, can be moved to other servers easily. This provides data or disk capacity elsewhere if a system hardware failure occurs and the data on these disks needs to be made available as soon as possible.

Note

If hardware-level RAID is configured, the controller card configuration should be backed up using a utility available through the vendor.

With most array controllers today, dynamic reading of the disk configuration can be done as long as the disks are placed into a new system using the same disk order. If this is not supported, the controller can be moved to the new systems or the configuration might need to be re-created from scratch to complete a successful disk move to a new machine.

This process should always be tested, verified, and documented in a lab environment before being considered as a valid recovery option.

To protect against a system failure, organizations need to have a full image backup that can then be restored in its entirety to a new or repaired server system. This also requires completing and documenting these steps in advance to ensure that it can be completed and administrators understand the steps involved.

Protecting Data in the Event of a Database Corruption

Data recovery also is needed in the event of a database corruption in Exchange. Unlike a catastrophic system failure, which can be restored from the last tape backup, data corruption creates a more challenging situation for information recovery. If data is corrupt on the server system, a restore from the last backup might also contain corrupt information in its database, so a data restore needs to predate the point of corruption. This typically requires the ability to restore the database from an older full backup tape and then recover incremental data since the clean database restoral.

Providing the Ability to Restore a Message, Folder, or Mailbox

In other situations, an organization might need to recover a single message, folder, or mailbox rather than a full database. With most full backups of an Exchange server, the restore process requires a full restore of all messages, folders, and mailboxes. If an administrator has to work with only a full image backup, typically a full restore must be performed on a spare server and information extracted from the full restore as necessary.

If message, folder, or mailbox recovery is required on a regular basis, the organization might elect to back up information in a format or process that provides an easier method of information recovery. This might involve the purchase and use of a third-party tape backup system, or a combination of various utilities available in Exchange 2007 to restore individual sets of information.

Assigning Tasks and Designating Team Members

Each particular server or network device in the enterprise has specific requirements for backing up and creating documentation around hardware and the service it provides. To make sure that a critical system is being backed up properly, IT staff should designate a single individual to monitor that device and ensure the backup is completed and documentation is accurate and current at all times. Assigning a secondary staff member who has the same set of skills to act as a backup if the primary staff member is unavailable is a wise decision, to ensure that there is no point of failure among IT staff performing these tasks.

Assigning only primary and secondary resources to specific devices or services helps improve the overall security and reliability of the device and services provided to network users. By limiting who can back up and restore data—and even who can manage servers and devices—to just the primary and secondary qualified staff members, the organization can rest assured that only competent, trained individuals are working on systems they are assigned to manage. Even though the backup and restore responsibilities lie with the primary and secondary resources, the backup and recovery plans should still be documented and available to the remaining IT staff for additional training and a final means of support if needed.

Selecting the Best Devices for Your Backup

Each device used on any network could have specific backup requirements. As mentioned earlier, each assigned IT staff member should also be responsible for researching and learning the backup and recovery requirements of each device to ensure that all backups will have everything that is necessary to also recover from a device failure.

As a rule of thumb for network devices, the device configuration should be backed up whenever possible—using the device manufacturer’s configuration software whenever possible or just by documenting the configuration for use as a reference should a device require reconfiguration.

Tip

It is also a best practice to evaluate the hardware used in your environment to determine which areas might be the most likely points of failure. Having spare devices can reduce the overall downtime in case of a failure. When dealing with Exchange 2007 considerations, these spare hardware devices can be pieces such as hard drives to support a failed drive in a RAID configuration.

Understanding How Devices Affect Backups

Depending on how a given environment is architected, there might be several different options on how it will be backed up. Administrators lucky enough to have network attached storage (NAS) or storage area networks (SANs) for their Exchange 2007 servers might have significantly faster options for performing backups than administrators who are using direct attached storage (DAS). Many times, the NAS or SAN devices are able to perform local snapshots, or the SAN might be able to be backed up by a tape device that is plugged directly into the Fibre Channel fabric. This has great advantages when compared to backing up an Exchange 2007 server over the network. For example, Gigabit Ethernet allows for 1Gb/sec of throughput. Fibre Channel not only offers speeds of 4Gb/sec, but is also a more efficient protocol.

Determining Backup Speeds and Times

The time needed to perform a backup of Exchange 2007 is influenced mostly by the speed of the backup device itself. Although vendors quote values for MB per minute that their device can backup, this isn’t always an accurate value when backing up an Exchange 2007 server. It is always recommended to perform test backups of Exchange servers to determine the speed at which they can be backed up. By knowing how long jobs will take, an administrator can better select the backup window in which the backups will occur. As Exchange servers grow in terms of the storage used by mail data, the backups take longer to occur. Pay careful attention to the network utilization and to the backup device utilization so that you can watch for bottlenecks that cause backup jobs to take too long.

Tip

Consider backing up Exchange 2007 to a backup server that is using disks as the media for the backup. This is typically the fastest media that you will be able to utilize for “over the network” backups. Then take the locally stored backup and back that up to tape. Because you are backing up “cold” data, there is no concern about performing the backup during the day. This allows you to keep your backup window relatively short. The side benefit is that if you ever experience a failure that requires you to restore from the backups, you’ll be doing a disk-to-disk restore, which is much faster than a tape-to-disk restore.

Validating the Backup Strategy in a Test Lab

Regardless of what methodology you choose for backups of your Exchange 2007 environment, it is critical to test the processes in a lab environment. The goal of this validation is not only to prove that data can be backed up and restored, but also to refine and document the exact steps used. It is much easier to figure out how to perform a restore in the lab than it is in production when hundreds or thousands of mailbox users are down. The goal of a production restore is to be able to follow accurate, validated instructions and not have to figure out what you need to do on the fly.