Windows Server 2003 on HP ProLiant Servers : Active Directory Logical Design - Replication Topology (part 3) - Designing an Efficient Replication Topology

2/20/2013 5:12:02 PM

7. Designing an Efficient Replication Topology

This section examines some best practices concerning the design of the replication topology. One of the things I've noticed in reviewing designs and observing topologies in place from a troubleshooting perspective, is that the topology is often made much more complex than it needs to be. “Behind every simple problem is a complex solution,” is certainly a true saying in replication topology design. Let's look at a checklist of some best practices in topology design, and then I'll detail some topologies I've observed that illustrate poor planning or implementation (or both) and how to fix them.

Best Practices in Replication Topology Design

This section is a summary of my experience specifying replication design topology solutions and troubleshooting broken designs, mixed in with recommendations from Microsoft.

Keep the Topology as Simple as Possible

Sounds logical, but this principle seems to be ignored in the heat of battle, even in simple environments. Consider Figure 5, which shows the actual topology snapshot of one of HP's customers, generated by HP OpenView Operations for Windows Active Directory Topology Viewer (ADTV). This tool connects to the domain and generates a 3D image of the replication topology. Although it will display sites, DCs, site links, and connection objects, I did not display connection objects in this figure for the sake of clarity. Without even looking at any logs or problem descriptions, this picture tells the story. The squares are sites, and the lines between the squares are site links. If the intent is to force replication along logical paths in an organized fashion, this is a failure. After identifying the sites that were “hub” sites, and associated sites that replicate to the hubs, we reconfigured the site links to produce the topology shown in Figure 5.42. A simple comparison should show even the novice that replication shown in Figure 6 will be more efficient and trouble-free than that in Figure 5. Obviously, a tool of this nature makes a big difference in your ability to quickly diagnose these kinds of problems.

Figure 5. Original replication topology showing sites, DCs in the sites, and site links connecting the sites. Quite a mess!

Figure 6. The same forest as in Figure 5, but site links are reconfigured.

Rename (But Don't Delete) and Reconfigure the Default-First-Site-Name Site

This problem is a result of a normal process of building sites and site links. When you build a site, it must be assigned to a site link. Yet you can't build a site link without a site in it. Microsoft provided a default site, Default-First-Site-Name, as well as a default site link, DefaultIPSiteLink. Normally you will create sites and assign them to the DefaultIPSiteLink. After the sites are defined, you can create the necessary site links and add the sites you created. Unfortunately, this process creates a problem. After all the sites are assigned to new site links, you still have the original DefaultIPSiteLink containing all of the sites. Left in this situation, it can cause replication to break. A good example of what can happen in this case is described later in this section as “Case Study #1.”

Structure Site Link Costing

Design a costing structure to force replication across your network in a path that makes sense. I've seen documents from Microsoft as recently as July 2003, advising you to look at the network and use a formula to calculate cost based solely on available bandwidth.

Figure 7 shows a complex 3-tier topology with the cost between tier 2 and 3 as 100, tier 1 to tier 2 as 50, and tier 1 as 25. This forces replication from Calgary to Barcelona to go to Seattle, then Boston, London, Madrid, and to Barcelona. This follows the fastest physical path for the network, but it also represents some decisions in terms of where you want traffic routed, rather than simply letting a calculation determine it without any planning.

Figure 7. Well-designed costing structure forces replication in a manner designed by the Administrator to take advantage of physical and logical constraints.

Identify Locations That Can Be Combined to a Single Site

This follows the example Compaq used, as noted previously in this section. Lumping multiple locations together changes replication from intersite with scheduled replication, compressed data, and site link costing to intrasite with no schedule, uncompressed data, and no way to force a replication path. In addition, intrasite replication contains urgent replication features, forcing immediate replication for LSA secrets, account lockouts, and RID changes. Windows Server 2003 also changes intrasite notification delays. In Windows 2000, there was a 5-minute delay before a DC replicated changes to its first replication partner in the site, and then replicated in 30-second intervals to other DCs in the site. Windows Server 2003's improved performance allows the first partner to be notified in 5 seconds with the same 30-second delay for the other DCs in the site.

Identify GC and DC Placement

Identify sites that require GCs and/or DCs for each domain. If you have a single domain, as far as replication is concerned, a GC is the same as a DC. For multiple domain structures, GCs will generate more replication traffic than DCs, as it replicates approximately 70% of data from all other domains in the forest. Make sure you plan for this in the network design. Note that Windows Server 2003 reduces the instances requiring a full sync of the GC over Windows 2000, and thus will make this configuration easier.

Identify Sites with “Poor Connectivity”

Some companies often place GCs in sites with poor connectivity to simply shield the users from slow WAN links—scheduling GCs to replicate at nonpeak times. However, sites with small numbers of users, or insufficient bandwidth, that cannot justify a GC, but could justify a DC, can take advantage of Windows Server 2003's Universal Group Membership Caching or GC Caching feature, as described previously in this section on replication. This feature gives users the performance of a having a local GC, without incurring the expense and network load of having a GC at that site. Note that this is a fairly complex feature and should be thoroughly tested so the Administrator is able to properly control the caching feature and understand its implications. You can't just turn it on—you need to monitor it. At some point, usually when the number of users exceeds the capability of the DCs in the site to handle the caching feature with acceptable performance or when the network is upgraded, you can consider putting a GC in that site.

Design Site Link Bridge to Eliminate Poorly Connected Sites

Poorly connected sites should be eliminated from site link bridges. Their inclusion forces the KCC to try to build transitive connections and, if unavailable, results in a lot of errors in the event log and possible replication failures. Remember, the default is to put all sites in a single site link bridge so this requires proactive action on the part of the Administrator. Inclusion of sites connected via dial-up or unreliable WAN links will result in the KCC trying to build transitive links, failing, and generating errors. It doesn't make sense to include these sites in a site link bridging because they are most likely to fail, and the purpose of the site link bridging is to provide replication in case of failure. As pointed out previously in this section, in a true hub-and-spoke configuration, there isn't a real justifiable need to enable site link bridging.

Determine How Replication Will Be Handled Through Firewalls, If Required

The issue of pushing AD replication through firewalls really has no one good answer. You might want to do this if you have remote sites protected by a firewall, or if you want to replicate to DCs that sit in a DMZ, outside the corporate firewall. Although you usually shouldn't replicate sensitive data outside the corporate firewall, sometimes this might be desired. The best, most secure answer, in my opinion, is to connect through the firewall with a VPN connection. By making an end-to-end VPN connection through the firewall, using IPSec encryption, you have a secure connection through the firewall. Microsoft KB article 832017—“Port Requirements for the Microsoft Windows Server System”—identifies the ports that need to be open for replication. If you decide to do this, remember that the firewall's purpose is to protect company infrastructure. By opening ports, you are poking holes in the firewall, thus compromising that protection. Make sure you can justify this action. Another good document on this issue is located at http://www.microsoft.com/serviceproviders/columns/config_ipsec_p63623.asp.

Determine Whether Replication Connection Load Balancing Is Required

The issues regarding managing large numbers of connections to a single site and BHS and the difficulty in load balancing the connections. If your environment contains more than 100 sites replicating to a single BHS at the hub in a single domain, or if it contains remote sites over slow or unreliable links requiring manual management of the replication connections, you should determine the measures necessary to accomplish this. In Windows 2000, follow the recommendations of the “Branch Office Deployment Guide” whitepaper (now updated for Windows Server 2003) and the management scripts associated with it. In Windows Server 2003, the ADLB tool can be used to manage tasks of balancing BHSs, and replication schedules can also be accomplished with ADLB (available in the Windows Server 2003 Resource Kit tools).

Monitor Replication

Monitoring replication is an absolute requirement, yet often is not done until a problem occurs. Recently, a coworker was talking to a customer reporting a replication problem. During the course of troubleshooting, my friend discovered one DC that hadn't replicated since October 2002—more than a year! Replication problems are often masked. If you don't notice a problem, there doesn't seem to be anything to investigate. Replication problems can be manifested by password changes not being propagated; Group Policy changes being applied to certain sites or users; or addition, modification, or deletion of user accounts not appearing in some sites. Some basic tools include Replication Monitor from the Windows 2000 and 2003 support tools, event logs, third-party utilities, and command-line tools, such as repadmin.exe and dcdiag.exe. Repadmin is an excellent tool to quickly determine current replication status of a DC and can be executed remotely. The Windows Server 2003 version offers enhanced features such as the removal of lingering objects and the replsum command, which reports the status of outstanding replication requests on all DCs using one command. This version can be executed in a Windows 2000 forest from a Windows XP client.

Identify Site Coverage for Sites Without a DC

If you have sites defined for client Site Affinity, but without a DC in the site, identify the DC and site that is providing coverage. The KCC determines the “closest” site with a DC that satisfies the client requests. This is referred to as the DC Locator. You can determine which DC is providing coverage by any of the following procedures:

From the client, note the value of the environmental variable logonserver
```
C:>set l
```

From the client, execute the command

nltest /server:<client name> /dsgetsite

<client name> is the name of the client you want the command executed on.

In addition, DC-less sites will have a DC assume responsibility. If you are planning to use Site Affinity for clients without a DC in the site, you must make sure the client is authenticating to the DC that you think it's using. You should also make the authenticating DC unavailable and then check to see who is becoming authoritative for that site. Make sure this works according to your design.

This section provides a few examples of customers with broken replication topologies, how we fixed the problem, and a summary of what could have been done to eliminate the problem.

Case Study #1: Overlooking the DefaultIPSiteLink

This case study came from a customer who was concerned about replication performance between two sites. The diagram in Figure 8 shows the situation. The topology has only four sites: Boston, Phoenix, Albany, and Richmond. The physical network has comparable bandwidth between all sites and is underutilized. The customer complained that replication between the hub site in Albany and satellite site in Boston takes almost twice as long to complete as between Albany and Richmond.

Figure 8. Site topology with incorrectly configured DefaultIPSiteLink.

We examined the physical network, client load at the sites, server configuration, and so on, but couldn't find the solution until we looked at the replication topology. Note that three site links contain the hub site of Albany and each of the remote sites of Richmond, Phoenix, and Boston, as well as site links from Richmond to Phoenix and from Phoenix to Boston. The links from Albany to the other sites have a cost of 100, whereas secondary links Phoenix to Boston and Richmond to Phoenix have a cost of 200. No problem.

However, the customer had forgotten about the DefaultIPSiteLink, which still contained all 5 sites and still had the default cost of 100. The KCC was confused. When replicating from Richmond to Albany, the KCC saw two options: DefaultIPSiteLink and the Albany-Richmond link, both with the same cost. If it picks the DefaultIPSiteLink, it could replicate to any of the other 4 sites, not necessarily Albany. This same scenario held for each of the other sites. The DefaultIPSiteLink provided a way for the KCC to route replication in a way other than what was intended.

The solution was to

Delete one of the existing site links—we picked Albany-Boston.
Rename the DefaultIPSiteLink to Albany-Boston
Remove all sites from the Albany-Boston link, except the Albany and Boston sites.

I would never recommend deleting the DefaultIPSiteLink, and I'm sure Microsoft would back me up on this. Although when I asked Microsoft about it, the opinions were divided over whether it would do any harm. I prefer to err on the side of safety, so we just rename it. It keeps the same GUID so if there are other functions it's used for, they will still work. Likewise, I would not recommend deleting the Default-First-Site-Name site. Rather, just rename it. When DCs are promoted and have no subnet mapped to a site, they are put in the Default-First-Site-Name site. If renamed, the DC will still put it in that site because it still has the same GUID, which is used for identification. Thus, if you see DCs being put into the Chicago site, for instance, it's because they are on subnets that aren't mapped to a site and the Chicago site was originally the Default-First-Site-Name site.

Case Study #2: Implementing a Multi-Tier Topology: A Failure

This customer had a serious and perplexing problem. The company had implemented a fairly complex multi-tier topology by dividing the United States up into four geographic regions, as illustrated in Figure 9. Sites in these four regions made up the third-level tier. The company wanted tier 3 sites to replicate to a specific tier 2 site in the correct geography, and tier 2 sites to replicate to a specific tier 1 site.

Figure 9. Replication topology designed to force replication through a three-tier topology using geographic regions of the United States as the foundation.

The design concept was sound, but the problem was the implementation. As shown in Figure 5.45, the company collected multiple sites in site links at the tier 3 and tier 2 levels. For instance, LAXLink and NYCLink are tier 2 site links, whereas OMHLink, DENLink, PITLink, and so on are tier 3 links. For replication to work, the tiers had to be connected together, so the customer placed a tier 2 site in the tier 3 site it was to be channeled through. For example, the Atlanta site, ATL, was included in the tier 2 NYCLink, as well as the tier 3 ATLLink. Note that ATLLink contains four sites: ATL, WDC, RAL, and RCH.

Although this didn't make much sense to me, it worked. That is, it worked until the DC in Raleigh (RAL) had network problems and was offline for a few days. This caused the KCC to reroute the topology so that instead of the other DCs in ATLLink replicating to Atlanta, they now replicated through Washington D.C. (WDC), and then through Atlanta. In addition, the KCC, at a DC (not a GC) in the LIT site generated connections to all DCs in all sites. Through experimentation, the customer found that the only way to force it back was to demote the Raleigh DC and repromote it; after that, the customer deleted all the connections at the LIT DC and regenerated them. That seemed like a simple irritation until another DC in another tier 3 site went offline for a few days. The same thing happened—the KCC rerouted replication that did not follow the design. The company was justifiably concerned because this meant that to keep the replication topology intact, the company would have to rebuild any DC that went offline for a day or so. This obviously wasn't acceptable.

The reason for this behavior was that more than two sites were collected into a single site link. Combined with some sites being in multiple site links, it gave the KCC the freedom to choose another routing scheme when it encountered a failure. The important rule here is that if you want replication to follow a specific path, you have to give the KCC explicit instructions and not allow it any latitude.

We fixed the problem, as shown in Figure 10, by creating a true hub-and-spoke topology with no more than two sites in any site link. Although only the Atlanta topology is shown, the others were recon figured in a similar manner. Note the naming standard used for each link, which gave it the name of the two sites included. The only deviation from this configuration would have been if the company had three or more sites at tier 1. In this case, all tier 1 sites would be in a single Core Link. This solution immediately forced replication the way the company wanted it. Figure 11 is a snapshot from ADTV showing the cleaned up topology. The squares are sites, the server icons are DCs, and the curved lines are connection objects. Note the simplicity of the topology. The two core sites in the center connect to the second-tier sites that in turn connect to the third-tier sites. As shown in the circled area in the figure, the tier 2 site becomes a hub with tier 3 sites as satellites or spokes.

Figure 10. New design configured third-tier sites with specific links connecting only two sites each (satellite to hub).

Figure 11. ADTV showing the configuration results—connection objects between the hub site and satellite sites.

Case Study #3: The HP Topology

The HP topology is a multi-tier topology utilizing the network configuration to force replication. Ten core sites are located globally in Tokyo; Swindon, UK; Boeblingen, Germany; Grenoble, France; Singapore; and Sydney, Australia; as well as the four sites in the United States in Palo Alto, California; Houston, Texas; Atlanta, Georgia; and Littleton, Massachusetts. Lower-level sites are connected to one of these core sites, and costing is implemented to force replication from a lower-level site up to the core site that serves the site of the target DC. Because the topology was created to take advantage of the physical network, this forces replication expeditiously through the network.

HP also adopted the philosophy that it would use the bandwidth of the network as much as possible to reduce the need for DCs in every physical location. If a location has at least a 2MB link to another location, the two locations can be combined into a single site, sharing the DC, GC, and so on. This has a lot of implications, as noted previously, such as performing urgent replication and sending uncompressed data across the WAN without the capability to schedule it. However, this has been in place since Compaq's rollout of Windows 2000 in early 2000, and it has served well. I've not seen any others who have taken this approach, even though HP realized reduced hardware costs and associated costs, such as administration, support, maintenance, and so on.

Case Study #4: Reed Elsevier

The site topology strategy is a four-tier approach, as shown in Figure 12, with the tiers defined as follows:

Tier 1: Consists of two top-level network hubs, one located in Oxford, England and the other in Dayton, Ohio. The FSMO role holders are located in these sites as well.
Tier 2: Consists of sites containing major business units that have a direct connection to the backbone.
Tier 3: Consists of sites that house medium-sized business offices that have a network connection to their parent business unit, a level 2 site.
Tier 4: Consists of sales and field offices that connect to a tier 3 site. Some physical locations of these sites are not AD sites, and depend on the KCC to locate a “closest site” for authentication and other DC tasks as described previously in this section.

Figure 12. Reed Elsevier's four tiers are configured in a basic hub-and-spoke topology.

The strategy here is that replication will flow up to the parent site. That is, level 4 will replicate up to level 3, level 3 will replicate to level 2, and level 2 will replicate to level 1. The replication schedule is as follows:

Tier 1 to tier 1: 15 minutes.
Tier 2 to tier 1: 15 minutes.

Tier 3 to tier 2: Depends upon the speed of the link, as shown in Table 1.

Table 1. Reed Elseveir's Replication Schedule
Link Speed	Interval	Schedule
1Mbps or faster	15 minutes	Always available
768Kbps or faster	1 hour	Always available
512Kbps	2 hours	Always available
256Kbps	4 hours	Always available
192Kbps	4 hours	Nonpeak time
128Kbps	4 hours	Nonpeak time
64Kbps or less	4 hours	Nonpeak time

Reed Elsevier also identified exceptions to these schedules and intervals. The exceptions covered cases when the BHS became overloaded, serving more than 25 sites or more than 50 replication partners, in which case the company would consider adding another site at that location. Note that because this will be a Windows 2003 forest, the company could also use the ADLB in these cases.

The only caution to this schedule is to not make these rules apply without looking at the big picture. This works well if the network is clearly defined so that the tiers match the network link speed without exception. Problem is, I've seen a lot of companies with a lot of exceptions. Make a diagram that shows the sites, their tier level, the associated network link speed (and available bandwidth), the location of the site links, and the replication interval and schedule. See if it makes sense. The diagram might cause you to redesign your topology or upgrade certain network links to match the topology you want. My advice here is don't ever make rules like this and forget about it. Make sure you test this in a lab, of course, but also monitor it after it has been put in place. Monitor the network utilization, and see if the replication is healthy. Make sure Group Policies are applied consistently, logon scripts are updated in a timely fashion, password changes are replicated expeditiously, and the event logs are clean.

In monitoring the network, keep in mind that replication probably won't have a big impact if the design is sound and replication is healthy. In the winter of 2001, I attended a conference in which a Microsoft program manager claimed that up to that point, not a single customer had complained about AD replication causing high network utilization.

AD replication replicates only changes at the attribute level, so the traffic should be minimal. Of course, the variables are the number of users, GCs, DNS servers, computers, servers, and so on at a site and the available network bandwidth. One customer indicated to me that the network Administrators felt AD replication was the cause of high network utilization at a couple of sites and was trying to adjust schedules to fix it. Problem was, they had no proof. Because the Admins didn't understand AD replication, it was an easy suspect. I told my contact to tell the Admins to prove it was AD that was causing the problem. We helped the company by simply scheduling replication so it didn't occur at those sites at the problem time periods, thus eliminating AD as the culprit.

tip

Implementation of a multi-tiered topology must be done carefully to get the desired results. The topology previously cited in Case Study #2 failed because the company tried to make it too complicated, and gave the KCC too much freedom. Simply create a site link from every tier 1 site to the tier 2 sites it should replicate with. Create a site link from every tier 2 site to every tier 3 site it should replicate with, and so on. The golden rule of site link creation is that each site link should contain only two sites, except for the Core site link, such as HP's where there are more than two sites in tier 1. This forces the KCC to replicate in the manner you want and gives it no freedom to choose, as was done in Case Study #2.

Related -----------------

- Windows Server 2003 on HP ProLiant Servers : Active Directory Logical Design - Replication Topology (part 2)

- Windows Server 2003 on HP ProLiant Servers : Active Directory Logical Design - Replication Topology (part 1) - Challenges and Issues in AD Replication

Other -----------------

- Extending the Real-Time Communications Functionality of Exchange Server 2007 : Exploring Office Communications Server Tools and Concepts

- Local Continuous Replications in Exchange Server 2007

- Microsoft Dynamics CRM 4.0 : SharePoint Integration - Store Attachments in SharePoint Using a Custom Solution

- Microsoft Dynamics CRM 4.0 : SharePoint Integration - Custom SharePoint Development

- Windows Server 2008 R2 file and print services : Administering Distributed File System Services (part 2) - Configuring and administering DFS Replication

- Windows Server 2008 R2 file and print services : Administering Distributed File System Services (part 1) - Configuring and administering DFS Namespaces

- Customizing Dynamics AX 2009 : Number Sequence Customization

- Microsoft Systems Management Server 2003 : Queries (part 3) - Executing Queries

- Microsoft Systems Management Server 2003 : Queries (part 2) - Creating a Query

- Microsoft Systems Management Server 2003 : Queries (part 1) - Query Elements