4. Diagnostics and Troubleshooting Methods and Tools
FRS has been one of the most difficult components to
troubleshoot because there are so many dependencies, the logs are
cryptic and almost too verbose, and few Administrators understand FRS
internals well enough to read the logs, anyway. Let's look at some
typical troubleshooting methods and some new tools Microsoft has
provided.
Methods
You can test the overall health of FRS in a couple of
ways. A good way to see who is replicating to whom is to create a text
file (empty if you like), name it after the DC it is on (such as DC1.txt), and place it in the %systemroot%\sysvol\sysvol
directory. Do this on every DC in the domain, and then wait for
end-to-end replication to occur. Every DC should have a text file from
every other DC. For instance, if four DCs are in the domain—DC1, DC2,
DC3, and DC4—you would create DC1.txt on DC1, DC2.txt on DC2, and so on.
After replication, each DC should have DC1.txt, DC2.txt, DC3.txt, and
DC4.txt—the results are shown in Table 1.
What you see is inbound replication from all DCs to DC1 and DC3. DC2
isn't getting inbound replication from DC4, and DC4 is not getting
inbound replication from anyone.
Table 1. Using a Text File in SYSVOL to Test FRS Replication Health
DC Name | DC1 | DC2 | DC3 | DC4 |
---|
Text files appearing on each DC after replication | DC1.txt
DC2.txt
DC3.txt
DC4.txt | DC1.txt
DC2.txt
DC3.txt | DC1.txt
DC2.txt
DC3.txt
DC4.txt | DC4.tx |
Result | DC1 Inbound replication from all DCs | No inbound replication from DC4 | DC3 inbound replication healthy | No inbound replication from any DC to DC4 |
We know that DC4 has inbound problems and outbound to
DC2 doesn't work, so concentrate on DC4. Check for DNS and AD
replication errors on DC4 and DC2, and then concentrate on FRS. One
powerful tool Microsoft has given us is MPS Reports, located at http://microsoft.com/downloads/details.aspx?FamilyId=CEBF3C7C-7CA5-408F-88B7-F9C79B7306C0&displaylang=en.
There are several versions: clusters, DS, FRS, network, and so on. Get
the FRS and DS versions and run them on the problem DCs (DC4 and DC2 in
this example). These are simple executables that run a variety of
command-line utilities and wrap the output in a single cab file located
in %systemroot%\MPS Reports. Now comes the hard part—trying to
read the logs, and worse yet, figure out what they mean. To do this
effectively, you need that FRS PhD degree, achieved mostly through
experience
Advanced Diagnostic Tools
There are a variety of ways to collect logs—the NtFrs_xxxxxx.log files in %systemroot%\debug,
those generated by the NTFRSUTL.exe tool, and the event logs—on suspect
DCs. The problem is interpreting them. This takes experience and a good
depth of knowledge to apply that information and resolve the problem.
Microsoft has now provided four powerful tools to help the average Admin
diagnose and troubleshoot FRS problems: Sonar, Ultrasound, FRSDiag, and
the Ultrasound help file.
Sonar
Sonar (see Figure 3)
is a GUI-based tool that monitors FRS data such as file backlog,
errors, missing SYSVOL shares, and so forth for all DCs in the domain,
and presents it in a nice table format with options for refresh
frequency and categories such as replication status. You can sort the
table to show errors, replication consistency, and other factors.
Ultrasound
Ultrasound (see Figure 4)
is a GUI-based tool that is a step beyond Sonar. Ultrasound hooks to a
SQL database (MSDE will work) to provide historical data so you can view
a history of the problem, and contains a feature to send e-mail in the
event of a failure, and other goodies. Ultrasound's real value is in the
capability to capture SYSVOL- and DFS-related replication data and
present it in a clean, easy-to-read format. Notice in this example that
on the far right side of the screen, Ultrasound has listed all
FRS-related warnings and errors for all members of the replica set we
are monitoring. This is much easier than scanning event logs.
FRSDiag.exe
FRSDiag.exe (see Figure 5)
is a tool with a simple UI that allows you to click check boxes for
types of data you want, and then it runs the appropriate utility to get
the data (sort of like customizable MPS Reports). It also produces an
FRSDiag.txt file that is similar to the DCDiag.exe tool used for AD
diagnostics. A sample output is shown here:
——————————————————————————————
FRSDiag v1.7 on 12/11/2003 11:43:23 AM
.\qtest-dc22 on 2003-12-11 at 11.43.23 AM
——————————————————————————————
Checking for minimum FRS version requirement ... passed
Checking for errors/warnings in ntfrsutl ds ... passed
Checking for Replica Set configuration triggers... passed
Checking for suspicious file Backlog size...
ERROR : File Backlog TO server "QTEST\QTEST-DC6$" is : 2770248
:: Unless this is due to your schedule, this is a problem!
failed with 1 error(s) and 0 warning(s)
Checking Overall Disk Space and SYSVOL structure (note: integrity is not checked)... passed
Checking for suspicious inlog entries ... passed
Checking for suspicious outlog entries ...
ERROR: 101.80% (2994 out of 2941) of your outlog contains Security ACL events.
See KB articles below for further information:
279156 - The Effects of Setting the File System Policy on a Disk Drive or Folder
284947 - Antivirus Programs May Modify Security Descriptors and Cause Excessive Replication of FRS Data in SYSVOL and DFS
......... failed
Checking for appropriate staging area size ... passed
Checking NTFRS Service (and dependent services) state...passed
Checking NTFRS related Registry Keys for possible problems...Checking Repadmin Showreps for errors...
DC=Qtest,DC=cpqcorp,DC=net
Atlanta\QTEST-DC99 via RPC
objectGuid: bde1b194-93d1-420d-ae14-3483e9eb8fb7
Last attempt @ 2003-12-11 10:54.16 failed, result 8524:
The DSA operation is unable to proceed because of a DNS
lookup failure.
Last success @ 2003-12-03 15:16.49.
189 consecutive failure(s).
CN=Configuration,DC=Qtest,DC=cpqcorp,DC=net
Atlanta\QAMERICAS-MDC1 via RPC
objectGuid: 1388a125-9318-4992-aa53-1a0519e24d0a
Last attempt @ 2003-12-11 10:54.14 failed, result 1722:
The RPC server is unavailable.
Last success @ 2003-11-13 19:20.23.
665 consecutive failure(s).
You can see that rather than a lengthy process of evaluating cryptic log files, there are several issues here:
Server Qtest-DC6 has a backlog of 2,770,248 files, so it is way behind.
A ton of security ACL events are in the outlog. Note that it provides two handy Microsoft KB articles to help resolve this.
There is excessive replication of FRS data.
AD
replication is failing due to DNS lookup failure and an RPC server
unavailable failure, which probably accounts for some of the other
problems.
Now we have direction: Fix the AD replication problem, run FRSDiag again, and work through the problems.
Ultrasound Help File
Simple, yet perhaps the most powerful of all tools,
this file is powerful because Microsoft has compiled its experience and
knowledge to provide descriptions of errors and problem conditions, and
the cause and solution. The file also contains FRS operation basics,
terminology, and, of course, info about the Ultrasound, Sonar, and
FRSDiag tools. This is a desktop reference for all FRS events, errors,
and problem conditions, and should help you resolve FRS issues without
involving tech support. Figure 6
shows one of my favorites: the Event ID list. Microsoft has listed all
FRS-related event IDs in the left pane. In this example, I selected
13568—the journal wrap error. In the right pane, you see a description
and the resolution. No searching the Microsoft site or Google for the
KB. It's right there.
Another powerful feature in the FRS troubleshooting section in the help file is illustrated in Figure 7.
Here you see how to resolve a corrupt FRS database. Microsoft has
collected its considerable experience and documented it in this help
file to help the rest of us resolve FRS issues without calling for
support.