Sharepoint 2010 : Reviewing and Troubleshooting Crawls (part 1) - Using Crawl Logs

3/23/2012 3:50:39 PM

Troubleshooting crawls on content sources can be a frustrating and time-consuming task. This is why it is important to understand the crawl and indexing process and the associated tools that are used to reveal problems or errors at each stage of the crawl process. The following sections will discuss the crawl logs, then the crawl reports, then finish by discussing Diagnostic Logging.

The Search Administration page uses the Crawl History Web Part to present a summary of crawl activities, as shown in Figure 1 . The hyperlinked content sources open the Edit Content Source page for that content source, and the hyperlinked numbers in the Success or All Errors columns open filtered views of the crawl log to display just those items.

Figure 1. Crawl History Web Part

Note:

By default, the Web Part shows the last six crawls per page, but that number can be modified by editing the Web Part.

1. Using Crawl Logs

The crawl logs will be your tool for determining modifications needed for crawl settings including timeouts or crawler impact rules. They are also your primary troubleshooting tools for determining the cause of problems such as why the crawler is not accessing certain documents or certain sites. On the positive side, you can use the logs to see if an individual document has been crawled successfully, and if so—but the user is unable to find the document via a query—you’ll be able to focus your troubleshooting efforts on helping the user refine their query. The crawl logs can be accessed from the Crawl Log link in the Crawling section of the Search Administration page or from the context menu for the content source on the Manage Content Sources page.

Note:

Crawl logs may be your first indication of problems on your sites. If the crawler (which has Read permissions for everything) cannot access items, then users cannot access them either. Any URLs that exceed the protocol limitations may be exposed first in crawl log errors.

Each hyperlinked number on a crawl log page opens a filtered view of the log. So if you click a number on the page, notice that the log will have already filtered the view based on the status type, without regard to date or time. You can then apply other filters available from drop-down boxes.

The crawl logs have five different views that present different levels of information, filtering options, and drilldown capabilities. The default Content Source view, shown in Figure 2 , presents summary counts of the five status types: Successes, Warnings, Errors, Top Level Errors, and Deletes.

Figure 2. Crawl Log – Content Source view

After the crawl log is opened, other views can be selected from the toolbar. The Host Name view in Figure 3 presents a summary of items crawled per host as well as the same status type counts as the Content Source view, plus a total column. It provides a search box for locating specific URLs that is useful when crawling large numbers of URLs or finding errors for specific sites within a URL.

Figure 3. Crawl Log – Host Name view

Note:

If you would like the system to analyze your current distribution and make
recommendations for redistribution, click here

The URL view shown in Figure 4 presents more information about the error, such as the error message, the content source, and the last time crawled. This list is normally extremely long, because it presents every action on every item. The advanced search filtering tool is useful for troubleshooting crawls and revealing user activities that create problems, such as uploading files with file names so long that the files cannot be opened or downloaded. It also reveals files that are too large for the default crawl settings. In this case, only the first portion of the file is indexed.

Figure 4. Crawl Log – URL view

The status message of each document appears below the URL, along with a symbol indicating whether the crawl was successful. Also notice that in the right-hand column of the table, the date and time of the message have been generated.

The status types are as follows.

Success The crawler successfully connected to the content source, read the content item, and passed the content to the indexer.
Warning The crawler connected to the content source and tried to crawl the content item, but could not for whatever reason.
Error The crawler could not communicate with the content source.
Top Level Errors These are errors at the root of an application or site collection that would impact all content below. Top level errors can result in shorter logs, because identical errors for individual items in that container are not recorded.
Deletes Deleted by the gatherer. In Figure 4 , the deletion was made because an application was moved from one content source to another.

The Crawl History view shown in Figure 5 gives information about specific crawls but does not provide any drilldown tools and only provides filtering on Content Source. This information is useful in adjusting crawl schedules and identifying the more dynamic content sources.

Figure 5. Crawl Log – Crawl History view

The Error Message view shown in Figure 6 aggregates all errors into a list of errors with a count of each. Clicking the hyperlinked number opens the URL view filtered to that particular error message. Other filters can then be applied to focus the presentation.