Sharepoint 2010 : Managing Crawls

3/23/2012 3:47:42 PM

To manage crawls, you must understand the differences between full and incremental crawls. A full crawl will follow the instructions of the content source and the crawl rules to crawl the entire content source according to the content type, whether hierarchical, enumerated list, or link traversal. A full crawl will replace the current index for that content source and give you a new index. However, because some full crawls take many hours, the old index for that content source will remain on the index and query servers to meet query demand by your users and is only replaced after the full crawl has successfully completed. This means that, for a brief length of time, you’ll have two full indexes of the same content source existing on your hard drives. Be sure you plan for enough disk space for committing full crawls.

What is crawled during an incremental crawl depends on the content type and how changes are detected for that content type. For a file system crawl or normal Web crawls, the date/time stamp is compared to a crawl history log. However, for SharePoint incremental crawls, the change logs maintained in the content databases are used. SharePoint 2010 now supports a very quick ACL-only crawl to update security information for index items. Most databases do not support incremental crawls. FAST technology supports change notifications from SQL databases that essentially “push” changes to the crawler, but the SharePoint 2010 Search feature does not.

In the following sections, you’ll learn how to manage crawls from the Manage Content Sources page shown in Figure 1, which presents the tools for managing crawls.

Figure 1. Crawl management options from the Manage Content Sources page

1. Global Crawl Management

Crawls for all content sources can be managed globally with the toolbar option to Start All Crawls, which changes to Stop All Crawls and Pause All Crawls after crawls are started. The type of crawl initiated by the Start All Crawls option depends on several factors.

It would follow the next crawl scheduled for each content source whether it is a full or incremental crawl.
If a crawl has been paused, then that crawl will be resumed.
If no crawl is scheduled and a full crawl has been completed, then an incremental crawl is started. However, remember that the first crawl of any content source is always a full crawl.
If either type of crawl has been stopped, the next crawl will always be a full crawl. Therefore, careful consideration should be given to the impact of using the Stop All Crawls tool.
The indexing process can always force a full crawl if it determines that enough errors exist in the index that an incremental crawl may not correct them.

Note:

Although the crawl process is read-only and does not modify the files, it will change the last read date on some files, which can impact access auditing.

2. Content Source Crawl Management

The context menu of each content source presents crawl management tools. You can start both full and incremental crawls from the context menu. You can also use the menu to pause, resume, or stop an active crawl. Remember that any time a crawl is stopped or does not complete for any reason, the next crawl of that content source will be a full crawl, because the information in the crawl log and markers set on the change logs are considered inaccurate. When a crawl is paused, the instructions for the crawl and the information about the crawl are retained in memory on the host of the crawl component for use when the crawl is resumed.

3. User Crawl Management

SharePoint crawlers have always obeyed “Do Not Crawl” instructions embedded in Web content. SharePoint 2010 continues to offer content owners of lists, libraries, and sites the ability to add these instructions through the user interface and eliminate their content from search indexes. Site collection administrators can also flag site columns (metadata) to keep them from being crawled. Personally identifiable information (PII) is an example of information that should not be indexed on public sites. Be sure to have clear policies regarding what type of content should or should not appear in your index.

4. Scheduling Crawls

The management of crawl schedules is an ongoing process that may require daily monitoring and tweaking. The Manage Content Sources page presents information on the duration of the current and last crawl but does not indicate the type of crawl involved.

However, the Crawl History view of the crawl logs itemizes each crawl’s start and end times with the calculated duration as well as the activity accomplished during the crawl. This information permits search administrators to adjust the crawl schedules as the corpus grows so that a crawl can complete successfully before the next crawl begins. Crawls must be scheduled as often as needed to meet the “freshness” requirements of your organization. You might need to adjust the topology of your search service to add resources to complete crawls often enough to meet these needs. When determining additional resources, consider the impact the additions will have on the WFEs being crawled and on the SQL servers hosting the content and search databases.

With the improvements in incremental crawl instructions, you may only schedule full crawls when required instead of on a regular basis. The crawl component can itself switch to a full crawl if

A search application administrator stopped the previous crawl or the previous crawl did not complete for any reason.
A content database was restored from backup without the appropriate switch on the STSADM –restore operation that allows the farm administrators to restore a content database without forcing a full crawl.
A farm administrator has detached and reattached a content database.
A full crawl of the content source has never been done.
The change log does not contain entries for the addresses that are being crawled. Without entries in the change log for the items being crawled, incremental crawls cannot occur.
Depending on the severity of the corruption, the index server might force a full crawl if corruption is detected in the index.

Finally, when is a full crawl required?

When a search application administrator added a new managed property.
To re-index ASPX pages on Windows SharePoint Services 3.0 or SharePoint Server 2007 sites.
Note:

Incremental crawls do not re-index views or home pages when content within the page has changed, such as the deletion of individual list items. This is because of the inability of the crawler to detect when ASPX pages on SharePoint sites have changed. You should periodically do full crawls of sites that contain ASPX files to ensure that these pages are re-indexed unless you have the site configured to not have ASPX pages crawled. This behavior is the same as in previous versions of SharePoint.
To resolve consecutive incremental crawl failures. The index server has been reported to remove content that could not be accessed in 100 consecutive attempts.
When crawl rules have been added, deleted, or modified.
To repair a corrupted index.
When the search services administrator has created one or more server name mappings.
When the account assigned to the default content access account or crawl rule account has changed. This also automatically triggers a full crawl. Account password changes do not require or trigger a full crawl.
When file types and/or iFilters have been installed and the new content needs to be indexed.