Sharepoint 2010 : Configuring Crawls (part 1) - Creating and Managing Content Sources

12/23/2011 11:35:42 AM

The top portion of the Search Administration page is shown in Figure 1 and presents a dashboard of System Status information with links to configuration dialogue boxes for general search service configurations. On the left side of the page is a Quick Launch organized into three sections: Administration, Crawling, and Queries And Results.

Figure 1. Top portion of Search Administration page

The top dashboard section presents information on the crawl and indexing processes as well as the recent query rate and scopes status. There are also six configuration links.

The default content access account was entered when you ran the Farm Configuration Wizard and can be overridden here for this search service application’s components.
The Contact e-mail address is left behind by crawl components of this service. This should be a monitored e-mail address, in case the crawler is creating performance issues on remote sites.
The Proxy server link opens the Farm Search Administration configuration and changes the farm-wide settings, although it does not indicate that it is not a local service setting.
The scopes update schedule is a toggle between the automatic schedule of the timer job and manually triggered updates. If there are scopes to be updated, a link is presented to start the update process manually.
Search alert status can be enabled/disabled here. Normally, it is automatically disabled during an index reset but must be manually enabled after the index is rebuilt.
If you disable Query logging, no data will be available for query reports.

1. Creating and Managing Content Sources

The first step in setting up crawling is the creation of content sources. Click the Content Sources link in the Quick Launch (refer back to Figure 1 ) to open the Manage Content Sources page shown in Figure 2. This page continues the dashboard theme to give the status of each content source and crawl information that will be useful in scheduling crawls.

A content source is a collection of addresses that can be crawled with the same connector or protocol handler along with instructions about how deep to crawl from the start address, how often to crawl the collection, and at what priority.

Figure 2. Manage Content Sources page

When creating a new content source, the name should be unique but meaningful to everyone managing the search service. The default content source types are

SharePoint Sites
Web Sites
File Shares
Exchange Public Folders
Line Of Business Data
Custom Repositories
Lotus Notes (If the service is provisioned and started on the server.)

A start address cannot appear in more than one content source but does not have to start at the root of an application. A single content source can contain up to 50 start addresses and a search service application can have up to 50 content sources. Start address entries are not checked for accuracy during the creation of a content source. Start addresses should match the type of content source you are creating. For example, if you create a Web Sites content source, then be sure to not include start addresses that start with file:// or some other type of content that is not Web based.

The crawl settings configuration is dependent on the content source type selected and controls the depth of the crawl from the starting address. For a Web Site content source, this includes how deep to follow hops within the server and how far to follow hops to other servers. Unlimited in both scenarios would be to crawl the Internet, starting at the specified addresses.

You can create either Full Crawl or Incremental Crawl schedules or both. Crawl schedules are not required for content sources for which crawls will be managed manually. It is not necessary to schedule a full crawl, because starting an incremental crawl on a content source that has never been crawled will result in a full crawl.

The ability to choose Normal or High crawl priority for a content source is new in SharePoint 2010. Items placed in the crawl queue are always crawled in their order in the queue. Items from high priority content sources are always placed in the queue before those from normal priority content sources, without regard to when the crawls began on the content sources.

Finally, when creating a content source, you may choose to begin a full crawl as soon as the content source is created. Considering the resources consumed by full crawls, they are usually scheduled when resources are available.

By default, a Local SharePoint sites content source is always created for a search service, and all applications associated with the service are placed in that content source, as shown in Figure 3 . Although this does ensure that all content will be crawled, it is seldom an optimum configuration because the crawls of content in different locations will generally need to be managed independently.

Figure 3. Start addresses for sample farm’s Local SharePoint site content source

You can gain an understanding of the start addresses by examining the example shown in Figure 11-7. First, the address sps3://mysites is a special crawl of the user profiles using the profile pages of the My Site host. Profiles need to be crawled after they are imported and before audiences are compiled. Incremental crawls may not be required throughout the day at all, but certainly profile data needs to be crawled on a different schedule and priority than other content. Remember that profile data can be manually entered by users in addition to imported from databases.

Depending on how personal sites are used in your organization, the content of the http://mysites application will probably have a different schedule and priority than other content. The content of a portal can be more or less dynamic than that of a team collaborative site. The crawl scheduling for any content source must reflect two main considerations: how fast changes in the content source need to be reflected in your index and the rate at which changes occur at the content source. This is why, for example, you might crawl the company announcements folder every 4 hours but crawl http://archive only once every quarter.

So, crawl scheduling and priority are two reasons for creating a unique content source. Other reasons could include the need to define a scope based on a content source, unique crawl settings for depth and hops to off-site links, and crawl management.

Real World: Removing an Inadvertent Exposure

In environments where security is important, occasionally content is sometimes inadvertently placed on websites with wider access than the material should be given. When this “spillage” material is removed, a full crawl must be run to prove that the material is not duplicated elsewhere and that the information is completely removed from the index. Dividing the content to be crawled into smaller content sources reduces the overhead of a full crawl.

Related -----------------

- Sharepoint 2010 : Configuring Crawls (part 3) - Using Server Name Mappings & Controlling Host Distribution

- Sharepoint 2010 : Configuring Crawls (part 2) - Creating and Managing Crawl Rules

Other -----------------

- Windows Server 2003 : Creating and Enforcing Security Policies (part 3) - Microsoft Baseline Security Analyzer

- Windows Server 2003 : Creating and Enforcing Security Policies (part 2) - Security Configuration and Analysis

- Windows Server 2003 : Creating and Enforcing Security Policies (part 1) - Using Security Policy Templates

- Microsoft Dynamic CRM 4.0 : MapPoint Integration (part 2) - MapPoint Web Services

- Microsoft Dynamic CRM 4.0 : MapPoint Integration (part 1) - MapPoint 2009

- Microsoft Dynamic CRM 4.0 : Microsoft Live Search Maps

- Configuring Email Settings in Windows Small Business Server 2011 (part 2) - Setting Mailbox Quotas & Moving Exchange Server Data

- Configuring Email Settings in Windows Small Business Server 2011 (part 1) - Configure a Smart Host for Internet Email & Using the POP3 Connector

- System Center Configuration Manager 2007 : Configuration Manager Queries - Relationships, Operations, and Joins

- System Center Configuration Manager 2007 : Configuration Manager Queries - Advanced Queries