SharePoint 2010 Search : Setting Up the Crawler - Using Crawl Rules

9/16/2011 5:35:41 PM

SharePoint 2010's crawler communicates with the content sources that are defined in a very standardized manner. It indexes the content as the user that it is specified as and collects information from all the links that are specified. If subfolders are set to be indexed, it will navigate to those folders, collect the links, and gather the content. It is not always desirable or possible, however, to have SharePoint crawl the content sources in the same way with the same accounts. Therefore, SharePoint 2010 has a powerful feature to specify rules for given paths that may be encountered during crawling. These rules can include or exclude specific content as well as pass special user credentials to those specific items in order to gather them correctly.

Crawl rules are applied in the Search service application on the Crawl Rules page, which is under the Crawler section of the left-hand navigation. Adding a new crawl rule is as easy as navigating to the Crawl Rules page and selecting new crawl rule. Because regular expressions and wildcard rules can be applied, a testing feature is made available on the Crawl Rules page. This feature will allow a particular address to be entered and tested to see if there is a rule already designated that affects the crawling of this address. Since many rules can be applied and the effect of rules is not always obvious, this testing feature is very useful (Figure 1 ). If a page is not being crawled, administrators are encouraged to check for conflicting rules.

Figure 1. Testing a crawl rule

To add a crawl rule, navigate to the Search service application and choose Crawl Rules in the left-hand navigation under Crawler. On the Crawl Rules page, select New Crawl Rule. On the Add Crawl Rule page, paths can be added to either explicitly exclude or include. Wildcards or regular expressions can be used to create complicated inclusion or exclusion rules. This gives a powerful way to find undesirable or desirable content and make sure it is or isn't crawled.

Adjusting the crawler with crawl rules can go a long way toward improving the relevance and quality of the search result set. All too often, search result lists are polluted with unnecessary or irrelevant content. Setting this content in crawl rules to be excluded from the crawl can help to remove unnecessary documents from the crawl database and consequently the result lists. Some typical examples of this are documents of a certain type or in a certain location. Although many serious scenarios can be imagined where documents with a certain file name or in a certain path need to be excluded, one of the most common situations is when crawling a public web site with print versions for each page. Setting a crawl rule to set the print version (e.g., print=true pattern in URL) can easily allow these to be removed from the crawled content and remove this noise. Some simple inspections of the search results and the patterns in URLs on the content source sites will help to determine what kinds of rules are appropriate.

1. Using Regular Expression in Crawl Rules

SharePoint 2010 has the added feature of supporting regular expressions in crawl rules. The administrator must be sure to select the "Match regular expressions" check box and formulate the expressions properly, but this feature opens vast new possibilities for controlling what is crawled and what isn't.

SharePoint 2010 supports the following regular expression operators listed in Tables 1 through 3.

Table 1. Acceptable Grouping Operators in SharePoint 2010
Operator	Symbol	Description	Example	Valid match	Invalid match
Group	()	Parentheses will group sets of characters. Operators for the group will be applied to the entire group.
Disjunction	\|	This pipe operator two expressions and returns true when only one is valid. It is a logical OR.	\\prosharepointshare\((share1)\|(share2))\.*	\\prosharepointshare\share1\<files> OR \\prosharepointshare\share2\<files>	\\myshare\share1share2\<files>

Table 2. Acceptable Matching Operators in SharePoint 2010
Operator	Symbol	Description	Example	Valid match	Invalid match
Match any	.	The period or dot operator matches any character. It will not match with a null character, which means the number of dots should correspond to the number of characters matched.	http://prosharepointsearch/default.as.	http://prosharepointsearch/default.aspx	http://prosharepointsearch/default.asp
Conditional match	?	The expression can be tested to either exist or not. It will not expand the expression.	http://prosharepointsearch/default(1)?.html	http://prosharepointsearch/default.aspx http://prosharepointsearch/default1.aspx AND	http://prosharepointsearch/default11.aspx
Wildcard match	*	. A single character can either exist or repeatedly exist based on the operator's expansion	http://prosharepointsearch/default(1)*.aspx	http://prosharepointsearch/default.aspx http://prosharepointsearch/default111.aspx AND	http://prosharepointsearch/def.aspx
Match one or more times	+	It requires the expression on which it is applied to exist in the target address at least once.	http://prosharepointsearch/default(1)+.aspx	http://prosharepointsearch/default1.aspx http://prosharepointsearch/default111.aspx AND	http://prosharepointsearch/default.aspx
List match	[<list of chars>]	This operator is a list of characters inside square brackets "[]". It matches any characters in the list. A range of characters can be specified using the hyphen "-" operator between the characters.	http://prosharepointsearch/page[1-9].htm	http://prosharepointsearch/page1.htm OR http://prosharepointsearch/page2.htm OR http://prosharepointsearch/page3.htm OR ...

Table 3. Acceptable Count Operators in SharePoint 2010
Operator	Symbol	Description	Example	Valid match	Invalid match
Exact count	{num}	This operator is a number inside curly brackets"{}", e.g., {1}. It limits the number of times a specific match may occur.	http://prosharepointsearch/(1){5}-(0){3}.aspx	http://prosharepointsearch/11111-000.aspx	http://prosharepointsearch/111-00.aspx
Min count	{num,}	This operator is a number inside curly brackets "{}" followed by a comma "," e.g., {1,}. It limits the number of repetitions a specific match can have and places a minimum amount on that match.	http://prosharepointsearch/(1){5,}-(0){2}.aspx	http://prosharepointsearch/11111-00.aspx AND http://prosharepointsearch/11111-00.aspx	http://prosharepointsearch/1111-00.aspx
Range count	{num1, num2}	This operator holds two numbers inside curly brackets"{}" separated by a comma "," e.g., {4,5}. The first number defines a lower limit, and the second number defines an upper limit. It limits the number of repetitions in a URL between the two values, num1 and num2. The first number should always be lower than the second to be valid.	http://prosharepointsearch/(1){4}-(0){2,3}.aspx	http://prosharepointsearch/1111-00.aspx AND http://prosharepointsearch/1111-000.aspx	http://prosharepointsearch/9999-0000.aspx
Disjunction	\|	This pipe operator is applied between two expressions and returns true when only one is valid. It is a logical OR.	\\proshare pointshare\((share1)\| (share2))\.*	\\proshare pointshare\share1\<files> OR \\proshare pointshare\share2\<files>	\\myshare\share1share2\<files<
List [<li]	st of chars>]	This operator is a list of characters inside square brackets "[]". It matches any characters in the list. A range of characters can be specified using the hyphen "-" operator between the characters.	http://prosharepointsearch/page[1-9].htm	http://prosharepointsearch/page1.htm OR http://prosharepointsearch/page2.htm OR http://prosharepointsearch/page3.htm OR ...

When adding regular expressions to match crawl paths, it is important to know that the protocol part (e.g., http://) of the path cannot contain regular expressions. Only parts of the path after the defined protocol may contain regular expressions. If the protocol is excluded, SharePoint will add http:// to the hostname and any attempts at regular expressions.

By default regular expression matches are not case-sensitive. Additionally, SharePoint 2010's crawler normalizes all discovered links by converting them to lowercase. If it is necessary to match case or use regular expressions to exclude documents based on character case in the path, the "Match case" check box should be checked. Otherwise, leave it empty. It may be necessary to match case if crawling Apache-driven web sites where pages are case-sensitive, Linux-based file shares, or content from Business Connectivity Services that preserves case. Creating crawl rules for case-sensitive file types allows them to be crawled and recognized as unique.

2. Using Crawl Rules to Grant Access

Crawl rules can also be used to grant access to specific content or parts of content by defining the user that will crawl that content. Generally, the crawler should be given full read access to content and allow SharePoint's permissions filtering to determine what users can see.

NOTE

Be careful when applying blanket permissions across large document repositories. Although giving the SharePoint crawler read access to everything is usually a good idea in well-managed SharePoint sites, doing it on other systems can often expose security risks such as documents without correct permissions that are never found solely due to obscurity. A search engine is a great tool for finding things, even those best left hidden.

It is also possible and sometimes necessary to define a special user for indexing external sites or independent systems such as file shares or Exchange. In these cases, a special user with read access to the content can be defined in the crawl rules. For example, if indexing Exchange public folders, a separate user can be defined to allow read-only access to those folders. This user can be set in crawl rules to be the user to index that content, thereby protecting other Exchange content from unauthorized crawling (Figure 2).