SharePoint 2010's crawler communicates with the
content sources that are defined in a very standardized manner. It
indexes the content as the user that it is specified as and collects
information from all the links that are specified. If subfolders are set
to be indexed, it will navigate to those folders, collect the links,
and gather the content. It is not always desirable or possible, however,
to have SharePoint crawl the content sources in the same way with the
same accounts. Therefore, SharePoint 2010 has a powerful feature to
specify rules for given paths that may be encountered during crawling.
These rules can include or exclude specific content as well as pass
special user credentials to those specific items in order to gather them
correctly.
Crawl rules are applied in
the Search service application on the Crawl Rules page, which is under
the Crawler section of the left-hand navigation. Adding a new crawl rule
is as easy as navigating to the Crawl Rules page and selecting new
crawl rule. Because regular expressions and wildcard rules can be
applied, a testing feature is made available on the Crawl Rules page.
This feature will allow a particular address to be entered and tested to
see if there is a rule already designated that affects the crawling of
this address. Since many rules can be applied and the effect of rules is
not always obvious, this testing feature is very useful (Figure 1). If a page is not being crawled, administrators are encouraged to check for conflicting rules.
To add a crawl rule,
navigate to the Search service application and choose Crawl Rules in the
left-hand navigation under Crawler. On the Crawl Rules page, select New
Crawl Rule. On the Add Crawl Rule page, paths can be added to either
explicitly exclude or include. Wildcards or regular expressions can be
used to create complicated inclusion or exclusion rules. This gives a
powerful way to find undesirable or desirable content and make sure it
is or isn't crawled.
Adjusting the crawler with
crawl rules can go a long way toward improving the relevance and quality
of the search result set. All too often, search result lists are
polluted with unnecessary or irrelevant content. Setting this content in
crawl rules to be excluded from the crawl can help to remove
unnecessary documents from the crawl database and consequently the
result lists. Some typical examples of this are documents of a certain
type or in a certain location. Although many serious scenarios can be
imagined where documents with a certain file name or in a certain path
need to be excluded, one of the most common situations is when crawling a
public web site with print versions for each page. Setting a crawl rule
to set the print version (e.g., print=true pattern in URL) can easily
allow these to be removed from the crawled content and remove this
noise. Some simple inspections of the search results and the patterns in
URLs on the content source sites will help to determine what kinds of
rules are appropriate.
1. Using Regular Expression in Crawl Rules
SharePoint 2010 has the
added feature of supporting regular expressions in crawl rules. The
administrator must be sure to select the "Match regular expressions"
check box and formulate the expressions properly, but this feature opens
vast new possibilities for controlling what is crawled and what isn't.
SharePoint 2010 supports the following regular expression operators listed in Tables 1 through 3.
Table 1. Acceptable Grouping Operators in SharePoint 2010
Operator | Symbol | Description | Example | Valid match | Invalid match |
---|
Group | () | Parentheses will group sets of characters. Operators for the group will be applied to the entire group. | | | |
Disjunction | | | This pipe operator two expressions and returns true when only one is valid. It is a logical OR. | \\prosharepointshare\((share1)|(share2))\.* | \\prosharepointshare\share1\<files> OR \\prosharepointshare\share2\<files> | \\myshare\share1share2\<files> |
Table 3. Acceptable Count Operators in SharePoint 2010
Operator | Symbol | Description | Example | Valid match | Invalid match |
---|
Exact count | {num} | This operator is a number inside curly brackets"{}", e.g., {1}. It limits the number of times a specific match may occur. | http://prosharepointsearch/(1){5}-(0){3}.aspx | http://prosharepointsearch/11111-000.aspx | http://prosharepointsearch/111-00.aspx |
Min count | {num,} | This
operator is a number inside curly brackets "{}" followed by a comma ","
e.g., {1,}. It limits the number of repetitions a specific match can
have and places a minimum amount on that match. | http://prosharepointsearch/(1){5,}-(0){2}.aspx | http://prosharepointsearch/11111-00.aspx AND http://prosharepointsearch/11111-00.aspx | http://prosharepointsearch/1111-00.aspx |
Range count | {num1, num2} | This
operator holds two numbers inside curly brackets"{}" separated by a
comma "," e.g., {4,5}. The first number defines a lower limit, and the
second number defines an upper limit. It limits the number of
repetitions in a URL between the two values, num1 and num2. The first
number should always be lower than the second to be valid. | http://prosharepointsearch/(1){4}-(0){2,3}.aspx | http://prosharepointsearch/1111-00.aspx AND http://prosharepointsearch/1111-000.aspx | http://prosharepointsearch/9999-0000.aspx |
Disjunction | | | This pipe operator is applied between two expressions and returns true when only one is valid. It is a logical OR. | \\proshare pointshare\((share1)| (share2))\.* | \\proshare pointshare\share1\<files> OR \\proshare pointshare\share2\<files> | \\myshare\share1share2\<files< |
List [<li] | st of chars>] | This
operator is a list of characters inside square brackets "[]". It
matches any characters in the list. A range of characters can be
specified using the hyphen "-" operator between the characters. | http://prosharepointsearch/page[1-9].htm | http://prosharepointsearch/page1.htm OR http://prosharepointsearch/page2.htm OR http://prosharepointsearch/page3.htm OR ... | |
When adding regular expressions to match crawl paths, it is important to know that the protocol part (e.g., http://) of the path cannot
contain regular expressions. Only parts of the path after the defined
protocol may contain regular expressions. If the protocol is excluded,
SharePoint will add http:// to the hostname and any attempts at regular expressions.
By default regular
expression matches are not case-sensitive. Additionally, SharePoint
2010's crawler normalizes all discovered links by converting them to
lowercase. If it is necessary to match case or use regular expressions
to exclude documents based on character case in the path, the "Match
case" check box should be checked. Otherwise, leave it empty. It may be
necessary to match case if crawling Apache-driven web sites where pages
are case-sensitive, Linux-based file shares, or content from Business
Connectivity Services that preserves case. Creating crawl rules for
case-sensitive file types allows them to be crawled and recognized as
unique.
2. Using Crawl Rules to Grant Access
Crawl rules can also be used to
grant access to specific content or parts of content by defining the
user that will crawl that content. Generally, the crawler should be
given full read access to content and allow SharePoint's permissions
filtering to determine what users can see.
NOTE
Be careful when
applying blanket permissions across large document repositories.
Although giving the SharePoint crawler read access to everything is
usually a good idea in well-managed SharePoint sites, doing it on other
systems can often expose security risks such as documents without
correct permissions that are never found solely due to obscurity. A
search engine is a great tool for finding things, even those best left
hidden.
It is also possible and sometimes
necessary to define a special user for indexing external sites or
independent systems such as file shares or Exchange. In these cases, a
special user with read access to the content can be defined in the crawl
rules. For example, if indexing Exchange public folders, a separate
user can be defined to allow read-only access to those folders. This
user can be set in crawl rules to be the user to index that content,
thereby protecting other Exchange content from unauthorized crawling (Figure 2).