SharePoint 2010 Search : Tuning Search (part 1) - Authoritative Pages & Result Removal

8/2/2011 3:06:44 PM

Really, the best way to improve the quality of search is to improve the quality of the content being indexed. The computer adage "Garbage In, Garbage Out" applies profoundly to search engines. Poor-quality content will, without exception, result in poor-quality search results. However, improving huge sets of legacy documents imported from a file share or other document storage can be daunting if not impossible. In these cases, it is wise to be critical of the need for such documents and brutal when it comes to trimming old and questionably valuable content from import to SharePoint or crawling by the SharePoint crawler.

The best way to manage document quality moving forward is to have an active training program for SharePoint end users and a coherent tagging and document generation strategy. Word and PDF documents, among others, are often if not usually mistitled, have poor or no metadata, and are not well formatted for search. Having a policy for document authoring and metadata use can lead to a much better search experience and better long-term knowledge management.

Here are some hints on how to improve the quality of content in SharePoint.

Make good titles—encourage document authors to title their documents and fill in any metadata on the documents.
Convert to .pdf wisely. Almost everyone has seen the .pdf document titled Word document.pdf, or some equally useless .pdf title like untitled.pdf. This is because no title was given to the document when it was converted from its original format to .pdf. By requiring authors to add meaningful titles, administrators and managers can help the findability of information in an organization.
Add properties. Use the Managed Metadata ervice to devise taxonomy and teach authors how to use it.
Remove old or unnecessary content.
Ask users to identify and flag noise documents and click, rate, and share useful ones.

1. Authoritative Pages

Probably the easiest administrative way to modify the ranking of the search results in SharePoint 2010 is by the use of authoritative pages. Pages can have additional ranking boost applied to them based on their click distance from an authoritative page. Sites are usually built in a pyramid structure, with an entry page at the site collection level and then a list of navigation links that link sites and subsites. Those sites link to lists and libraries that link to documents. Sometimes links can be provided across sites. SharePoint analyzes this structure and the number of clicks that exist between two documents. Then it applies added value to a page or document's ranking value based on that document's relative location to an authoritative page.

The default authoritative page is the "home page" to your site collection. Therefore, ranking is boosted based on the click depth of the sites, lists, libraries, pages, and documents. As expected, a deep page can have its ranking boosted by linking to it from the main page. The authoritative page setting includes lists and library pages, but individual documents cannot be added.

The Authoritative Pages page can be accessed from the Search service application in Central Administration. The top link under Queries and Results is to the Authoritative Pages page. See Figure 1.

Figure 1. Link to Authoritative Pages page

There are three separate levels of authoritative pages (Figure 2) that can be set:

Most authoritative pages: Pages close to this page will increase in ranking dependent on the number of clicks from the set page to the page returned in the result set.
Second-level authoritative pages: This performs the same way as authoritative pages but ranks lower than authoritative pages.
Third-level authoritative pages: This performs the same way as authoritative pages and second-level authoritative pages but boosts ranking the least of the three.

And there is one non-authoritative level, which will reduce ranking of entire sites:

Sites to demote: Adding sites to this section will demote all content on that site in the search results.

Figure 2. Authoritative Pages page

2. Result Removal

In some cases, especially those where very large document repositories are indexed, undesirable results may appear in the result list. In some cases, a document or documents that are buried on a file share may have sensitive information that can be returned in the search results. Searches for "password" or "salary" can often surface these documents. The best way to deal with these documents is, of course, to remove them from the location they are being indexed or restrict them with permissions. However, SharePoint offers a simple mechanism for removing results that are deemed undesirable in the search results. The Search Result Removal feature in the Search service application has a simple field where the administrator can add the URLs of undesired documents.

To add documents to the result removal, do the following:

Navigate to the Search service application.
On the left-hand menu under Queries and Results, choose Search Result Removal (Figure 3).

Figure 3. Search Result Removal menu item
Add the URLs of the documents you want removed from search (Figure 4).

Figure 4. Search result removal

Documents will automatically be removed, and crawl rules will be applied to avoid crawling the documents in future crawls.

3. Stop Words

As we saw with Heaps' Law, the most common terms in any corpus do not carry any value for searching. Grammatical terms and helper language that when used in context can convey meaning do not themselves help us to find the information we want. In order to help the search engine avoid ranking documents high that merely have a high density of terms such as "the", these most common terms are set as stop words or noise words in SharePoint's search engine. SharePoint 2010 has one stop word file for each language it supports as well as a neutral file. However, the neutral file is empty by default. All other files have a few of the most common terms for that language in them. More can be added easily by the administrator.

The files are located in C:\Program Files\Microsoft Office Servers\14.0\Data\Office Server\Config, where there is a virgin set of files that are copied to the Search service application's specific config file when a new Search service application is created. This path is C:\Program Files\Microsoft Office Servers\14.0\Data\Applications\GUID\Config. It is best to edit the files in this path at each query server.

The default stop word file for English is called noiseeng.txt and contains the following terms.

a
and
is
in
it
of
the
to

These terms can be added to by placing an additional term on each line of the file. Common words in the language may be freely added as long as searching for them would not provide useful results. Some useful additions might be "this", "that", "these", "those", "they", etc.

Related -----------------

- SharePoint 2010 Search : Tuning Search (part 4) - Search Keywords and Best Bets

- SharePoint 2010 Search : Tuning Search (part 3) - The noindex Class & The Ratings Column

- SharePoint 2010 Search : Tuning Search (part 2) - The Thesaurus & Custom Dictionaries

- SharePoint 2010 Search : Tuning Search (part 1) - Authoritative Pages & Result Removal

Other -----------------

- Automating Dynamics GP 2010 : Using Reminders to remember important events

- Organizing Dynamics GP 2010 : Going straight to the site with Web Links

- Microsoft Lync Server 2010 : Collaboration Benefits & Management and Administration Benefits

- Microsoft Lync Server 2010 : Benefits for Lync Server Users & Enterprise Voice Benefits

- Configuring Role-Based Permissions for Exchange Server 2010 (part 3) - Performing Advanced Permissions Management

- Configuring Role-Based Permissions for Exchange Server 2010 (part 2) - Viewing, Adding or Removing Role Group Members & Assigning Roles Directly or via Policy