Really, the best way to improve the quality of search
is to improve the quality of the content being indexed. The computer
adage "Garbage In, Garbage Out" applies profoundly to search engines.
Poor-quality content will, without exception, result in poor-quality
search results. However, improving huge sets of legacy documents
imported from a file share or other document storage can be daunting if
not impossible. In these cases, it is wise to be critical of the need
for such documents and brutal when it comes to trimming old and
questionably valuable content from import to SharePoint or crawling by
the SharePoint crawler.
The best way to manage
document quality moving forward is to have an active training program
for SharePoint end users and a coherent tagging and document generation
strategy. Word and PDF documents, among others, are often if not usually
mistitled, have poor or no metadata, and are not well formatted for
search. Having a policy for document authoring and metadata use can lead
to a much better search experience and better long-term knowledge
management.
Here are some hints on how to improve the quality of content in SharePoint.
Make good titles—encourage document authors to title their documents and fill in any metadata on the documents.
Convert to .pdf wisely. Almost everyone has seen the .pdf document titled Word document.pdf, or some equally useless .pdf title like untitled.pdf. This is because no title was given to the document when it was converted from its original format to .pdf.
By requiring authors to add meaningful titles, administrators and
managers can help the findability of information in an organization.
Add properties. Use the Managed Metadata ervice to devise taxonomy and teach authors how to use it.
Remove old or unnecessary content.
Ask users to identify and flag noise documents and click, rate, and share useful ones.
1. Authoritative Pages
Probably the easiest
administrative way to modify the ranking of the search results in
SharePoint 2010 is by the use of authoritative pages. Pages can have
additional ranking boost applied to them based on their click distance
from an authoritative page. Sites are usually built in a pyramid
structure, with an entry page at the site collection level and then a
list of navigation links that link sites and subsites. Those sites link
to lists and libraries that link to documents. Sometimes links can be
provided across sites. SharePoint analyzes this structure and the number
of clicks that exist between two documents. Then it applies added value
to a page or document's ranking value based on that document's relative
location to an authoritative page.
The default authoritative
page is the "home page" to your site collection. Therefore, ranking is
boosted based on the click depth of the sites, lists, libraries, pages,
and documents. As expected, a deep page can have its ranking boosted by
linking to it from the main page. The authoritative page setting
includes lists and library pages, but individual documents cannot be
added.
The Authoritative Pages page
can be accessed from the Search service application in Central
Administration. The top link under Queries and Results is to the
Authoritative Pages page. See Figure 1.
There are three separate levels of authoritative pages (Figure 2) that can be set:
Most authoritative pages:
Pages close to this page will increase in ranking dependent on the
number of clicks from the set page to the page returned in the result
set.
Second-level authoritative pages: This performs the same way as authoritative pages but ranks lower than authoritative pages.
Third-level authoritative pages:
This performs the same way as authoritative pages and second-level
authoritative pages but boosts ranking the least of the three.
And there is one non-authoritative level, which will reduce ranking of entire sites:
2. Result Removal
In some cases,
especially those where very large document repositories are indexed,
undesirable results may appear in the result list. In some cases, a
document or documents that are buried on a file share may have sensitive
information that can be returned in the search results. Searches for
"password" or "salary" can often surface these documents. The best way
to deal with these documents is, of course, to remove them from the
location they are being indexed or restrict them with permissions.
However, SharePoint offers a simple mechanism for removing results that
are deemed undesirable in the search results. The Search Result Removal
feature in the Search service application has a simple field where the
administrator can add the URLs of undesired documents.
To add documents to the result removal, do the following:
Navigate to the Search service application.
On the left-hand menu under Queries and Results, choose Search Result Removal (Figure 3).
Add the URLs of the documents you want removed from search (Figure 4).
Documents will automatically be removed, and crawl rules will be applied to avoid crawling the documents in future crawls.
3. Stop Words
As we saw with Heaps' Law,
the most common terms in any corpus do not carry any value for
searching. Grammatical terms and helper language that when used in
context can convey meaning do not themselves help us to find the
information we want. In order to help the search engine avoid ranking
documents high that merely have a high density of terms such as "the",
these most common terms are set as stop words or noise words in
SharePoint's search engine. SharePoint 2010 has one stop word file for
each language it supports as well as a neutral file. However, the
neutral file is empty by default. All other files have a few of the most
common terms for that language in them. More can be added easily by the
administrator.
The files are located in C:\Program Files\Microsoft Office Servers\14.0\Data\Office Server\Config, where there is a virgin set of files that are copied to the Search service application's specific config file when a new Search service application is created. This path is C:\Program Files\Microsoft Office Servers\14.0\Data\Applications\GUID\Config. It is best to edit the files in this path at each query server.
The default stop word file for English is called noiseeng.txt and contains the following terms.
a
and
is
in
it
of
the
to
These terms can be added to by
placing an additional term on each line of the file. Common words in
the language may be freely added as long as searching for them would not
provide useful results. Some useful additions might be "this", "that",
"these", "those", "they", etc.