As our menus on our templates display not only the
name of the current posting but links to other postings, we are creating
additional noise for the search results. The way to overcome this is to
determine when a page is requested by a crawler and then hide the
listing controls. Because search engines traverse sites from links
within the page, we cannot simply remove the navigational controls leaving
only the main posting content: we must also create hyperlinks to the
channel items (postings and channels) within the current channel.
Interrogating the User Agent
To determine if the site is being crawled, we will create a helper method that checks the Request.Header["User-Agent"] value and compares it to a list of known search agents stored in the web.config file.
Firstly, we have to set up a list of search agents in the web.config file. Under the <configuration> | <appSettings> element, we will insert an <add> element:
<!-- SharePoint, Google, MSN Search (Separated by | )-->
<add key="SearchUserAgents" value="MS Search|GoogleBot|msnbot" />
Next, we will create a helper class in the Tropical Green project. To do this, follow these steps:
1. | Open the TropicalGreen solution in Visual Studio .NET.
|
2. | In the TropicalGreen project, create a new folder called Classes.
|
3. | Right-click on the Classes folder and choose Add | Add Class.
|
4. | Enter the name SearchHelper.cs and click OK.
|
5. | Import the System.Web and System.Configuration namespaces:
using System;
using System.Text.RegularExpressions;
using System.Web;
using System.Configuration;
namespace TropicalGreen.Classes
{
/// <summary>
/// Summary description for SearchHelper.
/// </summary>
public class SearchHelper
{
... code continues ...
}
}
|
Let’s add a static method called IsCrawler(), which returns true if the Request.Header["User-Agent"] matches a value specified in the web.config:
public static bool IsCrawler()
{
// Get the user agent string from the web.config
string strUserAgents =
ConfigurationSettings.AppSettings.Get("SearchUserAgents");
// is it null? is it an empty string?
if(strUserAgents != null && strUserAgents != "")
{
// Regular expression to identify all robots
// robot strings need to be separated with "|" in the web.config
Regex reAllSearch = new Regex("("+strUserAgents+")",
RegexOptions.Compiled);
// Get the current user agent
string CurrentUserAgent = HttpContext.Current.Request.UserAgent;
return reAllSearch.Match(CurrentUserAgent).Success;
}
// agents are not specified in the web.config
return(false);
}
Hiding Navigational Elements
For
each of the navigation controls in the Tropical Green solution, we will
need to add a check during their loading or rendering to see if the
page is being crawled.
The easiest way to do this is in the Page_Load() or Render()
method of the control. We want to check if the current request is from a
crawler and hide the control if it is, otherwise load it as normal:
if (!TropicalGreen.Classes.SearchHelper.IsCrawler())
{
// Bind the Data
}
else
{
// Is being crawled so hide the user control
this.Visible = false;
}
Creating a Posting/Channel Listing User Control
Now we have hidden the
navigational controls, we must still provide a mechanism for the
crawlers to traverse the site. The simplest way is to build a user
control that lists the postings and channels in the current channel but
does not include their display name.
Let’s now build the CrawlingNavigationControl user control:
1. | Open the TropicalGreen project in Visual Studio .NET.
|
2. | Under the user controls folder, create a new user control and give it the name CrawlingNavigationControl.
|
3. | Switch to the code-behind file (CrawlingNavigationControl.ascx.cs).
|
4. | Import the following namespaces:
using System.Text;
using Microsoft.ContentManagement.Publishing;
|
We will add logic to the Page_Load()
of all templates and channel rendering scripts to create a new literal
control that contains the hyperlinks. The hyperlink labels will be text
that is ignored by the indexer (noise words). In this case, we use the
word “and”.
private void Page_Load(object sender, System.EventArgs e)
{
// Is the site being crawled?
if(TropicalGreen.Classes.SearchHelper.IsCrawler())
{
// declare & instantiate a string builder to hold the hyperlinks
StringBuilder sb = new StringBuilder();
// loop through all the channel items in the current channel
foreach(ChannelItem item in CmsHttpContext.Current.Channel.AllChildren)
{
// append the hyperlink creating a unique label
sb.Append("<a href=\"" + item.Url + "\">and</a>");
}
// instantiate a new literal control
Literal litLinks = new Literal();
// set the literal controls text
// to be the text from the string builder
litLinks.Text = sb.ToString();
// add the literal control to the control collection
this.Controls.Add(litLinks);
}
}
To enable the CrawlingNavigationControl,
you should add it to the bottom of every template file. Alternatively,
if your site uses a footer user control or a header user control on
every template, you could place the CrawlingNavigationControl in there.