Because postings are dynamically constructed, correct
information about a posting’s last modified date would always reflect
the current date and time. Therefore, search engines can’t perform
incremental crawls of MCMS sites, and would always run a full crawl.
Consider the case of a website
that uses the search component of SharePoint Portal Server. When
SharePoint Portal Server performs an incremental index crawl, it sends
HTTP GET requests for every page on the site it comes in contact with.
If it finds a page that has previously been indexed, SharePoint will
update its entry in the index only if the page was modified since the
last index job. To decide whether or not a page has been modified,
SharePoint sends a conditional HTTP GET request, which is a request that
includes an If-Modified-Since HTTP
header with the date and time the page was last modified. If the page
has not been modified, the response will be an HTTP status code 304 (not
modified) and SharePoint will not index the page. Should the response
be HTTP status code 200 (OK), it means that the page has been updated
and SharePoint will proceed to index it. This concept works well for
static websites made up of physical files, where the last modified date
reflects the actual date the page was modified.
However, because its
pages are assembled on the fly, IIS will always assume that the last
modified date of postings is the current date and time. Since the date
stored in the If-Modified-Since HTTP header
will always be an earlier date, IIS always returns a status code 200
(OK) without sending a last modified date. Therefore, the site will
never be incrementally crawled by SharePoint Portal Server, and it will
effectively undergo a full index every time. This behavior is seen in
all search engines that make use of the If-Modified-Since HTTP Header to perform incremental crawls.
Obviously this isn’t an ideal
situation for the performance of a search engine’s indexing process, nor
is it performance-friendly for the MCMS site. In the case of our
Tropical Green site, the impact is minimal because the overall number of
postings is very low. However, consider a site containing 10,000
postings in which only 10% have changed since the last crawl. Instead of
crawling and indexing 1,000 pages, SharePoint Portal Server indexes ten
times more content than necessary.
To solve this issue, we
need to modify our templates to return the actual time the posting was
modified with the HTTP status code of 200; if the posting hasn’t changed
since the last crawl, we should return a HTTP status code of 304.
Here’s how we’ll do this:
Get the last modified date of the posting was modified.
Get the time returned by the HTTP If-Modified-Since Header and convert it to UTC to negate any effect of time zones.
Compare
both dates. Determine if the posting was modified since SharePoint last
indexed it. If it wasn’t modified, return HTTP status code 304;
otherwise return HTTP status code 200.
Let’s see what the code would look like. Open the code-behind file of our Columns.aspx template in the TropicalGreen project.
Add the following to the Page_Load() event handler:
private void Page_Load(object sender, System.EventArgs e)
{
bool returnCode304 = false;
// Get the If-Modified-Since HTTP header
string myString =
HttpContext.Current.Request.Headers.Get("If-Modified-Since");
// If this is a conditional HTTP GET.
if (myString != null)
{
// It's a conditional HTTP GET, so compare the dates
try
{
DateTime incrementalIndexTime =
Convert.ToDateTime(myString).ToUniversalTime();
// if the conditional date sent by SharePoint is
// the same as the last date the posting was updated,
// return HTTP status code 304
if (incrementalIndexTime.ToString() ==
CmsHttpContext.Current.Posting.LastModifiedDate.ToString())
{
returnCode304 = true;
}
}
catch{}
// if the content didn't change,
// return 304 and stop parsing the page
if (returnCode304)
{
Response.StatusCode = 304;
Response.End();
}
}
As a final step to get the
above method working, we need to ensure that the last modified date of
the posting is sent back in the Last-Modified HTTP header to the client. This can be done with the following code in Page_Load():
private void Page_Load(object sender, System.EventArgs e)
{
// This is the code that causes ASP.NET to send the header.
Response.Cache.SetLastModified(
CmsHttpContext.Current.Posting.LastModifiedDate.ToLocalTime());
// Put user code to initialize the page here
. . . code continues . . .
}
Now let’s build our
Tropical Green project and make sure that it compiles correctly. When
the build finishes, open a browser and navigate through the site to one
of the postings in the plant catalog. If the posting is rendered as
expected, we’re in good shape and can proceed. Otherwise, double-check
the code you added for any inconsistencies.
With
our Plant template updated, you should go through the rest of the
templates and add the same code above. An alternative to copying this
code into every template’s code-behind would be to create a helper
class, which can be called in each Page_Load() event to determine if the 304 status code should be returned.
Dealing with Output Caching
Another thing to take into account is caching. If you are using the @OutputCache
page directive, you’ll need to remove it from your template files. You
can’t use standard output caching in combination with HTTP 304 handling
because the Page_Load() event, and all
the code that we have just written within it to enable incremental
crawls, would not be executed if the item is already in the cache.
When cached content is
delivered, you will always get back an HTTP 200 response. There are two
ways to get around this: if output caching is required you would need
to implement it manually , MCMS and RSS.
Or, you can set the
cache headers in the HTTP response to ensure that the pages are cached
in the browser and in downstream proxy servers. This can be achieved by
adding the Cache-Control header to the response and also setting the cache expiry time in the Page_Load() event handler immediately after the code above:
private void Page_Load(object sender, System.EventArgs e)
{
. . . code continues . . .
// public header to allow caching in proxy servers and in the browser cache
Response.Cache.SetCacheability(System.Web.HttpCacheability.Public);
// set how long the content should be cached in a downlevel cache
Response.Cache.SetExpires(System.DateTime.Now.AddMinutes(5));
Response.Cache.SetValidUntilExpires(true);
}
As this code is
likely to be used in several templates, in practice, it’d be a good idea
to put it in a helper class to eliminate code duplication.