SQL Server 2008 R2 : Row Estimation and Index Selection (part 1) - Evaluating SARG and Join Selectivity

9/10/2011 4:13:15 PM

When the query analysis phase of optimization is complete and all SARGs, OR clauses, and join clauses have been identified, the next step is to determine the selectivity of the expressions (that is, the estimated number of matching rows) and to determine the cost of finding the rows. The costs are measured primarily in terms of logical and physical I/O, with the goal of generating a query plan that results in the lowest estimated I/O and processing cost. Primarily, the Query Optimizer attempts to identify whether an index exists that can be used to locate the matching rows. If multiple indexes or search strategies can be considered, their costs are compared with each other and also against the cost of a table or clustered index scan to determine the least expensive access method.

An index is typically considered useful for an expression if the first column in the index is used in the expression and the search argument in the expression provides a means to effectively limit the search. If no useful indexes are found for an expression, typically a table or clustered index scan is performed on the table. A table or clustered index scan is the fallback tactic for the Query Optimizer to use if no lower-cost method exists for returning the matching rows from a table.

Evaluating SARG and Join Selectivity

To determine selectivity of a SARG, which helps in determining the most efficient query plan, the Query Optimizer uses the statistical information stored for the index or column, if any. If no statistics are available for a column or index, SQL Server automatically creates statistics on nonindexed columns specified in a SARG if the AUTO_CREATE_STATISTICS option is enabled for the database. SQL Server also automatically generates and updates the statistics for any indexed columns referenced in a SARG if the AUTO_UPDATE_STATISTICS option is enabled. In addition, you can explicitly create statistics for a column or set of columns in a table or an indexed view by using the CREATE STATISTICS command. Both index statistics and column statistics (whether created automatically or manually with the CREATE STATISTICS command) are maintained and kept up-to-date, as needed, if the AUTO_UPDATE_STATISTICS option is enabled or if the UPDATE STATISTICS command is explicitly run for a table, index, or column statistics. Available and up-to-date statistics allow the Query Optimizer to more accurately assess the cost of different query plans and choose a high-quality plan.

If no statistics are available for a column or an index and the AUTO CREATE STATISTICS and AUTO UPDATE STATISTICS options have been disabled for the database or table, SQL Server cannot make an informed estimate of the number of matching rows for a SARG and resorts to using some built-in percentages for the number of matching rows for various types of expressions. These percentages currently are as follows:

Operator	Row Estimate
=	(# of rows in table)^.75
between, > and <	9% (closed-range search)
>, <, >=, <=	30% (open-range search)

Using these default percentages almost certainly results in inappropriate query execution plans being chosen. You should always try to ensure that you have up-to-date statistics available for any columns referenced in your SARGs and join clauses.

When the value of a SARG can be determined at the time of query optimization, the Query Optimizer uses the statistics histogram to estimate the number of matching rows for the SARG. The histogram contains a sampling of the data values in the column and stores information on the number of matching rows for the sampled values, as well as for values that fall between the sampled values. If the statistics are up-to-date, this is the most accurate estimate of the number of matching rows for a SARG.

If the SARG contains an expression that cannot be evaluated until runtime (for example, a local variable or scalar function) but is an equality expression (=), the Query Optimizer uses the density information from the statistics to estimate the number of matching rows. The density value reflects the overall uniqueness of the data values in the column or index. Density information does not estimate the number of matching rows as accurately as the histogram because its value is determined across the entire range of values in a column or an index key and can be skewed higher by one or more values that have a high number of duplicates. Expressions that cannot be evaluated until runtime include comparisons against local variables or function expressions that cannot be evaluated until query execution.

If an expression cannot be evaluated at the time of optimization and the SARG is not an equality search but a closed- or open-range search, the density information cannot be used. The same percentages are used for the row estimates as when no statistics are available (9% for a closed-range search and 30% for an open-range search).

As a special case, if a SARG contains the equality (=) operator and a unique index exists that matches the SARG, based on the nature of a unique index, the Query Optimizer knows, without having to analyze the index statistics, that one and only one row can match the SARG.

If the query contains a join clause, SQL Server determines whether any usable indexes or column statistics exist that match the column(s) in the join clause. Because the Query Optimizer has no way of determining what value(s) will join between rows in the table at optimization time, it can’t use the statistics histogram on the join column to estimate the number of matching rows. Instead, it uses the density information, as it does for SARGs that are unknown during optimization.

A lower density value indicates a more selective index. As the density approaches 1, the join condition becomes less selective. For example, if a nonclustered index has a high density value, it will likely be more expensive in terms of I/O to retrieve the matching rows using the nonclustered index than to perform a table scan or clustered index scan and the index likely will not be used.

SARGs and Inequality Operators

In previous versions of SQL Server, when a SARG contained an inequality operator (!= or <>), the selectivity of the SARG could not be determined effectively for the simple reason that index or column statistics can help you estimate only the number of matching rows for a specific value, not the number of nonmatching rows. However, for some SARGs with inequality operators, if index or column statistics are available, SQL Server 2008 is able to estimate the number of matching rows. For example, consider the following SARG:

WHERE qty <> 1000

Without any available index or column statistics on the qty column, SQL Server would treat the inequality SARG as a SARG with no available statistics. Potentially every row in the table could satisfy the search criteria, so it would estimate the number of matching rows as all rows in the table.

However, if index or column statistics were available for the qty column, the Query Optimizer would look up the search value (1000) in the statistics and estimate the number of matching rows for the search value and then determine the number of matching rows for the query as the total number of rows in the table minus the estimated number of matching rows for the search value. For example, if there are 150,000 rows in the table and the statistics indicate that 1,570 rows match, where qty = 1000, the number of matching rows would be calculated as follows:

150,000	rows
− 1,570	rows (where qty = 1000)
= 148,430	rows (where qty <> 1000)

In this example, with the large number of estimated rows where qty <> 1000, SQL Server would likely end up performing a table scan to resolve the query. However, if the Query Optimizer estimates that there is a very small number of rows where qty <> 1000, the Query Optimizer might determine that it would be more efficient to use an index to find the nonmatching rows. You may be wondering how SQL Server efficiently searches the index for the rows where qty <> 1000 without having to look at every row. In this case, internally, it converts the inequality SARG into two range retrievals by using an OR condition:

WHERE qty < 1000 OR qty > 1000

Note

Even if an inequality SARG is optimizable, that doesn’t necessarily mean an index will be used. It simply allows the Query Optimizer to make a more accurate estimate of the number of rows that will match a given SARG. More often than not, an inequality SARG will result in a table or clustered index scan. You should try to avoid using inequality SARGs whenever possible.

SARGs and LIKE Clauses

In SQL Server versions prior to SQL Server 2005, the Query Optimizer would estimate the selectivity of a LIKE clause only if the first character in the string was a constant. Every row would have to be examined to determine if it was a match. SQL Server 2008 uses string summary statistics, which were introduced in SQL Server 2005, for estimating the selectivity of LIKE conditions.

String summary statistics provide a statistical summary of substring frequency distribution for character columns. String summary statistics can be created on columns of type text, ntext, char, varchar, and nvarchar. String summary statistics allow SQL Server to estimate the selectivity of LIKE conditions where the search string may have any number of wildcards in any combination, including when the first character is a wildcard. In versions of SQL Server prior to 2005, row estimates could not be accurately obtained when the leading character of a search string was a wildcard character. SQL Server 2008 can estimate the selectivity of LIKE predicates similar to the following:

au_lname LIKE 'Smith%'
stor_name LIKE '%Books'
title LIKE '%Cook%'
title_id LIKE 'BU[1234567]001'
title LIKE '%Cook%Chicken'

The string summary statistics result in fairly accurate row estimates. However, if there is a user-specified escape character in a LIKE pattern (for example, stor_name LIKE '%abc#_%' ESCAPE '#'), SQL Server 2008 has to guess at the selectivity of the SARG.

The values generated for string summary statistics are not visible via DBCC SHOW_STATISTICS. However, DBCC SHOW_STATISTICS does indicate if string summary statistics have been calculated; if the value YES is specified in the String Index field in the first rowset returned by DBCC SHOW_STATISTICS, the statistics also include a string summary. Also, if the strings are more than 80 characters in length, only the first and last 40 characters are used for creating the string summary statistics. Accurate frequency estimates cannot be determined for substrings that do not appear in the first and last 40 characters of a string.

SARGS on Computed Columns

In versions of SQL Server prior to 2005, for a SARG to be optimizable, there had to be no computations on the column itself in the SARG. In SQL Server 2008, expressions involving computations on a column might be treated as SARGs during optimization if SQL Server can simplify the expression into a SARG. For example, the SARG

ytd_sales/12 = 1000

can be simplified to this:

ytd_sales = 12000

The simplified expression is used only during optimization to determine an estimate of the number of matching rows and the usefulness of the index. During actual execution, the conversion is not done while traversing the index tree because it won’t be able to do the repeated division by 12 for each row while searching through the tree. However, doing the conversion during optimization and getting a row estimate from the statistics helps the Query Optimizer decide on other strategies to consider, such as index scanning versus table scanning, or it might help to determine an optimal join order if it’s a multitable query.

SQL Server 2008 supports the creation, update, and use of statistics on computed columns. The Query Optimizer can make use of the computed column statistics even when a query doesn’t reference the computed column by name but rather contains an expression that matches the computed column expression. This feature avoids the need to rewrite the SARGs in queries with expressions that match a computed column expression to SARGs that explicitly contain the computed column itself.

When the SARG has a more complex operation performed on it, such as a function, it can potentially prevent effective optimization of the SARG. If you cannot avoid using a function or complex expression on a column in the search expression, you should consider creating a computed column on the table and creating an index on the computed column. This materializes the function result into an additional column on the table that can be indexed for faster searching, and the index statistics can be used to better estimate the number of matching rows for the SARG expression that references the function.

An example of using this approach would be for a query that has to find the number of orders placed in a certain month, regardless of the year. The following is a possible solution:

select distinct stor_id
   from sales
   where datepart(month, ord_date) = 6

This query gets the correct result set but ends up having to do so with a full table or index scan because the function on the ord_date column prevents the Query Optimizer from using an index seek against any index that might exist on the ord_date column.

If this query is used frequently in the system and quick response time is critical, you could create a computed column on the function and index it as follows:

alter table sales add ord_month as datepart(month, ord_date)
create index nc_sales_ordmonth on sales(ord_month)

Now, when you run the query on the table again, if you specify the computed column in the WHERE clause, the Query Optimizer can use the index on the computed column to accurately estimate the number of matching rows and possibly use the nonclustered index to find the matching rows and avoid a table scan, as it does for the following query:

select distinct stor_id
   from sales
   where ord_month = 6

Even if the query still ends up using a table scan, it at least has statistics available to know how many rows it can expect to match where the month matches the value specified. In addition, if a computed column exists that exactly matches the SARG expression, SQL Server 2008 can still use the statistics and index on the computed column to optimize the query, even if the computed column is not specified in the query itself. For example, with the ord_month column defined on the sales table and an index created on it, the following query can also use the statistics and index to optimize the query:

select distinct stor_id
   from sales
   where datepart(month, ord_date) = 6

Tip

The automatic matching of computed columns in SQL Server 2008 enables you to create and exploit computed columns without having to change the queries in your application. Be aware, though, that computed column matching is based on identical comparison. For example, a computed column of the form A + B + C does not match an expression of the form A + C + B.