When the query analysis phase of optimization is complete and all SARGs, OR
clauses, and join clauses have been identified, the next step is to
determine the selectivity of the expressions (that is, the estimated
number of matching rows) and to determine the cost of finding the rows.
The costs are measured primarily in terms of logical and physical I/O,
with the goal of generating a query plan that results in the lowest
estimated I/O and processing cost. Primarily, the Query Optimizer
attempts to identify whether an index exists that can be used to locate
the matching rows. If multiple indexes or search strategies can be
considered, their costs are compared with each other and also against
the cost of a table or clustered index scan to determine the least
expensive access method.
An index is
typically considered useful for an expression if the first column in the
index is used in the expression and the search argument in the
expression provides a means to effectively limit the search. If no
useful indexes are found for an expression, typically a table or
clustered index scan is performed on the table. A table or clustered
index scan is the fallback tactic for the Query Optimizer to use if no
lower-cost method exists for returning the matching rows from a table.
Evaluating SARG and Join Selectivity
To determine selectivity of a
SARG, which helps in determining the most efficient query plan, the
Query Optimizer uses the statistical information stored for the index or
column, if any. If no statistics are available for a column or index,
SQL Server automatically creates statistics on nonindexed columns
specified in a SARG if the AUTO_CREATE_STATISTICS
option is enabled for the database. SQL Server also automatically
generates and updates the statistics for any indexed columns referenced
in a SARG if the AUTO_UPDATE_STATISTICS
option is enabled. In addition, you can explicitly create statistics for
a column or set of columns in a table or an indexed view by using the CREATE STATISTICS command. Both index statistics and column statistics (whether created automatically or manually with the CREATE STATISTICS command) are maintained and kept up-to-date, as needed, if the AUTO_UPDATE_STATISTICS option is enabled or if the UPDATE STATISTICS
command is explicitly run for a table, index, or column statistics.
Available and up-to-date statistics allow the Query Optimizer to more
accurately assess the cost of different query plans and choose a
high-quality plan.
If no statistics are available for a column or an index and the AUTO CREATE STATISTICS and AUTO UPDATE STATISTICS
options have been disabled for the database or table, SQL Server cannot
make an informed estimate of the number of matching rows for a SARG and
resorts to using some built-in percentages for the number of matching
rows for various types of expressions. These percentages currently are
as follows:
Operator | Row Estimate |
---|
= | (# of rows in table).75 |
between, > and < | 9% (closed-range search) |
>, <, >=, <= | 30% (open-range search) |
Using these default
percentages almost certainly results in inappropriate query execution
plans being chosen. You should always try to ensure that you have
up-to-date statistics available for any columns referenced in your SARGs
and join clauses.
When the value of a SARG can be
determined at the time of query optimization, the Query Optimizer uses
the statistics histogram to estimate the number of matching rows for the
SARG. The histogram contains a sampling of the data values in the
column and stores information on the number of matching rows for the
sampled values, as well as for values that fall between the sampled
values. If the statistics are up-to-date, this is the most accurate
estimate of the number of matching rows for a SARG.
If the SARG contains an
expression that cannot be evaluated until runtime (for example, a local
variable or scalar function) but is an equality expression (=),
the Query Optimizer uses the density information from the statistics to
estimate the number of matching rows. The density value reflects the
overall uniqueness of the data values in the column or index. Density
information does not estimate the number of matching rows as accurately
as the histogram because its value is determined across the entire range
of values in a column or an index key and can be skewed higher by one
or more values that have a high number of duplicates. Expressions that
cannot be evaluated until runtime include comparisons against local
variables or function expressions that cannot be evaluated until query
execution.
If an expression cannot be
evaluated at the time of optimization and the SARG is not an equality
search but a closed- or open-range search, the density information
cannot be used. The same percentages are used for the row estimates as
when no statistics are available (9% for a closed-range search and 30%
for an open-range search).
As a special case, if a SARG contains the equality (=)
operator and a unique index exists that matches the SARG, based on the
nature of a unique index, the Query Optimizer knows, without having to
analyze the index statistics, that one and only one row can match the
SARG.
If the query contains
a join clause, SQL Server determines whether any usable indexes or
column statistics exist that match the column(s) in the join clause.
Because the Query Optimizer has no way of determining what value(s) will
join between rows in the table at optimization
time, it can’t use the statistics histogram on the join column to
estimate the number of matching rows. Instead, it uses the density
information, as it does for SARGs that are unknown during optimization.
A lower density value
indicates a more selective index. As the density approaches 1, the join
condition becomes less selective. For example, if a nonclustered index
has a high density value, it will likely be more expensive in terms of
I/O to retrieve the matching rows using the nonclustered index than to
perform a table scan or clustered index scan and the index likely will
not be used.
SARGs and Inequality Operators
In previous versions of SQL Server, when a SARG contained an inequality operator (!= or <>),
the selectivity of the SARG could not be determined effectively for the
simple reason that index or column statistics can help you estimate
only the number of matching rows for a specific value, not the number of
nonmatching rows. However, for some SARGs with inequality operators, if
index or column statistics are available, SQL Server 2008 is able to
estimate the number of matching rows. For example, consider the
following SARG:
Without any available index or column statistics on the qty
column, SQL Server would treat the inequality SARG as a SARG with no
available statistics. Potentially every row in the table could satisfy
the search criteria, so it would estimate the number of matching rows as
all rows in the table.
However, if index or column statistics were available for the qty column, the Query Optimizer would look up the search value (1000)
in the statistics and estimate the number of matching rows for the
search value and then determine the number of matching rows for the
query as the total number of rows in the table minus the estimated
number of matching rows for the search value. For example, if there are
150,000 rows in the table and the statistics indicate that 1,570 rows
match, where qty = 1000, the number of matching rows would be calculated as follows:
150,000 | rows |
− 1,570 | rows (where qty = 1000) |
= 148,430 | rows (where qty <> 1000) |
In this example, with the large number of estimated rows where qty <> 1000,
SQL Server would likely end up performing a table scan to resolve the
query. However, if the Query Optimizer estimates that there is a very
small number of rows where qty <> 1000, the Query Optimizer might determine that it would be more efficient to use an index to find the nonmatching rows. You may be wondering how SQL Server efficiently searches the index for the rows where qty <> 1000
without having to look at every row. In this case, internally, it
converts the inequality SARG into two range retrievals by using an OR condition:
WHERE qty < 1000 OR qty > 1000
Note
Even if an inequality
SARG is optimizable, that doesn’t necessarily mean an index will be
used. It simply allows the Query Optimizer to make a more accurate
estimate of the number of rows that will match a given SARG. More often
than not, an inequality SARG will result in a table or clustered index
scan. You should try to avoid using inequality SARGs whenever possible.
SARGs and LIKE Clauses
In SQL Server versions prior to SQL Server 2005, the Query Optimizer would estimate the selectivity of a LIKE
clause only if the first character in the string was a constant. Every
row would have to be examined to determine if it was a match. SQL Server
2008 uses string summary statistics, which were introduced in SQL
Server 2005, for estimating the selectivity of LIKE conditions.
String summary
statistics provide a statistical summary of substring frequency
distribution for character columns. String summary statistics can be
created on columns of type text, ntext, char, varchar, and nvarchar. String summary statistics allow SQL Server to estimate the selectivity of LIKE
conditions where the search string may have any number of wildcards in
any combination, including when the first character is a wildcard. In
versions of SQL Server prior to 2005, row estimates could not be
accurately obtained when the leading character of a search string was a
wildcard character. SQL Server 2008 can estimate the selectivity of LIKE predicates similar to the following:
The string summary statistics
result in fairly accurate row estimates. However, if there is a
user-specified escape character in a LIKE pattern (for example, stor_name LIKE '%abc#_%' ESCAPE '#'), SQL Server 2008 has to guess at the selectivity of the SARG.
The values generated for string summary statistics are not visible via DBCC SHOW_STATISTICS. However, DBCC SHOW_STATISTICS does indicate if string summary statistics have been calculated; if the value YES is specified in the String Index field in the first rowset returned by DBCC SHOW_STATISTICS, the statistics also include a string summary. Also, if the
strings are more than 80 characters in length, only the first and last
40 characters are used for creating the string summary statistics.
Accurate frequency estimates cannot be determined for substrings that do
not appear in the first and last 40 characters of a string.
SARGS on Computed Columns
In versions of SQL Server
prior to 2005, for a SARG to be optimizable, there had to be no
computations on the column itself in the SARG. In SQL Server 2008,
expressions involving computations on a column might be treated as SARGs
during optimization if SQL Server can simplify the expression into a
SARG. For example, the SARG
can be simplified to this:
The simplified expression is
used only during optimization to determine an estimate of the number of
matching rows and the usefulness of the index. During actual execution,
the conversion is not done while traversing the index tree because it
won’t be able to do the repeated division by 12 for each row while
searching through the tree. However, doing the conversion during
optimization and getting a row estimate from the statistics helps the
Query Optimizer decide on other strategies to consider, such as index
scanning versus table scanning, or it might help to determine an optimal
join order if it’s a multitable query.
SQL Server 2008 supports the
creation, update, and use of statistics on computed columns. The Query
Optimizer can make use of the computed column statistics even when a
query doesn’t reference the computed column by name but rather contains
an expression that matches the computed column expression. This feature
avoids the need to rewrite the SARGs in queries with expressions that
match a computed column expression to SARGs that explicitly contain the
computed column itself.
When the SARG has a more complex
operation performed on it, such as a function, it can potentially
prevent effective optimization of the SARG. If you cannot avoid using a
function or complex expression on a column in the search expression, you
should consider creating a computed column on the table and creating an
index on the computed column. This materializes the function result
into an additional column on the table that can be indexed for faster
searching, and the index statistics can be used to better estimate the
number of matching rows for the SARG expression that references the
function.
An example of using this approach
would be for a query that has to find the number of orders placed in a
certain month, regardless of the year. The following is a possible
solution:
select distinct stor_id
from sales
where datepart(month, ord_date) = 6
This query gets the correct result set but ends up having to do so with a full table or index scan because the function on the ord_date column prevents the Query Optimizer from using an index seek against any index that might exist on the ord_date column.
If this query is used
frequently in the system and quick response time is critical, you could
create a computed column on the function and index it as follows:
alter table sales add ord_month as datepart(month, ord_date)
create index nc_sales_ordmonth on sales(ord_month)
Now, when you run the query on the table again, if you specify the computed column in the WHERE
clause, the Query Optimizer can use the index on the computed column to
accurately estimate the number of matching rows and possibly use the
nonclustered index to find the matching rows and avoid a table scan, as
it does for the following query:
select distinct stor_id
from sales
where ord_month = 6
Even if the query still
ends up using a table scan, it at least has statistics available to know
how many rows it can expect to match where the month matches the value
specified. In addition, if a computed column exists that exactly matches
the SARG expression, SQL Server 2008 can still use the statistics and
index on the computed column to optimize the query, even if the computed
column is not specified in the query itself. For example, with the ord_month column defined on the sales table and an index created on it, the following query can also use the statistics and index to optimize the query:
select distinct stor_id
from sales
where datepart(month, ord_date) = 6
Tip
The automatic matching
of computed columns in SQL Server 2008 enables you to create and
exploit computed columns without having to change the queries in your
application. Be aware, though, that computed column matching is based on
identical comparison. For example, a computed column of the form A + B +
C does not match an expression of the form A + C + B.