SQL Server 2008 R2 : Join Selection (part 2) - Determining the Optimal Join Order & Subquery Processing

9/20/2011 5:32:49 PM

Determining the Optimal Join Order

In addition to determining the best join strategies, the Query Optimizer also evaluates and determines the optimal join order that would result in the most efficient query plan. In the query’s execution plan, you might find that the order of the tables in the execution plan is a different order than specified in the query. Regardless of the join strategy used, the Query Optimizer needs to determine which table is the outer input and which is the inner input to the join strategy chosen. For example, consider the following query:

select a.au_lname, t.title, pubdate
   from authors a
   join titleauthor ta on a.au_id = ta.au_id
   join titles t on ta.title_id = t.title_id

In addition to the possible access paths and join strategies available, the server can consider the following pool of possible join orders:

authors → titleauthor → titles
titles → titleauthor → authors
titleauthor → titles → authors
titleauthor → authors → titles
authors → titles → titleauthor
titles → authors → titleauthor

For each of these join orders, the Query Optimizer considers the various access paths available for each table as well as the different join strategies available. For example, the Query Optimizer could consider the following possible join strategies:

Perform a table scan on the authors table and for each row perform an index seek against the auidind index on titleauthor to find the matching rows by au_id. And for each matching row in titleauthor, perform an index seek against the primary key of the titles table to find the matching rows in titles by title_id.
Perform a table scan on the titles table and for each row perform an index seek against the titleidind index on titleauthor to find the matching rows by title_id. And for each matching row in titleauthor, perform an index seek against the primary key of the authors table to find the matching rows in authors by au_id.
Perform an index scan of the titleidind of the titleauthor table and use a hash join to match it with a clustered index scan of the titles table. And for each of the qualifying rows from this hash join, perform another hash join with an index scan of the aunmind index of the authors table.

Note

If you run this query yourself and examine the query plan, you’ll likely see that the third alternative is the one chosen by the Query Optimizer. Index scans are performed on the authors and titleauthor tables because the nonclustered indexes on those tables cover the join query. That is, the nonclustered indexes contain all the columns necessary to satisfy the join conditions as well as the requested result columns.

These are just three of the possibilities. There are many more options for the Query Optimizer to consider as execution plans for processing this join. For example, for each of the three options, there are other indexes to consider, and there are other possible join orders and strategies to consider as well.

As you can see, there can be a large number of execution plan options for the Query Optimizer to consider for processing a join, and this example is a relatively simple three-table join. The number of options increases exponentially as the number of tables involved in the query increases.

Subquery Processing

SQL Server optimizes subqueries differently, depending on how they are written. For example, SQL Server attempts to flatten some subqueries into joins when possible, to allow the Query Optimizer to select the optimal join order rather than be forced to process the query inside-out. The following sections examine the different types of subqueries and how SQL Server optimizes them.

IN, ANY, and EXISTS Subqueries

In SQL Server, any query that contains a subquery introduced with an IN, = ANY, or EXISTS predicate is usually flattened into an existence join unless the outer query also contains an OR clause or unless the subquery is correlated or contains one or more aggregates.

An existence join is optimized the same way as a regular join, with one exception: With an existence join, as soon as a matching row is found in the inner table, the value TRUE is returned, and SQL Server stops looking for further matches for that row in the outer table and moves on to the next row. A normal join would continue processing to find all matching rows. The following query provides an example of an existence join and a quantified predicate subquery that will be converted to an existence join:

select pub_name from publishers p
   where exists (select 1 from titles t
                    where type = 'business'
                    and t.pub_id = p.pub_id)

select pub_name from publishers
   where pub_id in (select pub_id from titles
                     where type = 'business')

Figure 5 shows an example of the execution plan for both of these queries. You can see that the query plans are the same, providing proof that the quantified predicate subquery is being flattened into an existence join.

Figure 5. An execution plan for an existence join and a quantified predicate subquery flattened into an existence join.

Materialized Subqueries

If an outer query is comparing a column against the result of a subquery using any of the comparison operators (=, >, <, >=, <=, !=), and the subquery is not correlated, the results of the subquery are often resolved—that is, materialized—before comparison against the outer table column. For these types of queries, the Query Optimizer processes the query inside-out.

An example of this type of query is as follows:

select title from titles
   where ytd_sales = (select max(qty) from sales)

In this example, the subquery is resolved first to find the maximum qty value from the sales table to compare against ytd_sales in the outer query. Figure 6 shows an example of a query plan for this materialized subquery.

Figure 6. An execution plan for a materialized subquery.

The following query is an interesting case in which the subquery is not materialized first:

select title from titles
   where ytd_sales = (select max(ytd_sales) from titles)

In this example, with no index on the ytd_sales column, the Query Optimizer recognizes that a table scan is required on the titles table to find the maximum ytd_sales value. Rather than run the subquery first using a table scan and then use the value returned to perform another lookup against the titles table, it simply scans the titles table and returns and sorts the ytd_sales value in descending order. It then simply returns the rows with the top matching values because these rows are the ones where the ytd_sales value is the maximum. Figure 7 shows an example of a query plan for this subquery processing strategy.

Figure 7. An execution plan for a subquery flattened into a table scan with sort.

Correlated Subqueries

A correlated subquery contains a reference to an outer table in a join clause in the subquery. Following is an example of a correlated subquery:

SELECT title_id, price
FROM titles
WHERE ytd_sales IN
    (SELECT qty
    FROM sales
    WHERE titles.title_id = sales.title_id)

Because correlated subqueries depend on values from the outer query for resolution, they cannot be processed independently. Instead, SQL Server usually processes correlated subqueries repeatedly, once for each qualifying outer row. Often, a correlated subquery looks like a nested loop join. A sample execution plan for the preceding correlated subquery example is shown in Figure 8. Notice that an inner join using a left semi join is performed. Semi joins are joins that return rows from one table based on the existence of related rows in the other table. A left semi join operation returns each row from the first (top or left) input when there is a matching row in the second (bottom or right) input. If the attributes are returned from the bottom (or right) table, it’s referred to as a right semi join.

Figure 8. An execution plan for a correlated subquery.

Note

The inverse of a semi join is an anti–semi join. An anti–semi join looks for rows in one table based on their nonexistence in the other, such as for a NOT IN or NOT EXISTS type subquery, or for some outer join queries.

However, if there is no useful index on the correlated columns to find the matching rows, the Query Optimizer may choose to perform a single search against each table separately and perform a hash join against the results.

Related -----------------

- SQL Server 2008 R2 : Join Selection (part 2) - Determining the Optimal Join Order & Subquery Processing

- SQL Server 2008 R2 : Join Selection (part 1) - Join Processing Strategies

Other -----------------

- Managing Microsoft Windows Server 2003 Disk Storage : Configuring Disks and Volumes

- Managing Microsoft Windows Server 2003 Disk Storage : Understanding Disk Storage Options

- Microsoft Lync Server 2010 Edge : Reverse Proxy

- Microsoft Lync Server 2010 Edge : Edge Configuration

- Managing Exchange Server 2010 Clients : Checking Private and Public Folders with IMAP4 and UNIX Mail Servers

- Managing Exchange Server 2010 Clients : Leaving Mail on the Server with POP3

- Securing Windows Server 2008 R2 : Encrypting File System

- Securing Windows Server 2008 R2 : Auditing

- Microsoft Dynamics NAV : Business Intelligence - Reporting capabilities in NAV

- Microsoft Dynamics NAV and Business Intelligence