Query Path 5 — Bookmark Lookup
This bookmark lookup query path is a
two-edged sword. For queries returning a small number of records, it's
an acceptable query path, but for the queries that return a significant
amount of records , this query path can significantly hinder
performance.
To demonstrate a bookmark lookup query path, the following query filters by ProductID while returning all the base table's columns:
SELECT *
FROM Production.WorkOrder
WHERE ProductID = 757;
To rephrase the query in pseudo-code, find the rows for Product 757 and give me all the columns for those rows.
There is an index on the ProductID column, so the query optimizer has two possible options:
- Scan the entire clustered index to access all the columns, and then
filter the results to find the right rows. Essentially, this would be
the same as Query Path 4.
Or:
- Perform an index seek on the IX_Workload_ProductID index to fetch the 11 rows. In the process it learns the WorkOrderID
values for those 11 rows because the clustered index key columns are in
the leaf level of the nonclustered index. Then it can index seek those
11 rows from the clustered index to fetch the other columns.
This jump, from the nonclustered index
used to find the rows to the clustered index to complete the columns
needed for the query, is called a bookmark lookup as shown in Figure 6.
The real cost of the bookmark lookup is finding
the rows that are typically scattered throughout the base table, which
is a clustered index in this case. Locating the 11 rows in the
nonclustered index was a single page hit, but those 11 rows might be on
11 different pages in the clustered index. With a larger number of
selected rows, the problem intensifies. Selecting 1,000 rows with a
bookmark lookup might mean reading three to four pages from the
nonclustered index and then reading more than 1,000 pages from the
clustered index and leaf level. Eventually, SQL Server decides that the
bookmark lookup is more expensive than just scanning the clustered
index.
The query execution plan for a bookmark lookup shows the two indexes as data sources for a nested loop join (as shown in Figure 7).
For each row that comes from the seek of the nonclustered index, the
nested loop join is requesting the matching rows from the clustered
index by calling the key lookup.
It's a common saying that Select *
is a poor practice because it returns too many, and often unnecessary,
columns in the result set — the extra data is considered wasteful if it
is not needed. I agree that Select *
is a poor practice, but often the reason isn't due to the extra network
traffic; it's the bookmark lookup that is almost always generated by a Select *.
However, the extra network traffic associated with bringing back more
data than is needed can significantly hinder application throughput.
This query builds on the last bookmark lookup
query and sheds a little more light on the bookmark lookup problem; the
difference is that this query requests only one column that's not
available from the nonclustered index.
SELECT WorkOrderID, StartDate
FROM Production.WorkOrder
WHERE ProductID = 757;
Consider the performance difference (again, refer to Table 1) between this query path and the select * bookmark lookup query path. Their performance is nearly identical.
It doesn't take many columns to force a bookmark
lookup; a single column missing from the nonclustered index means SQL
Server must also look to the clustered index to solve the query.
There are only two ways to avoid the bookmark lookup problem:
- Filter by the clustered index key columns, so the query can be satisfied using the clustered index (Query Path 2 or 3).
- Design a covering index (the next query path).
Query Path 6 — Covering Index
If a nonclustered index includes every column required by the query (and that means every column referenced by the query: select columns, join on condition columns, group by columns, where
clause columns, and windowing columns), SQL Server's query optimizer
can choose to execute the query using only that nonclustered index.
When this occurs the index is said to cover the needs of the query, in
other words, it's a covering index.
This is an important concept to grasp to
understand how nonclustered indexes operate. A nonclustered index is a
separate data structure than the base table. By default, nonclustered
indexes have the same number of rows as the base table. In this aspect,
you can think of a nonclustered index as a smaller sorted table — one
that you can access quickly. When it makes sense, the query optimizer
can choose to use ONLY this separate nonclustered index
structure to satisfy a query. If this happens, the base table isn't
even used in the query — only the nonclustered index.
A covering index is a concept that applies only
to nonclustered indexes and only in the context of a query. There is no
such thing as a covering index as a single entity; it is applicable
only in the context of a query that uses the index.
Query Path 5's second query selected the StartDate column. Because StartDate isn't part of the IX_WorkOrder_ProductID index, SQL Server was forced to use a bookmark lookup. To solve the problem, the following code adds StartDate to the IX_WorkOrder_ProductID index so that the index can cover the query.
DROP INDEX Production.WorkOrder.IX_WorkOrder_ProductID
CREATE INDEX IX_WorkOrder_ProductID
ON Production.WorkOrder (ProductID)
INCLUDE (StartDate);
The include option (added in SQL Server 2005) adds the StartDate column to the leaf level of the IX_WorkOrder_ProductID
index — but not to the list of keys of the index. This enables you to
define additional columns that cover queries without hitting the index
column or size limit. Included columns are stored at the leaf levels of
the nonclustered index, but not at the intermediate levels. The query
optimizer can now solve the queries with an index seek (as show in Figure 8):
SELECT WorkOrderID, StartDate
FROM Production.WorkOrder
WHERE ProductID = 757; -- 9 rows
SELECT WorkOrderID, StartDate
FROM Production.WorkOrder
WHERE ProductID = 945; –- 1,105 rows
As mentioned earlier, when a nonclustered index
is defined on a clustered base table, the clustered keys are stored in
the nonclustered index to be used as a pointer back to the base table.
Because of this, the nonclustered index can satisfy queries that
include the clustered key columns. The following query filters by the
nonclustered index key and returns the clustered index key value:
SELECT WorkOrderID
FROM Production.WorkOrder
WHERE ProductID = 757;
The Ix_WorkOrder_ProductID nonclustered index has the ProductID
column as the key column, and the clustered index is defined on the
WorkOrderID column which is available in the nonclustered index.
The next query is a rare example of a covering index. Compared to the previous query path, this query adds the StartDate column in the where clause. Conventional wisdom would say that this query requires an index scan because it filters by a nonkey column. (StartDate is an included column in the index and not a key column.)
SELECT WorkOrderID
FROM Production.WorkOrder
WHERE ProductID = 945
AND StartDate = ‘2006-01-04';
In this case the index seek operator uses the index (keyed by ProductID) to seek the rows matching ProductID = 945.
Then the index seek operator continues to select the correct rows by filtering the rows by the included column (AND StartDate = ‘2006-01-04'). In the index seek properties (see Figure 9), the predicate is filtering by the StartDate column.
The performance difference between the
bookmark lookup solution and the covering index is dramatic. When
comparing the query optimizer cost and the logical reads (refer to Table 1),
the query paths that use a covering index are approximately 12 times
more efficient. (The duration appears less in the figure due to my
limited hardware.)