SQL Server indexes are mostly transparent to end
users and T-SQL developers. Indexes are typically not specified in
queries unless you use table hints to force the Query Optimizer to use a
particular index. Normally, based on the index key histogram or density values, the SQL
Server cost-based Query Optimizer automatically chooses the index that
is least expensive from an I/O standpoint.
In the meantime, the following
are some of the main guidelines to follow in creating useful indexes
that the Query Optimizer can use effectively:
- For composite
indexes, try to keep the more selective columns leftmost in the index.
The first element in the index should be the most unique (if possible),
and index column order in general should be from most to least unique.
However, remember that selectivity doesn’t help if the first ordered
index column is not specified in your SARGs or join clauses. To ensure
that the index is used for the largest number of queries, be sure the
first ordered column is the column used most often in your queries.
- Be
sure to index columns used in joins. Joins are processed inefficiently
if no index on the column(s) is specified in a join. Remember that a PRIMARY KEY constraint automatically creates an index on a column, but a FOREIGN KEY
constraint does not. You should create indexes on your foreign key
columns if your queries commonly join between the primary key and
foreign key tables.
- Tailor your indexes for your most
critical queries and transactions. You cannot index for every possible
query that might be run against your tables. However, your applications
will perform better if you can identify your critical and most
frequently executed queries and design indexes to support them. SQL
Server Profiler, is a useful tool for identifying the most frequently executed queries.
SQL Server Profiler can also help identify slow-running queries that
might benefit from improved index design.
- Avoid indexes
on columns that have poor selectivity. The Query Optimizer is not likely
to use the indexes, so they would simply take up space and add
unnecessary overhead during inserts, updates, and deletes. One possible
exception occurs when the index can be used to cover a query.
- Choose
your clustered and nonclustered indexes carefully. The next two
sections discuss tips and guidelines for choosing between clustered or
nonclustered indexes, based on the data contained in the columns and the
types of queries executed against the columns.
Clustered Index Indications
Searching
for rows via a clustered index is almost always faster than searching
for rows via a nonclustered index—for two reasons. One reason is that a
clustered index contains only pointers to pages rather than pointers to
individual data rows; therefore, a clustered index is more compact than a
nonclustered index. Because a clustered index is smaller and doesn’t
require an additional lookup via the row locator to find the matching
rows, the rows can be found with fewer page reads than with a similarly
defined nonclustered index. The second reason is that because the data
in a table with a clustered index is physically sorted on the clustered
key, searching for duplicate values or for a range of clustered key
values is faster; the rows are adjacent to each other, and SQL Server
can simply locate the first qualifying row and then search the rows in
sequence until the last qualifying row is found. However, because you
are allowed to create only one clustered index per table, you must
judiciously choose the column or columns on which to define the
clustered index.
If you require only a single index on a table, it’s
typically advantageous to make it a clustered index; the resulting
overhead of maintaining clustered indexes during updates, inserts, and
deletes can be considerably less than the overhead incurred by
nonclustered indexes.
By default, the primary key on a table is defined as a
clustered unique index. In most applications, the primary key column on
a table is almost always retrieved in single-row lookups. For
single-row lookups, a nonclustered index usually costs you only a few
more I/Os than a similar clustered index. Are you or the users really
going to notice a difference between three page reads to retrieve a
single data row versus four- to six-page reads to retrieve a single data
row? Not at all. However, if you have to perform a range retrieval,
such as a lookup on last name, will you notice a difference between
scanning 10% of the table versus having to find the rows using a full
table scan? Most definitely. With this in mind, you might want to
consider creating your primary key as a unique nonclustered index and
choosing another candidate for your clustered index.
Following are guidelines to consider for other potential candidates for clustered indexes:
- Columns with a number of duplicate values searched frequently (for example, WHERE last_name = 'Smith')—Because
the data is physically sorted, all the duplicate values are kept
together. Any query that tries to fetch records against such keys finds
all the values, using a minimum of I/O. SQL Server locates the first row
that matches the SARG and then scans the data rows in order until it
finds the last row matching the SARG.
- Columns often specified in the ORDER BY clause—Because the data is already sorted, SQL Server can avoid having to re-sort the data if the ORDER BY
is on the clustered index key and the data is retrieved in clustered
key order. Remember that even for a table scan, the data is retrieved in
clustered key order because the data in the table is in clustered key
order. The only exception is if a parallel query operation is used to
retrieve the data rows; in that case, the results needs to be re-sorted
when the result sets from each parallel thread are merged.
- Columns often searched for within a range of values (for example, WHERE price between $10 and $20)—A
clustered index can be used to locate the first qualifying row in the
range of values. Because the rows in the table are in sorted order, SQL
Server can simply scan the data pages in order until it finds the last
qualifying row within the range. When the result set within the range of
values is large, a clustered index scan is significantly more efficient
in terms of total logical I/O performed than repeated row locator
lookups via a nonclustered index.
- Columns, other than the primary key, frequently used in join clauses—Clustered
indexes tend to be smaller than nonclustered indexes; the amount of
page I/O required per lookup is generally less for a clustered index
than for a nonclustered index. It can be a significant difference when
joining many records. An extra page read or two might not seem like much
for a single-row retrieval, but add those additional page reads to
100,000 join iterations, and you’re looking at a total of 100,000 to
200,000 additional page reads.
When you consider columns for a clustered index, you
might want to try to keep your clustered indexes on relatively static
columns to minimize the re-sorting of data rows when an indexed column
is updated. Any time a clustered index key value changes, the entire
data row has to be moved to keep the clustered data values in physical
sort order. In addition, all nonclustered indexes using the clustered
key as the row locator to that row also need to be updated.
You should also avoid creating clustered indexes on
wide keys that are made up of several columns, especially several
large-size columns. The reason is that the clustered key values are
incorporated in all nonclustered indexes as the row locater. Because the
nonclustered index entries contain the clustering key in addition to
the key columns defined for that nonclustered index, the nonclustered
indexes end up being significantly larger and less efficient in terms of
I/O.
Because you can physically sort the data in a table
in only one way, you can have only one clustered index per table. Any
other columns you want to index have to be defined with nonclustered
indexes.
Nonclustered Index Indications
SQL Server allows you to create a maximum of 999
nonclustered indexes on a table. Until tables become extremely large,
the actual space taken by a nonclustered index is a minor expense
compared to the increased access performance. You need to keep in mind,
however, that as you add more indexes to the system, database
modification statements get slower due to the index maintenance
overhead.
Also, when defining nonclustered indexes, you
typically want to define indexes on columns that are more selective
(that is, columns with low density values) so that they can be used
effectively by the Query Optimizer. A high number of duplicate values in
a nonclustered index can often make it more expensive (in terms of I/O)
to process the query using the nonclustered index than a table scan.
Let’s look at a hypothetical example:
select title from titles
where price between $5. and $10.
Assume that you have 1 million rows within the range;
those 1 million rows could be randomly scattered throughout the table.
Although the index leaf level has all the index rows in sorted order,
reading all data rows one at a time would require a separate lookup via
the row locator for each row in the worst-case scenario.
Thus, the worst-case I/O estimate for range retrievals using a nonclustered index is as follows:
Number of levels in the nonclustered index
+ Number of index pages scanned to find all matching rows
+ (Number of matching rows × Number of pages per lookup via the row locator)
If you have no clustered index on the table, the row
locator is simply a page and row pointer and requires one data page read
to find the matching data row. If 1 million rows are in the range, the
worst-case cost estimate to search via the nonclustered index with no
clustered index on the table would be as follows:
Number of index page reads to find all the row locators
+ (1 million matching rows × 1 data page read)
= 1 million + I/O
If you have a clustered index on the table, the row
locator is a clustered index key for the data row. Using the row locator
to find the matching row requires searching the clustered index tree to
locate the data row. Assuming that the clustered index has two nonleaf
levels, it would cost three pages to find each qualifying row on a data
page. If the range has 1 million rows, the worst-case cost estimate to
search via the nonclustered index with a clustered index on the table
would be as follows:
Number of index page reads to find all the row locators
+ (1 million matching rows × 3 pages per lookup via the row locator)
= 3 million + I/O
Contrast each of these scenarios with the cost of a
table scan. If the entire table takes up 50,000 pages, a full table scan
would cost only 50,000 in terms of I/O. Therefore, in this example,
doing a table scan would actually be more efficient than using the
nonclustered index.
The following guidelines help you identify potential candidates for nonclustered indexes for your environment:
- Columns referenced in SARGs or join clauses that have a relatively high selectivity (the density value is low).
- Columns referenced in both the WHERE clause and the ORDER BY
clause. When the data rows are retrieved using a nonclustered index,
they are retrieved in nonclustered index key order. If the result set is
to be ordered by the nonclustered index key(s) as well, SQL Server can
avoid having to re-sort the result set, resulting in a more efficient
query. In the following sample query, SQL Server can avoid the extra
step of sorting the result set if a nonclustered index is on state and the index is used to retrieve the matching rows:
select * from authors
where state like 'C%'
order by state
In general, nonclustered indexes are useful for
single-row lookups, joins, queries on columns that are highly selective,
or queries with small range retrievals. Also, when considering your
nonclustered index design, you should not overlook the benefits of index
covering, as described in the following section.
Index Covering
Index covering is a situation in which all the information required by the query in the SELECT and WHERE
clauses can be found entirely within the nonclustered index itself.
Because the nonclustered index contains a leaf row corresponding to
every data row in the table, SQL Server can satisfy the query from the
leaf rows of the nonclustered index. This results in faster retrieval of
data because all the information can come directly from the index page,
and SQL Server avoids lookups of the data pages.
Because the leaf pages in a nonclustered index are
linked together, the leaf level of the index can be scanned just like
the data pages in a table. Because the leaf index rows are typically
much smaller than the data rows, a nonclustered index that covers a
query will be faster than a clustered index on the same columns because
fewer pages would need to be read.
In the following example, a nonclustered index on the au_lname and au_fname columns of the authors table would cover the query because the result columns and the SARGs can all be derived from the index itself:
Select au_lname, au_fname
From authors
Where au_lname like 'M%'
Go
Many other queries that use an aggregate function (such as MIN, MAX, AVG, SUM, and COUNT)
or simply check for existence of criteria also benefit from index
covering. The following aggregate query samples can take advantage of
index covering:
select count(au_lname) from authors where au_lname like 'M%'
select count(*) from authors where au_lname like 'M%'
select count(*) from authors
You might wonder how the last query, which doesn’t
even specify a SARG, can use an index. SQL Server knows that by its
nature, a nonclustered index contains a row for every data row in the
table; it can simply count all the rows in any of the nonclustered
indexes instead of scanning the whole table. For the last query, SQL
Server chooses the smallest nonclustered index—that is, the one with the
smallest number of leaf pages.
Index covering can sometimes occur when you are not
expecting it. When you have a
clustered index defined on a table, the clustered key is carried into
all the nonclustered indexes to be used as the row locator to locate the
actual data row. Having the additional clustered key column values in
the nonclustered index provides more data values that can be used in
index covering.
For example, assume that the authors table has a clustered index on au_lname and au_fname and a nonclustered primary key defined on au_id. Each row in the nonclustered index on au_id would contain the clustered key values for au_lname and au_fname for its corresponding data row. Because of this, the following query would actually be covered by the nonclustered index on au_id:
select au_lname, au_fname
from authors
where au_id like '123%'
Explicitly adding additional columns to nonclustered
indexes to promote the occurrence of index covering has historically
been a common method of improving query response time. Consider the
following query:
select royalty from titles
where price between $10 and $20
If you create an index on only the price column, SQL Server can find the rows in the index where price is between $10 and $20, but it has to access the data rows to retrieve royalty. With 100 rows in the range, the worst-case I/O cost to retrieve the data rows would be as follows:
Number of index levels
+ Number of index pages to find the 100 matching rows
+ (100 × Number of pages per lookup via the row locator)
If the royalty column were added to the index on the price
column, SQL Server could scan the index to retrieve the results instead
of having to perform the lookups via the row locator against the table,
resulting in faster query response. The I/O cost using index covering
would be lower, as follows:
Number of index levels
+ Number of index pages to scan to find the 100 matching rows
If
you are considering padding your indexes to take advantage of index
covering, beware of making an index too wide. As index row width
approaches data row width, the benefits of covering are lost as the
number of pages in the leaf level increases. As the number of leaf-level
index pages approaches the number of pages in the table, the number of
index levels also increases, increasing the I/O cost of using the index
to locate data.
You should also avoid adding to the index columns
that are frequently updated. Remember that any changes to the columns in
the data rows cascade into the indexes as well. This increases the
index maintenance overhead, which can adversely affect update
performance.
As an alternative to adding columns to the
nonclustered index key to encourage index covering, you might want to
consider taking advantage of the included columns feature in SQL Server
2008.
Included Columns
A feature available for nonclustered indexes in SQL
Server 2008 is included columns. Included columns allow you to add
nonkey columns to the leaf level of a nonclustered index for the purpose
of index covering.
One advantage of included columns is that because the
nonkey columns are stored only in the leaf level of the index, the
nonleaf rows of the index are smaller, which helps reduce the overall
size of the index, thereby helping reduce the I/O cost of using the
index. Another advantage is that this feature allows you to exceed the
SQL Server maximum limits of 16 index key columns and 900-byte index key
size. The included nonkey columns are not factored in when calculating
the number of index key columns or index key size. All data types are
allowed as included columns except for the text, ntext, and image data types. To add included columns to an index, specify the INCLUDE clause to the CREATE INDEX statement:
CREATE INDEX NC_titles_price on titles (price) INCLUDE (royalty)
An additional advantage of included columns is that
you can add columns to a unique index for index covering purposes
without affecting the uniqueness of the actual index key(s) and without
having to create a second index on the unique key column(s) and the
additional covering columns. For example, consider that you have a large
number of queries that search titles by title_id to retrieve the price value. Creating a covering index on title_id and price could improve performance of these queries. However, creating a unique index on title_id and price would not enforce uniqueness on title_id alone (it would allow the insertion of multiple rows with the same title_id as long as they had different prices). Without using included columns, you would have to create a unique index on title_id and an additional nonunique index on title_id and price to enforce uniqueness on title_id and also have a covering index on title_id and price. However, with the included column feature, you can create just a single unique index on title_id with price as an included column:
CREATE INDEX UQ_titleid_price on titles (title_id) INCLUDE (price)
Tip
If
you have existing nonclustered indexes with a large index key size, you
might want to consider redesigning them so that only columns used for
searching and lookups are key columns. You should make all other columns
that were added for index covering into included columns. This way, you
still have all columns needed to cover your queries, but the index key
itself is smaller and more efficient.
You still should be careful to avoid adding
unnecessary columns as included columns of an index. Adding too many
index columns, key or nonkey, can adversely affect performance for the
following reasons:
Fewer index leaf rows fit on a page, which
can increase I/O costs to search the leaf level of the index and also
reduce data cache efficiency.
Because of the increased leaf row size, more disk space is required to store the index, especially if you are adding varchar(max), nvarchar(max), varbinary(max), or xml
data types as nonkey index columns. Because the column values are also
copied into the index leaf level, you are essentially storing the data
values twice.
Changes to the included
columns in the data rows cascade into the leaf rows of the index as
well. This increases the index maintenance overhead, which can adversely
affect performance of data modifications.
Wide Indexes Versus Multiple Indexes
As an index key gets wider, the selectivity of the
key generally becomes higher as well. It might seem that creating wide
indexes would result in better performance. This is not necessarily
true. The reason is that the wider the key, the fewer rows SQL Server
stores on the index pages, requiring more pages at each level; this
results in a higher number of levels in the index B-tree. To get to
specific rows, SQL Server must perform more I/O.
To get better performance from queries, instead of
creating a few wide indexes, you should consider creating multiple
narrower indexes. The advantage here is that with smaller keys, the
Query Optimizer can quickly scan through multiple indexes to determine
the most efficient access plan. SQL Server has the option of performing
multiple index lookups within a single query and merging the result sets
together to generate an intersection of the indexes. Also, with more
indexes, the Query Optimizer can choose from a wider variety of query
plan alternatives.
If you are considering creating a wide key, you
should individually check the distribution of values for each member of
the composite key. If the selectivity on the individual columns is high,
you might want to break up the index into multiple indexes. If the
selectivity of individual columns is low but is high for combined
columns, it makes sense to have wider keys on the table. To get to the
right combination, you can populate your table with real-world data,
experiment with creating multiple indexes, and check the distribution
of values for each column. Based on the histogram steps and index
density, you can make the decisions for an index design that works best
for your environment.