Tip
If,
due to query response time requirements, you must have an index in
place when a query is run, consider creating the index only when you run
the query and then dropping the index for the remainder of the month.
This approach is feasible as long as the time it takes to create the
index and run the query that uses the index doesn’t exceed the time it
takes to simply run the query without the index in place.
2. Evaluating Index Usefulness
SQL Server provides indexes for two primary reasons:
as a method to enforce the uniqueness of the data in the database tables
and to provide faster access to data in the tables. Creating the
appropriate indexes for a database is one of the most important aspects
of physical database design. Because you can’t have an unlimited number
of indexes on a table, and it wouldn’t be feasible anyway, you should
create indexes on columns that have high selectivity so that your
queries will use the indexes. The selectivity of an index can be defined
as follows:
Selectivity ratio = Number of unique index values / Number of rows in table |
If the selectivity ratio is high—that is, if a large
number of rows can be uniquely identified by the key—the index is highly
selective and useful to the Query Optimizer. The optimum selectivity
would be 1, meaning that there is a unique value for each row. A low
selectivity means that there are many duplicate values and the index
would be less useful. The SQL Server Query Optimizer decides whether to
use any indexes for a query based on the selectivity of the index. The
higher the selectivity, the faster and more efficiently SQL Server can
retrieve the result set.
For example, say that you are evaluating useful indexes on the authors table in the bigpubs2008
database. Assume that most of the queries access the table either by
author’s last name or by state. Because a large number of concurrent
users modify data in this table, you are allowed to choose only one
index—author’s last name or state. Which one should you choose? Let’s
perform some analysis to see which one is a more useful, or selective,
index.
First, you need to determine the selectivity based on the author’s last name with a query on the authors table in the bigpubs2008 database:
select count(distinct au_lname) as '# unique',
count(*) as '# rows',
str(count(distinct au_lname) / cast (count(*) as real),4,2) as 'selectivity'
from authors
go
# unique # rows selectivity
----------- ----------- -----------
160 172 0.93
The selectivity ratio calculated for the au_lname column on the authors table, 0.93, indicates that an index on au_lname
would be highly selective and a good candidate for an index. All but 12
rows in the table contain a unique value for last name.
Now, look at the selectivity of the state column:
select count(distinct state) as '# unique',
count(*) '# rows',
str(count(distinct state) / cast (count(*) as real),4,2) as 'selectivity'
from authors
go
# unique # rows selectivity
----------- ----------- -----------
38 172 0.22
As you can see, an index on the state column would be much less selective (0.22) than an index on the au_lname column and possibly not as useful.
One of the questions to ask at this point is whether a few values in the state
column that have a high number of duplicates are skewing the
selectivity or whether there are just a few unique values in the table.
You can determine this with a query similar to the following:
select state,
count(*) as numrows,
count(*)/b.totalrows * 100 as percentage
from authors a,
(select convert(numeric(6,2), count(*)) as totalrows from authors) as b
group by state, b.totalrows
having count(*) > 1
order by 2 desc
go
state numrows percentage
----- ----------- -------------------------------------
CA 37 21.5116200
NY 18 10.4651100
TX 15 8.7209300
OH 9 5.2325500
FL 8 4.6511600
IL 7 4.0697600
NJ 7 4.0697600
WA 6 3.4883700
PA 6 3.4883700
CO 5 2.9069700
LA 5 2.9069700
MI 5 2.9069700
MN 3 1.7441800
MO 3 1.7441800
OK 3 1.7441800
AZ 3 1.7441800
AK 2 1.1627900
IN 2 1.1627900
GA 2 1.1627900
MA 2 1.1627900
NC 2 1.1627900
NE 2 1.1627900
SD 2 1.1627900
VA 2 1.1627900
WI 2 1.1627900
WV 2 1.1627900
As you can see, most of the state values are relatively unique, except for one value, 'CA', which accounts for more than 20% of the values in the table. Therefore, state
is probably not a good candidate for an indexed column, especially if
most of the time you are searching for authors from the state of
California. SQL Server would generally find it more efficient to scan
the whole table rather than search via the index.
As a general rule of thumb, if the selectivity ratio
for a nonclustered index key is less than 0.85 (in other words, if the
Query Optimizer cannot discard at least 85% of the rows based on the key
value), the Query Optimizer generally chooses a table scan to process
the query rather than a nonclustered index. In such cases, performing a
table scan to find all the qualifying rows is more efficient than
seeking through the B-tree to locate a large number of data rows.
Note
You can relate the concept of selectivity to a hypothetical example. Would it be easier to do it by using the index and going
back and forth from the index to all the pages that contain the word, or
would it be easier just to scan each page from beginning to end to
locate every occurrence? What if you had to find all references to the
word squonk, if any? Squonk
would definitely be easier to find via the index (actually, the index
would help you determine that it doesn’t even exist). Therefore, the
selectivity for Squonk would be high, and the selectivity for SQL would be much lower.
How
does SQL Server determine whether an index is selective and which
index, if it has more than one to choose from, would be the most
efficient to use? For example, how would SQL Server know how many rows
the following query might return?
select * from table
where key between 1000000 and 2000000
If the table contains 10,000,000 rows with values
ranging between 0 and 20,000,000, how does the Query Optimizer know
whether to use an index or a table scan? There could be 10 rows in the
range, or 900,000. How does SQL Server estimate how many rows are
between 1,000,000 and 2,000,000? The Query Optimizer gets this
information from the index statistics, as described in the next section.