SQL Server 2008 R2 : Index Selection, Evaluating Index Usefulness

8/16/2013 9:51:46 AM

1. Index Selection

To determine which indexes to define on a table, you need to perform a detailed query analysis. This process involves examining the search clauses to see what columns are referenced, knowing the bias of the data to determine the usefulness of the index, and ranking the queries in order of importance and frequency of execution. You have to be careful not to examine individual queries and develop indexes to support one query, without considering the other queries that are executed on the table as well. You need to come up with a set of indexes that work for the best cross-section of your queries.

Tip

A useful tool to help you identify your frequently executed and critical queries is SQL Server Profiler. I’ve found SQL Server Profiler to be invaluable when going into a new client site and having to identify the problem queries that need tuning. SQL Server Profiler allows you to trace the procedures and queries being executed in SQL Server and capture the runtime, reads and writes, execution plans, and other processing information. This information can help you identify which queries are providing substandard performance, which ones are being executed most often, which indexes are being used by the queries, and so on.

Because it’s usually not possible to index for everything, you should index first for the queries most critical to your applications or those run frequently by many users. If you have a query that’s run only once a month, is it worth creating an index to support only that query and having to maintain it throughout the rest of the month? The sum of the additional processing time throughout the month could conceivably exceed the time required to perform a table scan to satisfy that one query.

Tip

If, due to query response time requirements, you must have an index in place when a query is run, consider creating the index only when you run the query and then dropping the index for the remainder of the month. This approach is feasible as long as the time it takes to create the index and run the query that uses the index doesn’t exceed the time it takes to simply run the query without the index in place.

2. Evaluating Index Usefulness

SQL Server provides indexes for two primary reasons: as a method to enforce the uniqueness of the data in the database tables and to provide faster access to data in the tables. Creating the appropriate indexes for a database is one of the most important aspects of physical database design. Because you can’t have an unlimited number of indexes on a table, and it wouldn’t be feasible anyway, you should create indexes on columns that have high selectivity so that your queries will use the indexes. The selectivity of an index can be defined as follows:

Selectivity ratio = Number of unique index values / Number of rows in table

If the selectivity ratio is high—that is, if a large number of rows can be uniquely identified by the key—the index is highly selective and useful to the Query Optimizer. The optimum selectivity would be 1, meaning that there is a unique value for each row. A low selectivity means that there are many duplicate values and the index would be less useful. The SQL Server Query Optimizer decides whether to use any indexes for a query based on the selectivity of the index. The higher the selectivity, the faster and more efficiently SQL Server can retrieve the result set.

For example, say that you are evaluating useful indexes on the authors table in the bigpubs2008 database. Assume that most of the queries access the table either by author’s last name or by state. Because a large number of concurrent users modify data in this table, you are allowed to choose only one index—author’s last name or state. Which one should you choose? Let’s perform some analysis to see which one is a more useful, or selective, index.

First, you need to determine the selectivity based on the author’s last name with a query on the authors table in the bigpubs2008 database:

select count(distinct au_lname) as '# unique',
   count(*) as '# rows',
   str(count(distinct au_lname) / cast (count(*) as real),4,2) as 'selectivity'
from authors
go

# unique    # rows      selectivity
----------- ----------- -----------
160         172         0.93

The selectivity ratio calculated for the au_lname column on the authors table, 0.93, indicates that an index on au_lname would be highly selective and a good candidate for an index. All but 12 rows in the table contain a unique value for last name.

Now, look at the selectivity of the state column:

select count(distinct state) as '# unique',
    count(*) '# rows',
    str(count(distinct state) / cast (count(*) as real),4,2) as 'selectivity'
from authors
go

# unique    # rows      selectivity
----------- ----------- -----------
38          172         0.22

As you can see, an index on the state column would be much less selective (0.22) than an index on the au_lname column and possibly not as useful.

One of the questions to ask at this point is whether a few values in the state column that have a high number of duplicates are skewing the selectivity or whether there are just a few unique values in the table. You can determine this with a query similar to the following:

select state,
       count(*) as numrows,
       count(*)/b.totalrows * 100 as percentage
from authors a,
     (select convert(numeric(6,2), count(*)) as totalrows from  authors) as b
group by state, b.totalrows
having count(*) > 1
order by 2 desc
go

state numrows     percentage
----- ----------- -------------------------------------
CA    37          21.5116200
NY    18          10.4651100
TX    15          8.7209300
OH    9           5.2325500
FL    8           4.6511600
IL    7           4.0697600
NJ    7           4.0697600
WA    6           3.4883700
PA    6           3.4883700
CO    5           2.9069700
LA    5           2.9069700
MI    5           2.9069700
MN    3           1.7441800
MO    3           1.7441800
OK    3           1.7441800
AZ    3           1.7441800
AK    2           1.1627900
IN    2           1.1627900
GA    2           1.1627900
MA    2           1.1627900
NC    2           1.1627900
NE    2           1.1627900
SD    2           1.1627900
VA    2           1.1627900
WI    2           1.1627900
WV    2           1.1627900

As you can see, most of the state values are relatively unique, except for one value, 'CA', which accounts for more than 20% of the values in the table. Therefore, state is probably not a good candidate for an indexed column, especially if most of the time you are searching for authors from the state of California. SQL Server would generally find it more efficient to scan the whole table rather than search via the index.

As a general rule of thumb, if the selectivity ratio for a nonclustered index key is less than 0.85 (in other words, if the Query Optimizer cannot discard at least 85% of the rows based on the key value), the Query Optimizer generally chooses a table scan to process the query rather than a nonclustered index. In such cases, performing a table scan to find all the qualifying rows is more efficient than seeking through the B-tree to locate a large number of data rows.

Note

You can relate the concept of selectivity to a hypothetical example. Would it be easier to do it by using the index and going back and forth from the index to all the pages that contain the word, or would it be easier just to scan each page from beginning to end to locate every occurrence? What if you had to find all references to the word squonk, if any? Squonk would definitely be easier to find via the index (actually, the index would help you determine that it doesn’t even exist). Therefore, the selectivity for Squonk would be high, and the selectivity for SQL would be much lower.

How does SQL Server determine whether an index is selective and which index, if it has more than one to choose from, would be the most efficient to use? For example, how would SQL Server know how many rows the following query might return?

select * from table
     where key between 1000000 and 2000000

If the table contains 10,000,000 rows with values ranging between 0 and 20,000,000, how does the Query Optimizer know whether to use an index or a table scan? There could be 10 rows in the range, or 900,000. How does SQL Server estimate how many rows are between 1,000,000 and 2,000,000? The Query Optimizer gets this information from the index statistics, as described in the next section.

Others

- SQL Server 2008 R2 : Index Utilization

- SQL Server 2008 R2 : Data Modification and Performance

- Windows 7 : Understanding VPNs (part 2) - VPN Client and Client Software

- Windows 7 : Understanding VPNs (part 1) - Understanding VPN Encapsulation and Tunneling, Understanding Remote Access VPN Infrastructure

- SharePoint 2010 : ADO.NET Data Services and REST (part 4) - Consuming ADO.NET Data Services in JavaScript

- SharePoint 2010 : ADO.NET Data Services and REST (part 3) - Consuming ADO.NET Data Services in Silverlight

- SharePoint 2010 : ADO.NET Data Services and REST (part 2) - Consuming ADO.NET Data Services in .NET Applications

- SharePoint 2010 : ADO.NET Data Services and REST (part 1) - ADO.NET Data Services and REST Basics

- Managing Windows Server 2012 : Logging Off, Restarting, and Shutting Down, Performing Searches

- Managing Windows Server 2012 : Server 2012's Interface (part 2) - Accessing and Running Management Tools, Customizing the Interface