SQL Server 2012 : Indexing Basics (part 2) - Index Selectivity, Query Operators

11/6/2013 8:26:15 PM

2.5 Unique Indexes and Constraints

A Primary key and unique constraints are the method you use to uniquely identify a row. Indexes and primary keys are intertwined and a primary key must always be indexed. By default, creating a primary key automatically creates a unique clustered index, but it can optionally create a unique nonclustered index instead.

A unique index, as its name suggests, limits data to being unique. In other words, a unique index is constraining the data it indexes. A unique constraint builds a unique index to quickly check the data. A unique constraint and a unique index are the same thing — creating either one builds a unique constraint/index. The only difference between a unique constraint/index and a primary key is that a primary key cannot allow nulls, and a unique constraint/index can permit a single null value.

2.6 The Page Split Problem

Every index must maintain the key column data in the correct sort order. Inserts, updates, and deletes affect that data. As the data is inserted or modified, if the index page to which a value needs to be added is full, SQL Server must split the page into two less full pages so that it can insert the value in the correct position. Again using the telephone book example, if several new Chapmans moved into the area and the Cha page 515 had to now accommodate 20 additions, a simulated page split would take several steps:

1. Cut page 515 in half making two pages; call them 515a and 515b.

2. Print out and tape the new Chapman to page 515a.

3. Tape page 515b inside the back cover of the telephone book.

4. Make a note on page 515a that the Cha listing continues on page 515b located at the end of the book, and a note on page 515b that the listing continues on page 515a.

Page splits may cause several performance-related problems:

The page split operation is expensive because it involves several steps and moving data.
If after the page split there still isn't enough room, the page will be split again. This can occur again and again based on certain circumstances.
The data structure is left fragmented and can no longer be read in a single contiguous pass.
Page splits are also logged operations and can have a significant impact on the transaction log.

After the split, the page has more empty space. This means less data is read with every page read, and less data is stored in the buffer pool per page along with additional disk space required to store the data.

2.7 Index Selectivity

Another aspect of an indexing strategy is determining the selectivity of the index. An index that is selective has more distinct index values. A primary key or unique index has the highest possible selectivity, because every value in the constraint is defined as unique.

An index with only a few distinct values spread across a large table is less selective. Indexes that are less selective may not be useful for searching. A column with three values spread throughout the table is potentially a poor candidate for an index.

SQL Server uses its internal index statistics to track the selectivity of an index. DBCC SHOW_STATISTICS reports the last date on which the statistics were updated and the basic information about the index statistics, including the potential usefulness of the statistic (see Figure 3). A low density indicates that the index is highly selective, whereas a high density indicates low selectivity. The terms are the inverse of each other. A high density may be less useful, as shown in this code sample:

USE AdventureWorks2012
DBCC SHOW_STATISTICS ('Person.Person', 
IX_Person_LastName_FirstName_MiddleName);

Figure 3 Use the output from DBCC SHOW_STATISTICS to determine the last time the statistics were updated and the sampling rate.

Changing the order of the key columns may improve the selectivity of an index and improve its performance for certain queries. Be careful however, because other queries may depend on the order for their performance.

Unordered Heaps

You can create a table without a clustered index, in which case the data is stored in a unordered heap. Instead of being stored in sorted order as defined by the clustered index key columns, the rows are identified internally using the heap's row identifier. The row identifier is an actual physical location composed of three values, FileID:PageNum:SlotNum, and cannot be directly queried. As mentioned earlier, all nonclustered indexes contain the clustered key if the base table is clustered. The clustered keys are used in the nonclustered index to navigate back to the clustered index; they're basically used as a pointer in the nonclustered index, so the base table can be used if additional columns are needed for a given query. If the base table is not clustered, then any nonclustered index can store the heap's row identifier in every level of the index, which is used to point back to the full row in the base table.

In a sense, you can think of SQL Server having two different types of tables: a clustered (index) table and a heap table, which are mutually exclusive. A table can never be a heap and a clustered table at the same time.

Query Operators

Although there are dozens of logical and physical query execution operations, SQL Server uses three primary operators to access data. These are also known as access methods.

Table Scan: Reads the entire heap and, most likely, passes all the data to a secondary filter operation.
Index Scan: Reads the entire leaf level (every row) of the clustered index or nonclustered index. The index scan operation might filter the rows and return only those rows that meet the criteria, or it might pass all the rows to another filter operation depending on the complexity of the criteria. The data may or may not be ordered.
Index Seek: Locates specific row(s) of data using the B-tree and returns only the selected rows in an ordered list (see Figure 4).

Figure 4 An index seek operation navigates the B-tree index, selects a beginning row, and then scans all the required rows.

The query optimizer chooses the access method with the least overall cost. Sequentially reading the data is an efficient task, so an index scan and filter operation may actually be cheaper than an index seek with a bookmark lookup (see Query Path 5 in the next section) involving hundreds or thousands of random IO index seeks. SQL Server heavily uses statistics to determine the number of rows touched and returned by each operation in the query execution plan. If statistics are accurate, SQL Server has a great opportunity to choose the appropriate access method to most efficiently return the requested data. On the other hand, if statistics are skewed or out-of-date, the likelihood that SQL Server chooses the correct access method decreases significantly. I've seen hundreds of performance issues over the years caused by skewed statistics.

Others

- SQL Server 2012 : Indexing Basics (part 1) - The B-Tree Index, Clustered Indexes, Nonclustered Indexes

- Windows 7 : Using Internet Explorer 8 - Using Multimedia Browsing and Downloading (part 3)

- Windows 7 : Using Internet Explorer 8 - Using Multimedia Browsing and Downloading (part 2)

- Windows 7 : Using Internet Explorer 8 - Using Multimedia Browsing and Downloading (part 1)

- Windows Server 2012 : Highly available, easy-to-manage multi-server platform - Management efficiency (part 3) - PowerShell 3.0

- Windows Server 2012 : Highly available, easy-to-manage multi-server platform - Management efficiency (part 2) - Simplified Active Directory administration

- Windows Server 2012 : Highly available, easy-to-manage multi-server platform - Management efficiency (part 1) - The new Server Manager

- Windows Server 2012 : Highly available, easy-to-manage multi-server platform - Cost efficiency - Storage Spaces

- System Center Configuration Manager 2007 : Client Management - Client Discovery (part 2) - Network Discovery

- System Center Configuration Manager 2007 : Client Management - Client Discovery (part 1) - Active Directory User Discovery