2.5 Unique Indexes and Constraints
A Primary key and unique constraints
are the method you use to uniquely identify a row. Indexes and primary
keys are intertwined and a primary key must always be indexed. By
default, creating a primary key automatically creates a unique
clustered index, but it can optionally create a unique nonclustered
index instead.
A unique index, as its name suggests, limits data
to being unique. In other words, a unique index is constraining the
data it indexes. A unique constraint builds a unique index to quickly
check the data. A unique constraint and a unique index are the same
thing — creating either one builds a unique constraint/index. The only
difference between a unique constraint/index and a primary key is that
a primary key cannot allow nulls, and a unique constraint/index can
permit a single null value.
2.6 The Page Split Problem
Every index must maintain the key
column data in the correct sort order. Inserts, updates, and deletes
affect that data. As the data is inserted or modified, if the index
page to which a value needs to be added is full, SQL Server must split
the page into two less full pages so that it can insert the value in
the correct position. Again using the telephone book example, if
several new Chapmans moved into the area and the Cha page 515 had to
now accommodate 20 additions, a simulated page split would take several
steps:
1. Cut page 515 in half making two pages; call them 515a and 515b.
2. Print out and tape the new Chapman to page 515a.
3. Tape page 515b inside the back cover of the telephone book.
4. Make a note
on page 515a that the Cha listing continues on page 515b located at the
end of the book, and a note on page 515b that the listing continues on
page 515a.
Page splits may cause several performance-related problems:
- The page split operation is expensive because it involves several steps and moving data.
- If after the page split there still isn't enough room, the page
will be split again. This can occur again and again based on certain
circumstances.
- The data structure is left fragmented and can no longer be read in a single contiguous pass.
- Page splits are also logged operations and can have a significant impact on the transaction log.
After the split, the page has more empty space.
This means less data is read with every page read, and less data is
stored in the buffer pool per page along with additional disk space
required to store the data.
2.7 Index Selectivity
Another aspect of an indexing strategy
is determining the selectivity of the index. An index that is selective
has more distinct index values. A primary key or unique index has the
highest possible selectivity, because every value in the constraint is
defined as unique.
An index with only a few distinct values spread
across a large table is less selective. Indexes that are less selective
may not be useful for searching. A column with three values spread
throughout the table is potentially a poor candidate for an index.
SQL Server uses its internal index statistics to track the selectivity of an index. DBCC SHOW_STATISTICS
reports the last date on which the statistics were updated and the
basic information about the index statistics, including the potential
usefulness of the statistic (see Figure 3).
A low density indicates that the index is highly selective, whereas a
high density indicates low selectivity. The terms are the inverse of
each other. A high density may be less useful, as shown in this code
sample:
USE AdventureWorks2012
DBCC SHOW_STATISTICS ('Person.Person',
IX_Person_LastName_FirstName_MiddleName);
Changing the order of the key columns may improve
the selectivity of an index and improve its performance for certain
queries. Be careful however, because other queries may depend on the
order for their performance.
Unordered Heaps
You can create a table without a clustered index, in which case the data is stored in a unordered heap.
Instead of being stored in sorted order as defined by the clustered
index key columns, the rows are identified internally using the heap's
row identifier. The row identifier is an actual physical location
composed of three values, FileID:PageNum:SlotNum, and cannot be
directly queried. As mentioned earlier, all nonclustered indexes
contain the clustered key if the base table is clustered. The clustered
keys are used in the nonclustered index to navigate back to the
clustered index; they're basically used as a pointer in the
nonclustered index, so the base table can be used if additional columns
are needed for a given query. If the base table is not clustered, then
any nonclustered index can store the heap's row identifier in every
level of the index, which is used to point back to the full row in the
base table.
In a sense, you can think of SQL Server having
two different types of tables: a clustered (index) table and a heap
table, which are mutually exclusive. A table can never be a heap and a
clustered table at the same time.
Query Operators
Although there are dozens of logical
and physical query execution operations, SQL Server uses three primary
operators to access data. These are also known as access methods.
- Table Scan: Reads the entire heap and, most likely, passes all the data to a secondary filter operation.
- Index Scan: Reads the entire leaf level (every row) of the
clustered index or nonclustered index. The index scan operation might
filter the rows and return only those rows that meet the criteria, or
it might pass all the rows to another filter operation depending on the
complexity of the criteria. The data may or may not be ordered.
- Index Seek: Locates specific row(s) of data using the B-tree and returns only the selected rows in an ordered list (see Figure 4).
The query optimizer chooses the access
method with the least overall cost. Sequentially reading the data is an
efficient task, so an index scan and filter operation may actually be
cheaper than an index seek with a bookmark lookup (see Query Path 5 in
the next section) involving hundreds or thousands of random IO index
seeks. SQL Server heavily uses statistics to determine the number of
rows touched and returned by each operation in the query execution
plan. If statistics are accurate, SQL Server has a great opportunity to
choose the appropriate access method to most efficiently return the
requested data. On the other hand, if statistics are skewed or
out-of-date, the likelihood that SQL Server chooses the correct access
method decreases significantly. I've seen hundreds of performance
issues over the years caused by skewed statistics.