SQL Server 2008 R2 : Data Compression (part 1)

8/13/2013 5:05:51 PM

SQL Server 2008 introduced a new data compression feature that is available in Enterprise and Datacenter Editions. Data compression helps to reduce both storage and memory requirements as the data is compressed both on disk and when brought into the SQL Server data cache.

When compression is enabled and data is written to disk, it is compressed and stored in the designated compressed format. When the data is read from disk into the buffer cache, it remains in its compressed format. This helps reduce both storage requirements and memory requirements. It also reduces I/O because more data can be stored on a data page when it’s compressed. When the data is passed to another component of SQL Server, however, the Database Engine then has to uncompress the data on the fly. In other words, every time data has to be passed to or from the buffered cache, it has to be compressed or uncompressed. This requires extra CPU overhead to accomplish. However, in most cases, the amount of I/O and buffer cache saved by compression more than makes up for the CPU costs, boosting the overall performance of SQL Server.

Data compression can be applied on the following database objects:

Tables (clustered or heap)
Nonclustered indexes
Indexed views

As the DBA, you need to evaluate which of the preceding objects in your database could benefit from compression and then decide whether you want to compress it using either row-level or page-level compression. Compression is enabled or disabled at the object level There is no single option you can enable that turns compression on or off for all objects in the database. Fortunately, other than turning compression on or off for the preceding objects, you don’t have to do anything else to use data compression. SQL Server handles data compression transparently without your having to re-architect your database or your applications.

Row-Level Compression

Row-level compression isn’t true data compression. Instead, space savings are achieved by using a more efficient storage format for fixed-length data to use the minimum amount of space required. For example, the int data type uses 4 bytes of storage regardless of the value stored, even NULL. However, only a single byte is required to store a value of 100. Row-level compression allows fixed-length values to use only the amount of storage space required.

Row-level compression saves space and reduces I/O by

Reducing the amount of metadata required to store data rows
Storing fixed-length numeric data types as if they were variable-length data types, using only as many bytes as necessary to store the actual value
Storing CHAR data types as variable-length data types
Not storing NULL or 0 values

Row-level data compression provides less compression than page-level data compression, but it also incurs less overhead, reducing the amount of CPU resources required to implement it.

Row-level compression can be enabled when creating a table or index or using the ALTER TABLE or ALTER INDEX commands by specifying the WITH (DATA_COMPRESSION = ROW) option. The following example enables row compression on the titles table in the bigpubs2008 database:

ALTER TABLE titles REBUILD WITH (DATA_COMPRESSION=ROW)

Additionally, if a table or index is partitioned, you can apply compression at the partition level.

When row-level compression is applied to a table, a new row format is used that is unlike the standard data row format discussed previouslywhich has a fixed-length data section separate from a variable-length data section . This new row format is referred to as column descriptor, or CD, format. The name of this row format refers to the fact the every column has description information contained in the row itself. Figure 1 illustrates a representative view of the CD format (a definitive view is difficult because, except for the header, the number of bytes in each region is completely dependent on the values in the data row).

Figure 1. A representative structure of a CD format row.

The row header is always 1 byte in length and contains information similar to Status Bits A in a normal data row:

Bit 0— This bit indicates the type of record (1 = CD record format).
Bit 1— This bit indicates whether the row contains versioning information.
Bits 2–4— This three-bit value indicates what kind of information is stored in the row (such as primary record, ghost record, forwarding record, index record).
Bit 5— This bit indicates whether the row contains a Long data region (with values greater than 8 bytes in length).
Bits 6 and 7— These bits are not used.

The CD region consists of two parts. The first is either a 1- or 2-byte value indicating the number of short columns (8 bytes or less). If the most significant bit of the first byte is set to 0, it’s a 1-byte field representing up to 127 columns; if it’s 1, it’s a 2-byte field representing up to 32,767 columns. Following the first 1 or 2 bytes is the CD array. The CD array uses 4 bits for each column in the table to represent information about the length of the column. A bit representation of 0 indicates the column is NULL. A bit representation of the values 1 to 9 indicates the column is 0 to 8 bytes in length, respectively. A bit representation of 10 (0xa) indicates that the corresponding column value is a long data value and uses no space in the short data region. A bit representation of 11 (0xb) represents a bit column with a value of 1, and a bit representation of 12 (0xc) indicates that the corresponding value is a 1-byte symbol representing a value in the page compression dictionary (the page compression dictionary is discussed next in the page-level compression section).

The short data region contains each of the short data values. However, because accessing the last columns can be expensive if there are hundreds of columns in the table, columns are grouped into clusters of 30 columns. At the beginning of the short data region, there is an area called the short data cluster array. Each entry in the array is a single byte, which indicates the sum of the sizes of all the data in the previous cluster in the short data region; the value is essentially a pointer to the first column of the cluster (no row offset is needed for the first cluster because it starts immediately after the CD region).

Any data value in the row longer than 8 bytes is stored in the long data region. This can include LOB and row-overflow pointers. Long data needs an actual offset value to allow SQL Server to locate each value. This offset array looks similar to the offset array used in the standard data row structure. The long data region consists of three parts: an offset array, a long data cluster array, and the long data. The long data cluster array is similar to the short data cluster array; it has one entry for each 30-column cluster (except for the last one) and serves to limit the cost of locating columns near the end of a long list of columns.

The special information section at the end of the row contains three optional pieces of information. The existence of any or all of this information is indicated by bits in the first 1-byte header at the beginning of the row. The three special pieces of information are

Forwarding pointer— This pointer is used in a heap when a row is forwarded due to an update .
Back pointer— If the row is a forwarded row, it contains a pointer back to the original row location.
Versioning information— If snapshot isolation is being used, 14 bytes of versioning information are appended to the row.

Others

- Windows 7 : Windows in Your Pocket—Using a Windows Mobile Smartphone - Windows Mobile and Windows 7 (part 3) - Changing Device Settings

- Windows 7 : Windows in Your Pocket—Using a Windows Mobile Smartphone - Windows Mobile and Windows 7 (part 2) - Managing the Device Partnership

- Windows 7 : Windows in Your Pocket—Using a Windows Mobile Smartphone - Windows Mobile and Windows 7 (part 1)

- Windows 7 : Windows in Your Pocket—Using a Windows Mobile Smartphone - Windows Mobile Today

- Windows 7 : Windows in Your Pocket—Using a Windows Mobile Smartphone - History of Windows Mobile

- SharePoint 2010 : Monitoring and Reporting - Editing rule definitions in the health analyzer

- SharePoint 2010 : Monitoring and Reporting - Configuring what gets logged

- SharePoint 2010 : Monitoring and Reporting - Accessing the SharePoint 2010 logging database

- Windows Server 2012 : Using Event Viewer for Hyper-V troubleshooting

- Windows Server 2012 : Tuning your Hyper-V server