SQL Server 2008 : Data compression (part 1) - Row compression, Page compression

10/7/2013 3:30:44 AM

Predicting the future is a risky business, but if there's one thing that's guaranteed in future databases, it's an ever-growing volume of data to store, manage, and back up. Growing regulatory requirements, the plummeting cost of disk storage, and new types of information such as digitized video are converging to create what some call an information storage explosion. Managing this information while making it readily accessible to a wide variety of client applications is a challenge for all involved, particularly the DBA. Fortunately, SQL Server 2008 delivers a number of new features and enhancements to assist DBAs in this regard.

In the previous section we looked at the FileStream data type, which enhances the storage options for BLOB data. One of the great things about FileStream is the ability to have BLOBs stored outside the database on compressed NTFS volumes. Until SQL Server 2008, compressing data inside the database was limited to basic options such as variable-length columns, or complex options such as custom-developed compression routines.

In this section, we'll focus on a new feature in SQL Server 2008, data compression, which enables us to natively compress data inside the database without requiring any application changes. We'll begin with an overview of data compression and its advantages before looking at the two main ways in which SQL Server implements it. We'll finish with a number of important considerations in designing and implementing a compressed database.

1. Data compression overview

Data compression, available only in the Enterprise edition of SQL Server 2008, allows you to compress individual tables and indexes using either page compression or row compression, both of which we'll cover shortly. Due to its potentially adverse impact on performance, there's no option to compress the entire database in one action.

As you can see in figure 1, you can manage a table's compression by right-clicking it and choosing Storage > Manage Compression.

When considering compression in a broad sense, lossy and lossless are terms used to categorize the compression method used. Lossy compression techniques are used in situations where a certain level of data loss between the compressed and uncompressed file is accepted as a consequence of gaining higher and faster compression rates. JPEG images are a good example of lossy compression, where a reduction in data quality between the original image and the compressed JPEG is acceptable. Video and audio streaming are other common applications for lossy compression. It goes without saying that lossy compression is unacceptable in a database environment.

Figure 1. Individual tables can be selected for compression using SQL Server Management Studio.

SQL Server implements its own custom lossless compression algorithm and attempts to strike a balance between high compression rates and low CPU overhead. Compression rates and overhead will vary, and are dependent on a number of factors that we'll discuss, including fragmentation levels, the compression method chosen, and the nature of the data being compressed.

Arguably the most powerful aspect of SQL Server's implementation of data compression is the fact that the compressed data remains compressed, on disk and in the buffer cache, until it's actually required, at which point only the individual columns that are required are uncompressed. Compared to a file system-based compression solution, this results in the lowest CPU overhead while maximizing cache efficiency, and is clearly tailored toward the unique needs of a database management system.

Let's consider some of the benefits of data compression:

Lower storage costs—Despite the rapidly decreasing cost of retail storage, storage found in high-end SANs, typically used by SQL Server implementations, is certainly not cheap, particularly when considering actual:usable RAID ratios and duplication of data for various purposes, as figure 2 shows.
Lower administration costs—As databases grow in size, disk I/O-bound administration tasks such as backups, DBCC checks, and index maintenance take longer and longer. By compressing certain parts of the database, we're able to reduce the administration impact. For example, a database that's compressed to half its size will take roughly half the time to back up.

Figure 2. Compared to retail disk, enterprise SAN storage is expensive, a cost magnified by RAID protection and data duplication such as that shown here.
RAM and disk efficiency—As mentioned earlier, compressed data read into the buffer cache will remain compressed until required, effectively boosting the buffer size. Further, as the data is compressed on disk, the same quantity of disk time will effectively read more data, thus boosting disk performance as well.

SQL Server 2008 implements two different methods of data compression: page compression and row compression. The makeup of the data in the table or index determines which of these two will yield the best outcome. As we'll see shortly, we can use supplied tools to estimate the effectiveness of each method before proceeding.

2. Row compression

Row compression extends the variable-length storage format found in previous versions of SQL Server to all fixed-length data types. For example, in the same manner that the varchar data type is used to reduce the storage footprint of variable length strings, SQL Server 2008 can compress integer, char, and float data in the same manner. Crucially, the compression of fixed-length data doesn't expose the data type any differently to applications, so the benefits of compression are gained without requiring any application changes.

As an example, consider a table with millions of rows containing an integer column with a maximum value of 19. We could convert the column to tinyint, but not if we need to support the possibility of much larger values. In this example, significant disk savings could be derived through row compression, without requiring any application changes.

An alternative to row compression is page compression, our next topic.

3. Page compression

In contrast to row compression, page compression, as the name suggests, operates at the page level and uses techniques known as page-dictionary and column-prefix to identify common data patterns in rows on each page of the compressed object. When common patterns are found, they're stored once on the page, with references made to the common storage in place of the original data pattern. In addition to these methods, page compression includes row compression, therefore delivering the highest compression rate of the two options.

Page compression removes redundant storage of data. Consider an example of a large table containing columns with a default value specification. If a large percentage of the table's rows have the default value, there's obviously a good opportunity to store this value once on each page and refer all instances of that value to the common store.

Compressing a table, using either the page or row technique, involves a considerable amount of work by the SQL Server engine. To ensure the benefits outweigh the costs, you must take a number of factors into account.

Others

- SQL Server 2008 : BLOB storage with FileStream (part 2) - FileStream data

- SQL Server 2008 : BLOB storage with FileStream (part 1) - BLOBS in the database, BLOBS in the file system

- Windows Small Business Server 2011 : Managing Software Updates - Using SBS Software Updates (part 2) - Deploying Updates

- Windows Small Business Server 2011 : Managing Software Updates - Using SBS Software Updates (part 1) - Configuring Software Update Settings

- Windows Small Business Server 2011 : Managing Software Updates - The Patching Cycle

- Windows Phone 7 : Creating a Game Framework - Benchmarking and Performance (part 2)

- Windows Phone 7 : Creating a Game Framework - Benchmarking and Performance (part 1)

- Operating and Monitoring Exchange Server 2013 : Monitoring Enhancements in Exchange 2013

- Operating and Monitoring Exchange Server 2013 : Reporting

- Operating and Monitoring Exchange Server 2013 : Monitoring,Alerting, Inventory

SQL Server 2008 : Data compression (part 1) - Row compression, Page compression

1. Data compression overview

Figure 1. Individual tables can be selected for compression using SQL Server Management Studio.

Figure 2. Compared to retail disk, enterprise SAN storage is expensive, a cost magnified by RAID protection and data duplication such as that shown here.

2. Row compression

3. Page compression