SQL Server 2008 R2 : Database Design and Performance - Denormalizing a Database - Essential Denormalization Techniques

8/24/2013 9:39:01 AM

After a database has been normalized to the third form, database designers intentionally backtrack from normalization to improve the performance of the system. This technique of rolling back from normalization is called denormalization. Denormalization allows you to keep redundant data in the system, reducing the number of tables in the schema and reducing the number of joins to retrieve data.

Tip

Duplicate data is more helpful when the data does not change very much, such as in data warehouses. If the data changes often, keeping all “copies” of the data in sync can create significant performance overhead, including long transactions and excessive write operations.

1. Denormalization Guidelines

When should you denormalize a database? Consider the following points:

Be sure you have a good overall understanding of the logical design of the system. This knowledge helps in determining how other parts of the application are going to be affected when you change one part of the system.
Don’t attempt to denormalize the entire database at once. Instead, focus on the specific areas and queries that are accessed most frequently and are suffering from performance problems.
Understand the types of transactions and the volume of data associated with specific areas of the application that are having performance problems. You can resolve many such issues by tuning the queries without denormalizing the tables.
Determine whether you need virtual (computed) columns. Virtual columns can be computed from other columns of the table. Although this violates third normal form, computed columns can provide a decent compromise because they do not actually store another exact copy of the data in the same table.
Understand data integrity issues. With more redundant data in the system, maintaining data integrity is more difficult, and data modifications are slower.
Understand storage techniques for the data. You may be able to improve performance without denormalization by using RAID, SQL Server filegroups, and table partitioning.
Determine the frequency with which data changes. If data is changing too often, the cost of maintaining data and referential integrity might outweigh the benefits provided by redundant data.
Use the performance tools that come with SQL Server (such as SQL Server Profiler) to assess performance. These tools can help isolate performance issues and give you possible targets for denormalization.

Tip

If you are experiencing severe performance problems, denormalization should not be the first step you take to rectify the problem. You need to identify specific issues that are causing performance problems. Usually, you discover factors such as poorly written queries, poor index design, inefficient application code, or poorly configured hardware. You should try to fix these types of issues before taking steps to denormalize database tables.

2. Essential Denormalization Techniques

You can use various methods to denormalize a database table and achieve desired performance goals. Some of the useful techniques used for denormalization include the following:

Keeping redundant data and summary data
Using virtual columns
Performing horizontal data partitioning
Performing vertical data partitioning

Redundant Data

From an I/O standpoint, joins in a relational database are inherently expensive. To avoid common joins, you can add redundancy to a table by keeping exact copies of the data in multiple tables. The following example demonstrates this point. This example shows a three-table join to get the title of a book and the primary author’s name:

select c.title,
       a.au_lname,
       a.au_fname
   from   authors a join titleauthor b on a.au_id = b.au_id
   join titles c on b.title_id = c.title_id
   where  b.au_ord = 1
   order by c.title

You could improve the performance of this query by adding the columns for the first and last names of the primary author to the titles table and storing the information in the titles table directly. This would eliminate the joins altogether. Here is what the revised query would look like if this denormalization technique were implemented:

select title,
       au_lname,
       au_fname
   from  titles
   order by title

As you can see, the au_lname and au_fname columns are now redundantly stored in two places: the titles table and authors table. It is obvious that with more redundant data in the system, maintaining referential integrity and data integrity is more difficult. For example, if the author’s last name changed in the authors table, to preserve data integrity, you would also have to change the corresponding au_lname column value in the titles table to reflect the correct value. You could use SQL Server triggers to maintain data integrity, but you should recognize that update performance could suffer dramatically. For this reason, it is best if redundant data is limited to data columns whose values are relatively static and are not modified often.

Computed Columns

A number of queries calculate aggregate values derived from one or more columns of a table. Such computations can be CPU intensive and can have an adverse impact on performance if they are performed frequently. One of the techniques to handle such situations is to create an additional column that stores the computed value. Such columns are called virtual columns, or computed columns. Since SQL Server 7.0, computed columns have been natively supported. You can specify such columns in create table or alter table commands. The following example demonstrates the use of computed columns:

create table emp (
        empid int not null primary key,
        salary money not null,
        bonus money not null default 0,
        total_salary as ( salary+bonus )
        )
go
insert emp (empid, salary, bonus) values (100, $150000.00, $15000)
go
select * from emp
go
empid       salary         bonus                total_salary
----------- -------------  -------------------- ---------------
100         150000.0000    15000.0000           165000.0000

By default, virtual columns are not physically stored in SQL Server tables. SQL Server internally maintains a column property named iscomputed that can be viewed from the sys.columns system view. It uses this column to determine whether a column is computed. The value of the virtual column is calculated at the time the query is run. All columns referenced in the computed column expression must come from the table on which the computed column is created. You can, however, reference a column from another table by using a function as part of the computed column’s expression. The function can contain a reference to another table, and the computed column calls this function.

Since SQL Server 2000, computed columns have been able to participate in joins to other tables, and they can be indexed. Creating an index that contains a computed column creates a physical copy of the computed column in the index tree. Whenever a base column participating in the computed column changes, the index must also be updated, which adds overhead and may possibly slow down update performance.

In SQL Server 2008, you also have the option of defining a computed column so that its value is physically stored. You accomplish this with the ADD PERSISTED option, as shown in the following example:

--Alter the computed SetRate column to be PERSISTED
ALTER TABLE Sales.CurrencyRate
 alter column SetRate ADD PERSISTED

SQL Server automatically updates the persisted column values whenever one of the columns that the computed column references is changed. Indexes can be created on these columns, and they can be used just like nonpersisted columns. One advantage of using a computed column that is persisted is that it has fewer restrictions than a nonpersisted column. In particular, a persisted column can contain an imprecise expression, which is not possible with a nonpersisted column. Any float or real expressions are considered imprecise. To ensure that you have a precise column you can use the COLUMNPROPERTY function and review the IsPrecise property to determine whether the computed column expression is precise.

Summary Data

Summary data is most helpful in a decision support environment, to satisfy reporting requirements and calculate sums, row counts, or other summary information and store it in a separate table. You can create summary data in a number of ways:

Real-time— Every time your base data is modified, you can recalculate the summary data, using the base data as a source. This is typically done using stored procedures or triggers.
Real-time incremental— Every time your base data is modified, you can recalculate the summary data, using the old summary value and the new data. This approach is more complex than the real-time option, but it could save time if the increments are relatively small compared to the entire dataset. This, too, is typically done using stored procedures or triggers.
Delayed— You can use a scheduled job or custom service application to recalculate summary data on a regular basis. This is the recommended method to use in an OLTP system to keep update performance optimal.

Horizontal Data Partitioning

As tables grow larger, data access time also tends to increase. For queries that need to perform table scans, the query time is proportional to the number of rows in the table. Even when you have proper indexes on such tables, access time slows as the depth of the index trees increases. The solution is splitting the table into multiple tables such that each table has the same table structure as the original one but stores a different set of data. Figure 1 shows a billing table with 90 million records. You can split this table into 12 monthly tables (all with the identical table structure) to store billing records for each month.

Figure 1. Horizontal partitioning of data.

You should carefully weigh the options when performing horizontal splitting. Although a query that needs data from only a single month gets much faster, other queries that need a full year’s worth of data become more complex. Also, queries that are self-referencing do not benefit much from horizontal partitioning. For example, the business logic might dictate that each time you add a new billing record to the billing table, you need to check any outstanding account balance for previous billing dates. In such cases, before you do an insert in the current monthly billing table, you must check the data for all the other months to find any outstanding balance.

Tip

Horizontal splitting of data is useful where a subset of data might see more activity than the rest of the data. For example, say that in a healthcare provider setting, 98% of the patients are inpatients, and only 2% are outpatients. In spite of the small percentage involved, the system for outpatient records sees a lot of activity. In this scenario, it makes sense to split the patient table into two tables—one for the inpatients and one for the outpatients.

When splitting tables horizontally, you must perform some analysis to determine the optimal way to split the table. You need to find a logical dimension along which to split the data. The best choice takes into account the way your users use your data. In the example that involves splitting the data among 12 tables, date was mentioned as the optimal split candidate. However, if the users often did ad hoc queries against the billing table for a full year’s worth of data, they would be unhappy with the choice to split that data among 12 different tables. Perhaps splitting based on a customer type or another attribute would be more useful.

Note

You can use partitioned views to hide the horizontal splitting of tables. The benefit of using partitioned views is that multiple horizontally split tables appear to the end users and applications as a single large table. When this is properly defined, the optimizer automatically determines which tables in the partitioned view need to be accessed, and it avoids searching all tables in the view. The query runs as quickly as if it were run only against the necessary tables directly.

In SQL Server 2008, you also have the option of physically splitting the rows in a single table over more than one partition. This feature, called partitioned tables, utilizes a partitioning function that splits the data horizontally and a partitioning scheme that assigns the horizontally partitioned data to different filegroups. When a table is created, it references the partitioned schema, which causes the rows of data to be physically stored on different filegroups. No additional tables are needed, and the table is still referenced with the original table name. The horizontal partitioning happens at the physical storage level and is transparent to the user.

Vertical Data Partitioning

As you know, a database in SQL Server consists of 8KB pages, and a row cannot span multiple pages. Therefore, the total number of rows on a page depends on the width of the table. This means the wider the table, the smaller the number of rows per page. You can achieve significant performance gains by increasing the number of rows per page, which in turn reduces the number of I/Os on the table. Vertical splitting is a method of reducing the width of a table by splitting the columns of the table into multiple tables. Usually, all frequently used columns are kept in one table, and others are kept in the other table. This way, more records can be accommodated per page, fewer I/Os are generated, and more data can be cached into SQL Server memory. Figure 2 illustrates a vertically partitioned table. The frequently accessed columns of the authors table are stored in the author_primary table, whereas less frequently used columns are stored in the author_secondary table.

Figure 2. Vertical partitioning of data.

Tip

Make the decision to split data very carefully, especially when the system is already in production. Changing the data structure might have a system-wide impact on a large number of queries that reference the old definition of the object. In such cases, to minimize risks, you might want to use SQL Server views to hide the vertical partitioning of data. Also, if you find that users and developers are frequently joining between the vertically split tables because they need to pull data together from the two tables, you might want to reconsider the split point or the splitting of the table itself. Doing frequent joins between split tables with smaller rows requires more I/Os to retrieve the same data than if the data resided in a single table with wider rows.

Performance Implications of Zero-to-One Relationships

Suppose that one of the development managers in your company, Bob, approaches you to discuss some database schema changes. He is one of several managers whose groups all use the central User table in your database. Bob’s application makes use of about 5% of the users in the User table. Bob has a requirement to track five yes/no/undecided flags associated with those users. He would like you to add five one-character columns to the User table to track this information. What do you tell Bob?

Bob has a classic zero-to-one problem. He has some data he needs to track, but it applies to only a small subset of the data in the table. You can approach this problem in one of three ways:

Option 1: Add the columns to the User table—In this case, 95% of your users will have NULL values in those columns, and the table will become wider for everybody.
Option 2: Create a new table with a vertical partition of the User table—The new table will contain the User primary key and Bob’s five flags. In this case, 95% of your users will still have NULL data in the new table, but the User table is protected against these effects. Because other groups don’t need to use the new partition table, this is a nice compromise.
Option 3: Create a new vertically partitioned table as in Option 2 but populate it only with rows that have at least one non-NULL value for the columns in the new partition—This option is great for database performance, and searches in the new table will be wonderfully fast. The only drawback to this approach is that Bob’s developers will have to add additional logic to their applications to determine whether a row exists during updates. Bob’s folks will need to use an outer join to the table to cover the possibility that a row doesn’t exist.

Depending on the goals of the project, any one of these options can be appropriate. Option 1 is simple and is the easiest to code for and understand. Option 2 is a good compromise between performance and simplicity. Option 3 gives the best performance in certain circumstances but impacts performance in certain other situations and definitely requires more coding work to be done.

Others

- SQL Server 2008 R2 : Basic Tenets of Designing for Performance, Logical Database Design Issues

- SQL Server 2008 R2 : Database Snapshots - Setting Up Snapshots Against a Database Mirror

- SQL Server 2008 R2 : Query Versus Update Performance , Identifying Missing Indexes

- Windows Server 2008 : Creating Basic Visual Basic Scripts - Using if Statements, Checking for a Value with a Message Box

- Windows Server 2008 : Creating Basic Visual Basic Scripts - Displaying a Message Box with a Visual Basic Script

- Windows Server 2008 : Creating Basic Visual Basic Scripts - Working with filesystemobject

- Windows 7 : Installing and Upgrading Programs - Common Installation Prompts (part 2) - Type of Installation

- Windows 7 : Installing and Upgrading Programs - Common Installation Prompts (part 1) - Compliance check , The End User License Agreement

- Windows 7 : Installing and Upgrading Programs - Installing and Upgrading from a Disk