1. Basic Tenets of Designing for Performance
Designing for performance requires making
trade-offs. For example, to get the best write performance out of a
database, you must sacrifice read performance. Before you tackle
database design issues for an application, it is critical to understand
your goals. Do you want faster read performance? Faster write
performance? A more understandable design?
Following are some basic truths about physical database design for SQL Server 2008 and the performance implications of each:
It’s important to keep table row sizes as
small as possible. Doing so is not about saving disk space. Having
smaller rows means more rows fit on a single 8KB page, which means
fewer physical disk reads are required to read a given number of rows.
You
should use indexes to speed up read access. However, the more indexes a
table has, the longer it takes to insert, update, and delete rows from
the table.
Using triggers to perform any
kind of work during an insert, an update, or a delete exacts a
performance toll and decreases concurrency by lengthening transaction
duration.
Implementing declarative
referential integrity (via primary and foreign keys) helps maintain
data integrity, but enforcing foreign key constraints requires extra
lookups on the primary key table to ensure existence.
Using ON DELETE CASCADE referential integrity constraints helps maintain data integrity but requires extra work on the server’s part.
Keeping tables as narrow as possible—that is,
ensuring that the row size is as small as possible—is one of the most
important things you can do to ensure that a database performs well. To
keep your tables narrow, you should choose column data types with size
in mind. You shouldn’t use the bigint data type if the int
will do. If you have zero-to-one relationships in tables, you should
consider vertically partitioning the tables.
Cascading deletes (and updates) cause extra
lookups to be done whenever a delete runs against the parent table. In
many cases, the optimizer uses worktables to resolve delete and update
queries. Enforcing these constraints manually, from within stored
procedures, for example, can give better performance. This is not a
wholehearted endorsement against referential
integrity constraints. In most cases, the extra performance hit is
worth the saved aggravation of coding everything by hand. However, you
should be aware of the cost of this convenience.
2. Logical Database Design Issues
A good database design is fundamental to the success
of any application. Logical database design for relational databases
follows rules of normalization. As a result of normalization, you
create a data model that is usually, but not necessarily, translated
into a physical data model. A logical database design does not depend
on the relational database you intend to use. The same data model can
be applied to Oracle, Sybase, SQL Server, or any other relational
database. On the other hand, a physical data model makes extensive use
of the features of the underlying database engine to yield optimal
performance for the application. Physical models are much less portable
than logical models.
Tip
If portability is a big concern to you, consider
using a third-party data modeling tool, such as ERwin or ERStudio.
These tools have features that make it easier to migrate your logical
data models to physical data models on different database platforms. Of
course, using these tools just gets you started; to get the best
performance out of your design, you need to tweak the physical design
for the platform you have chosen.
Normalization Conditions
Any database designer must address two fundamental issues:
Designing the database in a simple, understandable way that is maintainable and makes sense to its developers and users
Designing the database such that data is fetched and saved with the fastest response time, resulting in high performance
Normalization is a
technique used on relational databases to organize data across many
tables so that related data is kept together based on certain
guidelines. Normalization results in controlled redundancy of data;
therefore, it provides a good balance between disk space usage and
performance. Normalization helps people understand the relationships
between data and enforces rules to ensure that the data is meaningful.
Tip
Normalization rules exist, among other reasons, to
make it easier for people to understand the relationships between data.
But a perfectly normalized database sometimes doesn’t perform well
under certain circumstances, and it may be difficult to understand.
There are good reasons to deviate from a perfectly normalized database.
Normalization Forms
Five
normalization forms exist, represented by the symbol 1NF for first
normal form, 2NF for second normal form, and so on. If you follow the
rules for the first rule of normalization, your database can be
described as “in first normal form.”
Each rule of normalization depends on the previous
rule for successful implementation, so to be in second normal form
(2NF), your database must also follow the rules for first normal form.
A typical relational database used in a business
environment falls somewhere between second and third normal forms. It
is rare to progress past the third normal form because fourth and fifth
normal forms are more academic than practical in real-world
environments.
Following is a brief description of the first three rules of normalization.
First Normal Form
The first rule of normalization requires removing
repeating data values and specifies that no two rows in a table can be
identical. This means that each table must have a logical primary key
that uniquely identifies a row in the table.
Consider a table that has four columns—PublisherName, Title1, Title2, and Title3—for storing up to three titles for each publisher. This table is not in first normal form due to the repeating Title columns. The main problem with this design is that it limits the number of titles associated with a publisher to three.
Removing the repeating columns so there is just a PublisherName column and a single Title
column puts the table in first normal form. A separate data row is
stored in the table for each title published by each publisher. The
combination of PublisherName and Title becomes the primary key that uniquely identifies each row and prevents duplicates.
Second Normal Form
A table is considered to be in second normal form if
it conforms to the first normal form and all nonkey attributes of the
table are fully dependent on the entire primary key. If the primary key
consists of multiple columns, nonkey columns should depend on the
entire key and not just on a part of the key. A table with a single
column as the primary key is automatically in second normal form if it
satisfies first normal form as well.
Assume that you need to add the publisher address to the database. Adding it to the table with the PublisherName and Title column would violate second normal form. The primary key consists of both PublisherName and Title, but the PublisherAddress attribute is an attribute of the publisher only. It does not depend on the entire primary key.
To put the database in second normal form requires
adding an additional table for storing publisher information. One table
consists of the PublisherName column and PublisherAddress. The second table contains the PublisherName and Title columns. To retrieve the PublisherName, Title, and PublisherAddress information in a single result would require a join between the two tables on the PublisherName column.
Third Normal Form
A table is considered to be in third normal form if
it already conforms to the first two normal forms and if none of the
nonkey columns are dependent on any other nonkey columns. All such
attributes should be removed from the table.
Let’s look at an example that comes up often during database architecture. Suppose that an employee table has four columns: EmployeeID (the primary key), salary, bonus, and total_salary, where total_salary = salary + bonus. Existence of the total_salary column in the table violates the third normal form because a nonkey column (total_salary) is dependent on two other nonkey columns (salary and bonus). Therefore, for the table to conform to the third rule of normalization, you must remove the total_salary column from the employee table.
Benefits of Normalization
Following are the major advantages of normalization:
Because information is logically kept together, normalization provides improved overall understanding of the system.
Because
of controlled redundancy of data, normalization can result in fast
table scans and searches (because less physical data has to be
processed).
Because tables are smaller with normalization, index creation and data sorts are much faster.
With less redundant data, it is easier to maintain referential integrity for the system.
Normalization
results in narrower tables. Because you can store more rows per page,
more rows can be read and cached for each I/O performed on the table.
This results in better I/O performance.
Drawbacks of Normalization
One result of normalization is that data is stored
in multiple tables. To retrieve or modify information, you usually have
to establish joins across multiple tables. Joins are expensive from an
I/O standpoint. Multitable joins can have an adverse impact on the
performance of the system. The following sections discuss some of the
denormalization techniques you can use to improve the performance of a
system.
Tip
An
adage for normalization is “Normalize ’til it hurts; denormalize ’til
it works.” To put this maxim into use, try to put your database in
third normal form initially. Then, when you’re ready to implement the
physical structure, drop back from third normal form, where excessive
table joins are hurting performance. A common mistake is that
developers make too many assumptions and over-denormalize the database
design before even a single line of code has been written to even begin
to assess the database performance.