Normalization and Normal Forms
Some of the most difficult decisions that you face
as a developer are what tables to create and what fields to place in
each table, and how to relate the tables that you create. Normalization is the process of applying a series of rules to ensure that a database achieves optimal structure. Normal forms
are a progression of these rules. Each successive normal form achieves
a better database design than the previous form. Although there are
several levels of normal forms, it is generally sufficient to apply only the first three levels of normal forms. The following sections describe the first three levels of normal forms.
First Normal Form
To achieve first normal form, all columns in a table must be atomic.
This means, for example, that you cannot store first name and last name
in the same field. The reason for this rule is that data becomes very
difficult to manipulate and retrieve if you store multiple values in a
single field. Let’s use the full name as an example. It would be
impossible to sort by first name or last name independently if you
stored both values in the same field. Furthermore, you or the user
would have to perform extra work to extract just the first name or just
the last name from the field.
Another requirement for first normal form is that
the table must not contain repeating values. An example of repeating
values is a scenario in which Item1, Quantity1, Item2, Quantity2,
Item3, and Quantity3 fields are all found within the Orders table (see Figure 1).
This design introduces several problems. What if the user wants to add
a fourth item to the order? Furthermore, finding the total ordered for
a product requires searching several columns. In fact, all numeric and
statistical calculations on the table are extremely cumbersome.
Repeating groups make it difficult to summarize and manipulate table
data. The alternative, shown in Figure 2,
achieves first normal form. Notice that each item ordered is located in
a separate row. All fields are atomic, and the table contains no
repeating groups.
Second Normal Form
For a table to achieve second normal form, all
nonkey columns must be fully dependent on all the fields that make up
the primary key. This rule only applies to tables that have a composite
key: a key made up of two or more fields. For example, the table shown
in Figure 2
has a primary key made up of the OrderID and CustomerID. The
combination of those two keys ensures that the primary key is unique
for every row in the table, which is the purpose of the primary key.
However, this table is not in second normal form
because some of the information in the table depends only on part of
the primary key. For instance, the CustInfo field depends only on the
CustomerId field: If you were to change the CustomerId field you’d also
have to change the CustInfo field. In real life, of course, there is
every possibility that the CustomerId field would be changed and the
CustInfo would not, leading to problems in the application.
To achieve second normal form, you must break this
data into two tables—an order table and a customer table. The customer
table would contain the CustomerId field from the primary key and the
fields that depend on it (the CustInfo field, in this case); the order
table would consist of the OrderId field and the fields that depend
upon it. The order table would still have the CustomerId field (so
you’d know which customer to ship the order to) but wouldn’t have any
other customer information (all other customer information would be in
the customer table). If you assign the order to another customer, you’d
only have to change the CustomerId field.
The process of breaking the data into two tables is called decomposition. Decomposition is considered to be nonloss decomposition because no data is lost during the decomposition
process. After you separate the data into two tables, you can easily
bring the data back together by joining the two tables via a query. Figure 3
shows the data separated into two tables. These two tables achieve
second normal form for two reasons. First, neither table has a
composite key—their primary keys now have only a single field (second
normal form only applies to tables with a composite primary key).
Second, the fields in each table depend on the whole of the primary key
rather than on just part of the key.
Third Normal Form
To attain third normal form, a table must meet all
the requirements for first and second normal forms, and all nonkey
columns dependent only on the primary key field and not dependent on
each other—all the fields are independent of each other. This means
that you must eliminate any calculations, and you must break out the
data into lookup tables. Lookup tables include tables such as Inventory
tables, Course tables, State tables, and any other table where we look
up a set of values from which we select the entry that we store in the
foreign key field. For example, from our Customer table, we look up
within the set of states in the state table to select the state
associated with the customer.
An example of a calculation stored in a table is the
product of price multiplied by quantity; the extended price is
dependent on two other fields in the table—the price and the quantity.
If either of those fields is changed, then the extended price would
also have to be changed. As with the example in second normal form,
there is every possibility of changing one or two of these fields
without changing the third one, leading to problems in the application.
Instead of storing the result of this calculation in the table, you
would generate the calculation in a query or in the control source of a
control on a form or a report.
The example in Figure 3
does not achieve third normal form because the description of the
inventory items is stored in the Order Details table. If the
description changes, all rows with that inventory item need to be
modified. The Order Details table, shown in Figure 4,
shows the item descriptions broken into an Inventory table. This design
achieves third normal form. We have moved the description of the
inventory items to an Inventory table, and ItemID is stored in the
Order Details table. All fields are mutually independent. You can
modify the description of an inventory item in one place.
Denormalization: Purposely Violating the Rules
Although a developer’s goal is normalization, sometimes it makes sense to deviate from normal forms. This process is called denormalization. The primary reason for applying denormalization is to enhance performance.
An example of when denormalization might be the
preferred tact could involve an open invoices table and a summarized
accounting table. It might be impractical to calculate summarized
accounting information for a customer when you need it. Instead, you
can maintain the summary calculations in a summarized accounting table
so that you can easily retrieve them as needed. Although the upside of
this scenario is improved performance, the downside is that you must
update the summary table whenever you make changes to the open
invoices. This imposes a definite trade-off between performance and
maintainability. You must decide whether the trade-off is worth it.
If you decide to denormalize, you should document
your decision. You should make sure that you make the necessary
application adjustments to ensure that you properly maintain the
denormalized fields. Finally, you need to test to ensure that the
denormalization process actually improves performance.
Integrity Rules
Although
integrity rules are not part of normal forms, they are definitely part
of the database design process. Integrity rules are broken into two
categories: overall integrity rules and database-specific integrity
rules.
Overall Integrity Rules
The two types of overall integrity rules are referential integrity rules and entity integrity rules. Referential integrity rules dictate that a database does not contain any orphan foreign key values. This means that
Child rows cannot be added for parent rows
that do not exist. In other words, an order cannot be added for a
nonexistent customer.
A primary key
value cannot be modified if the value is used as a foreign key in a
child table. This means that a CustomerID in the customers table cannot
be changed if the Orders table contains rows with that CustomerID.
A
parent row cannot be deleted if child rows have that foreign key value.
For example, a customer cannot be deleted if the customer has orders in
the Orders table.
Entity integrity dictates that the primary key value cannot be Null.
This rule applies not only to single-column primary keys, but also to
multicolumn primary keys. In fact, in a multicolumn primary key, no
field in the primary key can be Null. This makes sense because if any part of the primary key can be Null,
the primary key can no longer act as a unique identifier for the row.
Fortunately, the Jet Engine does not allow a field in a primary key to
be Null.
Database-Specific Integrity Rules
Database-specific integrity rules are not
applicable to all databases, but are, instead, dictated by business
rules that apply to a specific application. Database-specific rules are
as important as overall integrity rules. They ensure that the user
enters only valid data into a database. An example of a
database-specific integrity rule is requiring the delivery date for an
order to fall after the order date.