4. Normalization
In 1970, Dr. Edgar F. Codd published “A
Relational Model of Data for Large Shared Data Bank” and became the
father of the relational database. During the 1970s Codd wrote a series
of papers that defined the concept of database normalization. He wrote
his famous “Codd's 12 Rules” in 1985 to define what constitutes a
relational database and to defend the relational database from software
vendors who were falsely claiming to be relational. Since that time,
others have amended and refined the concept of normalization.
The primary purpose of normalization is to
improve the data integrity of the database by reducing or eliminating
modification anomalies that can occur when the same fact is stored in
multiple locations within the database. In other words, the process of
normalization attempts to reduce redundant data that causes unnecessary
updates.
Duplicate data raises all sorts of interesting
problems for inserts, updates, and deletes. For example, if the product
name is stored in the order detail table, and the product name is
edited, should every order details' row be updated? If so, is there a
mechanism to ensure that the edit to the product name propagates down
to every duplicate entry of the product name? If data is stored in
multiple locations, is it safe to read just one of those locations
without double-checking other locations? Normalization prevents these
kinds of modification anomalies.
In addition to the primary goal of consistency
and data integrity, several other good reasons to normalize an OLTP
relational database exist:
- Performance: Duplicate data requires extra code to perform
extra writes, maintain consistency, and manipulate data into a set when
reading data. Addressing the aforementioned issues is even more
problematic when dealing with large highly transactional databases.
Imagine a 2-terabyte database that averages approximately 30K
transactions per second. Moreover, assume that the database is the back
end for a large retail store. If a change were made to one product or
item, that change would need to be propagated across every table that
referenced that product. This could be tens or even hundreds of tables,
resulting in a 10–15 percent system performance degradation.
Normalization also reduces locking contention and improves multiple-user concurrency because you need fewer updates.
- Development costs: Although it may take longer to design a
normalized database, it's easier to work with a normalized database,
and it reduces development costs.
- Usability: By placing columns in the correct table, it's easier to understand the database and easier to write correct queries.
- Extensibility: A non-normalized database is often more
complex and therefore more difficult to modify. This can be directly
attributed to the distribution of the redundant data across several
tables within the database.
5. The Three “Rules of One”
Normalization is well defined as
normalized forms — specific issues that address specific potential
errors in the design. Therefore, when designing databases, you should
implement normalization design principles from the onset. This approach
can help minimize design errors and produce a highly stable and
performing database.
You should follow three rules known as the “Rules
of One,” when designing a database. One type of item is represented by
one entity (table). The key to designing a schema that avoids update
anomalies is to ensure that each single fact in real life is modeled by
a single data point in the database. Three principles define a single
data point:
- One group of similar things is represented by one entity (table).
- One thing is represented by one tuple (row).
- One descriptive fact about the thing is represented by one attribute (column).
Learn these three simple rules to help you design a properly normalized database.
6. Identifying Entities
The first step to designing a database
conceptual diagram is to identify the entities (tables). Because any
entity represents only one type of thing, it takes several entities
together to represent an entire process or organization.
Entities are usually discovered from several sources:
- Examining existing documents (order forms, registration forms, patient files, and reports)
- Interviews with subject-matter experts
- Diagramming the process flow
At this early stage the goal is to simply collect
a list of possible entities and their facts. Some of the entities will
be obvious nouns, such as customers, products, flights, materials, and
machines.
Other entities will be verbs: shipping,
processing, assembling parts to build a product. Verbs may be entities,
or they may indicate a relationship between two entities.
The goal is to simply collect all the
possible entities and their attributes. At this early stage, it's also
useful to document as many known relationships as possible, even if
those relationships will be edited several times.
Generalization
Normalization has a reputation of
creating complex and unwieldy databases. It's true that some database
schemas are far too complex, but normalization, by itself, isn't the
root cause.
The difference between elegant databases that are
a joy to query and overly complex designs that make you want to polish
your resume is the data modeler's view of entities.
When identifying entities, there's a continuum, as illustrated in Figure 1, ranging from a broad all-inclusive view to a specific, narrow definition of the entity.
The overly simple view groups together entities
that are different types of things, for example, storing machines,
products, and processes in the single entity. This approach might risk
data integrity for two reasons. First, it's difficult to enforce
referential integrity (foreign key constraints) because the primary key
attempts to represent multiple types of items. Second, these designs
tend to merge entities with different attributes, which means that many
of the attributes (columns) won't apply to various rows and will simply
be left null. Many nullable columns means the data will probably be
sparsely filled and inconsistent.
At the other extreme, the overly specific view
segments entities that could be represented by a single entity into
multiple entities, for example, splitting different types of
subassemblies and finished products into multiple different entities.
This type of design risks flexibility and usability:
- The additional tables create additional work at every layer of the software.
- Database relationships become more complex because what could have
been a single relationship are now multiple relationships. For example,
instead of relating an assembly process between any part, the assembly
relationship must now relate with multiple types of parts.
- The database has now hard-coded the specific types of similar
entities, making it difficult to add another similar type of entity.
Using the manufacturing example again, if there's an entity for every
type of subassembly, adding another type of subassembly means changes
at every level of the software.
- Coining a query to extract the proper set of data to meet reporting
requirements is now difficult and sometimes a daunting task due to the
sheer number of tables that are needed to fulfill the requirement.
The sweet spot in the middle generalizes, or
combines, similar entities into single entities. This approach creates
a more flexible and elegant database design that is easier to query and
extend:
- Look for entities with similar attributes, or entities that share some attributes.
- Look for types of entities that might have an additional similar entity added in the future.
- Look for entities that might be summarized together in reports.
When designing a generalized entity, two techniques are essential:
- Use a lookup entity to organize the types of entities. For the manufacturing example, a subassemblytype attribute would serve the purpose of organizing the parts by subassembly type. Typically, this would be a foreign key to a subassemblytype entity.
- Typically, the different entity types that could be generalized
together do have some differences — which is why a purist view would
want to segment them. Employing the supertype/subtype (discussed in the
“Data Design Patterns” section) solves this dilemma perfectly.
Although generalization sounds like
denormalization — it's not. When generalizing, it's critical that the
entities comply with all the rules of normalization.
Generalized databases tend to be data-driven,
have fewer tables, and are easier to extend. For example, an
advertising company allowed the application architect to develop the
database. As a result, writing a query that returned customer
information (first name, last name, address, phone, city, state, and so
on) required accessing more than 40 tables in one query. o mitigate the
problem, the developer wrote a process that transformed and loaded the
data into a database that contained one-third of the number of tables
as the original. The same customer query could be written against the
new database only requiring the need of 10 tables. For which database
would you rather write a stored procedure?
On the other hand, be careful when merging
entities because they actually do share a root meaning in the data.
Don't merge unlike entities just to save programming. The result will
be more complex programming.
Best Practice
Granted, knowing when to generalize
and when to segment can be an art form and requires a repertoire of
database experience, but generalization is the buffer against database
over-complexity; and consciously working at understanding
generalization is the key to becoming an excellent data modeler.