SQL Server 2012 : Database Basics (part 2) - Normalization, Identifying Entities

11/8/2013 2:05:48 AM

4. Normalization

In 1970, Dr. Edgar F. Codd published “A Relational Model of Data for Large Shared Data Bank” and became the father of the relational database. During the 1970s Codd wrote a series of papers that defined the concept of database normalization. He wrote his famous “Codd's 12 Rules” in 1985 to define what constitutes a relational database and to defend the relational database from software vendors who were falsely claiming to be relational. Since that time, others have amended and refined the concept of normalization.

The primary purpose of normalization is to improve the data integrity of the database by reducing or eliminating modification anomalies that can occur when the same fact is stored in multiple locations within the database. In other words, the process of normalization attempts to reduce redundant data that causes unnecessary updates.

Duplicate data raises all sorts of interesting problems for inserts, updates, and deletes. For example, if the product name is stored in the order detail table, and the product name is edited, should every order details' row be updated? If so, is there a mechanism to ensure that the edit to the product name propagates down to every duplicate entry of the product name? If data is stored in multiple locations, is it safe to read just one of those locations without double-checking other locations? Normalization prevents these kinds of modification anomalies.

In addition to the primary goal of consistency and data integrity, several other good reasons to normalize an OLTP relational database exist:

Performance: Duplicate data requires extra code to perform extra writes, maintain consistency, and manipulate data into a set when reading data. Addressing the aforementioned issues is even more problematic when dealing with large highly transactional databases. Imagine a 2-terabyte database that averages approximately 30K transactions per second. Moreover, assume that the database is the back end for a large retail store. If a change were made to one product or item, that change would need to be propagated across every table that referenced that product. This could be tens or even hundreds of tables, resulting in a 10–15 percent system performance degradation.

Normalization also reduces locking contention and improves multiple-user concurrency because you need fewer updates.

Development costs: Although it may take longer to design a normalized database, it's easier to work with a normalized database, and it reduces development costs.
Usability: By placing columns in the correct table, it's easier to understand the database and easier to write correct queries.
Extensibility: A non-normalized database is often more complex and therefore more difficult to modify. This can be directly attributed to the distribution of the redundant data across several tables within the database.

5. The Three “Rules of One”

Normalization is well defined as normalized forms — specific issues that address specific potential errors in the design. Therefore, when designing databases, you should implement normalization design principles from the onset. This approach can help minimize design errors and produce a highly stable and performing database.

You should follow three rules known as the “Rules of One,” when designing a database. One type of item is represented by one entity (table). The key to designing a schema that avoids update anomalies is to ensure that each single fact in real life is modeled by a single data point in the database. Three principles define a single data point:

One group of similar things is represented by one entity (table).
One thing is represented by one tuple (row).
One descriptive fact about the thing is represented by one attribute (column).

Learn these three simple rules to help you design a properly normalized database.

6. Identifying Entities

The first step to designing a database conceptual diagram is to identify the entities (tables). Because any entity represents only one type of thing, it takes several entities together to represent an entire process or organization.

Entities are usually discovered from several sources:

Examining existing documents (order forms, registration forms, patient files, and reports)
Interviews with subject-matter experts
Diagramming the process flow

At this early stage the goal is to simply collect a list of possible entities and their facts. Some of the entities will be obvious nouns, such as customers, products, flights, materials, and machines.

Other entities will be verbs: shipping, processing, assembling parts to build a product. Verbs may be entities, or they may indicate a relationship between two entities.

The goal is to simply collect all the possible entities and their attributes. At this early stage, it's also useful to document as many known relationships as possible, even if those relationships will be edited several times.

Generalization

Normalization has a reputation of creating complex and unwieldy databases. It's true that some database schemas are far too complex, but normalization, by itself, isn't the root cause.

The difference between elegant databases that are a joy to query and overly complex designs that make you want to polish your resume is the data modeler's view of entities.

When identifying entities, there's a continuum, as illustrated in Figure 1, ranging from a broad all-inclusive view to a specific, narrow definition of the entity.

Figure 1 You can identify entities along a continuum, from overly generalized with a single table, to overly specific with too many tables.

The overly simple view groups together entities that are different types of things, for example, storing machines, products, and processes in the single entity. This approach might risk data integrity for two reasons. First, it's difficult to enforce referential integrity (foreign key constraints) because the primary key attempts to represent multiple types of items. Second, these designs tend to merge entities with different attributes, which means that many of the attributes (columns) won't apply to various rows and will simply be left null. Many nullable columns means the data will probably be sparsely filled and inconsistent.

At the other extreme, the overly specific view segments entities that could be represented by a single entity into multiple entities, for example, splitting different types of subassemblies and finished products into multiple different entities. This type of design risks flexibility and usability:

The additional tables create additional work at every layer of the software.
Database relationships become more complex because what could have been a single relationship are now multiple relationships. For example, instead of relating an assembly process between any part, the assembly relationship must now relate with multiple types of parts.
The database has now hard-coded the specific types of similar entities, making it difficult to add another similar type of entity. Using the manufacturing example again, if there's an entity for every type of subassembly, adding another type of subassembly means changes at every level of the software.
Coining a query to extract the proper set of data to meet reporting requirements is now difficult and sometimes a daunting task due to the sheer number of tables that are needed to fulfill the requirement.

The sweet spot in the middle generalizes, or combines, similar entities into single entities. This approach creates a more flexible and elegant database design that is easier to query and extend:

Look for entities with similar attributes, or entities that share some attributes.
Look for types of entities that might have an additional similar entity added in the future.
Look for entities that might be summarized together in reports.

When designing a generalized entity, two techniques are essential:

Use a lookup entity to organize the types of entities. For the manufacturing example, a subassemblytype attribute would serve the purpose of organizing the parts by subassembly type. Typically, this would be a foreign key to a subassemblytype entity.
Typically, the different entity types that could be generalized together do have some differences — which is why a purist view would want to segment them. Employing the supertype/subtype (discussed in the “Data Design Patterns” section) solves this dilemma perfectly.

Although generalization sounds like denormalization — it's not. When generalizing, it's critical that the entities comply with all the rules of normalization.

Generalized databases tend to be data-driven, have fewer tables, and are easier to extend. For example, an advertising company allowed the application architect to develop the database. As a result, writing a query that returned customer information (first name, last name, address, phone, city, state, and so on) required accessing more than 40 tables in one query. o mitigate the problem, the developer wrote a process that transformed and loaded the data into a database that contained one-third of the number of tables as the original. The same customer query could be written against the new database only requiring the need of 10 tables. For which database would you rather write a stored procedure?

On the other hand, be careful when merging entities because they actually do share a root meaning in the data. Don't merge unlike entities just to save programming. The result will be more complex programming.

Best Practice

Granted, knowing when to generalize and when to segment can be an art form and requires a repertoire of database experience, but generalization is the buffer against database over-complexity; and consciously working at understanding generalization is the key to becoming an excellent data modeler.

Others