Exchange Server 2013 : Defining a Highly Available Messaging Solution - Defining the Cost of Downtime, Planning for Failure

11/19/2013 6:46:39 PM

1. Defining the Cost of Downtime

Defining the cost of downtime may appear to be an emotional measure, and it may seem somewhat intangible. In this section, we will look at turning some of the intangibles back into definable measures.

Often the actual cost of downtime may be difficult to measure without an understanding of the business in question and how the services provided by Exchange fit into the processes deemed mission critical. We will examine a number of examples of businesses in different vertical markets:

A financial brokerage may email customers throughout the day about stock evaluations. Customers may then interact with brokers, giving instructions to buy or sell.
A law firm may use email to receive instructions from its clients, send and receive documents for approval, and store and file email. While legal representatives are in court, they may call on their colleagues to forward urgent documentation or correspondence via email to their phones, tablets, or laptops.
Retailers nowadays conduct much of their business online. Customers order products online and are notified via email about the progress of their shipments. If the products ordered are perishable, deliveries may be time-critical.

Each of these businesses has a critical path enabled by email and an average cost or value associated with it. For example, if an average day's worth of transactions is $500,000, and this amount represents 1,000 transactions, the average transaction value is easily calculated as $500. The financial transaction may have multiple email interactions associated with it, so there is little point in trying to calculate the average cost of an email. If the bulk of the transactions are occurring over a 10-hour period, and assuming that customers' transactions occur evenly throughout the day, then each hour's transactions may be worth $50,000.

Continuing with our example, assuming a two-hour outage during which email-based transactions cannot occur, the business may face a loss of at least $100,000. Very often, however, this represents only the tip of the iceberg in terms of financial losses, because the damage to the business's reputation may multiply this number in terms of transactions that will not occur in the future. That is, customers may elect to move their business to another company that they deem is more reliable. This is particularly prevalent in the retail, online, and premium brand segments.

If email is part of the critical transaction path, then the hourly cost of the workers who are attempting to facilitate the transaction, as well as in-house or third-party personnel who are endeavoring to rectify the failure, are added to the cost of the downtime. With this in mind, we are able to measure in part some of the more tangible losses accrued to downtime. We say “in part” because reputational and confidence losses may cascade for months or even years to come. In the following equation, we include the percentage impact, since rarely is the outage 100 percent, with the exception of an actual disaster, of course.

lost revenue = (gross revenue/business hours) × percentage impact × hours of outage

cost of the outage = personnel costs + lost revenue + new equipment

From these equations, we are able to compute a value. It is worth restating, however, that this value may represent only a portion of the total attributable loss due to confidence and reputational loss.

A bank may argue that email is a non-mission-critical system, since the equipment and software required to process transactions during the day are represented via mainframes and traditional banking systems. Nonetheless, confidence loss comes into play here as well. Since email is ubiquitous, no bank is immune from the confidence loss, which may occur if its customers or partners are unable to email for two or three days at a time. This time frame is quite realistic when faced with even a small outage without redundancy in place.

2. Planning for Failure

Failure of systems or components is inevitable. However, you will be better able to plan for failure as you come to understand that failure is not a random event by which to be taken by surprise. Rather, it should be a carefully planned scenario. We will use drives as our topic for discussion of failure, because there will be more drives in your deployment than there are servers, racks, or cooling units. Nevertheless, the logic presented here applies equally to all of these components.

Hardware components have a published annual failure rate (AFR), which is simply the rate at which the component is expected to fail. One hundred SATA drives assembled in a system will suffer an approximate failure rate of 5 percent. Does that mean that for every 100 drives, we need to keep 5 spares available on a shelf? Not quite. It does, however, lead us down the road of planning for failure, as we consider factors such as database distribution within a database availability group (DAG) or how many servers to deploy in order to satisfy a services dependency.

Earlier, we defined total availability as the product of component availability. However, you will recall that the three components with an availability of 99.9 percent had a total availability of only 99.7 percent. A similar mathematical model is available when we add multiple components to a given system.

The logic used for calculating the probability of failure when using identical components is to take the probability of failure of the component and calculate to the power, as opposed to the product. The probability of failure of a system consisting of several identical redundant components is a product of the probabilities of failure of each component. Because all probabilities are identical, the product becomes a power:

P(n) = P × P × . . . × P = Pⁿ

The difference between the availability and probability of failure is that in the first case, each component must be available in order for the entire system to be available, whereas in the second case, each component must fail in order for the entire system to fail. We will use probability of failure of identical SATA drives to demonstrate this principle.

Remembering that SATA drives have an expected failure rate of 5 percent, the first drive has a probability of failure of 5 percent, P₁ = P = 5%. In other words, the probability of one drive failing (P1) is the same as the overall probability of failure (P₁ = P), which we know is 5 percent, thus P₁ = P = 5%.

Things become more interesting as soon as we add a second identical component into the same system. For two identical drives, we will use the notation of P₂, three identical drives P₃, and so on. Assuming up to four components in a system, you will note that the probability of failure drops considerably.

P₁ = P = 5%

P₂ = P² = 0.25%

P₃ = P³ = 0.0125%

P₄ = P⁴ = 0.000625%

As the probability of failure becomes smaller, the availability value of the system increases. Availability (A) is calculated as An = 1 – Pn. Using our one-drive example, this becomes A₁ = 1 – 5% = 95%. As we add multiple drives or identical components to a system, system availability increases radically:

A₁ = 1 – 5% = 95%.

A₂ = 1 – 0.25% = 99.75%

A₃ = 1 – 0.0125% = 99.9875%

A₄ = 1 – 0.000625% = 99.9994%

While we have used this logic to demonstrate failure rates with drives, we can also apply the identical logic to calculate the probability and possible availability of individual systems or system components such as switches, servers, datacenters, and so forth.

When thinking about Exchange 2013, however, the logic is even simpler. Two CAS servers per location are better than one, and four database copies on SATA drives have a probability of failure of 0.000625 percent and, reciprocally, an availability of 99.9994 percent.

Taking it as a given that failures are an event for which you must plan, you can look for and mitigate failure domains. Failure domains are service interdependencies or shared components of a system that are able to reduce the overall availability of the system, or they may introduce a significant impact to the overall system should a failure occur.

An obvious example of a failure domain in a datacenter is a single power source. Should that power source fail, then the entire datacenter fails. Similarly, we can think of power to a rack, shared-blade chassis, non-redundant switches, and cooling systems as other obvious examples. A not-so-obvious example might be a storage area network (SAN). SANs tend to be designed with redundancy in mind. However, as we observed when we calculated the probability of failure of a single component, it makes sense to use multiple database copies. The benefit of those multiple database copies is nonetheless negated if they are stored on the same SAN, because the probability of failure is now identical to that of the SAN itself, as opposed to a fraction of what it could be, even when choosing multiple, identical, cheap drives.

Virtualization introduces similar failure domains. If you elect to virtualize, then know that each host represents a failure domain. It makes sense to distribute your Exchange components over different hosts, as opposed to centralizing them onto a single host.

When planning for failure, isolation and separation are great concepts to use in your datacenters. Isolation of components—for example, using multiple cheap drives as opposed to shared storage—increases availability considerably. Separation significantly reduces the number of shared failure domains. When Exchange servers are distributed across multiple racks, servers, or even virtualization hosts, the probability of overall failure increases dramatically as the number of service interdependencies, or possible failure domains, decrease dramatically.

Well-publicized failures of public cloud-based services have taught us that as systems scale, complexity and interdependencies increase and failure becomes inevitable. Operational efficiency is a significant factor in maintaining availability. How you respond to a failure can significantly increase or decrease outage times.

Others

- Exchange Server 2013 : Defining a Highly Available Messaging Solution - Defining Availability

- About Microsoft SharePoint 2013 : What Is a Workflow?

- About Microsoft SharePoint 2013 : What Is Tagging?

- About Microsoft SharePoint 2013 : What Is a Content Type?

- About Microsoft SharePoint 2013 : What Are Web Parts?

- About Microsoft SharePoint 2013 : What Is a View?

- Feature Overview and Benefits of Microsoft Lync Server 2013 : Remote Access

- Feature Overview and Benefits of Microsoft Lync Server 2013 : Enterprise Voice

- Feature Overview and Benefits of Microsoft Lync Server 2013 : Dial-In Conferencing

- Feature Overview and Benefits of Microsoft Lync Server 2013 : Presence (part 3)