1. Defining the Cost of Downtime
Defining the cost of downtime may appear to be an
emotional measure, and it may seem somewhat intangible. In this
section, we will look at turning some of the intangibles back into
definable measures.
Often the actual cost of downtime may be
difficult to measure without an understanding of the business in
question and how the services provided by Exchange fit into the
processes deemed mission critical. We will examine a number of examples
of businesses in different vertical markets:
- A financial brokerage may email customers throughout the day about
stock evaluations. Customers may then interact with brokers, giving
instructions to buy or sell.
- A law firm may use email to receive instructions from its clients,
send and receive documents for approval, and store and file email.
While legal representatives are in court, they may call on their
colleagues to forward urgent documentation or correspondence via email
to their phones, tablets, or laptops.
- Retailers nowadays conduct much of their business online. Customers
order products online and are notified via email about the progress of
their shipments. If the products ordered are perishable, deliveries may
be time-critical.
Each of these businesses has a critical
path enabled by email and an average cost or value associated with it.
For example, if an average day's worth of transactions is $500,000, and
this amount represents 1,000 transactions, the average transaction
value is easily calculated as $500. The financial transaction may have
multiple email interactions associated with it, so there is little
point in trying to calculate the average cost of an email. If the bulk
of the transactions are occurring over a 10-hour period, and assuming
that customers' transactions occur evenly throughout the day, then each
hour's transactions may be worth $50,000.
Continuing with our example, assuming a
two-hour outage during which email-based transactions cannot occur, the
business may face a loss of at least $100,000. Very often, however,
this represents only the tip of the iceberg in terms of financial
losses, because the damage to the business's reputation may multiply
this number in terms of transactions that will not occur in the future.
That is, customers may elect to move their business to another company
that they deem is more reliable. This is particularly prevalent in the
retail, online, and premium brand segments.
If email is part of the critical
transaction path, then the hourly cost of the workers who are
attempting to facilitate the transaction, as well as in-house or
third-party personnel who are endeavoring to rectify the failure, are
added to the cost of the downtime. With this in mind, we are able to
measure in part some of the more tangible losses accrued to downtime.
We say “in part” because reputational and confidence losses may cascade
for months or even years to come. In the following
equation, we include the percentage impact, since rarely is the outage
100 percent, with the exception of an actual disaster, of course.
lost revenue = (gross revenue/business hours) × percentage impact × hours of outage
cost of the outage = personnel costs + lost revenue + new equipment
From these equations, we are able to
compute a value. It is worth restating, however, that this value may
represent only a portion of the total attributable loss due to
confidence and reputational loss.
A bank may argue that email is a
non-mission-critical system, since the equipment and software required
to process transactions during the day are represented via mainframes
and traditional banking systems. Nonetheless, confidence loss comes
into play here as well. Since email is ubiquitous, no bank is immune
from the confidence loss, which may occur if its customers or partners
are unable to email for two or three days at a time. This time frame is
quite realistic when faced with even a small outage without redundancy
in place.
2. Planning for Failure
Failure of systems or components is inevitable.
However, you will be better able to plan for failure as you come to
understand that failure is not a random event by which to be taken by
surprise. Rather, it should be a carefully planned scenario. We will
use drives as our topic for discussion of failure, because there will
be more drives in your deployment than there are servers, racks, or
cooling units. Nevertheless, the logic presented here applies equally
to all of these components.
Hardware components have a published annual failure rate (AFR),
which is simply the rate at which the component is expected to fail.
One hundred SATA drives assembled in a system will suffer an
approximate failure rate of 5 percent. Does that mean that for every
100 drives, we need to keep 5 spares available on a shelf? Not quite.
It does, however, lead us down the road of planning for failure, as we
consider factors such as database distribution within a database
availability group (DAG) or how many servers to deploy in order to
satisfy a services dependency.
Earlier, we defined total availability as
the product of component availability. However, you will recall that
the three components with an availability of 99.9 percent had a total
availability of only 99.7 percent. A similar mathematical model is
available when we add multiple components to a given system.
The logic used for calculating the
probability of failure when using identical components is to take the
probability of failure of the component and calculate to the power,
as opposed to the product. The probability of failure of a system
consisting of several identical redundant components is a product of
the probabilities of failure of each component. Because all
probabilities are identical, the product becomes a power:
P(n) = P × P × . . . × P = Pn
The difference between the availability
and probability of failure is that in the first case, each component
must be available in order for the entire system to be available,
whereas in the second case, each component must fail in order for the
entire system to fail. We will use probability of failure of identical
SATA drives to demonstrate this principle.
Remembering that SATA drives have an
expected failure rate of 5 percent, the first drive has a probability
of failure of 5 percent, P1 = P = 5%. In other words, the probability of one drive failing (P1) is the same as the overall probability of failure (P1 = P), which we know is 5 percent, thus P1 = P = 5%.
Things become more interesting as soon as
we add a second identical component into the same system. For two
identical drives, we will use the notation of P2, three identical drives P3, and so on. Assuming up to four components in a system, you will note that the probability of failure drops considerably.
P1 = P = 5%
P2 = P2 = 0.25%
P3 = P3 = 0.0125%
P4 = P4 = 0.000625%
As the probability of failure becomes
smaller, the availability value of the system increases. Availability
(A) is calculated as An = 1 – Pn. Using our one-drive example, this
becomes A1 = 1 – 5% = 95%. As we add multiple drives or identical components to a system, system availability increases radically:
A1 = 1 – 5% = 95%.
A2 = 1 – 0.25% = 99.75%
A3 = 1 – 0.0125% = 99.9875%
A4 = 1 – 0.000625% = 99.9994%
While we have used this logic to
demonstrate failure rates with drives, we can also apply the identical
logic to calculate the probability and possible availability of
individual systems or system components such as switches, servers,
datacenters, and so forth.
When thinking about Exchange 2013,
however, the logic is even simpler. Two CAS servers per location are
better than one, and four database copies on SATA drives have a
probability of failure of 0.000625 percent and, reciprocally, an
availability of 99.9994 percent.
Taking it as a given that failures are an event for which you must plan, you can look for and mitigate failure domains. Failure domains
are service interdependencies or shared components of a system that are
able to reduce the overall availability of the system, or they may
introduce a significant impact to the overall system should a failure
occur.
An obvious example of a failure domain in
a datacenter is a single power source. Should that power source fail,
then the entire datacenter fails. Similarly, we can think of power to a
rack, shared-blade chassis, non-redundant switches, and cooling systems
as other obvious examples. A not-so-obvious example might be a storage area network (SAN).
SANs tend to be designed with redundancy in mind. However, as we
observed when we calculated the probability of failure of a single
component, it makes sense to use multiple database copies. The benefit
of those multiple database copies is nonetheless negated if they are
stored on the same SAN, because the probability of failure is now
identical to that of the SAN itself, as opposed to a fraction of what
it could be, even when choosing multiple, identical, cheap drives.
Virtualization introduces similar failure
domains. If you elect to virtualize, then know that each host
represents a failure domain. It makes sense to distribute your Exchange
components over different hosts, as opposed to centralizing them onto a
single host.
When planning for
failure, isolation and separation are great concepts to use in your
datacenters. Isolation of components—for example, using multiple cheap
drives as opposed to shared storage—increases availability
considerably. Separation significantly reduces the number of shared
failure domains. When Exchange servers are distributed across multiple
racks, servers, or even virtualization hosts, the probability of
overall failure increases dramatically as the number of service
interdependencies, or possible failure domains, decrease dramatically.
Well-publicized failures of
public cloud-based services have taught us that as systems scale,
complexity and interdependencies increase and failure becomes
inevitable. Operational efficiency is a significant factor in
maintaining availability. How you respond to a failure can
significantly increase or decrease outage times.