Raising the availability of any system has a direct cost
implication. Exchange is in a league of its own in terms of
interdependency with other systems.
Following are a number of factors to consider that will dramatically
influence both cost and complexity when planning for high availability:
Process Existing processes for a
decentralized, non-highly available system will not suffice when
considering a move to a highly centralized or a higher level of
availability. Increased hardware cost is only one of the factors to
consider with high availability. Another is the cost of a larger number
of processes, often interdependent with each other.
User Locations Users may be centralized in
one well-connected campus or located in many different geographies
across the globe, all with varying connectivity and power, which may or
may not be guaranteed.
Servers Server construction may or may not
have its own availability factors to consider. Memory bank redundancy,
power supply redundancy, processor redundancy, and backplane failure in
the case of blade servers are some of the factors that influence total
server availability.
Network End users may have varying amounts
of bandwidth available to access email as well as different network
topologies, which themselves may present points of failure. Do the
datacenters hosting the Exchange servers have multiple connection
points to the Internet as well as redundant routes to the rest of the
network? Are the routing and the switch fabric redundant? Are firewalls
and reverse proxy/hygiene solutions redundant?
Power to the Datacenter Power availability
is often taken for granted. Events have shown, however, that power may
be compromised during extreme weather or may not be guaranteed in some
parts of the world at all.
Power to the Racks Is the power to the
racks themselves wired so that loss of any one power source or power
distribution point within a rack does not affect the entire contents of
the rack?
Cooling Cooling is another critical
measure of datacenter availability. Do the datacenters housing your
servers have redundant cooling available in the event of an outage?
Cloud Solutions You
may rely on an external cloud solution for some of the availability of
your infrastructure. Does your vendor have a published availability
strategy, and does that strategy map to your desired availability goals
in a compatible manner?
Virtualization While virtualization
carries with it the promise of on-demand capacity and higher levels of
availability, Exchange may not fit into your current virtualization
strategy. You may be increasing risk and lowering availability by
virtualizing Exchange, as opposed to deploying on physical hardware.
Capacity All of the points mentioned thus
far have a measure of available capacity that may be overwhelmed or
compromised during an outage or a denial-of-service attack.
Single Points of Failure When increasing
availability, redundancy of components is a given. However, one not so
obvious factor is a single point of failure or, as you learned earlier,
a failure domain. A failure domain could include an individual server,
power supply, the network, the rack itself, or any datacenter
component, including the datacenter itself.
This is not meant as an exhaustive list of
all possible factors. Your own analysis of your environment may yield a
number of other factors that may be relevant.
Once you identify the factors influencing
availability, you can evaluate each in an attempt to mitigate them. For
example, let's say your analysis has shown that cooling and the
centralization of all IT into a single datacenter represent a single
point of failure. The business will remedy the lack of cooling
redundancy, but it will not build or rent another datacenter. You may
want to capture this and other potential factors as demonstrated in Table 1.
TABLE 1: Availability factors and mitigation
When
calculating availability (remember that total availability is a product
of all the availability factors), the biggest factor influencing total
availability is the component(s) in the entire chain that is most
likely to fail. When calculating total availability, a single machine is not very
redundant, and it is able to drop the total availability of a single
factor, such as networking or mail flow, quite significantly.