We've already talked about disaster
recovery and how it can be confused with general data protection
(backup and recovery) and business continuity. Perhaps an even more
common confusion, though, is the distinction between high availability
and disaster recovery.
High availability (HA) is a design strategy. The
strategy is simple: try to ensure that users keep access to services,
such as their Exchange mailboxes or Unified Messaging servers, during
periods of outage or downtime. These outages could be the result of any
sort of event:
Hardware failure, such as the loss of a power supply, a memory module, or the server motherboard
Storage failure, such as the loss of a disk, disk controller, or data-level corruption
Network failure, such as the cutting of a network cable or a router or a switch losing configuration
Some other service failure, such as the loss of an Active Directory domain controller or a DNS server
HA technologies and strategies are designed to allow
a given service to continue to be available to users (or other
services) in the event of these kind of failures. No matter which
technology is involved, there are two main approaches, one or both of
which is used by each HA technology and strategy:
Fault Tolerance and Redundancy
This involves placing resources into a pool so
that one can take up the load when another member of the pool fails.
This strategy removes the presence of a single point of failure. Fault
tolerance needs to be accompanied by some mechanism for selecting which
of the redundant resources is to be used. These mechanisms are either round-robin or load balancing.
In the former, each resource in the pool is used in turn, regardless of
the current state or load. In the latter, additional mechanisms are
used to direct users to the least loaded member of the resource pool.
Many higher-end hardware systems use redundant parts to make the
overall server system more redundant to many common types of hardware
failures. Exchange Server 2010 can use database availability groups
(DAGs) to replicate copies of data from one Mailbox server to another
and to provide failover in the event the node where the data resides
fails.
Replication
This process involves making copies of critical
data between multiple members of the resource pool. If replication
happens quickly enough and with a small enough time interval, when one
member of the resource pool becomes unavailable another member can take
over the load. Most replication strategies, including Exchange's
database replication features, are based on a single master
strategy, where all updates happen to the master (or active) copy and
are replicated to the additional copies. Some technologies such as
Active Directory are designed to allow multimaster replication, where updates can be directed to the closest member.
To achieve complete availability with Exchange,
you'll use both strategies. However, you also need to think of the
different levels of availability that you'll need to ensure.
It is not uncommon to find that availability of a
system is measured differently depending on the organization.
Typically, to report the percentage of availability, you take the
amount of time during a measurement period and then subtract the total
downtime during that period. Finally, you divide that number by the
total elapsed time.
So, let's say that during a 30-day period of time, there was no scheduled
downtime, but there was a 4-hour period of time when patches were
applied to the system. So, 30 days – .17 days = 29.8 days of total
uptime, and 29.8/30 = 99.3 percent availability.
This is just a sample calculation, of course. In the
real world, you would probably have a maintenance window during your
operations that would not count against your availability numbers. You
want to do your very best to minimize the amount of unplanned downtime,
but you also have to take in to consideration scheduled maintenance and
planned downtime.
|
1. Service Availability
When we have discussions with people about high
availability in Exchange organizations, we find that the level of high
availability that most of them are actually thinking about is service availability.
That is, they think of the Exchange deployment as an overall service
and think of how to ensure that users can get access to the whole
shebang (either that, or they think solely of hardware clusters,
storage replication, and the other low-end technologies). It is
important to note that when discussing service availability, this term
may mean different things to different people.
Service availability is an important consideration
for your overall availability strategy. It doesn't make a lot of sense
to plan for redundant server hardware if you forget to deploy
sufficient numbers of those servers with the right Exchange roles in
the appropriate locations.
The other aspect of service availability is to think about what other services Exchange is dependent on:
The obvious dependency is Active Directory.
Each Exchange server requires access to a domain controller as well as
global catalog services. The more Exchange servers in the site, the
more of each Active Directory role that site requires. If your domain
controllers are also DNS servers, you need enough DNS servers to
survive the loss of one or two. If you lose DNS servers or domain
controllers in a site, Exchange will fail.
What
type of network services do you need? Do you assign static IP addresses
and default gateways or do you use DHCP and dynamic routing? Do you
have extra router or switching capacity? What about your firewall
configurations — do you have only a single firewall between different
network zones, or are those redundant as well?
What
other applications do you deploy as part of your Exchange deployment?
Do you rely on a monitoring system such as Microsoft System Center
Operations Manager? What will occur if something happens to your
monitoring server; is there a redundant or backup system that takes
over, or will additional faults and failures go unnoticed and be
allowed to take down the Exchange system? Do you have enough backup
agents and servers to protect your mailbox servers?
Service availability typically requires a
combination of redundancy and replication strategies. For example, you
deploy multiple Active Directory domain controllers in a site for
redundancy, but they replicate the directory data between each other.
2. Network Availability
The next layer we want to talk about is network
availability. By this, we don't mean the types of network services we
mentioned in the previous section. Instead, what we mean is the ability
to ensure that you can receive new connections from clients and other
servers, whether your organization uses Exchange servers, PBX systems
and telephony gateways, or external mail servers. Network availability
is a key part of your Exchange infrastructure and should therefore be
considered part of your overall service availability.
The typical strategy for network availability is
load balancing. This is network-level redundancy. Simple network load
balancers use a round-robin mechanism to alternately and evenly (on the
basis of numbers) distribute incoming connections to the members of the
resource pool. Other solutions use more sophisticated mechanisms, such
as monitoring each member of the pool for overall load and assigning
incoming connections to the least-loaded member.
For larger organizations and complex Exchange
deployments, it's common to use hardware load balancers. Hardware
systems are typically more expensive and represent yet more systems to
manage and maintain, so they add a degree of complexity that is often
undesirable to smaller organizations. Smaller organizations prefer to
use software-based load-balancing solutions likes Windows Network Load
Balancing (WNLB).
Unfortunately, WNLB isn't generally suitable for
Exchange 2010 deployments. This is the official recommendation of both
the Exchange product group and the Windows product group, the folks who
develop the WNLB component. WNLB has a few characteristics that render
it unsuitable for use with Exchange in any but smaller deployments or
test environments:
WNLB simply performs round-robin balancing
of incoming connections. It doesn't detect whether members of the
load-balance cluster are down, so it will keep sending connections to
the downed member. This could result in intermittent and confusing
behavior for clients and loss or delay of messages from external
systems.
WNLB is incompatible with the
Windows Failover Clustering components. This means that small shops
can't deploy a pair of servers with the Mailbox, Client Access, and Hub
Transport roles; use WNLB to balance the Client Access and Hub
Transport roles; or use continuous replication to replicate the mailbox
databases. They'd have to deploy four servers at a minimum.
Even when using hardware network load-balancing, there are a number of things to remember and best practices to follow.
3. Data Availability
We've seen many Exchange organization designs and
deployment plans. Most of them spend a lot of time ensuring that the
mailbox data will be available.
In versions of Exchange prior to Exchange 2007, data
availability meant using failover clustering. Failover clustering used
a feature of Enterprise Edition Windows, the Windows Cluster Service
(now called Windows Failover Clustering in Windows Server 2008), to
create groups of servers that shared a single storage source. Within
this cluster, one or more clustered Exchange server instances are
activated and control the corresponding mailbox databases. When one
underlying hardware node fails, the active virtual server instance
fails over to another node.
Failover clustering is a common HA strategy and
Windows clustering is a proven technology. This turned out to be a good
strategy for many Exchange organizations. However, failover clustering
has some cons. For clusters that relay on a shared quorum, the biggest
is the reliance on shared storage — typically a storage area network.
Shared storage increases the cost and complexity of the clustering
solution, but it doesn't guard against the most common cause of
Exchange outage: data corruption.
Exchange Server 2007 used failover clustering when
implementing the Single Copy Cluster (SCC) feature, but a new data
availability solution was introduced called continuous replication
to help overcome some of the weaknesses associated with failover
clustering and allow more organizations to take advantage of highly
available deployments. Continuous replication, also known as log
shipping, copies the transaction logs corresponding to a mailbox
database from one Mailbox server to another. The target then replays
the logs into its own separate copy of the database, re-creating the
latest changes.
Exchange 2007 offered three types of continuous replication:
Local Continuous Replication (LCR)
Protects a server from local data corruption and
disk failure by creating a second copy of mailbox databases on separate
disks. Because these copies are on the same server, it doesn't protect
from server or site failure; activation is manual, making it less than
ideal for availability designs.
Clustered Continuous Replication (CCR)
Protects against server failure (and site
failure if the CCR cluster is stretched across sites) by using log
shipping and Windows failover clustering components to copy mailbox
databases to a second server known as the passive node.
Standby Continuous Replication (SCR)
Protects from site failure by allowing one or
more copies of mailbox databases to be created. One target can host
replicated copies from multiple servers, making SCR ideal for disaster
recovery strategies. SCR isn't really an availability option because it
requires not only manual activation, but also the activation of
dependent services.
Exchange 2010 makes some sweeping changes in the
data availability offerings. First, the bad news is the SCC feature is
gone. Yes, that's correct; Exchange 2010 no longer supports SCC
clusters. The LCR feature was discontinued as well. These features have
been replaced with a solution that provides significantly better
service and data availability.
What Microsoft has done instead is combine CCR and
SCR into a single continuous replication offering. You now join servers
into a database availability group (DAG);
members of that group can replicate one or more of their mailbox
databases with the other servers in the group. Each database can be
replicated separately from others and have one or more replicas. A DAG
can cross Active Directory site boundaries, thus providing site
resiliency. And activation of a passive copy is automatic, avoiding
some of the pitfalls of the Exchange 2007 SCR solution.
We'll provide a quick comparison between the typical
Exchange HA deployment and DR deployment. If you think that by having
disaster recovery you have availability, or vice versa, think again.
In an HA Exchange environment, the focus is usually
on keeping mailboxes up and running for users, mail transferring with
external systems, and Exchange services up. In a DR environment, the
focus is usually on restoring a bare minimum of services, often for a smaller portion of the overall user population. In short, the difference is that of abundance vs. triage.
For Exchange, an HA design can provide several
advantages beyond the obvious availability goals. A highly available
Exchange environment often enables server consolidation; the same
technologies that permit mailbox data to be replicated between servers
or to keep multiple instances of key Exchange services also permit
greater user mailbox density, or force the upgrading of key
infrastructure (like network bandwidth) so that greater number of users
can be handled. This increased density can make proper DR planning more
difficult by increasing the requirements for a DR solution and making
it harder to identify and target the appropriate user populations.
That's not to say that HA and DR are
incompatible. Far from it; you can and should design your Exchange 2010
deployment for both. To do that effectively, though, you need to have a
clear understanding of what each technology and feature actually
provides you, so you can avoid design errors. For example, if you have
separate groups of users who will need their mailboxes replicated to a
DR site, set them aside in separate mailbox databases, rather than
mingling them in with users whose mailboxes won't be replicated.