Exchange Server 2010 : A Closer Look at Availability - Service Availability, Network Availability, Data Availability

8/25/2013 9:29:25 AM

We've already talked about disaster recovery and how it can be confused with general data protection (backup and recovery) and business continuity. Perhaps an even more common confusion, though, is the distinction between high availability and disaster recovery.

High availability (HA) is a design strategy. The strategy is simple: try to ensure that users keep access to services, such as their Exchange mailboxes or Unified Messaging servers, during periods of outage or downtime. These outages could be the result of any sort of event:

Hardware failure, such as the loss of a power supply, a memory module, or the server motherboard
Storage failure, such as the loss of a disk, disk controller, or data-level corruption
Network failure, such as the cutting of a network cable or a router or a switch losing configuration
Some other service failure, such as the loss of an Active Directory domain controller or a DNS server

HA technologies and strategies are designed to allow a given service to continue to be available to users (or other services) in the event of these kind of failures. No matter which technology is involved, there are two main approaches, one or both of which is used by each HA technology and strategy:

Fault Tolerance and Redundancy: This involves placing resources into a pool so that one can take up the load when another member of the pool fails. This strategy removes the presence of a single point of failure. Fault tolerance needs to be accompanied by some mechanism for selecting which of the redundant resources is to be used. These mechanisms are either round-robin or load balancing. In the former, each resource in the pool is used in turn, regardless of the current state or load. In the latter, additional mechanisms are used to direct users to the least loaded member of the resource pool. Many higher-end hardware systems use redundant parts to make the overall server system more redundant to many common types of hardware failures. Exchange Server 2010 can use database availability groups (DAGs) to replicate copies of data from one Mailbox server to another and to provide failover in the event the node where the data resides fails.
Replication: This process involves making copies of critical data between multiple members of the resource pool. If replication happens quickly enough and with a small enough time interval, when one member of the resource pool becomes unavailable another member can take over the load. Most replication strategies, including Exchange's database replication features, are based on a single master strategy, where all updates happen to the master (or active) copy and are replicated to the additional copies. Some technologies such as Active Directory are designed to allow multimaster replication, where updates can be directed to the closest member.

To achieve complete availability with Exchange, you'll use both strategies. However, you also need to think of the different levels of availability that you'll need to ensure.

Measuring Availability

It is not uncommon to find that availability of a system is measured differently depending on the organization. Typically, to report the percentage of availability, you take the amount of time during a measurement period and then subtract the total downtime during that period. Finally, you divide that number by the total elapsed time.

So, let's say that during a 30-day period of time, there was no scheduled downtime, but there was a 4-hour period of time when patches were applied to the system. So, 30 days – .17 days = 29.8 days of total uptime, and 29.8/30 = 99.3 percent availability.

This is just a sample calculation, of course. In the real world, you would probably have a maintenance window during your operations that would not count against your availability numbers. You want to do your very best to minimize the amount of unplanned downtime, but you also have to take in to consideration scheduled maintenance and planned downtime.

1. Service Availability

When we have discussions with people about high availability in Exchange organizations, we find that the level of high availability that most of them are actually thinking about is service availability. That is, they think of the Exchange deployment as an overall service and think of how to ensure that users can get access to the whole shebang (either that, or they think solely of hardware clusters, storage replication, and the other low-end technologies). It is important to note that when discussing service availability, this term may mean different things to different people.

Service availability is an important consideration for your overall availability strategy. It doesn't make a lot of sense to plan for redundant server hardware if you forget to deploy sufficient numbers of those servers with the right Exchange roles in the appropriate locations.

The other aspect of service availability is to think about what other services Exchange is dependent on:

The obvious dependency is Active Directory. Each Exchange server requires access to a domain controller as well as global catalog services. The more Exchange servers in the site, the more of each Active Directory role that site requires. If your domain controllers are also DNS servers, you need enough DNS servers to survive the loss of one or two. If you lose DNS servers or domain controllers in a site, Exchange will fail.
What type of network services do you need? Do you assign static IP addresses and default gateways or do you use DHCP and dynamic routing? Do you have extra router or switching capacity? What about your firewall configurations — do you have only a single firewall between different network zones, or are those redundant as well?
What other applications do you deploy as part of your Exchange deployment? Do you rely on a monitoring system such as Microsoft System Center Operations Manager? What will occur if something happens to your monitoring server; is there a redundant or backup system that takes over, or will additional faults and failures go unnoticed and be allowed to take down the Exchange system? Do you have enough backup agents and servers to protect your mailbox servers?

Service availability typically requires a combination of redundancy and replication strategies. For example, you deploy multiple Active Directory domain controllers in a site for redundancy, but they replicate the directory data between each other.

2. Network Availability

The next layer we want to talk about is network availability. By this, we don't mean the types of network services we mentioned in the previous section. Instead, what we mean is the ability to ensure that you can receive new connections from clients and other servers, whether your organization uses Exchange servers, PBX systems and telephony gateways, or external mail servers. Network availability is a key part of your Exchange infrastructure and should therefore be considered part of your overall service availability.

The typical strategy for network availability is load balancing. This is network-level redundancy. Simple network load balancers use a round-robin mechanism to alternately and evenly (on the basis of numbers) distribute incoming connections to the members of the resource pool. Other solutions use more sophisticated mechanisms, such as monitoring each member of the pool for overall load and assigning incoming connections to the least-loaded member.

For larger organizations and complex Exchange deployments, it's common to use hardware load balancers. Hardware systems are typically more expensive and represent yet more systems to manage and maintain, so they add a degree of complexity that is often undesirable to smaller organizations. Smaller organizations prefer to use software-based load-balancing solutions likes Windows Network Load Balancing (WNLB).

Unfortunately, WNLB isn't generally suitable for Exchange 2010 deployments. This is the official recommendation of both the Exchange product group and the Windows product group, the folks who develop the WNLB component. WNLB has a few characteristics that render it unsuitable for use with Exchange in any but smaller deployments or test environments:

WNLB simply performs round-robin balancing of incoming connections. It doesn't detect whether members of the load-balance cluster are down, so it will keep sending connections to the downed member. This could result in intermittent and confusing behavior for clients and loss or delay of messages from external systems.
WNLB is incompatible with the Windows Failover Clustering components. This means that small shops can't deploy a pair of servers with the Mailbox, Client Access, and Hub Transport roles; use WNLB to balance the Client Access and Hub Transport roles; or use continuous replication to replicate the mailbox databases. They'd have to deploy four servers at a minimum.

Even when using hardware network load-balancing, there are a number of things to remember and best practices to follow.

3. Data Availability

We've seen many Exchange organization designs and deployment plans. Most of them spend a lot of time ensuring that the mailbox data will be available.

In versions of Exchange prior to Exchange 2007, data availability meant using failover clustering. Failover clustering used a feature of Enterprise Edition Windows, the Windows Cluster Service (now called Windows Failover Clustering in Windows Server 2008), to create groups of servers that shared a single storage source. Within this cluster, one or more clustered Exchange server instances are activated and control the corresponding mailbox databases. When one underlying hardware node fails, the active virtual server instance fails over to another node.

Failover clustering is a common HA strategy and Windows clustering is a proven technology. This turned out to be a good strategy for many Exchange organizations. However, failover clustering has some cons. For clusters that relay on a shared quorum, the biggest is the reliance on shared storage — typically a storage area network. Shared storage increases the cost and complexity of the clustering solution, but it doesn't guard against the most common cause of Exchange outage: data corruption.

Exchange Server 2007 used failover clustering when implementing the Single Copy Cluster (SCC) feature, but a new data availability solution was introduced called continuous replication to help overcome some of the weaknesses associated with failover clustering and allow more organizations to take advantage of highly available deployments. Continuous replication, also known as log shipping, copies the transaction logs corresponding to a mailbox database from one Mailbox server to another. The target then replays the logs into its own separate copy of the database, re-creating the latest changes.

Exchange 2007 offered three types of continuous replication:

Local Continuous Replication (LCR): Protects a server from local data corruption and disk failure by creating a second copy of mailbox databases on separate disks. Because these copies are on the same server, it doesn't protect from server or site failure; activation is manual, making it less than ideal for availability designs.
Clustered Continuous Replication (CCR): Protects against server failure (and site failure if the CCR cluster is stretched across sites) by using log shipping and Windows failover clustering components to copy mailbox databases to a second server known as the passive node.
Standby Continuous Replication (SCR): Protects from site failure by allowing one or more copies of mailbox databases to be created. One target can host replicated copies from multiple servers, making SCR ideal for disaster recovery strategies. SCR isn't really an availability option because it requires not only manual activation, but also the activation of dependent services.

Exchange 2010 makes some sweeping changes in the data availability offerings. First, the bad news is the SCC feature is gone. Yes, that's correct; Exchange 2010 no longer supports SCC clusters. The LCR feature was discontinued as well. These features have been replaced with a solution that provides significantly better service and data availability.

What Microsoft has done instead is combine CCR and SCR into a single continuous replication offering. You now join servers into a database availability group (DAG); members of that group can replicate one or more of their mailbox databases with the other servers in the group. Each database can be replicated separately from others and have one or more replicas. A DAG can cross Active Directory site boundaries, thus providing site resiliency. And activation of a passive copy is automatic, avoiding some of the pitfalls of the Exchange 2007 SCR solution.

HA vs. DR: Not the Same

We'll provide a quick comparison between the typical Exchange HA deployment and DR deployment. If you think that by having disaster recovery you have availability, or vice versa, think again.

In an HA Exchange environment, the focus is usually on keeping mailboxes up and running for users, mail transferring with external systems, and Exchange services up. In a DR environment, the focus is usually on restoring a bare minimum of services, often for a smaller portion of the overall user population. In short, the difference is that of abundance vs. triage.

For Exchange, an HA design can provide several advantages beyond the obvious availability goals. A highly available Exchange environment often enables server consolidation; the same technologies that permit mailbox data to be replicated between servers or to keep multiple instances of key Exchange services also permit greater user mailbox density, or force the upgrading of key infrastructure (like network bandwidth) so that greater number of users can be handled. This increased density can make proper DR planning more difficult by increasing the requirements for a DR solution and making it harder to identify and target the appropriate user populations.

That's not to say that HA and DR are incompatible. Far from it; you can and should design your Exchange 2010 deployment for both. To do that effectively, though, you need to have a clear understanding of what each technology and feature actually provides you, so you can avoid design errors. For example, if you have separate groups of users who will need their mailboxes replicated to a DR site, set them aside in separate mailbox databases, rather than mingling them in with users whose mailboxes won't be replicated.

Others

- Exchange Server 2010 : What's in a Name? (part 3) - Management Frameworks

- Exchange Server 2010 : What's in a Name? (part 2) - Location

- Exchange Server 2010 : What's in a Name? (part 1) - Backup and Recovery, Disaster Recovery

- Windows 8 : Managing Content - Homegroups (part 2) - To enable media streaming, To join a homegroup

- Windows 8 : Managing Content - Homegroups (part 1) - To create a homegroup , To modify an existing homegroup

- Sharepoint 2013 : Usage and Health Data Collection

- Sharepoint 2013 : Using and Configuring the Health Analyzer

- Sharepoint 2013 : Configuring Monitoring in Central Administration (part 2) - Configuring ULS via PowerShell

- Sharepoint 2013 : Configuring Monitoring in Central Administration (part 1) - Unified Logging Service, Configuring ULS via Central Admin

- InfoPath with SharePoint 2010 : Using Template Parts to Create Reusable Form Components (part 3)