High Availability in Exchange Server 2010 : Developments in High Availability (part 1) - Exchange database replication, Database Availability Group and Continuous Replication

4/16/2013 2:51:39 AM

Ever since Exchange Server 5.5, Microsoft has offered the option to use Windows Clustering to create a highly available Exchange Mailbox environment. In a typical shared-storage cluster environment there are two server nodes available, both running Exchange Server, and both servers are connected to a shared storage solution. In the early days, this shared storage was built on a shared SCSI bus and later on, SANs with a Fiber Channel or iSCSI network connection were used. The important part was the shared storage where the Exchange Server databases were located.

At any given point in time only one server node is the "owner" of this shared data, and it is this server node that is providing the client services; this server node is also known as the active node. The other node was not able to access this data, and was therefore the passive node. A private network between the two server nodes is used for intra-cluster communications, such as a heartbeat signal, allowing both nodes to determine the state of the cluster, and if other nodes are still alive.

In addition to the two nodes, an "Exchange Virtual Server" was created as a cluster resource (note that this has nothing to do with virtual machines!). This is the resource that (Outlook) clients connect to in order to get access to their mailboxes. When the active node fails, the passive node takes over the Exchange Virtual Server, which then continues to run. Although users will notice a short downtime during the fail-over, it is an otherwise seamless experience, and no action is needed from an end-user perspective.

Although this solution offers redundancy, there's still a single point of failure: the shared database of the Exchange server. In a typical environment this database is stored on a SAN, and by its nature a SAN is a highly available environment. But when something does happen to the database, a logical failure for example, the database is unavailable for both nodes, resulting in total unavailability.

Figure 1. A two node cluster with shared storage.

1. Exchange database replication

Microsoft offered a new solution in Exchange Server 2007 to create highly available Exchange environments: database replication. When using database replication, a copy of a database was created, resulting in database redundancy. This technology was available in three flavors:

Local Continuous Replication (LCR) – a copy of the database is created on the same server.
Cluster Continuous Replication (CCR) – a copy of the database is created on another node in a Windows failover cluster (there can only be two nodes in a CCR cluster).
Stand-by Continuous Replication (SCR) – this came with Exchange Server 2007 SP1. A copy of a database is created on any other Exchange server (i.e. not necessarily in the cluster). This is not meant as a high availability solution, but more as a disaster recovery solution.

This is how database replication works in a CCR clustered environment:

Exchange Server 2007 is installed on a Windows Server 2003 or Windows Server 2008 Fail-over cluster. There's no shared storage in use within the cluster, but each node has its own storage. This can be either on a SAN (fibre channel or iSCSI) or Direct Attached Storage (DAS) – i.e. local physical disks.

As mentioned earlier, the active node in the cluster is servicing client requests, and Exchange Server uses the standard database technology with a database, log files, and a checkpoint file. When Exchange Server is finished with a log file, the log file is sent immediately to the passive node of the cluster. This can either be via a normal network connection or via a dedicated replication network.

The passive node receives the log file and checks it for errors. If none are found, the data in the log file is relayed into the passive copy of the database. This is an asynchronous process, meaning the passive copy is always a couple of log files behind the active copy, and so information is "missing" in the passive copy.

In this environment, all messages are sent via a Hub Transport Server, even internal messages. The Hub Transport Server keeps track of these messages in a CCR environment, and can therefore send missing information (which the passive node actually requests) to the passive copy of the cluster in case of a cluster fail-over. This is called the "Transport Dumpster" in a Hub Transport Server.

This kind of replication works very well; a lot of System Administrators are using CCR replication and are very satisfied with it. There are a couple of drawbacks, though:

An Exchange Server 2007 CCR environment is running on Windows Server 2003 or Windows Server 2008 clustering. For many Exchange administrators this brings a lot of additional complexity to the environment.
Windows Server 2003 clustering in a multi-subnet environment is nearly impossible, although this has improved (but is still not perfect) in Windows Server 2008 failover-clustering.
Site Resilience is not seamless.
CCR clustering is only possible in a two node environment.
All three kinds of replication (LCR, CCR and SCR) are managed differently.

Figure 2. A fail-over cluster with Exchange Server 2007 Continuous Cluster Replication.

To overcome these issues, Microsoft has dramatically improved the replication technology, and reduced the administrative overhead at the same time. This is achieved by completely hiding the cluster components behind the implementation of Exchange Server 2010. The cluster components are still there, but the administration is completely done with the Exchange Management Console or the Exchange Management Shell.

2. Database Availability Group and Continuous Replication

In Exchange Server 2010, Microsoft introduces the concept of a Database Availability Group or DAG, which is a logical unit of Exchange Server 2010 Mailbox Servers. All Mailbox Servers within a DAG can replicate databases to each other, and a single DAG can hold up to 16 Mailbox Servers and up to 16 copies of a database. The idea of multiple copies of a database in one Exchange organization is called Exchange Mobility; one database exists on multiple servers, each instance of which is 100% identical and thus has the same GUID.

With a DAG in place, clients connect to an Active Database, which is the database where all data is initially stored. Also, new SMTP messages that arrive, either from outside or inside the organization, are stored in this database first. When the Exchange Server has finished processing information in the database's log file, the file is replicated to other servers (you can assign which servers should have a copy of the database). The log file is inspected upon receipt and, if everything is all right, the information contained in the log file is dropped into the local copy of the database.

Figure 3. A Database Availability Group with three servers; each server holds one Active Database and two Passive Databases.

In Exchange Server 2010, all clients connect to the Client Access Server, including all MAPI clients like Microsoft Outlook. Supported Outlook clients in Exchange Server 2010 include Outlook 2003, Outlook 2007 and Outlook 2010. So, the Outlook client connects to the Client Access Server which, in turn, connects to the mailbox in the Active Copy of the database, as you can see in Figure 6. Unfortunately, this is only true for Mailbox Databases. When an Outlook client needs to access a Public Folder Database, the client still accesses the Mailbox Server directly.

When the active copy of a database or its server fails, one of the passive copies of the database becomes active. The order of fail-over is configurable during the configuration of database copies. The Client Access Server automatically notices the fail-over, and starts using the new Active Database. Since the Outlook client is connected to the Client Access Server and not directly to the database, a database fail-over is fully transparent. Messages like "The connection to the server was lost" and "The connection to the server is restored" simply do not appear any more.

When building a highly available Mailbox Server environment in a DAG, there's no need to build a fail-over cluster in advance, as additional Mailbox Servers can be added to the DAG on the fly. However, for the DAG to function properly, some fail-over clustering components are still used, but these are installed during DAG's configuration. All Management of the DAG and the database copies is performed via the Exchange Management Console or the Exchange Management Shell; the Windows Cluster Manager is no longer used.

NOTE

The Database Availability Group with Database Copies is the only high availability technology used in Exchange Server 2010. Older technologies like SCR, CCR and SCR are no longer available. The traditional Single Copy Cluster (SCC) with shared storage is also no longer supported.

Configuration of a Database Availability Group is no longer limited to a server holding just the Mailbox Server Role. It is possible to create a two-server situation with the Hub Transport, Client Access and Mailbox Server role on both servers, and then create a Database Availability Group and configure Database Copies. However, it isn't a High Availability configuration for the Client Access or Hub Transport servers unless you've put load balancers in front of them, since it's not possible to use the default Windows Network Load Balancing (NLB) in combination with the fail-over clustering components. Regardless, this is a great improvement for smaller deployments of Exchange Server 2010 where high availability is still required.

2.1 Active Manager

In Exchange Server 2007, Cluster Continuous Replication uses the cluster resource management model to install and manage the High Availability solution. Initially, the Windows cluster is built and then Exchange setup is run in clustered mode, registering the EXRES.DLL in the failover-cluster, and the Clustered Mailbox Server (CMS) was created. For a High Available Exchange Server 2007 environment it is always necessary to build a fail-over cluster in advance, even if it's just a one-node cluster!

The cluster components are now hidden in Exchange Server 2010, and a new component named the Active Manager has been introduced. The Active Manager replaces the resource model and fail-over management features offered in previous versions of Exchange Server.

The fail-over clustering components have not been completely removed, though, and some of them are actually still used. If you open the Fail-over Cluster Manager in Administrative Tools, you'll find the DAG, cluster networks, etc. Do not try to manage the DAG using the Failover Cluster Manager, as this is not supported. The only way to manage the DAG is using the Exchange Management Console or the Exchange Management Shell!

The Active Manager runs on all Mailbox Servers that are members of a DAG, and there are two roles; the Primary Active Manager (PAM) and the Standby Active Manager (SAM). The PAM is running on the Mailbox Server that also holds the cluster quorum, and this is the server that decides which databases are active and which databases are passive in a DAG. The SAM is responsible for determining server or database failures (the PAM does this on its own server for its own local databases) and, if detected, communicates with the PAM to initiate a failover.

The replication service monitors the health of the mounted databases in a DAG, and monitors the ESE engine for any I/O issues or failures. If anything goes wrong here, the replication service immediately contacts the Active Manager. In the case of a failover, the Active Manager determines which database should become the Active Copy of the database (depending on the fail-over order you've specified during configuration).

Others

- High Availability in Exchange Server 2010 : Exchange Server database technologies

- Monitoring Microsoft Lync Server 2010 : How OpsMgr Works

- Microsoft Lync Server 2010 : Firewall and Security Requirements - Securing Service Accounts

- Active Directory Lightweight Directory Services : Configuring and Using AD LDS (part 2) - Working with AD LDS Instances

- Active Directory Lightweight Directory Services : Configuring and Using AD LDS (part 1) - Working with AD LDS Tools, Creating AD LDS Instances

- Active Directory Lightweight Directory Services : Understanding and Installing AD LDS

- Microsoft Lync Server 2010 : Using Reverse Proxies with Lync Server (part 2) - Configuring TMG to Support Lync Server

- Microsoft Lync Server 2010 : Using Reverse Proxies with Lync Server (part 1) - Configuring ISA 2006 SP1 to Support Lync Server

- Microsoft Dynamics Ax 2009 : Programming Enterprise Portal Controls (part 4) - ViewState, Page Life Cycle, Proxy Classes

- Microsoft Dynamics Ax 2009 : Programming Enterprise Portal Controls (part 3) - Labels, Formatting, Error Handling