Windows Server 2008 R2 high-availability and recovery features : Planning for High Availability

2/11/2014 8:29:06 PM

1. New High-availability and Recovery Features

Windows Server 2008 R2 includes several features to further enhance HA and backup services. These include new features such as PowerShell support for clustering and the ability to backup individual files and folders with Windows Backup.

Failover Cluster PowerShell support

Failover Clusters can now be set up and administered using PowerShell 2.0. This includes not only the new cmdlets for Failover Clustering but also the ability to remotely send commands to cluster services via PowerShell 2.0. With the added support for PowerShell, the cluster.exe command line utility is being deemphasized and may not be available in future releases of Windows.

Cluster-Shared Volumes

Failover Clustering supports the use of Cluster-Shared Volumes (CSVs). These are volumes that can be accessed by multiple nodes of the cluster at the same time. This brings new benefits to Hyper-V deployments by providing Live Migration and reduced number of LUNs required. Live Migration allows you to move virtual machines between two hosts in a Failover Cluster with no downtime. CSV make this process possible.

Since previous versions of Windows could only have one host actively accessing the LUN, a failover would cause all VMs stored on a LUN to failover. Prior to Windows Server 2008 R2, Microsoft recommended that each VM in a Failover Cluster be assigned its own LUN to ensure that a single VM could fail over. For many deployments, this resulted in a lot of LUNs being assigned to each Hyper-V Host. Windows Server 2008 R2 removes this restriction using CSV allowing both hosts to access the volume at the same time, enabling a single VM on a LUN to fail over without requiring over VMs on that same LUN to do the same.

Improved Cluster Validation

Windows Server 2008 introduced the Cluster Validation Wizard. By using this wizard, administrators could easily verify and set up a cluster ensuring that it was in a supported configuration. If the cluster passed the validation wizard, it was considered to be in a correct configuration. Windows Server 2008 R2 adds additional tests to further ensure that a cluster can be validated using the Cluster Validation Wizard.

Support for additional cluster aware services

The Remote Desktop Connection Broker and DFS Replication (DFSR) can both be configured on a Failover Cluster to provide HA and redundancy to these services.

Ability to backup individual files and folders

Windows Server 2008 R1 (RTM) backup did not have the ability to select individual files and folders to be backed up. This was a feature offered in previous versions of Windows such as Windows Server 2003. Windows Server 2008 R1, however, provided the ability to backup only a full volume. Windows Server 2008 R2 has brought back the feature to allow administrators to selectively choose which files and folders to include in a backup set.

2. Planning for High Availability

Deploying HA features on your network requires adequate planning and testing prior to production use of the solution. One of the first planning steps you should perform is to determine what the expected uptime requirements are for the system. You may find out that the actual business need for the system does not even require HA features. This all depends on how long it takes to restore the system and how long the organization can work without the system being online. This needs to be reviewed from a business standpoint and should have buy-in from those in charge of the business process that is supported by a particular system. Additionally, you will need to determine whether the particular system is supported using Windows Server 2008 R2 HA features. For example, Microsoft SQL Server is a cluster aware application, and therefore can be supported using Failover Clustering features. IIS web servers can be configured using NLB features. A third-party database server may not be cluster aware, and therefore you may not be able to provide an HA solution for that application using Windows Server 2008 R2 Failover Clustering. There are Generic Application and Generic Service options for setting up applications and services that are not cluster aware. These, however, provide only basic Failover Clustering features. This allows you to set up HA services for the following standard Microsoft applications and services are cluster-aware meaning that they can be deployed on a Windows Server 2008 R2 Failover Cluster to provide HA:

Microsoft SQL Server
Microsoft Exchange Server
DHCP Server
File Server
DFS Server
Distributed Transaction Coordinator
iSNS Server
Message Queuing
Print Server
Remote Desktop Connection Broker
Hyper-V Host
WINS

After you have determined that HA features are required and that they can be supported by Windows Server 2008 R2 Failover Clustering or NLB, you can begin planning your HA solution.

Understanding how Failover Clustering works

As you previously learned, Windows Failover Clusters provide HA by deploying multiple servers in a cluster. The cluster hides the fact that multiple servers are deployed meaning that client computers see all servers in the cluster as a single server. Each server in the cluster is referred to as a node. Windows clustering uses an active/passive concept to support HA services. This means that active nodes are online and performing all processing requested by the installed application. In the event that the active node fails, the cluster fails-over to the passive node when then becomes active. The new active node continues to handle processing of the application.

Cluster nodes use heartbeat and quorum to determine which node is online and active and to initiate a failover in the event of a node failure. The heartbeat is used to determine whether nodes of the cluster are online. Each node communicates over the heartbeat network continuously to determine whether the other nodes are online. If an active node fails to return a heartbeat request, the cluster will fail over to a passive node. Quorum is used to ensure that the cluster can continue to function and nodes can recover in the event of failure. The quorum also helps ensure that clusters do not experience “split brain” which is where an active and passive node both believe they should be the active node. For a node in an active/passive cluster to become active, it must be able to communicate with the quorum. If a node cannot communicate with the quorum, it cannot become active. Windows Server 2008 R2 allows you to use a quorum disk or a file share, known as a file share witness. Failover Clusters can use any of the following quorum configurations:

Node Majority—This quorum setting is used when there are an odd number of nodes in the cluster. This ensures that a cluster can tolerate failure of half of the nodes (rounded up) minus one.
Node and Disk Majority—This quorum setting should be used for clusters with an even number of nodes. Using this setting, the cluster can tolerate failure of half of the nodes (rounded up) if the quorum disk remains online. If the quorum disk goes offline, the cluster can tolerate half of the nodes (rounded up) minus one. For example, if a four-node cluster can remain online if two nodes fail and the disk quorum remains online or if one node fails and the quorum disk fails.
Node and File Share Majority—This quorum setting is used for clusters that require special configuration using a file share instead of a quorum disk. For example, Exchange Server 2007 Continuous Cluster Replication (CCR) uses a file share witness.
No Majority-Disk Only—This setting is not recommended but using this quorum setting allows the cluster to tolerate failure of all nodes as long as the quorum disk remains online.

We will explore setting up quorum later in this chapter when we discuss administering Failover Clusters.

Planning for a Failover Cluster

When planning to implement a Failover Cluster, you need to answer the following preliminary questions:

How many node (server) failures should be tolerated? Windows Server 2008 R2 Failover Clusters can be configured with multiple nodes. For example, you could provide an active node with two passive nodes. In the event that the active node failed, one of the passive nodes would become active. In the event that the second node failed, the other passive node would then become active. This allows a Failover Cluster to support failure of multiple nodes.
Does the cluster need to support geographic resiliency? With the release of Windows Server 2008 R1, Failover Clusters now have the ability to span a wide area network. Using a geo-cluster, you can have an active node in one datacenter and the failover node in a datacenter in another geographic location. In the event of complete datacenter loss, the cluster could fail over the node in the second datacenter. Figure 1 depicts a Windows Server 2008 R2 geo-cluster.

Figure 1. A file server supported by a Windows Server 2008 R2 geo-cluster.
Can the application sustain the brief time required to fail over to another node in the cluster? Failover Clusters require a very brief period of time to fail over in the event of a node going offline. You will want to ensure that your system, including front-end applications, can support this very brief outage. For example, you may deploy SQL server on a Failover Cluster. During the failover process, there will be a very brief period of time where the front-end application cannot talk to the SQL back-end. You need to verify whether the application can easily reconnect after the brief outage occurs. This outage is usually just a few seconds.
How will you be notified in the event of a node failure? The beauty of Failover Clustering is that the application remains online when a server fails. However, what if you as the administrator are not aware that a failure has occurred. The application is still online after all; thus, helpdesk phone lines probably are not ringing. This does not negate the fact that you need to know that a node has gone down and the cluster has failed-over. You need to be able to troubleshoot and resolve the issue that caused the failover to begin with. You also need to restore failover capabilities; otherwise, a second node failure could cause a service outage depending on how many failover nodes are available.