Windows Server 2012 Overview : Introducing Windows Server 2012 (part 4) - Planning for availability, scalability, and manageability

8/7/2013 11:39:50 AM

4. Planning for availability, scalability, and manageability

The enterprise depends on highly available, scalable, and manageable systems. High availability refers to the ability of the system to withstand hardware, application, or service outages while maintaining system availability. High scalability refers to the ability of the system to expand processor and memory capacity, as business needs demand. High manageability refers to the ability of the system to be managed locally and remotely and the ease with which components, services, and applications can be administered.

Planning for high availability, scalability, and manageability is critical to the success of using Windows Server 2012 in the enterprise, and you need a solid understanding of the recommendations and operating principles for deploying and maintaining high-availability servers before you deploy servers running these editions. You should also understand the types of hardware, software, and support facilities needed for enterprise computing.

Note

The discussion that follows focuses on achieving high availability, scalability, and manageability in the enterprise. Smaller organizations or business units can adopt similar approaches to meet business objectives, but they should determine the scope appropriately with budgets and available resources in mind.

Planning for software needs

Software should be chosen for its ability to support the high-availability needs of the business system. Not all software is compatible with high-availability solutions like clustering or load balancing. Not all software must be compatible, either. Instead of making an arbitrary decision, you should let the uptime needs of the application determine the level of availability required.

An availability goal of 99 percent uptime is usual for most noncritical business systems. If an application must have 99 percent uptime, the application might not need to support clustering or load balancing. To achieve 99 percent uptime means that the application can have about 88 hours of downtime in an entire year, or 100 minutes of downtime a week.

To have 99.9 percent uptime is the availability goal for highly available business systems. If an application must have 99.9 percent uptime, the application must support some type of high-availability solution, such as clustering or load balancing. To achieve 99.9 percent uptime means that the application has less than 9 hours of downtime in an entire year or, put another way, less than 10 minutes of downtime a week.

Applications that support clustering are said to be cluster aware. Microsoft SQL Server and Microsoft Exchange Server are examples of applications that are cluster aware. Although both applications can be configured to provide high availability in the enterprise, they don’t achieve high availability through cluster support alone. High-availability applications must support online backups and be tested for compatibility with Windows Server 2012. Support for online backups ensures that you don’t have to take the application offline to back up critical data. Compatibility testing ensures that the software has been thoroughly evaluated for operation with Windows Server 2012.

To evaluate the real-world environment prior to deployment, you should perform integration testing on applications that will be used together. The purpose of integration testing is to ensure that disparate applications interact as expected and to uncover problem areas if they don’t. During integration testing, testers should look at system performance and overall system utilization, as well as compatibility. Testing should be repeated prior to releasing system or application changes to a production environment.

You should standardize the software components needed to provide system services. The goal of standardization is to set guidelines for software components and technologies that will be used in the enterprise. Standardization accomplishes the following:

Reduces the total cost of maintaining and updating software
Reduces the amount of integration and compatibility testing needed for upgrades
Improves recovery time because problems are easier to troubleshoot
Reduces the amount of training needed for administration support

Software standardization isn’t meant to limit the organization to a single specification. Over the life of a data center, new application versions, software components, and technologies will be introduced, and the organization can implement new standards and specifications as necessary. The key to success lies in ensuring that there is a standard process for deploying software updates and new technologies. The standard process must include the following:

Software compatibility and integration testing
Software support training for personnel
Predeployment planning
Step-by-step software deployment checklists
Postdeployment monitoring and maintenance

The following checklist summarizes the recommendations for designing and planning software for high availability:

Choose software that meets the availability needs of the solution or service.
Choose software that supports online backups.
Test software for compatibility with other applications.
Test software integration with other applications.
Repeat testing prior to releasing updates.
Create and enforce software standards.
Define a standard process for deploying software updates.

Planning for hardware needs

Sound hardware strategy helps increase system availability while reducing total cost of ownership and improving recovery times. Windows Server 2012 is designed and tested for use with high-performance hardware, applications, and services. To ensure that hardware components are compatible, choose only components that are certified as compatible, such as those that are listed as certified for Windows Server 2012 in the Windows Server Catalog (http://www.windowsservercatalog.com/).

Note

All certified components undergo rigorous testing, with a retest for firmware revisions, service pack updates, and other minor revisions. After a component is certified through testing, hardware vendors must maintain the configuration through updates and resubmit the component for testing and certification. The program requirements and the tight coordination with vendors greatly improve the reliability and availability of Windows Server 2012. All hardware certified for Windows Server 2012 also is fully supported in Hyper-V environments.

You should standardize on a hardware platform, and this platform should have standardized components. Standardization accomplishes the following:

Reduces the amount of training needed for support
Reduces the amount of testing needed for upgrades
Requires fewer spare parts because subcomponents are the same
Improves recovery time because problems are easier to troubleshoot

Standardization isn’t meant to restrict a data center to a single type of server. In an n-tier environment, standardization typically means choosing a standard server configuration for the front-end servers, a standard server configuration for middle-tier business logic, and a standard server configuration for back-end data services. The reason for this is that web servers, application servers, and database servers all have different resource needs. For example, although a web server might need to run on a dual-processor system with limited hardware RAID control and 4 gigabytes (GBs) of random access memory (RAM), a database server might need to run on an eight-way system with dual-channel RAID control and 64 GBs of RAM.

Standardization isn’t meant to limit the organization to a single hardware specification either. Over the life of a data center, new equipment will be introduced and old equipment likely will become unavailable. To keep up with the pace of change, new standards and specifications should be implemented when necessary. These standards and specifications, as with the previous standards and specifications, should be published and made available to you.

Redundancy and fault tolerance must be built into the hardware design at all levels to improve availability. You can improve hardware redundancy by using the following components:

Clusters Clusters provide failover support for critical applications and services.
Standby systems Standby systems provide backup systems in case of total failure of a primary system.
Spare parts Spare parts ensure replacement parts are available in case of failure.
Fault-tolerant components Fault-tolerant components improve the internal redundancy of the system.

Storage devices, network components, cooling fans, and power supplies all can be configured for fault tolerance. For storage devices, you should be sure to use multiple disk controllers, hot-swappable drives, and redundant drive arrays. For network components, you should look well beyond the network adapter and also consider whether fault tolerance is needed for routers, switches, firewalls, load balancers, and other network equipment.

A standard process for deploying hardware must be defined and distributed to all support personnel. The standard process must include the following:

Hardware compatibility and integration testing
Hardware support training for personnel
Predeployment planning
Step-by-step hardware deployment checklists
Postdeployment monitoring and maintenance

The following checklist summarizes the recommendations for designing and planning hardware for high availability:

Choose hardware that is listed on the Hardware Compatibility List (HCL).
Create and enforce hardware standards.
Use redundant hardware whenever possible.
Use fault-tolerant hardware whenever possible.
Provide a secure physical environment for hardware.
Define a standard process for deploying hardware.

If possible, add these recommendations to the preceding checklist:

Use fully redundant internal networks, from servers to border routers.
Use direct peering to major tier-1 telecommunications carriers.
Use redundant external connections for data and telephony.
Use a direct connection with high-speed lines.

Planning for support structures and facilities

The physical structures and facilities supporting your server room are critically important. Without adequate support structures and facilities, you will have problems. The primary considerations for support structures and facilities have to do with the physical environment of the servers. These considerations also extend to the physical security of the server environment.

Just as hardware and software have availability requirements so should support structures and facilities. Factors that affect the physical environment are as follows:

Temperature and humidity
Dust and other contaminants
Physical wiring
Power supplies
Natural disasters
Physical security

Temperature and humidity should be carefully controlled at all times. Processors, memory, hard drives, and other pieces of physical equipment operate most efficiently when they are kept cool; between 65 and 70 degrees Fahrenheit is the ideal temperature in most situations. Equipment that overheats can malfunction or cease to operate altogether. Servers should have multiple redundant internal fans to ensure that these and other internal hardware devices are kept cool.

Important

You should pay particular attention to fast-running processors and hard drives. Typically, fast processors and hard drives can become overheated and need additional cooling fans—even if the surrounding environment is cool.

Humidity should be kept low to prevent condensation, but the environment shouldn’t be dry. A dry climate can contribute to static electricity problems. Antistatic devices and static guards should be used in most environments.

Dust and other contaminants can cause hardware components to overheat or short out. Servers should be protected from these contaminants whenever possible. You should ensure that an air-filtration system is in place in the server room or hosting facility that is used. The regular preventive maintenance cycle on the servers should include checking servers and their cabinets for dust and other contaminants. If dust is found, the servers and cabinets should be carefully cleaned.

Few things affect the physical environment more than wiring and cabling. All electrical wires and network cables should be tested and certified by qualified technicians. Electrical wiring should be configured to ensure that servers and other equipment have adequate power available for peak usage times. Ideally, multiple dedicated circuits should be used to provide power.

Improperly installed network cables are the cause of most communications problems. Network cables should be tested to ensure that their operation meets manufacturer specifications. Redundant cables should be installed to ensure the availability of the network. All wiring and cabling should be labeled and well maintained. Whenever possible, use cable management systems and tie wraps to prevent physical damage to wiring.

Ensuring that servers and their components have power is also important. Servers should have hot-swappable, redundant power supplies. Being hot-swappable ensures that the power supply can be replaced without having to turn off the server. Redundancy ensures that one power supply can malfunction and the other will still deliver power to the server. You should be aware that having multiple power supplies doesn’t mean that a server or hardware component has redundancy. Some hardware components require multiple power supplies to operate. In this case, an additional (third or fourth) power supply is needed to provide redundancy.

The redundant power supplies should be plugged into separate power strips, and these power strips should be plugged into separate local uninterruptible power supply (UPS) units if other backup power sources aren’t available. Some facilities have enterprise UPS units that provide power for an entire room or facility. If this is the case, redundant UPS systems should be installed. To protect against long-term outages, gas-powered or diesel-powered generators should be installed. Most hosting and colocation facilities have generators. But having a generator isn’t enough; the generator must be rated to support the peak power needs of all installed equipment. If the generator cannot support the installed equipment, brownouts (temporary outages) will occur.

Caution

A fire-suppression system should be installed to protect against fire. Dual gas-based systems are preferred because these systems do not harm hardware when they go off. Water-based sprinkler systems, on the other hand, can destroy hardware.

In addition, access controls should be used to restrict physical access to the server room or facility. Use locks, key cards, access codes, or biometric scanners to ensure that only designated individuals can gain entry to the secure area. If possible, use surveillance cameras and maintain recorded tapes for at least a week. When the servers are deployed in a hosting or colocation facility, ensure that locked cages are used and that fencing extends from the floor to the ceiling.

The following checklist summarizes the recommendations for designing and planning structures and facilities:

Maintain the temperature at 65 to 70 degrees Fahrenheit.
Maintain low humidity (but not dry).
Install redundant internal cooling fans.
Use an air-filtration system.
Check for dust and other contaminants periodically.
Install hot-swappable, redundant power supplies.
Test and certify wiring and cabling.
Use wire management to protect cables from damage.
Label hardware and cables.
Install backup power sources, such as UPS and generators.
Install seismic protection and bracing.
Install dual, gas-based, fire-suppression systems.
Restrict physical access by using locks, key cards, access codes, and so forth.
Use surveillance cameras, and maintain recorded tapes (if possible).
Use locked cages, cabinets, and racks at offsite facilities.
Use floor-to-ceiling fencing with cages at offsite facilities.

Planning for day-to-day operations

Day-to-day operations and support procedures must be in place before you deploy mission-critical systems. The most critical procedures for day-to-day operations involve the following activities:

Monitoring and analysis
Resources, training, and documentation
Change control
Problem escalation procedures
Backup and recovery procedures
Postmortem after recovery
Auditing and intrusion detection

Monitoring is critical to the success of business system deployments. You must have the necessary equipment to monitor the status of the business system. Monitoring allows you to be proactive in system support rather than reactive. Monitoring should extend to the hardware, software, and network components but shouldn’t interfere with normal systems operations—that is, the monitoring tools chosen should require limited system and network resources to operate.

Note

Keep in mind that having too much data is just as bad as not collecting any data. The monitoring tools should gather only the data required for meaningful analysis.

Without careful analysis, the data collected from monitoring is useless. Procedures should be put in place to ensure that personnel know how to analyze the data they collect. The network infrastructure is a support area that is often overlooked. Be sure you allocate the appropriate resources for network monitoring.

A well-run and well-maintained network should have 99.99 percent availability. There should be less than 1 percent packet loss and a packet turnaround of 80 milliseconds or less. To achieve this level of availability and performance, the network must be monitored. Any time business systems extend to the Internet or to wide area networks (WANs), internal network monitoring must be supplemented with outside-in monitoring that checks the availability of the network and business systems. With outside-in monitoring, you use external systems for your checks, rather than internal systems.

Resources, training, and documentation are essential to ensuring that you can manage and maintain mission-critical systems. Many organizations cripple the operations team by staffing minimally. Minimally staffed teams will have marginal response times and nominal effectiveness. The organization must take the following steps:

Staff for success to be successful.
Conduct training before deploying new technologies.
Keep the training up to date with what’s deployed.
Document essential operations procedures.

Every change to hardware, software, and the network must be planned and executed deliberately. To do this, you must have established change-control procedures and well-documented execution plans. Change-control procedures should be designed to ensure that everyone knows what changes have been made. Execution plans should be designed to ensure that everyone knows the exact steps that were or should be performed to make a change.

Change logs are a key part of change control. Each piece of physical hardware deployed in the operational environment should have a change log. The change log should be stored in a text document or spreadsheet that is readily accessible to support personnel. The change log should show the following information:

Who changed the hardware
What change was made
When the change was made
Why the change was made

You should have well-defined backup and recovery plans. The backup plan should specifically state the following information:

When full, incremental, differential, and log backups are used
How often and at what time backups are performed
Whether the backups must be conducted online or offline
The amount of data being backed up, as well as how critical the data is
The tools used to perform the backups
The maximum time allowed for backup and restore
How backup media is labeled, recorded, and rotated

Backups should be monitored daily to ensure that they are running correctly and that the media is good. Any problems with backups should be corrected immediately. Multiple media sets should be used for backups, and these media sets should be rotated on a specific schedule. With a four-set rotation, there is one set for daily, weekly, monthly, and quarterly backups. By rotating one media set offsite, support staff can help ensure that the organization is protected in case of a disaster.

The recovery plan should provide detailed step-by-step procedures for recovering the system under various conditions, such as procedures for recovering from hard disk drive failure or troubleshooting problems with connectivity to the back-end database. The recovery plan should also include system design and architecture documentation that details the configuration of physical hardware, application-logic components, and back-end data. Along with this information, support staff should provide a media set containing all software, drivers, and operating system files needed to recover the system.

Note

One thing administrators often forget about is spare parts. Spare parts for key components—such as processors, drives, and memory—should also be maintained as part of the recovery plan.

You should practice restoring critical business systems using the recovery plan. Practice shouldn’t be conducted on the production servers. Instead, the team should practice on test equipment with a configuration similar to the real production servers. Practicing once a quarter or semiannually is highly recommended.

You should have well-defined, problem-escalation procedures that document how to handle problems and emergency changes that might be needed. Some organizations use a three-tiered help desk structure for handling problems:

Level 1 support staff forms the front line for handling basic problems. They typically have hands-on access to the hardware, software, and network components they manage. Their main job is to clarify and prioritize a problem. If the problem has occurred before and there is a documented resolution procedure, they can resolve the problem without escalation. If the problem is new or not recognized, they must understand how, when, and to whom to escalate it.
Level 2 support staff includes more specialized personnel who can diagnose a particular type of problem and work with others to resolve a problem, such as system administrators and network engineers. They usually have remote access to the hardware, software, and network components they manage. This allows them to troubleshoot problems remotely and to send out technicians after they’ve pinpointed the problem.
Level 3 support staff includes highly technical personnel who are subject matter experts, team leaders, or team supervisors. The level 3 team can include support personnel from vendors as well as representatives from the user community. Together, they form the emergency-response or crisis-resolution team that is responsible for resolving crisis situations and planning emergency changes.

All crisis situations and emergencies should be responded to decisively and resolved methodically. A single person on the emergency response team should be responsible for coordinating all changes and executing the recovery plan. This same person should be responsible for writing an after-action report that details the emergency response and resolution process used. The after-action report should analyze how the emergency was resolved and what the root cause of the problem was.

In addition, you should establish procedures for auditing system usage and detecting intrusion. In Windows Server 2012, auditing policies are used to track the successful or failed execution of the following activities:

Account logon events Tracks events related to user logon and logoff
Account management Tracks tasks involved with handling user accounts, such as creating or deleting accounts and resetting passwords
Directory service access Tracks access to the Active Directory Domain Service (AD DS)
Object access Tracks system resource usage for files, directories, and objects
Policy change Tracks changes to user rights, auditing, and trust relationships
Privilege use Tracks the use of user rights and privileges
Process tracking Tracks system processes and resource usage
System events Tracks system startup, shutdown, restart, and actions that affect system security or the security log

You should have an incident-response plan that includes priority escalation of suspected intrusion to senior team members and provides step-by-step details on how to handle the intrusion. The incident-response team should gather information from all network systems that might be affected. The information should include event logs, application logs, database logs, and any other pertinent files and data. The incident-response team should take immediate action to lock out accounts, change passwords, and physically disconnect the system if necessary. All team members participating in the response should write a postmortem report that details the following information:

What date and time they were notified and what immediate actions they took
Who they notified and what the response was from the notified individual
What their assessment of the issue is and the actions necessary to resolve and prevent similar incidents

The team leader should write an executive summary of the incident and forward this to senior management.

The following checklist summarizes the recommendations for operational support of high-availability systems:

Monitor hardware, software, and network components 24/7.
Ensure that monitoring doesn’t interfere with normal systems operations.
Gather only the data required for meaningful analysis.
Establish procedures that let personnel know what to look for in the data.
Use outside-in monitoring any time systems are externally accessible.
Provide adequate resources, training, and documentation.
Establish change-control procedures that include change logs.
Establish execution plans that detail the change implementation.
Create a solid backup plan that includes onsite and offsite tape rotation.
Monitor backups, and test backup media.
Create a recovery plan for all critical systems.
Test the recovery plan on a routine basis.
Document how to handle problems and make emergency changes.
Use a three-tier support structure to coordinate problem escalation.
Form an emergency-response or crisis-resolution team.
Write after-action reports that detail the process used.
Establish procedures for auditing system usage and detecting intrusion.
Create an intrusion response plan with priority escalation.
Take immediate action to handle suspected or actual intrusion.
Write postmortem reports detailing team reactions to the intrusion.

Planning for deploying highly available servers

You should always create a plan before deploying a business system. The plan should show everything that must be done before the system is transitioned into the production environment.

The deployment plan should include the following items:

Checklists
Contact lists
Test plans
Deployment schedules

Checklists are a key part of the deployment plan. The purpose of a checklist is to ensure that the entire deployment team understands the steps they need to perform. Checklists should list the tasks that must be performed and designate individuals to handle the tasks during each phase of the deployment—from planning to testing to installation. Prior to executing a checklist, the deployment team should meet to ensure that all items are covered and that the necessary interactions among team members are clearly understood. After deployment, the preliminary checklists should become a part of the system documentation and new checklists should be created any time the system is updated.

The deployment plan should include a contact list. The contact list should provide the name, role, telephone number, and email address of all team members, vendors, and solution-provider representatives. Alternative numbers for cell phones and pagers should be provided as well.

The deployment plan should include a test plan. An ideal test plan has several phases. In Phase I, the deployment team builds the business system and support structures in a test lab. Building the system means accomplishing the following tasks:

Creating a test network on which to run the system
Putting together the hardware and storage components
Installing the operating system and application software
Adjusting basic system settings to suit the test environment
Configuring clustering, network load balancing, or another high-availability solution, if necessary

The deployment team can conduct any necessary testing and troubleshooting in the isolated lab environment. The entire system should undergo burn-in testing to guard against faulty components. If a component is flawed, it usually fails in the first few days of operation. Testing doesn’t stop with burn-in. Web and application servers should be stress tested. Database servers should be load tested. The results of the stress and load tests should be analyzed to ensure that the system meets the performance requirements and expectations of the customer. Adjustments to the configuration should be made to improve performance and optimize the configuration for the expected load.

In Phase II, the deployment team tests the business system and support equipment in the deployment location. They conduct similar tests as before, but in the real-world environment. Again, the results of these tests should be analyzed to ensure that the system meets the performance requirements and expectations of the customer. Afterward, adjustments should be made to improve performance and optimize as necessary. The team can then deploy the business system.

After deployment of the server or servers, the team should perform limited, nonintrusive testing to ensure that the system is operating normally. After Phase III testing is completed, the team can use the operational plans for monitoring and maintenance.

The following checklist summarizes the recommendations for predeployment planning of mission-critical systems:

Create a plan that covers the entire testing-to-operations cycle.
Use checklists to ensure that the deployment team understands the procedures.
Provide a contact list for the team, vendors, and solution providers.
Conduct burn-in testing in the lab.
Conduct stress and load testing in the lab.
Use the test data to optimize and adjust the configuration.
Provide follow-on testing in the deployment location.
Follow a specific deployment schedule.
Use operational plans once final tests are completed.

Others

- Windows Server 2012 Overview : Introducing Windows Server 2012 (part 3) - Thinking about server roles and Active Directory

- Windows Server 2012 Overview : Introducing Windows Server 2012 (part 2) - Planning for Windows Server 2012

- Windows Server 2012 Overview : Introducing Windows Server 2012 (part 1) - Windows 8 and Windows Server 2012