4. Planning for availability, scalability, and manageability
The enterprise depends on highly available, scalable, and manageable systems. High availability
refers to the ability of the system to withstand hardware, application,
or service outages while maintaining system availability. High scalability refers to the ability of the system to expand processor and memory capacity, as business needs demand. High manageability
refers to the ability of the system to be managed locally and remotely
and the ease with which components, services, and applications can be
administered.
Planning for high availability, scalability, and manageability is critical to the success of using Windows Server
2012 in the enterprise, and you need a solid understanding of the
recommendations and operating principles for deploying and maintaining
high-availability servers before you deploy servers running these
editions. You should also understand the types of hardware, software, and support facilities needed for enterprise computing.
Note
The discussion that follows focuses on achieving high availability,
scalability, and manageability in the enterprise. Smaller organizations
or business units can adopt similar approaches to meet business
objectives, but they should determine the scope appropriately with
budgets and available resources in mind.
Planning for software needs
Software should be chosen for its ability to support the
high-availability needs of the business system. Not all software is
compatible with high-availability solutions like clustering or load
balancing. Not all software must be compatible, either. Instead of
making an arbitrary decision, you should let the uptime needs of the
application determine the level of availability required.
An availability goal of 99
percent uptime is usual for most noncritical business systems. If an
application must have 99 percent uptime, the application might not need
to support clustering or load balancing. To achieve 99 percent uptime
means that the application can have about 88 hours of downtime in an
entire year, or 100 minutes of downtime a week.
To have 99.9
percent uptime is the availability goal for highly available business
systems. If an application must have 99.9 percent uptime, the
application must support some type of high-availability solution, such
as clustering or load balancing. To achieve 99.9 percent uptime means
that the application has less than 9 hours of downtime in an entire year
or, put another way, less than 10 minutes of downtime a week.
To evaluate the real-world environment prior to deployment, you should perform integration testing on applications that will be used together. The purpose of integration
testing is to ensure that disparate applications interact as expected
and to uncover problem areas if they don’t. During integration testing,
testers should look at system performance and overall system
utilization, as well as compatibility. Testing should be repeated prior
to releasing system or application changes to a production environment.
You should standardize the software components needed to provide
system services. The goal of standardization is to set guidelines for
software components and technologies that will be used in the
enterprise. Standardization accomplishes the following:
-
Reduces the total cost of maintaining and updating software
-
Reduces the amount of integration and compatibility testing needed for upgrades
-
Improves recovery time because problems are easier to troubleshoot
-
Reduces the amount of training needed for administration support
Software standardization isn’t meant to limit the organization to a
single specification. Over the life of a data center, new application
versions, software components, and technologies will be introduced, and
the organization can implement new standards and specifications as
necessary. The key to success lies in ensuring that there is a standard
process for deploying software updates and new technologies. The
standard process must include the following:
-
Software compatibility and integration testing
-
Software support training for personnel
-
Predeployment planning
-
Step-by-step software deployment checklists
-
Postdeployment monitoring and maintenance
The following checklist summarizes the recommendations for designing and planning software for high availability:
-
Choose software that meets the availability needs of the solution or service.
-
Choose software that supports online backups.
-
Test software for compatibility with other applications.
-
Test software integration with other applications.
-
Repeat testing prior to releasing updates.
-
Create and enforce software standards.
-
Define a standard process for deploying software updates.
Planning for hardware needs
Sound hardware strategy helps increase system availability while
reducing total cost of ownership and improving recovery times. Windows
Server 2012 is designed and tested for use with high-performance
hardware, applications, and services. To ensure that hardware components
are compatible, choose only components that are certified as
compatible, such as those that are listed as certified for Windows
Server 2012 in the Windows Server Catalog (http://www.windowsservercatalog.com/).
Note
All certified components undergo rigorous testing, with a retest for
firmware revisions, service pack updates, and other minor revisions.
After a component is certified through testing, hardware vendors must
maintain the configuration through updates and resubmit the component
for testing and certification. The program requirements and the tight
coordination with vendors greatly improve the reliability and
availability of Windows Server 2012. All hardware certified for Windows
Server 2012 also is fully supported in Hyper-V environments.
You should standardize on a hardware platform, and this platform should have standardized components. Standardization accomplishes the following:
-
Reduces the amount of training needed for support
-
Reduces the amount of testing needed for upgrades
-
Requires fewer spare parts because subcomponents are the same
-
Improves recovery time because problems are easier to troubleshoot
Standardization isn’t meant to restrict a data center to a single type of server. In an n-tier
environment, standardization typically means choosing a standard server
configuration for the front-end servers, a standard server
configuration for middle-tier business logic, and a standard server
configuration for back-end data services. The reason for this is that
web servers, application servers, and database servers all have
different resource needs. For example, although a web server might need
to run on a dual-processor system with limited hardware RAID control and
4 gigabytes (GBs) of random access memory (RAM), a database server
might need to run on an eight-way system with dual-channel RAID control
and 64 GBs of RAM.
Standardization isn’t meant to limit the organization to a single
hardware specification either. Over the life of a data center, new
equipment will be introduced and old equipment likely will become
unavailable. To keep up with the pace of change, new standards and
specifications should be implemented when necessary. These standards and
specifications, as with the previous standards and specifications,
should be published and made available to you.
Redundancy and fault tolerance must be built into the hardware design at all levels to improve availability. You can improve hardware redundancy by using the following components:
-
Clusters Clusters provide failover support for critical applications and services.
-
Standby systems
Standby systems provide backup systems in case of total failure of a primary system.
-
Spare parts Spare parts ensure replacement parts are available in case of failure.
-
Fault-tolerant components Fault-tolerant components improve the internal redundancy of the system.
Storage devices, network components, cooling fans, and power supplies
all can be configured for fault tolerance. For storage devices, you
should be sure to use multiple disk controllers, hot-swappable drives,
and redundant drive arrays. For network components, you should look well
beyond the network adapter and also consider whether fault tolerance is
needed for routers, switches, firewalls, load balancers, and other
network equipment.
A standard process for deploying hardware must be defined and distributed to all support personnel. The standard process must include the following:
-
Hardware compatibility and integration testing
-
Hardware support training for personnel
-
Predeployment planning
-
Step-by-step hardware deployment checklists
-
Postdeployment monitoring and maintenance
The following checklist summarizes the recommendations for designing and planning hardware for high availability:
-
Choose hardware that is listed on the Hardware Compatibility List (HCL).
-
Create and enforce hardware standards.
-
Use redundant hardware whenever possible.
-
Use fault-tolerant hardware whenever possible.
-
Provide a secure physical environment for hardware.
-
Define a standard process for deploying hardware.
If possible, add these recommendations to the preceding checklist:
-
Use fully redundant internal networks, from servers to border routers.
-
Use direct peering to major tier-1 telecommunications carriers.
-
Use redundant external connections for data and telephony.
-
Use a direct connection with high-speed lines.
Planning for support structures and facilities
The physical structures and facilities supporting your server room are critically important. Without adequate support
structures and facilities, you will have problems. The primary
considerations for support structures and facilities have to do with the
physical environment of the servers. These considerations also extend
to the physical security of the server environment.
Just as hardware and software have availability requirements so should support structures and facilities. Factors that affect the physical environment are as follows:
Temperature and humidity should be carefully controlled at all times.
Processors, memory, hard drives, and other pieces of physical equipment
operate most efficiently when they are kept cool; between 65 and 70
degrees Fahrenheit is the ideal temperature in most situations.
Equipment that overheats can malfunction or cease to operate altogether.
Servers should have multiple redundant internal fans to ensure that
these and other internal hardware devices are kept cool.
Important
You should pay particular attention to fast-running processors and
hard drives. Typically, fast processors and hard drives can become
overheated and need additional cooling fans—even if the surrounding
environment is cool.
Humidity should be kept low to prevent condensation, but the
environment shouldn’t be dry. A dry climate can contribute to static
electricity problems. Antistatic devices and static guards should be
used in most environments.
Dust and other contaminants can cause hardware components to overheat
or short out. Servers should be protected from these contaminants
whenever possible. You should ensure that an air-filtration system is in
place in the server room or hosting facility that is used. The regular
preventive maintenance cycle on the servers should include checking
servers and their cabinets for dust and other contaminants. If dust is
found, the servers and cabinets should be carefully cleaned.
Few things affect the physical environment more than wiring
and cabling. All electrical wires and network cables should be tested
and certified by qualified technicians. Electrical wiring should be
configured to ensure that servers and other equipment have adequate
power available for peak usage times. Ideally, multiple dedicated
circuits should be used to provide power.
Improperly installed network cables
are the cause of most communications problems. Network cables should be
tested to ensure that their operation meets manufacturer
specifications. Redundant cables should be installed to ensure the availability of the network. All wiring
and cabling should be labeled and well maintained. Whenever possible,
use cable management systems and tie wraps to prevent physical damage to wiring.
Ensuring that servers and their components have power is also important. Servers should have hot-swappable, redundant power supplies. Being hot-swappable ensures that the power
supply can be replaced without having to turn off the server.
Redundancy ensures that one power supply can malfunction and the other
will still deliver power to the server. You should be aware that having
multiple power supplies doesn’t mean that a server or hardware
component has redundancy. Some hardware components require multiple
power supplies to operate. In this case, an additional (third or fourth)
power supply is needed to provide redundancy.
The redundant power supplies should be plugged into separate power
strips, and these power strips should be plugged into separate local
uninterruptible power supply (UPS)
units if other backup power sources aren’t available. Some facilities
have enterprise UPS units that provide power for an entire room or
facility. If this is the case, redundant UPS systems should be
installed. To protect against long-term outages, gas-powered or
diesel-powered generators should be installed. Most hosting and
colocation facilities have generators. But having a generator isn’t
enough; the generator must be rated to support
the peak power needs of all installed equipment. If the generator
cannot support the installed equipment, brownouts (temporary outages)
will occur.
Caution
A fire-suppression system should be installed to protect against fire.
Dual gas-based systems are preferred because these systems do not harm
hardware when they go off. Water-based sprinkler systems, on the other
hand, can destroy hardware.
In addition, access controls should be used to restrict physical
access to the server room or facility. Use locks, key cards, access
codes, or biometric scanners to ensure that only designated individuals
can gain entry to the secure area. If possible, use surveillance cameras
and maintain recorded tapes for at least a week. When the servers are
deployed in a hosting or colocation facility, ensure that locked cages
are used and that fencing extends from the floor to the ceiling.
The following checklist summarizes the recommendations for designing and planning structures and facilities:
-
Maintain the temperature at 65 to 70 degrees Fahrenheit.
-
Maintain low humidity (but not dry).
-
Install redundant internal cooling fans.
-
Use an air-filtration system.
-
Check for dust and other contaminants periodically.
-
Install hot-swappable, redundant power supplies.
-
Test and certify wiring and cabling.
-
Use wire management to protect cables from damage.
-
Label hardware and cables.
-
Install backup power sources, such as UPS and generators.
-
Install seismic protection and bracing.
-
Install dual, gas-based, fire-suppression systems.
-
Restrict physical access by using locks, key cards, access codes, and so forth.
-
Use surveillance cameras, and maintain recorded tapes (if possible).
-
Use locked cages, cabinets, and racks at offsite facilities.
-
Use floor-to-ceiling fencing with cages at offsite facilities.
Planning for day-to-day operations
Day-to-day operations and support
procedures must be in place before you deploy mission-critical systems.
The most critical procedures for day-to-day operations involve the
following activities:
-
Monitoring and analysis
-
Resources, training, and documentation
-
Change control
-
Problem escalation procedures
-
Backup and recovery procedures
-
Postmortem after recovery
-
Auditing and intrusion detection
Monitoring
is critical to the success of business system deployments. You must
have the necessary equipment to monitor the status of the business
system. Monitoring allows you to be proactive in system support rather
than reactive. Monitoring should extend to the hardware,
software, and network components but shouldn’t interfere with normal
systems operations—that is, the monitoring tools chosen should require
limited system and network resources to operate.
Note
Keep in mind that having too much data is just as bad as not
collecting any data. The monitoring tools should gather only the data
required for meaningful analysis.
Without careful analysis, the data collected from monitoring is
useless. Procedures should be put in place to ensure that personnel know
how to analyze the data they collect. The network infrastructure is a
support area that is often overlooked. Be sure you allocate the
appropriate resources for network monitoring.
Resources, training, and documentation are essential to ensuring that
you can manage and maintain mission-critical systems. Many
organizations cripple the operations
team by staffing minimally. Minimally staffed teams will have marginal
response times and nominal effectiveness. The organization must take the
following steps:
-
Staff for success to be successful.
-
Conduct training before deploying new technologies.
-
Keep the training up to date with what’s deployed.
-
Document essential operations procedures.
Every change to hardware,
software, and the network must be planned and executed deliberately. To
do this, you must have established change-control procedures and
well-documented execution plans. Change-control procedures should be
designed to ensure that everyone knows what changes have been made.
Execution plans should be designed to ensure that everyone knows the
exact steps that were or should be performed to make a change.
Change logs are a key part of change control. Each piece of physical hardware deployed in the operational
environment should have a change log. The change log should be stored
in a text document or spreadsheet that is readily accessible to support
personnel. The change log should show the following information:
-
Who changed the hardware
-
What change was made
-
When the change was made
-
Why the change was made
You should have well-defined backup and recovery plans. The backup plan should specifically state the following information:
-
When full, incremental, differential, and log backups are used
-
How often and at what time backups are performed
-
Whether the backups must be conducted online or offline
-
The amount of data being backed up, as well as how critical the data is
-
The tools used to perform the backups
-
The maximum time allowed for backup and restore
-
How backup media is labeled, recorded, and rotated
Backups should be monitored daily to ensure that they are running
correctly and that the media is good. Any problems with backups should
be corrected immediately. Multiple media sets should be used for
backups, and these media sets should be rotated on a specific schedule.
With a four-set rotation, there is one set for daily, weekly, monthly,
and quarterly backups. By rotating one media set offsite, support staff can help ensure that the organization is protected in case of a disaster.
The recovery plan should provide detailed step-by-step procedures for recovering the system
under various conditions, such as procedures for recovering from hard
disk drive failure or troubleshooting problems with connectivity to the
back-end database. The recovery plan should also include system design
and architecture documentation that details the configuration of
physical hardware,
application-logic components, and back-end data. Along with this
information, support staff should provide a media set containing all
software, drivers, and operating system files needed to recover the
system.
Note
One thing administrators often forget about is spare
parts. Spare parts for key components—such as processors, drives, and
memory—should also be maintained as part of the recovery plan.
You should practice restoring critical business systems using the
recovery plan. Practice shouldn’t be conducted on the production
servers. Instead, the team should practice on test equipment with a
configuration similar to the real production servers. Practicing once a quarter or semiannually is highly recommended.
You should have well-defined, problem-escalation procedures that
document how to handle problems and emergency changes that might be
needed. Some organizations use a three-tiered help desk structure for handling problems:
-
Level 1 support staff forms the front line for handling basic problems. They typically have hands-on access to the hardware,
software, and network components they manage. Their main job is to
clarify and prioritize a problem. If the problem has occurred before and
there is a documented resolution procedure, they can resolve the
problem without escalation. If the problem is new or not recognized,
they must understand how, when, and to whom to escalate it.
-
Level 2 support staff includes more specialized personnel who can
diagnose a particular type of problem and work with others to resolve a
problem, such as system
administrators and network engineers. They usually have remote access
to the hardware, software, and network components they manage. This
allows them to troubleshoot problems remotely and to send out
technicians after they’ve pinpointed the problem.
-
Level 3 support staff includes highly technical personnel who are
subject matter experts, team leaders, or team supervisors. The level 3
team can include support personnel from vendors as well as
representatives from the user community. Together, they form the
emergency-response or crisis-resolution team that is responsible for
resolving crisis situations and planning emergency changes.
All crisis situations and emergencies should be responded to decisively and resolved methodically. A single person on the emergency
response team should be responsible for coordinating all changes and
executing the recovery plan. This same person should be responsible for
writing an after-action report that details the emergency
response and resolution process used. The after-action report should
analyze how the emergency was resolved and what the root cause of the
problem was.
In addition, you should establish procedures for auditing system usage and detecting intrusion. In Windows Server 2012, auditing policies are used to track the successful or failed execution of the following activities:
-
Account logon events Tracks events related to user logon and logoff
-
Account management Tracks tasks involved with handling user accounts, such as creating or deleting accounts and resetting passwords
-
Directory service access Tracks access to the Active Directory Domain Service (AD DS)
-
Object access Tracks system resource usage for files, directories, and objects
-
Policy change Tracks changes to user rights, auditing, and trust relationships
-
Privilege use Tracks the use of user rights and privileges
-
Process tracking Tracks system processes and resource usage
-
System events Tracks system startup, shutdown, restart, and actions that affect system security or the security log
You should have an incident-response plan that includes priority
escalation of suspected intrusion to senior team members and provides
step-by-step details on how to handle the intrusion. The
incident-response team should gather information from all network
systems that might be affected. The information should include event
logs, application logs, database logs, and any other pertinent files and
data. The incident-response team should take immediate action to lock
out accounts, change passwords, and physically disconnect the system if
necessary. All team members participating in the response should write a
postmortem report that details the following information:
-
What date and time they were notified and what immediate actions they took
-
Who they notified and what the response was from the notified individual
-
What their assessment of the issue is and the actions necessary to resolve and prevent similar incidents
The team leader should write an executive summary of the incident and forward this to senior management.
The following checklist summarizes the recommendations for operational support of high-availability systems:
-
Monitor hardware, software, and network components 24/7.
-
Ensure that monitoring doesn’t interfere with normal systems operations.
-
Gather only the data required for meaningful analysis.
-
Establish procedures that let personnel know what to look for in the data.
-
Use outside-in monitoring any time systems are externally accessible.
-
Provide adequate resources, training, and documentation.
-
Establish change-control procedures that include change logs.
-
Establish execution plans that detail the change implementation.
-
Create a solid backup plan that includes onsite and offsite tape rotation.
-
Monitor backups, and test backup media.
-
Create a recovery plan for all critical systems.
-
Test the recovery plan on a routine basis.
-
Document how to handle problems and make emergency changes.
-
Use a three-tier support structure to coordinate problem escalation.
-
Form an emergency-response or crisis-resolution team.
-
Write after-action reports that detail the process used.
-
Establish procedures for auditing system usage and detecting intrusion.
-
Create an intrusion response plan with priority escalation.
-
Take immediate action to handle suspected or actual intrusion.
-
Write postmortem reports detailing team reactions to the intrusion.
Planning for deploying highly available servers
You should always create a plan before deploying a business
system. The plan should show everything that must be done before the
system is transitioned into the production environment.
The deployment plan should include the following items:
-
Checklists
-
Contact lists
-
Test plans
-
Deployment schedules
Checklists are a key part of the deployment plan. The purpose of a
checklist is to ensure that the entire deployment team understands the
steps they need to perform. Checklists should list the tasks that must
be performed and designate individuals to handle the tasks during each
phase of the deployment—from planning
to testing to installation. Prior to executing a checklist, the
deployment team should meet to ensure that all items are covered and
that the necessary interactions among team members are clearly
understood. After deployment, the preliminary checklists should become a
part of the system documentation and new checklists should be created
any time the system is updated.
The deployment plan should include a contact list. The contact list
should provide the name, role, telephone number, and email address of
all team members, vendors, and solution-provider representatives.
Alternative numbers for cell phones and pagers should be provided as
well.
The deployment plan should include a test plan. An ideal test plan
has several phases. In Phase I, the deployment team builds the business system and support structures in a test lab. Building the system means accomplishing the following tasks:
-
Creating a test network on which to run the system
-
Putting together the hardware and storage components
-
Installing the operating system and application software
-
Adjusting basic system settings to suit the test environment
-
Configuring clustering, network load balancing, or another high-availability solution, if necessary
The deployment team can conduct any necessary testing and
troubleshooting in the isolated lab environment. The entire system
should undergo burn-in
testing to guard against faulty components. If a component is flawed,
it usually fails in the first few days of operation. Testing doesn’t
stop with burn-in. Web and application servers should be stress
tested. Database servers should be load tested. The results of the
stress and load tests should be analyzed to ensure that the system meets
the performance requirements and expectations of the customer.
Adjustments to the configuration should be made to improve performance
and optimize the configuration for the expected load.
In Phase II, the deployment team tests the business system and
support equipment in the deployment location. They conduct similar tests
as before, but in the real-world environment. Again, the results of
these tests should be analyzed to ensure that the system meets the
performance requirements and expectations of the customer. Afterward,
adjustments should be made to improve performance and optimize as
necessary. The team can then deploy the business system.
After deployment of the server or servers, the team should perform
limited, nonintrusive testing to ensure that the system is operating
normally. After Phase III testing is completed, the team can use the
operational plans for monitoring and maintenance.
The following checklist summarizes the recommendations for predeployment planning of mission-critical systems:
-
Create a plan that covers the entire testing-to-operations cycle.
-
Use checklists to ensure that the deployment team understands the procedures.
-
Provide a contact list for the team, vendors, and solution providers.
-
Conduct burn-in testing in the lab.
-
Conduct stress and load testing in the lab.
-
Use the test data to optimize and adjust the configuration.
-
Provide follow-on testing in the deployment location.
-
Follow a specific deployment schedule.
-
Use operational plans once final tests are completed.