Developing and implementing a
SharePoint disaster recovery plan is not easy, because there are so
many components integrated with SharePoint. A medium-scale or larger
SharePoint Server installation has many infrastructure dependencies as
well as core components like Web front-ends servers, search servers, and
database servers. Nearly all of your SharePoint information is stored
in SQL Server databases, but there are several other areas of concern
when developing a disaster recovery plan for SharePoint. For instance,
you have information stored in other applications, such as IIS and DNS,
and even at the file system level. You also have to be concerned about
hardware—hard drives, routers, switches, cables, and so on.
Disaster planning should
also encompass the implementation of best practices to avoid or
minimize the chance of a catastrophic event occurring in the first
place. If you take the time to plan carefully, using a three-step
process involving education, documentation, and preparation, you can
build a comprehensive—and successful—disaster recovery plan that will
benefit your organization in the short term and the long run.
1. Education
The education phase of disaster
recovery planning involves the process of familiarizing yourself with
all the integrated SharePoint components, so you know what you will need
to do in the recovery process to minimize the disruption of your
business infrastructure. Not all of the components you must be concerned
with are contained within SharePoint, but because SharePoint is
integrated so tightly with so many other applications, it is dependent
on many of them to function.
1.1. Server Operating System
The most obvious
component that SharePoint 2010 depends on is the Windows Server
operating system. You should create a new operating system image of all
non-database servers, and the image should contain all the service packs
and patches that you have applied. You can use this image to quickly
restore the operating system before SharePoint is reinstalled. However,
be cautious; you should keep a different image for each farm server
role, because changing the SID (Security
Identifier) of a single image to create multiple SharePoint 2010 WFEs
and application servers is not a recommended practice. Be sure to update
your images every time you add a service pack or patch and when the SharePoint Root changes.
Note:
The SharePoint Root is
located at C:\Program Files\Common Files\Microsoft Shared\web server
extensions\14\ and replaces the phrase 14 Hive.
You should create a network
drive with all of your system images, installation sources, patches, and
third-party software additions. Schedule backups for this drive at
least once a week. This practice will allow you to rapidly restore the
server while retaining your SharePoint 2010 farm consistency.
1.2. SQL Server
If you could back up only one
server in your SharePoint 2010 farm, it would have to be your SQL
Server. SQL Server contains more than 95 percent of your SharePoint
information. SQL Server stores configuration information about your
entire farm, the site collection content of your Web applications, your
Web application settings, service application information, performance
information, and several other important bits of SharePoint information.
If you aren’t also the SQL
Server database administrator (DBA), you should introduce yourself to
the database administrator(s) who are managing the SQL Server instance
or instances that are hosting your SharePoint content. Take the time to
become familiar with their schedules, backup strategies, database
failover options, and anything else they are willing to share with you
about the SharePoint databases.
1.3. Internet Information Services
All SharePoint content is
accessed through a Web service hosted by Internet Information Services
(IIS). The configuration of your Web applications and application pools
made from SharePoint are stored in the farm configuration database.
However, any changes you make directly in the IIS Manager are not stored
in the SharePoint farm configuration database; they are stored in an
IIS configuration file. For instance, if you add an additional host
header to a Web application using IIS Manager, it is not stored in the
farm configuration database—it is stored in an IIS 7 configuration file.
The foundation of your
Web application information stored in IIS is the configuration file.
This is a repository for your IIS configuration information located in
the directory C:\Windows\System32\inetsrv\config. This IIS configuration
file is an XML file called Applicationhost.config,
and you should update it only by using the IIS Manager application or
the Appcmd.exe command-line tool. You should back up your IIS
configuration file regularly so that you have an up-to-date version if
you lose the IIS server hosting your SharePoint Web applications.
1.4. Third-Party Software
Most organizations
have third-party solutions running on their SharePoint 2010 server
farms. This might include backup software, Web parts, language packs,
antivirus software, and custom code. Become familiar with this software
and document how it is installed. Document any installation keys that
are required and keep the installation media in a central location that
is easily accessible during a recovery process. Be sure to reinstall any
third-party Web Parts and custom code before redeploying your Web
front-end (WFE) servers. Forgetting to do so on a load-balanced WFE will
result in page errors and an inconsistent experience for the end user.
As part of your disaster
recovery planning, you should be cautious about installing products
that extend the time required to recover your farm. Make third-party
solutions dynamic enough to restore your farm with minimal delays.
1.5. Network Components
Since SharePoint 2010
hosts its content through a Web service and is network dependent, being
familiar with all of the connection components is crucial to recovery or
continuity of services. Be sure to include your network team in your disaster
recovery planning process at an early stage to discuss and document all
connecting pieces. The following list provides some examples of
components you should discuss with your network team.
Switches Redundancy, virtual LANS, Network Interface Card (NIC) teaming, port speed, duplex, dedicated backup LANs
Routers Redundant paths, latency, hardware load balancing
Firewalls Rules, redundancy, OS version
SAN (Storage Area Network) Compatibility, capacity, speed, Host Bus Adapter
Cabling and electrical topology Redundant cabling, processes for working in your raised floor, redundant power, uninterruptible power supplies, generators
Note:
If you are using an
Internet Service Provider (ISP), be sure to get a service level
agreement (SLA) that defines their strategies and obligations regarding
the services they provide.
1.6. Central Administration
With the exception of the SQL
Server and the operating system, the server hosting the Central
Administration Web application is the most important component in the recovery
of a SharePoint installation. If you experience a complete loss of
service, you will need to bring up the Central Administration server
first and use it to re-establish connections to your SharePoint
databases. You can use your Central Administration server Web
application console to access the Backup And Restore user interface
(UI), or optionally, use the STSADM or Windows PowerShell command-line
tools. You can restore this server from a system image or by using the
Windows Server Backup utility. After completing your restores, be sure
to verify that your SharePoint installation–specific services are
running using Central Administration.
1.7. Web Front-End Servers
In an out-of-the-box
SharePoint 2010 implementation, Web front-end servers (WFEs) are
stateless servers, meaning that they don’t track client access, and any
WFE can serve your SharePoint data. This eases restoration of a WFE by
allowing you to install the application binaries and then connect to an
existing SQL Server configuration database. The SQL Server configuration
database populates any required information on the WFE to serve
SharePoint content. The exception to this is when you are customizing
Web application content. As an example, many WFEs will have branded
images, custom pages, excluded managed paths, Web Parts, and specialized
authentication mechanisms. All of these must be reinstalled after a WFE
system rebuild, which reinforces the need to carefully document
customized environments.
1.8. Search Server
If your indexes are not large,
rebuilding the index after a system image restore is an efficient way to
return current search and query functionality. Alternatively, you can
reinstall SharePoint to an existing farm and enable it as a Search
server in Central Administration. Conversely, if your index sizes are
measured in gigabytes or terabytes, you will want to back up your
indexes so they can be restored, providing a reasonably timed return to
service. If you don’t back up large content indexes, your search results
can be incomplete for hours or even days, depending on the size of your
content sources and the speed of your hardware.
Note:
MORE INFO A good source for more information about Search servers and indexing is the Microsoft Office SharePoint Server 2007 Best Practices (Microsoft Press, 2008).
1.9. Service Applications
You can use any of the
SharePoint disaster backup tools to back up your service applications,
or you can perform a full farm backup that includes all service
applications. Don’t forget that the flexibility of SharePoint 2010
allows for an easy reinstallation of your service applications should
one of them fail. Also, if your organization relies heavily on a
particular service application, you may benefit from having multiple
instances of that service application hosted on your farm.
2. Documentation
Documentation ensures that
you have identified and defined the remedies necessary to recover all
components of your SharePoint farm. There are two categories of
documentation: the SharePoint-dependent items and SharePoint component
documentation. The SharePoint-dependent items you need to document
include all dependent software, hardware, and network components
supporting your SharePoint installation.
You also should document
all SharePoint-specific components, including Central Administration
settings, search and index settings, WFE, and service application
settings. By documenting the SharePoint components and their
dependencies, you will be able to recover your entire SharePoint farm or
a subset of the farm. Organizations that document and prepare for
disaster can swiftly react and stay operational after any type of
catastrophe.
You should
have detailed installation documentation that defines every setting and
keystroke required to completely rebuild each server. Document every
nuance of your servers, including items like WFE SharePoint Root
customizations, and you won’t have to worry about missed configuration
options and forgotten software when rebuilding servers. Create a
separate document for each server and include all relevant hardware
information—the server name, BIOS and backplane versions, network
interface cards, RAID controllers, and so on. Documenting your hardware
configuration makes it easier to troubleshoot, download correct drivers,
and effectively communicate with technical support in the event of
failure.
You should also document all
service packs, hotfixes, antivirus programs, and other software
additions. When you have servers in a load-balanced cluster, it is very
important that all machines have an identical configuration. If months
have passed since a server build and you haven’t documented additions,
you will almost certainly forget a Web Part or similar piece of software
when you have to restore the server. This sort of omission can create
an inconsistent, negative user experience that can be very difficult to
troubleshoot.
Note:
BEST PRACTICES Have your disaster recovery documents backed up to a source that is readily accessible and easily restored in the event of a disaster.
If you have your documentation
stored only on your SharePoint site and SharePoint fails, you will not
be able to use this documentation. Store hard copies of all of your disaster
recovery documents onsite and offsite. In addition, versioning your
server documentation can be an invaluable aid for rolling back changes
when patches or third-party software affect usability and performance.
After you have thoroughly
documented your farm installation, continually update your server
documents. This creates a “living” document set that is always current,
and it will be worth all the time it took to keep it current when you
need the documents for restoring services. Create an appendix in your
server documentation with version history and note the reason for
changing your specific installation. If possible, verify any changes you
make with your peers.
Note:
ON THE COMPANION MEDIA Use the Disaster Recovery Template on the companion media as a guide to completing your organization’s disaster recovery plan.
2.1. SharePoint-Dependent Documentation
This category should
contain all of the information that you discovered during your meetings,
lunches, and water-cooler conversations with network and SQL Server
administrators and is specific to those components that are outside of
SharePoint but are required for SharePoint to function. Have the
administrators of your network and SQL Server create the documentation
for items in this category to make sure it contains everything necessary
to recover from a disaster.
2.1.1. Operating System
Because there are several
versions of operating systems in widespread use, you must document your
specific installation, and keep the installation media easily available
as well. Update your documentation whenever you apply service packs,
patches, hotfixes, and any other changes or additions to the operating
system to ensure that it is consistent on all servers in the farm.
2.1.2. SQL Server
Document the version of SQL
Server you are using, along with the service packs, patches, hotfixes,
and so on that have been applied to your SharePoint SQL Server instance
or instances. Also, if you are performing SQL backups of your SharePoint
databases, document the backup strategies and methods you are using,
the backup schedule, and the location of backup copies, as well as any
other information that will help you quickly recover your SharePoint
databases.
2.1.3. Internet Information Services
Document any modifications
made to your Web applications through IIS Manager. Also document your
backup schedule and the location of the IIS backups.
After talking to the
administrators of these systems and becoming familiar with how they are
integrated with SharePoint, you should identify any scheduled outages,
such as maintenance windows, that you need to take into account during
the planning stage for disaster recovery. Your disaster
recovery plan will only be as good as the weakest link, so don’t forget
to involve the stakeholders early and convince your peers that a good
disaster recovery plan is a solid investment.
2.2. SharePoint Component Documentation
This category of
documentation contains the information specific to SharePoint, and it
focuses on the different components within SharePoint. Your source for
SharePoint component information should be your SharePoint farm
administrators, who are the best people to write and maintain this
critical documentation.
2.2.1. Central Administration
It is important to
completely document the installation of all servers, but especially your
Central Administration Web application server. This document should be
secured and only be accessible to farm administrators. It should contain
the following information.
Farm account name and password
Farm passphrase specified during initial creation of farm
Port number of Central Administration
SQL Server server name
SQL instance name on SQL Server
SQL Server account name and password
Configuration database name
Location of binaries (if not the default)
2.2.2. Web Front-End
The following is a list of items that must be documented to successfully back up and restore a customized WFE.
IIS Configuration
Customized authentication software
TCP ports on Web applications and extended Web applications
IIS excluded managed paths and associated content
Centrally located repository for IIS configuration backups
SSL certificate backups
IIS Logs at %SystemRoot%\system32\LogFiles\w3svc<IIS Virt Server ID>
Web Parts installed into the Global Assembly Cache (GAC)
Customized code located in the SharePoint Root
2.2.3. Search Service
Document your file
index locations, the backup schedule for these indexes, and the location
of these backups. Also be sure to include a list of the database names,
the backup schedule for the search databases, the backup method, and
the location of backup copies.
2.2.4. Service Applications
Document
application service configuration information, associated Web
application information, and database names, as well as the backup
schedule, method, and location of the backups.
3. Preparation
Preparation involves testing the identified remedies that you established in the documentation
process so that when a disaster occurs, you will know exactly what
steps to take to recover from it—and you know how long it will take to
accomplish the recovery.
Having a plan that won’t work is of little use, so it makes sense to execute a simulation of your disaster
recovery plan often, making sure to coordinate with your peers and
stakeholders. Executing a disaster recovery plan on a production farm is
generally a bad idea, but you can test the plan on secondary server
farms and on system image restores in a development environment. If your
organization has the resources to build a lab with a mock-up of your
production environment, you can use it to test your disaster recovery
plan. To minimize costs and overhead, a mock-up environment can be
simulated using a virtual environment.
When you are testing your
disaster recovery plan, try to test the plan with real-world scenarios
involving Search server failures, SQL content database corruption, IIS
corruption, network card failures, hard drive failures, and any other
common issues you might face. This will provide you with valuable
knowledge about how to bring back a failed SharePoint farm.
Many disaster recovery
plans adequately cover all hardware, software, and system components,
but leave out what may be the most important part of the equation—you
and your associates. As an example, if the network administrator is on
vacation when a disaster occurs, you may be able to quickly restore your
SharePoint 2010 server farm, but it will be of little value if the
network is still down. Make sure you and the other system administrators
have a list of all administrators. This list should include shift
schedules, home and cell phone numbers, vacation schedules and contact
information, and any other relevant information you may need to round up
the personnel you will need to implement your plan for restoration of
service. Having this information available also will help make sure that
you are meeting the defined service level agreements (SLAs) for your
clients.
It is no accident that after major disasters large banks and brokerage firms do not lose data: their disaster
recovery plans are well documented and carefully executed when needed.
It is nearly impossible to execute a disaster recovery plan successfully
if you do not know all of the dependencies in your environment, haven’t
accurately documented the steps required to perform disaster recovery
at different levels, and haven’t tested the success of your disaster
recovery plan. Education, documentation, and preparation:
remember that these are the three steps to creating a disaster recovery
plan that will allow your organization to recover quickly and
efficiently when calamity strikes.
But
don’t just file away that great plan you’ve created after you’ve
finished testing it. You have to keep it current and viable. Perform
tests monthly to remain familiar with exactly what has to be completed
to recover any level of your SharePoint farm.