If you randomly ask DBAs at a few companies how
they manage their databases, chances are you will get a variety of
answers in return. Some have the DBAs solely focus on the actual
database and its performance; they are hands off of the backend
storage. Some DBAs are very much involved in the SAN architecture and
understand its limitations. Despite these slight variations on the role
of DBA, one concept is common among everyone in
the IT world—uptime. The IT organization commits to the business a
guarantee for the business applications (and databases) to be
available, or up. This commitment is usually referred to as a service
level agreement (SLA).
The “Nines”
When referring to the amount of uptime, or
downtime, people refer to the nines. For example, if your customer
would only allow about 8 hours of downtime a year, this would be
considered three nines (99.9%). In other words, this application has to
be up 99.9% of 365 days. Table 1 shows a list of common nines and their corresponding allowable downtimes per year.
In general, the more nines you are asked to
provide, the more expensive the solution will be. Most groups in my
travels are three nines, because the cost from this point upward is
substantial.
Other Metrics
When thinking about SLAs, there are other
discussions to have and decisions to make. First, you should think
through the current size of your databases, their expected growth
patterns, and the workloads for these databases. The workload analysis
should include how much transaction log is created during peak times.
The Performance Data Collector feature within SQL Server can help you
capture historical server performance.
Within the topic of uptime, you may hear two
additional metrics used to satisfy SLAs. The first, recovery point
objective (RPO), is how much data can be lost. In the database world,
think of a solution where your disaster recovery plan is to perform log
shipping to a remote server. If you log ship every 5 minutes and your
RPO is 5 minutes, your log shipping will not satisfy this requirement.
Remember that for log shipping, you must also include the time it takes
to file copy the log to the remote server. Thus our RPO would have to
be more than 5 minutes to be satisfied.
The next metric is recovery time objective
(RTO), and it defines how much time is allowed to pass in order for a
restore of the database in case of a complete failure. “Complete
failure,” in this case, means recovering from a backup.
When databases go down,
the outage can be planned or unplanned. Planned outages are usually
known, and the end users expectations are set accordingly. These
outages can occur because of deployment of operating system or database
system patches, upgrades, or migrations. An unplanned downtime is a
failure that occurs generally without warning. It can occur because of
hardware or software failure or user error. Users can cause outages to
database applications by accidently deleting data needed by the
application to function properly. It is a best practice to grant users
the fewest privileges possible. If a user needs to perform specific
action that requires elevated privileges, consider writing the
functionality within a stored procedure, executing the stored procedure
as an administrator and just granting the user EXECUTE
permission on that stored procedure.
Planning an effective recovery plan
ahead of time eases the work necessary to recover from failure. SQL
Server has a number of features that help you architect a highly
available and fault-tolerant solution.