SQL Server 2012 : Fault Tolerance - Defining a Service Level Agreement

1/13/2014 8:33:13 PM

If you randomly ask DBAs at a few companies how they manage their databases, chances are you will get a variety of answers in return. Some have the DBAs solely focus on the actual database and its performance; they are hands off of the backend storage. Some DBAs are very much involved in the SAN architecture and understand its limitations. Despite these slight variations on the role of DBA, one concept is common among everyone in the IT world—uptime. The IT organization commits to the business a guarantee for the business applications (and databases) to be available, or up. This commitment is usually referred to as a service level agreement (SLA).

The “Nines”

When referring to the amount of uptime, or downtime, people refer to the nines. For example, if your customer would only allow about 8 hours of downtime a year, this would be considered three nines (99.9%). In other words, this application has to be up 99.9% of 365 days. Table 1 shows a list of common nines and their corresponding allowable downtimes per year.

images

In general, the more nines you are asked to provide, the more expensive the solution will be. Most groups in my travels are three nines, because the cost from this point upward is substantial.

Other Metrics

When thinking about SLAs, there are other discussions to have and decisions to make. First, you should think through the current size of your databases, their expected growth patterns, and the workloads for these databases. The workload analysis should include how much transaction log is created during peak times. The Performance Data Collector feature within SQL Server can help you capture historical server performance.

Within the topic of uptime, you may hear two additional metrics used to satisfy SLAs. The first, recovery point objective (RPO), is how much data can be lost. In the database world, think of a solution where your disaster recovery plan is to perform log shipping to a remote server. If you log ship every 5 minutes and your RPO is 5 minutes, your log shipping will not satisfy this requirement. Remember that for log shipping, you must also include the time it takes to file copy the log to the remote server. Thus our RPO would have to be more than 5 minutes to be satisfied.

The next metric is recovery time objective (RTO), and it defines how much time is allowed to pass in order for a restore of the database in case of a complete failure. “Complete failure,” in this case, means recovering from a backup.

When databases go down, the outage can be planned or unplanned. Planned outages are usually known, and the end users expectations are set accordingly. These outages can occur because of deployment of operating system or database system patches, upgrades, or migrations. An unplanned downtime is a failure that occurs generally without warning. It can occur because of hardware or software failure or user error. Users can cause outages to database applications by accidently deleting data needed by the application to function properly. It is a best practice to grant users the fewest privileges possible. If a user needs to perform specific action that requires elevated privileges, consider writing the functionality within a stored procedure, executing the stored procedure as an administrator and just granting the user EXECUTE permission on that stored procedure.

Planning an effective recovery plan ahead of time eases the work necessary to recover from failure. SQL Server has a number of features that help you architect a highly available and fault-tolerant solution.

Others