1. Choosing and Configuring Hardware for Redundancy
This section describes the most
important items that you should consider from a hardware perspective
when you are trying to increase the basic resiliency and availability
of an individual database server. These are some of the first steps you
would take as part of designing a high-availability solution for your
data tier. The basic goal here is to eliminate as many single points of
failure as possible at the hardware and configuration level. Therefore,
when choosing components for a database server and including them as
part of the server configuration (as opposed to a web server, for
example), you should consider these aspects regardless of any other
high-availability techniques you decide to use.
You should always get two internal drives in a
RAID 1 (mirrored) configuration for the operating system and the SQL
Server binaries. These drives should be using the integrated hardware
RAID controller that is available on most new rack-mounted servers.
Using an integrated hardware RAID controller (which usually has a
256MB–512MB cache) provides better performance than using software RAID
through Windows. Having two drives in RAID 1 offers a basic level of
redundancy for the operating system and the SQL Server binaries, so the
server will not stop functioning if one of the drives fails.
Try to get at least 146GB, 15K 2.5″ drives for
this purpose. Using 15K drives helps Windows Server boot a little
faster, and it will help SQL Server load a bit faster when the service
first starts up. Using 146GB (or larger) drives provides more room to
accommodate things like the Windows page file, SQL Server Error Log
files, dump files, and so on, without being worried about drive space.
As SSD prices continue to fall, you might want to consider using two
SSDs for your mirrored boot drive. Reducing your boot time and reducing
the time it takes for SQL Server to start up using SSDs could help you
meet your recovery time objective (RTO) goals.
Ensure that you have dual power supplies for the
database server, each plugged into separate circuits in your server
room or data center. You should also be plugged into an uninterruptable
power supply (UPS) on each circuit, and ideally have a backup power
source, such as a diesel generator for your data center. The idea here
is to protect against an internal power supply failure, a cord being
kicked out of an electrical socket, a circuit breaker tripping, or loss
of electrical power from the utility grid. Adding a second power supply
is relatively inexpensive insurance, typically less than $300. Despite
this, we have seen many battles with economizing bosses about this item
over the years. Power supplies do fail, cords are accidentally
unplugged, and circuit breakers do get tripped. Therefore, stick to
your guns about dual power supplies for a database server. You should
have multiple network ports in the server, with Ethernet connections
into at least two different network switches. These network switches
(which should also have dual power supplies) should be plugged into
different electrical circuits in your data center. Most new
rack-mounted servers have at least four gigabit Ethernet ports embedded
on the motherboard. All of this is designed to prevent an outage caused
by the loss of a single network port or a single network switch.
You should have multiple RAID controller cards
(if you are using direct-attached or internal storage); multiple host
bus adapters (HBAs) (if you are using a Fibre Channel SAN); or multiple
PCIe Gigabit, or better Ethernet cards with an iSCSI SAN. This will
give you better redundancy and better throughput, depending on your
configuration. Again, the idea here is to try to avoid an outage caused
by the loss of a single component.
Wherever your SQL Server data files, log files,
tempdb files, and SQL Server backup files are located, they should be
protected by an appropriate RAID level, depending on your budget and
performance needs. You want to prevent your databases from going down
due to the loss of a single drive. Keep in mind that RAID is not a
substitute for an appropriate SQL Server backup and restore strategy! Never
let anyone, whether it is a SAN vendor, a server administrator from
your operations team, or your boss, talk you into not doing SQL Server
backups as appropriate for your recovery point objective (RPO) and
recovery time objective (RTO) requirements. This cannot be emphasized
enough! There is absolutely no substitute for having SQL Server backup
files, although you will undoubtedly be pressured throughout your
career, by different people, into not running SQL Server database
backups. Stand your ground. The old saying is true: “If you don’t have
backups, you don’t have a database.”
To reduce the boot and SQL Server startup time on
your database servers, note the following BIOS configuration setting.
For a standalone database server, reducing your total reboot time has a
direct effect on your high-availability numbers. Therefore, go into the
BIOS setup for the server and disable the memory testing that normally
occurs during the POST sequence, which shaves a significant amount of
time off of it (often many minutes, depending on how much RAM is
installed), so the server will boot faster. This carries little risk,
as this testing only occurs during the POST sequence; it has nothing to
do with detecting a memory problem while the server is running later,
which is the job of your hardware or system-monitoring software.
While you are in the BIOS setup, also access the
Power Management section and either disable the power management
settings or set them to OS control. By default, Windows Server 2008 and
Windows Server 2008 R2 use the Windows Balanced Power Plan. This saves
electrical power usage by reducing the multiplier setting for the
processors, which reduces their clock speed when the system is not
under a heavy load. This sounds like a good idea, but it can actually
have a very significant negative effect on performance, as some
processors do not react quickly enough to an increase in workload. This
is particularly important if you have an Intel Nehalem or Westmere
family processor. The latest Intel Sandy Bridge and Ivy Bridge family
processors react to power state changes much more quickly than Nehalem
or Westmere did, which makes them much less sensitive to those changes
from a performance perspective.
Regardless of what processor you have, power
management can have other negative effects on your database server. One
example is when you are using Fusion-io cards in your server. Some
forms of hardware management can affect the PCIe slots in the server,
so Fusion-io specifically recommends that you disable power management
settings in your main BIOS setup and in Windows. The easy solution to
all of this is to ensure that you are using the High Performance
Windows Power Plan, and that you disable the power management settings
in your BIOS.
Finally, after ensuring that you have
followed all the guidelines described thus far, you still are not done.
Depending on your RPO and RTO requirements, you should be planning and
hopefully implementing some sort of overall high-availability and
disaster-recovery (HA/DR) strategy to provide you with an even more
robust system that will be able to handle as many different types of
issues and “disasters” as possible. This strategy could include
technologies such as Windows failover clustering, database mirroring,
log shipping, transactional replication, and SQL Server 2012 AlwaysOn
Availability Groups, along with an actual plan that outlines the
policies and procedures needed to successfully handle a disaster.
2. Hardware Comparison Tools
We are firm proponents of using readily
available benchmark tools and some common sense and analysis as a means
of comparing different hardware types and configurations. Rather than
simply guess about the relative and absolute performance of different
systems, you can use the results of standardized database benchmarks
and specific component benchmarks to more accurately evaluate and
compare different systems and components. This section discusses two
such benchmarking tools: the TPC-E OLTP benchmark and the Geekbench
processor and memory performance benchmark.
TPC-E Benchmark
The TPC Benchmark E (TPC-E) is an OLTP
performance benchmark that was introduced in early 2007. TPC-E is a not
a replacement for the old TPC-C benchmark, but rather a completely new
OLTP benchmark. Even though this newer benchmark has been available for
over five years, there are still no posted results for any RDBMS other
than SQL Server. Fortunately, many results are posted for SQL Server,
which makes it a very useful benchmark when assessing SQL Server
hardware. At the time of writing, there are 54 published TPC-E results,
using SQL Server 2005, 2008, 2008 R2, and SQL Server 2012. This gives
you many different systems and configurations from which to choose as
you look for a system resembling one that you want to evaluate.
The TPC-E benchmark is an OLTP, database-centric
workload that is meant to reduce the cost and complexity of running the
benchmark compared to the older TPC-C benchmark. Unlike TPC-C, the
storage media for TPC-E must be fault tolerant (which means no RAID 0
arrays). Overall, the TPC-E benchmark is designed to have reduced I/O
requirements compared to the old TPC-C benchmark, which makes it both
less expensive and more realistic because the sponsoring hardware
vendors will not feel as much pressure to equip their test systems with
disproportionately large, expensive disk subsystems in order to get the
best test results. The TPC-E benchmark is also more CPU intensive than
the old TPC-C benchmark, which means that the results tend to correlate
fairly well to CPU performance, as long as the I/O subsystem can drive
the workload effectively.
It simulates the OLTP workload of a brokerage
firm that interacts with customers using synchronous transactions and
with a financial market using asynchronous transactions. The TPC-E
database is populated with pseudo-real data, including customer names
from the year 2000 U.S. Census, and company listings from the NYSE and
NASDAQ. Having realistic data introduces data skew, and makes the data
compressible. The business model of the brokerage firm is organized by
customers, accounts, and securities. The data model for TPC-E is
significantly more complex, and more realistic, than TPC-C, with 33
tables and many different data types. The data model for the TPC-E
database also enforces referential integrity, unlike the older TPC-C
data model.
The TPC-E implementation is broken down into a
Driver and a System Under Test (SUT), separated by a network. The
Driver represents the various client devices that would use an N-tier
client-server system, abstracted into a load generation system. The SUT
has multiple application servers (Tier A) that communicate with the
database server and its associated storage subsystem (Tier B). The TPC
provides a transaction harness component that runs in Tier A, while the
test sponsor provides the other components in the SUT. The performance
metric for TPC-E is transactions per second, tpsE. The actual tpsE
score represents the average number of Trade Result transactions
executed within one second. To be fully compliant with the TPC-E
standard, all references to tpsE results must include the tpsE rate,
the associated price per tpsE, and the availability date of the priced
configuration. The current range of published TPC-E scores ranges from
a low of 144.88 tpsE to a high of 4614.22. There are scores for
two-socket, four-socket, eight-socket and 16-socket systems, using
several different processor families from Intel and AMD. Reflecting the
performance deficit of recent AMD processors, only four AMD results
have been published out of the 54 total submissions.
When assessing the OLTP performance of different
server platforms using different processor families and models, you
want to look for a TPC-E result that uses the same type and number of
processors as the system you are considering. If you cannot find an
exact match, look for the closest equivalent system as a starting
point, and then adjust the results upward or downward using component
benchmark results and common sense.
For example, let’s say that you are considering
the potential performance of a new two-socket, 2.6GHz Intel Xeon
E5–2670 system. After looking at the published TPC-E results, the
nearest match that you can find is a two-socket, 2.9GHz Intel Xeon
E5–2690 system that has a tpsE score of 1863.23. After looking at other
component-level benchmarks for CPU and memory performance, you might
feel relatively safe reducing that score by about 10% to account for
the clock speed difference on the same generation and family
processor(with the same number of cores, cache sizes, and memory
bandwidth), coming up with an adjusted score of about 1676 tpsE.
You want to compare the potential performance of
this system to an older four-socket system that uses the 2.66GHz Intel
Xeon X7460 processor, and you find a TPC-E benchmark for a similar
system that has a score of 671.35 tpsE. Just looking at these raw
scores, you could be relatively confident that you could replace the
old four-socket system with that new two-socket system and see better
performance with more scalability headroom. You should also drill into
the actual TPC-E submissions to better understand the details of each
system that was tested. For each tested system, you want to know things
such as operating system version, SQL Server version, the amount of RAM
in the database server, the initial database size, the type of storage,
and the number of spindles. All of this gives you a better idea of the
validity of the comparison between the two systems.
When assessing the relative OLTP performance of
different processors, take the raw TPC-E tpsE score for a system using
the processor and divide it by the number of physical cores in the
system to get an idea of the relative “per physical core performance.”
Using the preceding example, the proposed new two-socket Xeon E5–2670
system would have 16 physical cores. Taking your adjusted score of 1676
and dividing by 16 would give you a figure of 104.75. The old
four-socket Xeon X7460 system has 24 physical cores, so taking the
actual raw score of 671.35 and dividing it by 24 gives you a figure of
27.97, which is a pretty dramatic difference between the two processors
for single-threaded OLTP performance.
Geekbench Benchmark
Geekbench is a cross-platform,
synthetic benchmark tool from a company called Primate Labs. It offers
a rather comprehensive set of benchmarks designed to measure the
processor and memory performance of a system, whether it is a laptop or
a multi-processor database server. There is no measurement of I/O
performance in this benchmark. One convenient feature of Geekbench is
that there are no configuration options to worry about. You simply
install it and run it, and within about three minutes you will see the
scores for the system you have tested. These are broken down into an
overall Geekbench score and a number of scores for processor and memory
performance. This is very useful for comparing the relative processor
and memory performance of different processors and different model
servers that may be configured in a variety of ways.
This test can be a very reliable and useful gauge
of processor and memory performance. Thousands of Geekbench score
reports have been submitted to the online Geekbench database, which is
available at http://browser.primatelabs.com.
It is highly likely that you can find a score in their database for
nearly any processor or model server that you want to compare. This is
very handy, especially if you don’t have a large dedicated testing lab
with a lot of different model servers and processors.
For example, suppose you have an older
Dell PowerEdge 2950 server with two Intel Xeon E5440 processors and
32GB of RAM. It turns out that a system like this has a Geekbench score
of around 7950. You are trying to justify the purchase of a new Dell
PowerEdge R720 server with two Intel Xeon E5–2690 processors and 128GB
of RAM, and you discover a result in the online database that shows a
Geekbench score of about 41,000. That’s a rather dramatic increase
compared to a score of 7950. Using Geekbench scores in conjunction with
TPC-E scores is a fairly reliable way to compare relative processor and
memory performance, especially for OLTP workloads. Using these two
benchmarks together is a very useful technique that will likely serve
you well.