Selecting the Right Storage Hardware
Now we can move on to selecting appropriate
hardware for our requirements. There are many aspects to this decision,
and the requirements that we have identified so far represent only a
few of them. Often there may be a hardware constraint in your
requirements, which mandates that you use a specific storage technology
or vendor. This technology may not be ideal for use with Exchange but
may align with the organization's overall storage strategy. In some
cases, this strategy has been mandated for all services, regardless of
their requirements or function, and it will be necessary to use a
particular storage platform. The other scenario is that you have free
choice regarding the storage platform for Exchange, but then you will
need to narrow down your choices. How do you narrow down your options?
This section will discuss both scenarios and how to deal with them.
COMPANY-MANDATED STORAGE PLATFORM
First, let's take on the scenario where your
storage platform choice is fixed. Even though this decision has been
made for you, you still have work to do. To start, you need to research
the platform that has been designated. Begin this process by looking at
the Exchange Solution Reviewed Program (ESRP) - Storage. Search for a submission from the same storage vendor and, hopefully, for the same platform that is to be used.
ESRP submissions vary in quality and the
usefulness of the information provided, but they are almost always a
great place to find specific configuration details for running Exchange
on a particular platform. ESRP is grouped into versions that relate to
the Exchange platform to which they apply:
- Exchange Server 2007: ESRP - Storage v2.1
- Exchange Server 2010: ESRP - Storage v3.0
- Exchange Server 2013: ESRP - Storage v4.0
If you cannot find an ESRP submission, look on the manufacturer's website for Exchange-specific configuration recommendations.
Next, evaluate your disk type options and
RAID group configurations. Not all storage platforms allow all
configurations, and so it is vital that you understand what you can and
cannot do. Then try to map the data that you obtained from the Exchange
2013 Server Role Requirements Calculator, and try to make it fit the
platform. Often, the best way to do this is to run a combined storage
design workshop with the storage team, in order to evaluate the
options, and then try to tweak the calculator accordingly.
Once this process is
complete, try to define a validation approach. This is generally a
small deployment of servers and storage on representative hardware that
can be used for Jetstress testing. The goal of the Jetstress testing
will be to validate that the proposed solution is capable of meeting
the requirements identified by the calculator. This is where the
marketing nonsense stops and the fun begins!
FREE CHOICE OF STORAGE PLATFORM
Recall that, in the second scenario, you have
full control over the storage platform. This is often more challenging
for a design team. Now you have to come up with a process for
evaluating storage platforms and their ability to meet your
requirements. To this end, the first things to define are your
specification requirements. These are common areas of comparison; the
aim for most team members is to grade the platform from 1 to 10 (where
1 is very poor and 10 is perfect for the task).
Cost This is obviously a key aspect.
However, it is vital to consider the total cost of the platform and not
just the purchase price. What are the support costs? What about
operator training expenses? What about installation and configuration
costs? If possible, calculate the total cost of the platform over a
period of time, for example, two or three years, and use this to
compare the real costs of each platform.
Operations How easy or difficult is this
platform going to be to operate? Can it be easily upgraded? Can parts
be swapped out without affecting service? Try to determine a common set
of operational processes that will be required, and grade each platform
on a 1 to 10 scale for the ease with which these tasks can be completed.
Space Datacenter space is a primary
concern for many customers. Space is an expensive commodity in most
datacenters, and it should be taken into consideration for any new
platform. Try to determine the rack space required per GB or per
mailbox for each platform to aid in making a comparison.
Power This is another area of increasing
concern for recent deployments. The more power that a device draws, the
more heat it usually generates. This leads to more demand for
datacenter cooling. When possible, calculate the power in kWh per
mailbox or per GB for comparison.
Performance From an Exchange perspective,
your performance requirements are defined in the Exchange 2013 Server
Role Requirements Calculator. Can the platform meet your IOPS,
throughput, and capacity requirements while remaining under your
recommended I/O latency thresholds when tested with Jetstress? We
generally suggest that you record a pass/fail for each platform here,
where failed systems are either redesigned and retested or discarded in
the process.
Storage Validation Using Jetstress
What is storage validation? Simply put,
the goal of this process is to ensure that the storage platform is
capable of meeting the demands of Exchange Server to service end-user requests
in a timely manner. If the storage platform is incapable of meeting
these demands, then the end-user experience will suffer. We know this
from experience in the early days with Exchange Server 2003, where poor
storage performance equaled poor Exchange Server performance.
There is an important aspect to the
validation process that is rarely discussed, however, and that is that
it must take place with a calibrated workload. A calibrated workload
means that the test workload applied should be approved (calibrated) by
the Exchange product group as not only being a representative one but
also equal to the workload generated by Exchange Server. This point is
important because it separates out tools that generate workload, such
as Iometer and LoadGen, from tools that generate a defined and
calibrated workload, such as Jetstress.
Sometimes, in a project where the design
has been completed and the storage is failing to pass the Jetstress
test, a storage team member will insist that Jetstress is not a good
test because the requirements can be met with Iometer, and that's proof
that Jetstress is broken. A slight variation on this occurs when a team
will use LoadGen to simulate the expected production workload and find
that it passes whereas Jetstress fails and thus will come to the same
conclusion; that is, LoadGen passes where Jetstress fails and so
Jetstress must be broken. Both situations are equally difficult to
address since the explanation of the results is complex. By far the
most compelling explanation is that Jetstress is a calibrated workload,
and when used with the values derived from the Exchange 2013 Server
Role Requirements Calculator, it represents the peak two hours of a
working day as accurately as possible.
When it comes to storage validation, Jetstress is the only real tool for the job. Now let's see how it works.
JETSTRESS TEST PROCESS
The Jetstress test process itself is documented in the Jetstress Field Guide.
This does not yet exist for Jetstress 2013; however, the general
process outlined for Jetstress 2010 still applies. The test must be
conducted as follows if it to be considered successful:
- Meets or exceeds the database IOPS requirements identified within the calculator in normal conditions
- Meets or exceeds the database IOPS requirements identified within the calculator in degraded (rebuild) conditions
- Runs for a duration of two hours (strict mode test)
- Runs for a duration of 24 hours (lenient mode test)
- Completes all test runs with a status of Passed
A common area of confusion about these
tests is the 2-hour vs. 24-hour test recommendation. Jetstress runs in
strict mode when the duration is less than six hours. A completed test
run in strict mode is required to be sure that the storage is meeting
the performance requirements. The lenient mode relaxes some of the peak
latency spike requirements, and it is intended for longer duration
testing. The 24-hour test is recommended to ensure that the storage
platform is capable of operating at peak workload for an extended
duration, since several cases have been logged where performance
deteriorates over time when a storage platform is operating at or near
its limits. If a storage platform passes all of these Jetstress tests,
experience shows that the design is then good to go.
BUILD-TIME VALIDATION
There is one more aspect to Jetstress validation
work that is sometimes disregarded or overlooked, and that is
build-time validation testing. Build-time validation testing involves
running a Jetstress test on each production Mailbox server before it is
accepted into production. When discussing this type of testing, the
question often asked is, why bother with this test when we have already
tested an identical solution in the test lab and it has passed? The
answer is that, although the tests are the same, the purpose is
different. Validation tests in the lab were designed to corroborate
design assumptions and decisions about the storage platform. The
build-time validation is designed to ensure that the hardware has been
deployed and configured appropriately to meet the requirements, and it
is operating according to expectations; that is, it is not faulty.
It is not unheard of for a
storage platform to pass Jetstress with flying colors in the test lab,
where it receives TLC from the vendor presales team, only to find out
that it fails to pass the same Jetstress test in a production
environment, where it has been deployed and configured by a completely
different team. This can be addressed by adopting build and
configuration standards. However, these are still not foolproof, and so
adopting an automated Jetstress validation test prior to installing
Exchange Server into a production environment is highly recommended.
This recommendation is even stronger when a complex storage solution
has been deployed. One thing to remember is that it is much easier to
fix a problem when Jetstress is the only user of the service. If you
first become aware of a problem when an end user reports it, your job
becomes significantly harder.