2. Acting on Performance Issues
Acting on performance issues is as important as
identifying them. There are a few key areas where performance can be
improved based on data from the health reports.
High query latency
- Separate query server role, web server role, and crawl server role
depending on where the bottleneck is or if another service is taking up
too many resources.
- Scale up by adding index partitions and if necessary RAM to handle the 33% of the index partitions being stored in memory.
- Add query servers, and place additional index partitions on the new query servers to balance the query load.
Query latency growing over time
- Add index partitions to spread load and check the performance trend.
- Separate roles onto separate servers.
- Add servers to handle query load as demand grows.
3. Scaling
Scaling is the process of adding resources to
IT systems to improve performance or handle additional load. Scaling is
an important topic for mitigating poor performance and eventual
end-user dissatisfaction. Scaling can prevent downtime and allow for
reactive management of resources as well as control budget overruns by
allowing for progressive addition of infrastructure as demand and
corpus size grow.
There are two basic methods for scaling search:
- Scaling up: Adding more resources to a physical machine that is under heavy load
- Scaling out: Adding additional machines or segmenting tasks to better utilize existing resources
SharePoint 2010 is especially good at managing existing resources and scaling out to handle demand and alleviate bottlenecks.
Why and When to Scale
The most critical point to consider scaling is
when users begin to complain about missing content and slow response
times from search. Missing content can indicate a problem with crawling
that may be the result of a lack of resources (either in the crawler or
in the target system). Slow response times usually indicate too heavy
of a load on the query servers.
Basic scaling can be as simple as adding
servers to effectively move from one level of deployment as content
grows (as was outlined in the previous section) or a more reactive
approach of addressing specific performance or capacity issues as they
arise. In order to implement the latter technique, one should be aware
of the triggers for identifying an issue and the steps to address it.
The following are some typical scaling trigger scenarios and their resolutions:
- Scenario 1—Missing or stale content: Content is missing from
the repository, and the crawler rate is less than 15 documents per
second from the Crawl Rate per Content Source health report. This
health report can be found in Central Administration => Administrative Report Library => Search Administration Reports => Crawl Rate per Content Source.
- Scenario 2—Slow search responses: Users complain of slow
response times on search. Either SharePoint 2010 is performing well on
basic page loads and search is not returning fast enough, or the entire
search and result pages load slowly. First check that there aren't any
poorly performing custom design elements or Web Parts that may be
experiencing problems on the search result page. Check the Search
health reports to identify slow query response times and latency, and
consider scaling out the query servers by either partitioning the index
and/or adding query servers to hold index partitions.
- Scenario 3—No disk space on database server:
The good SharePoint admin always keeps his or her eye on the Central
Administration and never ignores warning messages from SharePoint 2010.
SharePoint 2010 is especially good at notifying the administrator of
any issues. This includes database size issues. Should the database
server run out of space, SharePoint 2010 will alert the administrator.
This scenario most often happens during crawling when the database
server has been set to automatically grow. The natural fix for database
disk space issues is to add disks to the array or increase the size of
disks on a server if the solution is not using RAID or some other
redundant storage solution.
Disk Recommendations
Search is an input/output–intense process.
Basically, a search engine indexes all the content from a particular
source (e.g., SharePoint site, web site, database, file share, etc.)
and builds databases with the words indexed, the links to the
documents, metadata found on the documents, metadata associated with
the documents, time, size, type, etc. The crawling process writes all
this information to databases. SharePoint stores some of these
databases in SQL and some on the file structure in index partitions.
The query components then access these databases, look for the terms
that were searched for (these may be free text or property-based), and
match the terms with the documents and their properties. To do this
requires a lot of writing and reading to hard drives. Therefore having
hardware that performs well is an important aspect of improving search
performance.
Generally speaking, to support search well,
databases will need to be in some kind of disk array. For
write-intensive databases, such as a crawl database, RAID 10 is
recommended. For added performance, the temp database should be
separated to a RAID 10 array. (See more in the following sections, and
consult external resources for more information on RAID.) In addition,
there should be a redundant array to avoid downtime. More redundant
arrays will slow performance but improve redundancy, so finding the
right balance for the requirements is key.
RAID stands for redundant array of independent
disks and is the technology of combining several disks into an array,
where if one fails, the other will take the load and provide a mirror
of all data. More complicated RAID configurations have more redundancy.
They also improve performance.
Search queries require a lot of throughput, so
having well-performing databases can help. On the database servers, it
is recommended that the input/output per second (IOPS) capabilities be
3,500 to 7,000 IOPS and at least 2,000 on the property database.
Therefore, at minimum, disk speeds of 7,200 RPM or better are required
in a RAID configuration. Highly performant search implementations also
make use of storage area networks and sometimes solid state drives
(which may support up to 1,000,000 IOPS in a storage array vs. 90 IOPS
for a single 7,200 RPM SATA drive).
4. Availability
Availability in IT systems is the measure of
how often servers and their services are available for users to access.
Of course, all organizations would like to have 100% availability. This
level of availability is either not possible, not practical, or too
expensive for most organizations. Therefore, most organizations endeavor
to achieve the highest availability possible for a reasonable cost.
Generally, availability is measured by a number of nines. This is a
reference to the closeness to 100% availability achievable by
increments of additional nines from 90% to 99.999…%. A common
expression is “5 nines of uptime” or 99.999%. This represents 5 minutes
and 35 seconds of allowable downtime per year.
The main method to achieve high availability in
IT systems is by making them redundant—that is, adding copies (or
mirrors) of the systems that will take over the tasks should hardware
or software fail (which it inevitably does). Redundancy in SharePoint
deployments can insure uninterrupted service for end users. Redundancy
in a SharePoint deployment can be deployed as a solution for failover
scenarios (a server dies) but can also be deployed to ease maintenance
tasks and allow for performance and server management without adversely
affecting the uptime of the solution.
SharePoint 2010 handles redundancy in a number
of ways by allowing for mirrored services over multiple machines. The
farm architecture of SharePoint also allows for one server to
compensate for another if one of the servers in the farm fails or is
taken down for maintenance.
Each component in SharePoint 2010 Search
represents a possible failure point. Therefore, the components are made
capable of being mirrored. The most common areas to mirror are as
follows:
- Web servers: Although not strictly part of the search
engine, web servers deliver the content that SharePoint 2010 Search is
indexing. If these servers fail and there is no redundancy, then the
index will not refresh. However, if the web servers fail, there will be
no site, so users will probably not be primarily concerned with the
lack of search.
- Query servers: Multiple query servers can improve
performance but also provide for redundancy by providing an alternative
source for the queries to be processed. Mirroring index partitions
across query servers insures that all portions of the index are
available if one server should fail.
- Crawl servers: If constant index freshness is a critical
business need, mirroring crawl servers can help the database and
indexes stay up to date should one fail.
- Database servers: One of the most critical areas to provide
redundancy in is the database servers. Database servers provide the
data for the query components and the repository for the indexing
components to write to. Database failure will cause search to fail
completely, so it is essential to have databases constantly available.
Why and When to Consider Making Services Redundant
Any service that is considered critical should
have a failover component. Being able to search and receive results is
a critical feature of SharePoint. Therefore, the services that allow
users to query the index and receive results should have redundancy.
This includes query servers and any index partitions on those query
servers.
All servers with the query role should have
some level of redundancy. Ideally, all index partitions should have a
redundant instance. So, if there is one server with the query role and
two index partitions on that server, a redundant query server with a
second instance of both index partitions will be necessary.
This is considered minimum redundancy for
SharePoint Search as it insures that given a hardware failure for a
query server, search will not fail with an error for users.
Additionally, database servers should have a
redundancy element as they hold the crawl and property databases
essential to searching as well as the Administration database, without
which search cannot be performed. Most organizations already have
redundant database implementations, such as database
clusters. Either the search databases should be installed in these
existing database clusters, or a similar model should be employed to
insure redundancy.
If freshness of content is considered critical,
making crawl servers redundant is required. If freshness of content is
not a critical factor, having a less strict redundancy model for
indexing can be acceptable. However, should a failure happen, the crawl
components must be re-provisioned on new or existing servers.
Server Downtime Impact Chart
Although no one likes downtime, avoiding any
downtime with full redundancy may be either too costly, too cumbersome,
or too maintenance-intensive for some organizations. Therefore, it is
valuable to understand the impact of downtime for the individual
components of SharePoint 2010 Search and determine which may be
acceptable to the organization. See Table 1.
Note
A dedicated search farm need not have the web server role if this role
is provided by a content farm, which connects to the search farm.
Table 1. Server Downtime Impact Chart
Server Role |
Severity |
Impact of Downtime |
Web server role |
Critical |
Users cannot see the search page (and possibly not SharePoint content if the web server role is shared). |
Query server role |
High |
Users will not be able to search for content. The search page may still work but return an error. |
Crawl server role |
Medium |
The search will still work and results will be returned, but the content will not be up to date and indexes not refreshed. |
Database server role |
Critical |
Users will not be able to search
for content. The search page may still work but will return an error
(SharePoint content may also not be returned if the database server
role is shared). |