4.2 System Center Operations Manager
The System Center Operations Manager
(SCOM) product is Microsoft’s enterprise monitoring tool and part of
the Systems Center suite. SCOM provides a powerful, flexible, and
highly configurable platform for building a monitoring solution.
However, it requires a lot of work. In addition, the management packs
for SQL Server provided by Microsoft have been updated (rather than
rewritten) across several versions of SQL Server. As such, the
management packs often use legacy technologies and don’t provide
optimal feature coverage for new releases.
The bottom line is that you need to make a
significant investment in terms of designing, deploying, configuring,
tuning, and developing in order to create a meaningful monitoring
solution with SCOM.
Design and Implementation
The System Center Operations Manager
solution consists of a number of key components (some of which are
shared with the technology used in System Center Advisor), including an
Agent, which must be installed on each server to be monitored; the
Gateway, which collects monitoring data; the Root Management Server
(RMS), where the data is stored and aggregated and alerts are
generated; and the Console, which is where DBAs and systems engineers
can manage an environment. Figure 12 shows a typical SCOM deployment scenario.
The Agent must be installed onto each target
server that will be monitored, and communication must be enabled with
its gateway. If the target server and gateway are not in the same
security zone (i.e., not in the same domain or in a workgroup), then
certificates must be used to provide authentication between the target
server and gateway. Each server can report to up to six management
groups.
The Gateway role is both a security boundary and
an architectural scalability point. Given that the SCOM platform is
designed to scale to monitor many thousands of devices, the RMS may
become a point of contention if all devices were set up to report
directly to this host. Instead, the Gateway servers provide a point of
scale-out for the monitoring infrastructure. Additionally, in scenarios
in which organizations operate from multiple locations or use different
security zones, gateway servers can be used as a security boundary and
as a point of aggregation for data flowing to the RMS. Agents are
“homed” to a given Gateway, and a PowerShell script can be used to
provide a failover Gateway, providing a fault-tolerant solution.
The top tier in the hierarchy is the Root
Management Server (RMS), which is the central point for configuration
and changes (new agents and rules or monitors). The RMS server must be
able to communicate with all Gateway servers; and if no Active
Directory trust exists, certificate authentication must be configured.
Rules and Monitors
Two types of checks are carried out by
SCOM: rules and monitors. Both collect data, and understanding the
difference between them is crucial for determining which should be used.
A monitor is a near real-time operation, and the
only way to alter the health state of a managed object. Additionally,
the health state changes automatically once the condition is resolved.
An example is low disk space; once space is released, the monitor will
resolve automatically. Collected data is not stored.
A rule is typically used to collect data about a
specific object (e.g., Avg Disk Transfer/sec for a storage performance
baseline). Rules may also be useful to create an alert without
affecting health state. These alerts must be resolved manually.
Collected data is stored in the data warehouse.
Alerts
The final fundamental SCOM concept to
understand is alerts. An alert is not an e-mail or page notification,
but an event that can be triggered by a monitor or rule. Alerts are
displayed in the SCOM Console, under the Alerts tab where they are
sorted in order of priority by default. A notification is a method of
communication — such as e-mail, SMS, or pager — fired on an alert.
Calibration is the process of tuning alerts to
ensure the correct level of sensitivity. An environment can contain
vastly different database workloads, Windows and SQL Server
configuration settings, and optimization, so the concept of a healthy
server can also vary. Alert calibration refines thresholds on a
per-server basis to ensure that alerts are meaningful.
Alert tuning takes the form of overrides, which
modify thresholds from the standard to customize the values of a given
rule or monitor for a specific server or group (e.g., All Windows 2008
Logical Disks or All SQL Server 2008 databases).
When creating overrides, it is useful to store
these outside the “sealed” management packs that are provided by
Microsoft. This provides isolation between the pre-packaged, downloaded
management packs and anything that is organization or server specific.
Define an organization standard for naming the management packs where
overrides are saved — for example, you could create a new MP for the
Windows Server 2008 R2 customizations and name it Windows Server 2008
R2 — Overrides. This clearly delimits the in-box and custom
functionality.
Importing Management Packs
The Windows and SQL Server management
packs (MPs) are published by Microsoft, version controlled, and
released for public consumption free of charge. Download the latest
version and import it into SCOM. Any dependencies between management
packs are indicated at the time of import. The MP download includes a
Word document that is a guide to describe the setup process, rules, and
monitors, and contains any last-minute breaking changes.
The import/export functionality can also be used
as a backup and recovery method for custom management packs in case a
management pack rollback is required.
SCOM AND SQL AGENT
By default, the SCOM alerts will alert
only on job failure. If there is a step failure but the “On failure”
job step is set to continue, then no alert is raised. This is the
out-of-the-box behavior and may be changed if required.
Management Pack Authoring
The greatest value derived from any
monitoring process is the creation of health checks that identify key
aspects of the application platform and provide detailed data
collection. As such, SCOM is a great platform to develop this custom
monitoring in the form of management pack authoring.
One such example for SQL Server is
checking for the most recent full backup, a feature that isn’t included
out-of-the-box. This is a good example in which SCOM can alert based on
SQL Agent job failures; however, in some situations SQL Agent is
disabled, the database maintenance job schedule becomes disabled, or
for some reason the backup job does not run. Without proactive
monitoring to check for the last good backup, situations like these
could continue unnoticed for some time. This is a good scenario in
which authoring a custom monitor to check for the backup event would be
useful.