SQL Server 2012 : Delivering Manageability and Performance (part 8) - OTHER MICROSOFT TOOLS FOR MANAGING SQL SERVER - System Center Operations Manager

11/18/2013 2:43:19 AM

4.2 System Center Operations Manager

The System Center Operations Manager (SCOM) product is Microsoft’s enterprise monitoring tool and part of the Systems Center suite. SCOM provides a powerful, flexible, and highly configurable platform for building a monitoring solution. However, it requires a lot of work. In addition, the management packs for SQL Server provided by Microsoft have been updated (rather than rewritten) across several versions of SQL Server. As such, the management packs often use legacy technologies and don’t provide optimal feature coverage for new releases.

The bottom line is that you need to make a significant investment in terms of designing, deploying, configuring, tuning, and developing in order to create a meaningful monitoring solution with SCOM.

Design and Implementation

The System Center Operations Manager solution consists of a number of key components (some of which are shared with the technology used in System Center Advisor), including an Agent, which must be installed on each server to be monitored; the Gateway, which collects monitoring data; the Root Management Server (RMS), where the data is stored and aggregated and alerts are generated; and the Console, which is where DBAs and systems engineers can manage an environment. Figure 12 shows a typical SCOM deployment scenario.

FIGURE 12

The Agent must be installed onto each target server that will be monitored, and communication must be enabled with its gateway. If the target server and gateway are not in the same security zone (i.e., not in the same domain or in a workgroup), then certificates must be used to provide authentication between the target server and gateway. Each server can report to up to six management groups.

The Gateway role is both a security boundary and an architectural scalability point. Given that the SCOM platform is designed to scale to monitor many thousands of devices, the RMS may become a point of contention if all devices were set up to report directly to this host. Instead, the Gateway servers provide a point of scale-out for the monitoring infrastructure. Additionally, in scenarios in which organizations operate from multiple locations or use different security zones, gateway servers can be used as a security boundary and as a point of aggregation for data flowing to the RMS. Agents are “homed” to a given Gateway, and a PowerShell script can be used to provide a failover Gateway, providing a fault-tolerant solution.

The top tier in the hierarchy is the Root Management Server (RMS), which is the central point for configuration and changes (new agents and rules or monitors). The RMS server must be able to communicate with all Gateway servers; and if no Active Directory trust exists, certificate authentication must be configured.

Rules and Monitors

Two types of checks are carried out by SCOM: rules and monitors. Both collect data, and understanding the difference between them is crucial for determining which should be used.

A monitor is a near real-time operation, and the only way to alter the health state of a managed object. Additionally, the health state changes automatically once the condition is resolved. An example is low disk space; once space is released, the monitor will resolve automatically. Collected data is not stored.

A rule is typically used to collect data about a specific object (e.g., Avg Disk Transfer/sec for a storage performance baseline). Rules may also be useful to create an alert without affecting health state. These alerts must be resolved manually. Collected data is stored in the data warehouse.

Alerts

The final fundamental SCOM concept to understand is alerts. An alert is not an e-mail or page notification, but an event that can be triggered by a monitor or rule. Alerts are displayed in the SCOM Console, under the Alerts tab where they are sorted in order of priority by default. A notification is a method of communication — such as e-mail, SMS, or pager — fired on an alert.

Calibration is the process of tuning alerts to ensure the correct level of sensitivity. An environment can contain vastly different database workloads, Windows and SQL Server configuration settings, and optimization, so the concept of a healthy server can also vary. Alert calibration refines thresholds on a per-server basis to ensure that alerts are meaningful.

Alert tuning takes the form of overrides, which modify thresholds from the standard to customize the values of a given rule or monitor for a specific server or group (e.g., All Windows 2008 Logical Disks or All SQL Server 2008 databases).

When creating overrides, it is useful to store these outside the “sealed” management packs that are provided by Microsoft. This provides isolation between the pre-packaged, downloaded management packs and anything that is organization or server specific. Define an organization standard for naming the management packs where overrides are saved — for example, you could create a new MP for the Windows Server 2008 R2 customizations and name it Windows Server 2008 R2 — Overrides. This clearly delimits the in-box and custom functionality.

Importing Management Packs

The Windows and SQL Server management packs (MPs) are published by Microsoft, version controlled, and released for public consumption free of charge. Download the latest version and import it into SCOM. Any dependencies between management packs are indicated at the time of import. The MP download includes a Word document that is a guide to describe the setup process, rules, and monitors, and contains any last-minute breaking changes.

The import/export functionality can also be used as a backup and recovery method for custom management packs in case a management pack rollback is required.

SCOM AND SQL AGENT

By default, the SCOM alerts will alert only on job failure. If there is a step failure but the “On failure” job step is set to continue, then no alert is raised. This is the out-of-the-box behavior and may be changed if required.

Management Pack Authoring

The greatest value derived from any monitoring process is the creation of health checks that identify key aspects of the application platform and provide detailed data collection. As such, SCOM is a great platform to develop this custom monitoring in the form of management pack authoring.

One such example for SQL Server is checking for the most recent full backup, a feature that isn’t included out-of-the-box. This is a good example in which SCOM can alert based on SQL Agent job failures; however, in some situations SQL Agent is disabled, the database maintenance job schedule becomes disabled, or for some reason the backup job does not run. Without proactive monitoring to check for the last good backup, situations like these could continue unnoticed for some time. This is a good scenario in which authoring a custom monitor to check for the backup event would be useful.

Others