SQL Server 2012 : Delivering A SQL Server Health Check (part 5)

12/7/2013 8:21:23 PM

Next, the results of the query shown in Listing 17 indicate which database files are seeing the most I/O stalls.

LISTING 17: I/O stall information by database file

-- Calculates average stalls per read, per write, 
-- and per total input/output for each database file. 
SELECT DB_NAME(fs.database_id) AS [Database Name], mf.physical_name, 
io_stall_read_ms, num_of_reads,
CAST(io_stall_read_ms/(1.0 + num_of_reads) AS NUMERIC(10,1)) AS 
[avg_read_stall_ms],io_stall_write_ms, 
num_of_writes,CAST(io_stall_write_ms/(1.0+num_of_writes) AS NUMERIC(10,1)) AS 
[avg_write_stall_ms],
io_stall_read_ms + io_stall_write_ms AS [io_stalls], num_of_reads + num_of_writes 
AS [total_io],
CAST((io_stall_read_ms + io_stall_write_ms)/(1.0 + num_of_reads + num_of_writes) AS
NUMERIC(10,1)) 
AS [avg_io_stall_ms]
FROM sys.dm_io_virtual_file_stats(null,null) AS fs
INNER JOIN sys.master_files AS mf WITH (NOLOCK)
ON fs.database_id= mf.database_id
AND fs.[file_id] = mf.[file_id]
ORDER BY avg_io_stall_ms DESC OPTION (RECOMPILE);
 
-- Helps determine which database files on 
-- the entire instance have the most I/O bottlenecks

This query lists each database file (data and log) on the instance, ordered by the average I/O stall time in milliseconds. This is one way of determining which database files are spending the most time waiting on I/O. It also gives you a better idea of the read/write activity for each database file, which helps you characterize your workload by database file. If you see a lot of database files on the same drive that are at the top of the list for this query, that could be an indication that you are seeing disk I/O bottlenecks on that drive. You would want to investigate this issue further, using Windows Performance Monitor metrics such as Avg Disk Sec/Write and Avg Disk Sec/Read for that logical disk. After you have gathered more metrics and evidence, talk to your system administrator or storage administrator about this issue. Depending on what type of storage you are using , it might be possible to improve the I/O performance situation by adding more spindles, changing the RAID controller cache policy, or changing the RAID level. You also might consider moving some of your database files to other drives if possible.

Now, using the query shown in Listing 18, you are going to see which user databases on the instance are using the most memory.

LISTING 18: Total buffer usage by database

-- Get total buffer usage by database for current instance
SELECT DB_NAME(database_id) AS [Database Name],
COUNT(*) * 8/1024.0 AS [Cached Size (MB)]
FROM sys.dm_os_buffer_descriptors WITH (NOLOCK)
WHERE database_id > 4 -- system databases
AND database_id <> 32767 -- ResourceDB
GROUP BY DB_NAME(database_id)
ORDER BY [Cached Size (MB)] DESC OPTION (RECOMPILE);
 
-- Tells you how much memory (in the buffer pool) 
-- is being used by each database on the instance

This query will list the total buffer usage for each user database running on the current instance. Especially if you are seeing signs of internal memory pressure, you are going to be interested in knowing which database(s) are using the most space in the buffer pool. One way to reduce memory usage in a particular database is to ensure that you don’t have a lot of missing indexes on large tables that are causing a large number of index or table scans. Another way, if you have SQL Server Enterprise Edition, is to start using SQL Server data compression on some of your larger indexes (if they are good candidates for data compression). The ideal candidate for data compression is a large static table that is highly compressible because of the data types and actual data in the table. A bad candidate for data compression is a small, highly volatile table that does not compress well. A compressed index will stay compressed in the buffer pool, unless any data is updated. This means that you may be able to save many gigabytes of space in your buffer pool under ideal circumstances.

Next, you will take a look at which user databases on the instance are using the most processor time by using the query shown in Listing 19.

LISTING 19: CPU usage by database

-- Get CPU utilization by database
WITH DB_CPU_Stats
AS
(SELECT DatabaseID, DB_Name(DatabaseID) AS [DatabaseName], 
 SUM(total_worker_time) AS [CPU_Time_Ms]
 FROM sys.dm_exec_query_stats AS qs
 CROSS APPLY (SELECT CONVERT(int, value) AS [DatabaseID] 
              FROM sys.dm_exec_plan_attributes(qs.plan_handle)
              WHERE attribute = N’dbid’) AS F_DB
 GROUP BY DatabaseID)
SELECT ROW_NUMBER() OVER(ORDER BY [CPU_Time_Ms] DESC) AS [row_num],
       DatabaseName, [CPU_Time_Ms], 
       CAST([CPU_Time_Ms] * 1.0 / SUM([CPU_Time_Ms]) 
       OVER() * 100.0 AS DECIMAL(5, 2)) AS [CPUPercent]
FROM DB_CPU_Stats
WHERE DatabaseID > 4 -- system databases
AND DatabaseID <> 32767 -- ResourceDB
ORDER BY row_num OPTION (RECOMPILE);
 
-- Helps determine which database is 
-- using the most CPU resources on the instance

This query shows you the CPU utilization time by database for the entire instance. It can help you characterize your workload, but you need to take the results with a bit of caution. If you have recently cleared the plan cache for a particular database, using the DBCC FLUSHPROCINDB (database_id) command, it will throw off the overall CPU utilization by database numbers for the query. Still, this query can be useful for getting a rough idea of which database(s) are using the most CPU on your instance.

The next query, shown in Listing 20, is extremely useful. It rolls up the top cumulative wait statistics since SQL Server was last restarted or since the wait statistics were cleared by using the DBCC SQLPERF (’sys.dm_os_wait_stats’, CLEAR) command.

LISTING 20: Top cumulative wait types for the instance

-- Isolate top waits for server instance since last restart or statistics clear
WITH Waits AS
(SELECT wait_type, wait_time_ms / 1000. AS wait_time_s,
100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct,
ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn
FROM sys.dm_os_wait_stats WITH (NOLOCK)
WHERE wait_type NOT IN (N'CLR_SEMAPHORE',N'LAZYWRITER_SLEEP',N'RESOURCE_QUEUE',
N'SLEEP_TASK',N'SLEEP_SYSTEMTASK',N'SQLTRACE_BUFFER_FLUSH',N'WAITFOR', 
N'LOGMGR_QUEUE',N'CHECKPOINT_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH',
N'XE_TIMER_EVENT',N'BROKER_TO_FLUSH',N'BROKER_TASK_STOP',N'CLR_MANUAL_EVENT',
N'CLR_AUTO_EVENT',N'DISPATCHER_QUEUE_SEMAPHORE', N'FT_IFTS_SCHEDULER_IDLE_WAIT',
N'XE_DISPATCHER_WAIT', N'XE_DISPATCHER_JOIN', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
N'ONDEMAND_TASK_QUEUE', N'BROKER_EVENTHANDLER', N'SLEEP_BPOOL_FLUSH',
N'DIRTY_PAGE_POLL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', 
N'SP_SERVER_DIAGNOSTICS_SLEEP'))
 
SELECT W1.wait_type, 
CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s,
CAST(W1.pct AS DECIMAL(12, 2)) AS pct,
CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct
FROM Waits AS W1
INNER JOIN Waits AS W2
ON W2.rn <= W1.rn
GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct
HAVING SUM(W2.pct) - W1.pct < 99 OPTION (RECOMPILE); -- percentage threshold
 
-- Clear Wait Stats 
-- DBCC SQLPERF(’sys.dm_os_wait_stats’, CLEAR);

This query will help you zero in on what your SQL Server instance is spending the most time waiting for. Especially if your SQL Server instance is under stress or having performance problems, this can be very valuable information. Knowing that your top cumulative wait types are all I/O related can point you in the right direction for doing further evidence gathering and investigation of your I/O subsystem. However, be aware of several important caveats when using and interpreting the results of this query.

First, this is only a rollup of wait types since the last time your SQL Server instance was restarted, or the last time your wait statistics were cleared. If your SQL Server instance has been running for several months and something important was recently changed, the cumulative wait stats will not show the current actual top wait types, but will instead be weighted toward the overall top wait types over the entire time the instance has been running. This will give you a false picture of the current situation.

Second, there are literally hundreds of different wait types (with more being added in each new version of SQL Server). There is a lot of bad information on the Internet about what many wait types mean, and how you should consider addressing them. Bob Ward, who works for Microsoft Support, is a very reliable source for SQL Server wait type information. He has a SQL Server Wait Type Repository available online at http://blogs.msdn.com/b/psssql/archive/2009/11/03/the-sql-server-wait-type-repository.aspx that documents many SQL Server wait types, including what action you might want to take to alleviate that wait type.

Finally, many common wait types are called benign wait types, meaning you can safely ignore them in most situations. The most common benign wait types are filtered out in the NOT IN clause of the health check query to make the results more relevant. Even so, I constantly get questions from DBAs who are obsessing over a particular wait type that shows up in this query. My answer is basically that if your database instance is running well, with no other signs of stress, you probably don’t need to worry too much about your top wait type, particularly if it is an uncommon wait type. SQL Server is always waiting on something; but if the server is running well, with no other warning signs, you should relax a little.

Next, using the query shown in Listing 21, you are going to look at the cumulative signal (CPU) waits for the instance.

LISTING 21: Signal waits for the instance

-- Signal Waits for instance
SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM (wait_time_ms) AS NUMERIC(20,2))
 AS [%signal (cpu) waits],
CAST(100.0 * SUM(wait_time_ms - signal_wait_time_ms) / SUM (wait_time_ms) AS 
NUMERIC(20,2)) AS [%resource waits]
FROM sys.dm_os_wait_stats WITH (NOLOCK) OPTION (RECOMPILE);
 
-- Signal Waits above 15-20% is usually a sign of CPU pressure

Signal waits are CPU-related waits. If you are seeing other signs of CPU pressure on your SQL Server instance, this query can help confirm or deny the fact that you are seeing sustained cumulative CPU pressure. Usually, seeing signal waits above 15–20% is a sign of CPU pressure.

Others