Soft NUMA
In some scenarios, you may be able to
work with an SMP server and still get the benefit of having a NUMA-type
structure with SQL Server. You can achieve this by using soft NUMA.
This enables you to use Registry settings to tell SQL Server that it
should configure itself as a NUMA system, using the CPU-to-memory-node
mapping that you specify.
As with anything that requires Registry changes,
you need to take exceptional care, and be sure you have backup and
rollback options at every step of the process.
One common use for soft NUMA is when a SQL Server
is hosting an application that has several different groups of users
with very different query requirements. After configuring your
theoretical 16-processor server for soft NUMA, assigning 2 NUMA nodes
with 4 CPUs , and one 8-CPU node to a third NUMA node, you would next
configure connection affinity for the three nodes to different ports,
and then change the connection settings for each class of workload, so
that workload A is “affinitized” to port x, which connects to the first
NUMA node; workload B is affinitized to port y, which connects to the
second NUMA node; and all other workloads are affinitized to port z,
which is set to connect to the third NUMA node.
CPU Nodes
A CPU node is a logical collection of
CPUs that share some common resource, such as a cache or memory. CPU
nodes live below memory nodes in the SQLOS object hierarchy.
Whereas a memory node may have one or more CPU
nodes associated with it, a CPU node can be associated with only a
single memory node. However, in practice, nearly all configurations
have a 1:1 relationship between memory nodes and CPU nodes.
CPU nodes can be seen in the DMV sys.dm_os_nodes. Use the following query to return select columns from this DMV:
select node_id, node_state_desc, memory_node_id, cpu_affinity_mask
from sys.dm_os_nodes
The results from this query, when run on a single-CPU system are as follows:
NODE_ID NODE_STATE_DESC MEMORY_NODE_ID CPU_AFFINITY_MASK
0 ONLINE 0 1
32 ONLINE DAC 0 0
The results from the previous query,
when run on a 96-processor NUMA system, comprising four nodes of four
sockets, each socket with six cores, totaling 24 cores per NUMA node,
and 96 cores across the whole server, are as follows:
NODE_ID NODE_STATE_DESC MEMORY_NODE_ID CPU_AFFINITY_MASK
0 ONLINE 1 16777215
1 ONLINE 0 281474959933440
2 ONLINE 2 16777215
3 ONLINE 3 281474959933440
64 ONLINE DAC 0 16777215
NOTE
The hex values for the cpu_affinity_mask values in this table are as follows:
16777215 = 0x00FFFFFF
281474959933440 = 0x0F000001000000FFFFFF0000
This indicates which processor cores each CPU node can use.
Processor Affinity
CPU affinity is a way to force a
workload to use specific CPUs. It’s another way that you can affect
scheduling and SQL Server SQLOS configuration.
CPU affinity can be managed at several levels.
Outside SQL Server, you can use the operating system’s CPU affinity
settings to restrict the CPUs that SQL Server as a process can use.
Within SQL Server’s configuration settings, you can specify that SQL
Server should use only certain CPUs. This is done using the affinity mask and affinity64 mask
configuration options. Changes to these two options are applied
dynamically, which means that schedulers on CPUs that are either
enabled or disabled while SQL is running will be affected immediately.
Schedulers associated with CPUs that are disabled will be drained and
set to offline. Schedulers associated with CPUs that are enabled will
be set to online, and will be available for scheduling workers and
executing new tasks.
You can also set SQL Server I/O affinity using
the affinity I/O mask option. This option enables you to force any
I/O-related activities to run only on a specified set of CPUs. Using
connection affinity as described earlier in the section “Soft NUMA,”
you can affinitize network connections to a specific memory node.
Schedulers
The scheduler node is where the work of scheduling activity occurs. Scheduling occurs against tasks,
which are the requests to do some work handled by the scheduler. One
task may be the optimized query plan that represents the T-SQL you want
to execute; or, in the case of a batch with multiple T-SQL statements,
the task would represent a single optimized query from within the
larger batch.
When SQL Server starts up, it creates one
scheduler for each CPU that it finds on the server, and some additional
schedulers to run other system tasks. If processor affinity is set such
that some CPUs are not enabled for this instance, then the schedulers
associated with those CPUs will be set to a disabled state. This
enables SQL Server to support dynamic affinity settings.
While there is one scheduler per CPU, schedulers
are not bound to a specific CPU, except in the case where CPU affinity
has been set.
Each scheduler is identified by its own unique
scheduler_id. Values from 0–254 are reserved for schedulers running
user requests. Scheduler_id 255 is reserved for the scheduler for the
dedicated administrator connection (DAC). Schedulers with a
scheduler_id > 255 are reserved for system use and are typically
assigned the same task.
The following code sample shows select columns from the DMV sys.dm_os_schedulers:
select parent_node_id, scheduler_id, cpu_id, status, scheduler_address
from sys.dm_os_schedulers
order by scheduler_id
The following results from the preceding
query indicate that scheduler_id 0 is the only scheduler with an id
< 255, which implies that these results came from a single-core
machine. You can also see a scheduler with an ID of 255, which has a
status of VISIBLE ONLINE (DAC),
indicating that this is the scheduler for the DAC. Also shown are three
additional schedulers with IDs > 255. These are the schedulers
reserved for system use.
PARENT_NODE_ID SCHEDULER_ID CPU_ID STATUS SCHEDULER_ADDRESS
0 0 0 VISIBLE ONLINE 0x00480040
32 255 0 VISIBLE ONLINE (DAC) 0x03792040
0 257 0 HIDDEN ONLINE 0x006A4040
0 258 0 HIDDEN ONLINE 0x64260040
0 259 0 HIDDEN ONLINE 0x642F0040
Tasks
A task is a request to do some unit of
work. The task itself doesn’t do anything, as it’s just a container for
the unit of work to be done. To actually do something, the task has to
be scheduled by one of the schedulers, and associated with a particular
worker. It’s the worker that actually does something, and you will
learn about workers in the next section.
Tasks can be seen using the DMV sys.dm_os_tasks. The following example shows a query of this DMV:
Select *
from sys.dm_os_tasks
The task is the container for the work that’s being done, but if you look into sys.dm_os_tasks,
there is no indication of exactly what work that is. Figuring out what
each task is doing takes a little more digging. First, dig out the
request_id. This is the key into the DMV sys.dm_exec_requests. Within sys.dm_exec_requests you will find some familiar fields — namely, sql_handle, along with statement_start_offset, statement_end_offset, and plan_handle. You can take either sql_handle or plan_handle and feed them into sys.dm_exec_sql_text (plan_handle | sql_handle) and get back the original T-SQL that is being executed:
Select t.task_address, s.text
From sys.dm_os_tasks as t inner join sys.dm_exec_requests as r
on t.task_address = r.task_address
Cross apply sys.dm_exec_sql_text (r.plan_handle) as s
where r.plan_handle is not null
Workers
A worker is where the work is actually
done, and the work it does is contained within the task. Workers can be
seen using the DMV sys.dm_os_workers:
Select *
From sys.dm_os_workers
Some of the more interesting columns in this DMV are as follows:
- Task_address — Enables you to join back to the task, and from there back to the actual request, and get the text that is being executed
- State — Shows the current state of the worker
- Last_wait_type — Shows the last wait type that this worker was waiting on
- Scheduler_address — Joins back to sys.dm_os_schedulers
Threads
To complete the picture, SQLOS also
contains objects for the operating system threads it is using. OS
threads can be seen in the DMV sys.dm_os_threads:
Select *
From sys.dm_os_threads
Interesting columns in this DMV include the following:
- Scheduler_address — Address of the scheduler with which the thread is associated
- Worker_address — Address of the worker currently associated with the thread
- Kernel_time — Amount of kernel time that the thread has used since it was started
- Usermode_time — Amount of user time that the thread has used since it was started
Scheduling
Now that you have seen all the objects
that SQLOS uses to manage scheduling, and understand how to examine
what’s going on within these structures, it’s time to look at how SQL
OS actually schedules work.
One of the main things to understand about
scheduling within SQL Server is that it uses a nonpreemptive scheduling
model, unless the task being run is not SQL Server code. In that case,
SQL Server marks the task to indicate that it needs to be scheduled
preemptively. An example of code that might be marked to be scheduled
preemptively would be any code that wasn’t written by SQL Server that
is run inside the SQL Server process, so this would apply to any CLR
code.
PREEMPTIVE VS. NONPREEMPTIVE SCHEDULING
With preemptive scheduling, the
scheduling code manages how long the code can run before interrupting
it, giving some other task a chance to run.
The advantage of preemptive scheduling
is that the developer doesn’t need to think about yielding; the
scheduler takes care of it. The disadvantage is that the code can be
interrupted and prevented from running at any arbitrary point, which
may result in the task running more slowly than possible. In addition,
providing an environment that offers preemptive scheduling features
requires a lot of work.
With nonpreemptive scheduling, the
code that’s being run is written to yield control at key points. At
these yield points, the scheduler can determine whether a different
task should be run.
The advantage of nonpreemptive
scheduling is that the code running can best determine when it should
be interrupted. The disadvantage is that if the developer doesn’t yield
at the appropriate points, then the task may run for an excessive
amount of time, retaining control of a CPU when it’s waiting. In this
case, the task blocks other tasks from running, wasting CPU resources.
SQL Server begins to schedule a task when a new
request is received, after the Query Optimizer has completed its work
to find the best plan. A task object is created for this user request,
and the scheduling starts from there.
The newly created task object has to be
associated with a free worker in order to actually do anything. When
the worker is associated with the new task, the worker’s status is set
to init. When the initial setup has been done, the status changes to runnable.
At this point, the worker is ready to go but there is no free scheduler
to allow this worker to run. The worker state remains as runnable until
a scheduler is available. When the scheduler is available, the worker
is associated with that scheduler, and the status changes to running.
It remains running until either it is done or it releases control while
it waits for something to be done. When it releases control of the
scheduler, its state moves to suspended
(the reason it released control is logged as a wait_type. When the item
it was waiting on is available again, the status of the worker is
changed to runnable. Now it’s back to waiting for a free scheduler
again, and the cycle repeats until the task is complete.
At that point, the task is released, the worker
is released, and the scheduler is available to be associated with the
next worker that needs to run. The state diagram for scheduling workers
is shown in Figure 6.