The job of the Query Optimizer
is to take the query tree that was output from the algebrizer and find
a “good” way to retrieve the data (results) needed. Note the use of
“good” here, rather than “best,” as for any nontrivial query, there may
be hundreds, or even thousands, of different ways to achieve the same
results, so finding the absolutely best one can be an extremely
time-consuming process. Therefore, in order to provide results in a
timely manner, the Query Optimizer looks for a “good enough” plan, and
uses that. This approach means that you may very well be able to do
better when you manually inspect the query plan; and in the section
“Influencing Optimization” you will look at different ways you can
affect the decisions that SQL Server makes during optimization.
The query optimization process is based on a principle of cost,
which is an abstract measure of work that is used to evaluate different
query plan options. The exact nature of these costs is a closely
guarded secret, with some people suggesting that they are a reflection
of the time, in seconds, that the query is expected to take. They also
take into account I/O and CPU resources. However, users should consider
cost to be a dimensionless value that doesn’t have any units — its
value is derived from comparisons to the cost of other plans in order
to find the cheapest one. Therefore, there are no true units for cost
values.
Although the exact details of what SQL Server
does within the optimization phase are secret, it’s possible to get a
glimpse at some of what goes on. For one thing, there is nothing you
can do to alter this process; moreover, with each new service pack or
hotfix, the SQL Server team tunes the internal algorithms, thereby
changing the exact behavior. If you were to know too much about what
was occurring, you could build in dependencies that would break with
every new version of SQL Server.
Rather than know all the details, you need only
understand the bigger picture. Even this bigger picture is often too
much information, as it doesn’t offer any real visibility into what the
Query Optimizer is doing. All you can see of this secretive process is
what is exposed in the Dynamic Management View (DMV) sys.dm_exec_query_optimizer_info.
This can be interesting, but it’s not a great deal of help in
understanding why a given T-SQL statement is assigned a particular
plan, or how you can “fix” what you think may be a non-optimal plan.
The current model provided by the SQL Server team works something like this:
- Is a valid plan cached? If yes, then use the cached plan. If no plan exists, then continue.
- Is this a trivial plan? If yes, then use the trivial plan. If no, then continue.
- Apply simplification. Simplification is a process of normalizing
the query tree and applying some basic transformations to additionally
“simplify” the tree.
- Is the plan cheap enough? If yes, then use this. If no, then start optimization.
- Start cost-based optimization.
- Phase 0 — Explore basic rules, and hash and nested join options.
- Does the plan have a cost of less than 0.2? If yes, then use this. If no, then continue.
- Phase 1 — Explore more rules, and
alternate join ordering. If the best (cheapest) plan costs less than
1.0, then use this plan. If not, then if MAXDOP
> 0 and this is an SMP system, and the min cost > cost threshold
for parallelism, then use a parallel plan. Compare the cost of the
parallel plan with the best serial plan, and pass the cheaper of the
two to phase 2.
- Phase 2 — Explore all options, and opt for the cheapest plan after a limited number of explorations.
You can view the inner workings of the optimization process via the DMV sys.dm_exec_query_optimizer_info.
This DMV contains a set of optimization attributes, each with an
occurrence and a value.
select *
from sys.dm_exec_query_optimizer_info
where counter in (
'optimizations'
, 'trivial plan'
, 'search 0'
, 'search 1'
, 'search 2'
)
order by [counter]
The preceding will return the same
number of rows as follows, but the counters and values will be
different. Note that the value for optimizations matches the sum of the
trivial plan, search 0, search 1, and search 2 counters (2328 + 8559 +
3 + 17484 = 28374):
Counter occurrencevalue
Optimizations 28374 1
search 0 2328 1
search 1 8559 1
search 2 3 1
trivial plan 17484 1
1. Parallel Plans
A parallel plan is any plan for which
the Optimizer has chosen to split an applicable operator into multiple
threads that are run in parallel.
Not all operators are suitable to be used in a parallel plan. The Optimizer will only choose a parallel plan if:
- the server has multiple processors,
- the maximum degree of parallelism setting allows parallel plans, and
- the cost threshold for parallelism
sql server configuration option is set to a value lower than the lowest
cost estimate for the current plan. Note that the value set here is the
time in seconds estimated to run the serial plan on a specific hardware
configuration chosen by the Query Optimizer team.
- The cost of the parallel plan is cheaper than the serial plan.
If all these criteria are met, then the Optimizer will choose to parallelize the operation.
An example that illustrates how this works is
trying to count all the values in a table that match particular search
criteria. If the set of rows in the table is large enough, the cost of
the query is high enough, and the other criteria are met, then the
Optimizer might parallelize the operation by dividing the total set of
rows in the table into equal chunks, one for each processor core. The
operation is then executed in parallel, with each processor core
executing one thread, and dealing with one/number of cores of the total
set of rows. This enables the operation to complete in a lot less time
than using a single thread to scan the whole table. One thing to be
aware of when dealing with parallel plans is that SQL Server doesn’t
always do a great job of distributing the data between threads, and so
your parallel plan may well end up with one or two of the parallel
threads taking considerably longer to complete.
2. Algebrizer Trees
As mentioned earlier, the output of the
parser is a parse tree. This isn’t stored anywhere permanently, so you
can’t see what this looks like. The output from the algebrizer is an
algebrizer tree, which isn’t stored for any T-SQL queries either, but
some algebrizer output is stored —
namely, views, defaults, and constraints. This is stored because these
objects are frequently reused in other queries, so caching this
information can be a big performance optimization. The algebrizer trees
for these objects are stored in the cache store, where type = CACHESTORE_PHDR:
select *
from sys.dm_os_memory_cache_entries
where type = 'CACHESTORE_PHDR'
It’s only at the next stage (i.e., when
you have the output from optimization) that things start to get really
interesting, and here you can see quite a bit of information. This very
useful data provides details about each optimized plan.
sql_handle or plan_handle
In the various execution-related DMVs, some contain a sql_handle, while others contain the plan_handle. Both are hashed values: sql_handle is the hash of the original T-SQL source, whereas plan_handle is the hash of the cached plan. Because the SQL queries are auto-parameterized, the relationship between these means that many sql_handles can map to a single plan_handle.
You can see the original T-SQL for either using the dynamic management function (DMF) sys.dm_exec_sql_text (sql_handle | Plan_handle).
To see the XML showplan for the plan, use the DMF sys.dm_exec_query_plan (plan_handle).