Quantcast
Channel: Search results matching tags 'parallel processing', 'parallelism', and 'performance'
Viewing all articles
Browse latest Browse all 3

Next-Level Parallel Plan Forcing: An Alternative to 8649

$
0
0

"Even experts with decades of SQL Server experience and detailed internal knowledge will want to be careful with this trace flag.  I cannot recommend you use it directly in production unless advised by Microsoft, but you might like to use it on a test system as an extreme last resort, perhaps to generate a plan guide or USE PLAN hint for use in production (after careful review)."

So wrote Paul White in his often referenced article, "Forcing a Parallel Query Execution Plan." His article focuses on the various reasons that you might not get a parallel query plan, chief among them the optimizer simply not doing a great job with its own cost model. (My session from the 2012 PASS conference, available on YouTube, also discusses the issue in some detail, and from a different perspective. You might want to both watch it and read Paul's article, prior to reading any further here.)

The trace flag Paul mentions, 8649, is incredibly useful. It allows us to tell the query optimizer to disregard its cost-based comparison of potential serial and parallel versions of our plans, thereby skipping right to the good stuff: a parallel plan (when it's possible to generate one). Alas, the flag is undocumented, unknown, and its full impact difficult to understand. And for those cowboys who are willing to take the plunge and use it in a production environment, there are still other issues lying in wait:

  • This flag is not something you want to enable on a system-wide basis. You can use DBCC TRACEON on a case-by-case basis, but as Paul shows in his post it's much nicer to use the QUERYTRACEON hint. Unfortunately, this hint was, just like the flag, entirely undocumented until only recently. It is now documented only for a very small number of flags; 8649 isn't one of them.
  • Just like DBCC TRACEON, QUERYTRACEON requires significant system-level privileges: system administrator, to be exact. This means that you're forced to either give that permission to all of your users, or do some module signing. Clearly the latter approach is far superior, but it's still a pain.
  • QUERYTRACEON, being a query-level hint, can't be encapsulated in a view or inline table-valued function. So if we want to force these to go parallel, we're pretty much out of luck.

Clearly, while invaluable for testing, 8649 just isn't the parallel panacea we need.

Before we get into the details of the solution I'll be presenting here, it’s probably a good idea to do a quick review of the situation.

Here’s the problem, in brief:

expensive_plan_no_parallel

Big plan, no parallelism. Ouch. Or maybe I should say that more slowly. OOOOOOOOuuuuuuuucccccccchhhhhhhh. Because you’re going to be waiting for quite a while for this plan to finish.

Why did this happen? We can find out by forcing a parallel version of the plan using 8649 and evaluating the metadata.

parallel_cost

When the query optimizer decides whether or not to select a parallel plan, it first comes up with a serial plan cost, then compares it to the parallel plan cost. If the parallel plan cost is lower, that's the one that gets used. This parallel plan cost is based on the serial cost, but with a few key modifications:

  • CPU costs for each iterator are divided by the degree of parallelism...except when they sit beneath a parallel nested loop. (Except when the input to the parallel nested loop is guaranteed to be one row. That’s a very special parallelism edge case.) Notice the nested loop iterator above? Strike one. (Note: I/O costs aren’t divided by anything.)
  • Sometimes, the parallel plan will have slightly different iterators than the serial version, due to the optimizer doing its job. Notice, above, that the iterator feeding the nested loop has transitioned from a Merge Join to a Hash Match? That’s going to change the cost. In this case, it’s more expensive. Strike two.
  • Parallel iterators are added into the plan as necessary. See that Gather Streams? It’s not free. Strike three.

Adding all of these modifications together, our parallel plan has a higher cost than our serial plan. Cost is the query optimizer’s way of weighing the relative merits of one plan against another, and just like you might choose the less expensive option at a store, so has the query optimizer. A plan with a cost 10 less than some other plan must perform better, right? Well…no. Not in the real world, and especially not when we're dealing with plans that have estimated costs in the millions.

Unfortunately, as I mention in the video linked above, the costing model is rather broken. And the workaround I suggest in the video--using a TOP with a variable in conjunction with OPTIMIZE FOR--can work, but it has some problems. The biggest issue, as far as I’m concerned? It requires use of a local variable. Which, just like 8649, means that it can’t be used in a view or inline TVF.

So what’s a SQL developer to do?

Recently it hit me. If I could only create a query fragment that had certain properties, I could apply it as needed, just like 8649. Here’s what I set out to create:

  • High CPU cost, low I/O cost. This is key. The query fragment had to be able to benefit from the query optimizer’s math.
  • No variables. See above.
  • A single, self-contained unit. No tables or other outside objects. As much as possible, avoidance of future maintenance issues seemed like a good approach.
  • No impact on estimates in the core parts of the plan. This query fragment had to impact plan selection purely on the basis of cost, and without causing the optimizer to make strange choices about join types, execution order, and so on.

After quite a bit of trial and error I arrived at my solution, which I encapsulated into the following table-valued function:

CREATE FUNCTION dbo.make_parallel()
RETURNS TABLE AS
RETURN
(
    WITH
    a(x) AS
    (
        SELECT
            a0.*
        FROM
        (
            VALUES
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1),
                (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1), (1)
        ) AS a0(x)
    ),
    b(x) AS
    (
        SELECT TOP(9223372036854775807)
            1
        FROM
            a AS a1,
            a AS a2,
            a AS a3,
            a AS a4
        WHERE
            a1.x % 2 = 0
    )
    SELECT
        SUM(b1.x) AS x
    FROM
        b AS b1
    HAVING
        SUM(b1.x) IS NULL
)
GO

What this does:The function starts with a set of 1024 rows, defined in the row value constructor in CTE [a]. This set of rows is cross-joined to itself four times. The resultant Cartesian set contains 1,099,511,627,776 rows, all of which are forced to pass through a Top iterator as well as a Stream Aggregate. This is, naturally, a hugely expensive operation that generates a very high estimated cost.

Except that in reality there are only 1024 rows. Notice the predicate in CTE [b]: (a1.x % 2 = 0). If you've studied a bit of math you know that 1 divided by 2 has a remainder of 1, not 0. But luckily for us, the query optimizer has no way of evaluating that at compile time. It instead asks its sibling, the query processor, to do that work. The plan involves scanning each of the four cross joined sets for a match, but of course no match is ever found. And since cross joining an empty set to any other set results in an empty set, the query processor has no need to scan any but the first set of rows it encounters. So at run time 1024 rows are touched, and that's that.

Here's the catch (and it's a good one, for us): Since the query optimizer doesn't know that our predicate will never return true, it also doesn't know how many rows will actually be touched. And it's forced to assume the worst, that the full cross product will be processed. Therefore, this function delivers a rather large cost estimate of just under 1,000,000. (Serial cost; parallel will be less, due to the aforementioned math.) This number, as a cost goal, is somewhat arbitrary. I originally came up with a function that delivered a cost in the billions, but it added a lot of complexity to the query plans I was working with. So I scaled it back a bit. 1,000,000 should be fine in just about every case. The example above is fairly typical; I usually see these cost problems crop up with very small relative differentials between the serial and parallel plan cost. The really important thing is that 100% of the estimated cost from this function is CPU time. That means that we can take full advantage of the way the optimizer works.

Of course, almost none of the cost is real. This UDF will add a couple of dozen milliseconds to your plan. It will also add around 100ms to the compile time. In my opinion, that doesn’t matter. If you’re playing in the big parallel workload arena you’re not doing thousands of batch requests a second. You’re trying to get your two hour query down to a reasonable amount of time. No one is going to care about a dozen milliseconds.

This function is also engineered so that the output number of rows is guaranteed to be no greater than one. See that SUM, with no GROUP BY? The query optimizer knows that that can’t return more than one row. And that’s a good thing. It gives us that parallelism edge case I mentioned above. (Nested loop with an input guaranteed to be exactly one row.) Another thing? No rows will ever actually be aggregated in that SUM. Its result will always be NULL. But the query optimizer has no way of knowing that, and it comes up with a plan where the entire backing tree needs to be evaluated. That’s why the HAVING clause is there.

Using this function is quite simple. You take your query, wrap it in a derived table expression, and CROSS APPLY into it. No correlation required. Let’s pretend that the query we want to force parallel looks something like this:

SELECT
    a.Col1,
    b.Col2
FROM TableA AS a
INNER JOIN TableB AS b ON
    a.PK = b.PK

Using the make_parallel function to force this to go parallel is as simple as:

SELECT
    x.*
FROM dbo.make_parallel() AS mp
CROSS APPLY
(
    SELECT
        a.Col1,
        b.Col2
    FROM TableA AS a
    INNER JOIN TableB AS b ON
        a.PK = b.PK
) AS x

The reason we CROSS APPLY from the function into the query is to keep the high cost on the outer side of any parallel nested loops. This way, the query optimizer’s parallelism math will work it’s magic the right way, yielding the parallel plan we’re after. CROSS APPLY in this case—uncorrelated—can only be optimized as a nested loop itself, and that loop only makes sense if there is at least one row feeding it. Therefore, this query is logically forced to process the TVF first, followed by the inside of table expression [x].

Note that just as with trace flag 8649, there are a number of parallel inhibitors that are still going to keep this from working. And note that unlike when using the trace flag, the base cost of just under 1,000,000 means that even if you have predicates in certain cases that make a parallel plan less than ideal, you’re still going to get a parallel plan. Using this function is effectively applying a big huge hammer to a problematic nail. Use it with caution, make sure it’s appropriate, and don't bash your thumb.

So what does a query plan look like, once this has been applied? Here’s the same query from the screen shots above, with the function in play:

now_parallel

Hopefully (without squinting too much) you can see the key attributes: There is a new subtree on top of the original query. That’s thanks to the TVF. This subtree is evaluated, and feeds the top Distribute Streams iterator, which then feeds the Nested Loops iterator. Since the input to that Distribute Streams is guaranteed to be one row, it uses Broadcast partitioning. But there is no correlation here, so there is really nothing to broadcast; the net effect of this action is to prepare a set of threads to do the work of processing your query in parallel.

Under the Nested Loops? The same exact plan shape that was produced when I used the trace flag. Same estimates, same iterators, and so on and so forth. The UDF necessarily impacts the plan as a whole, but it works as intended and does not impact any of the parts of the plan we actually care about -- the parts we need to make faster in order to get data to our end users.

I’m really excited about this technique. It means that I can play fewer games, worry less about privileges and the potentially negative impact of using undocumented hints, and concentrate more on the data I need the query to return than the physical manner in which it’s being processed.

Bear in mind that there are myriad other problems with SQL Server’s parallel processing capabilities. This technique merely solves one of them, and in all honesty it’s neither the cleanest nor best solution that I can imagine. It’s a start. My hope is that these issues will eventually be addressed by the query optimizer and query processor teams. In the meantime, we as end-users have no choice but to continue to push the product in an effort to get the best possible performance, today, for the queries that we need to run today. Tomorrow, perhaps, will hold a better story.

Disclaimer: This technique works for me. It may not work for you. It may cause some problem that I haven't yet considered. It may cause your server to literally explode, sending fragments of plastic and metal all over your data center. The function and technique are provided as-is, with no guarantees or warranties, and I take no responsibility for what you do with them. That said, if you DO try this out I'd love to hear about your experience, in the comments section below!

Enjoy!

Special thanks to Gokhan Varol for helping me test this technique.


Viewing all articles
Browse latest Browse all 3

Latest Images

Trending Articles





Latest Images