THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Queries barely over the Cost Threshold for Parallelism

I had discussed SQL Server parallelism in Oct 2010, with my thoughts on the best settings for: Cost Threshold for Parallelism (CTP) and Max Degrees of Parallelism (MAXDOP) in Parallelism Strategy and Comments. At the time, I had intended to follow up with detailed measurements. So now a mere 2 years later, here it is. The general thought was that CTP should be raised from the default value of 5, and MAXDOP should be changed from unrestricted, on modern systems with very many cores, and most especially on systems with Hyper-Threading. However reasonable each persons ideas/suggestions are, nothing is better than hard data.

The interest is in the smaller queries that can have a parallel execution plan. With the default Cost Threshold of Parallelism setting, we are trying to find queries with the lowest plan cost and lowest actual cost (which point to different queries) that have a parallel execution plan. (I will details on index seek later).

The test system is now a 2-socket Intel Xeon 5670 six-core 2.93GHz (32nm Westmere-EP) with Hyper-Threading enabled (a total of 24 logical processors). Some references are made to test results on an earlier system with 2 x Xeon E5430 quad-core 2.66GHz (45nm Core 2) without Hyper-Threading.

The test data is built using the TPC-H data generator, initially employing derivatives of the Part table. At scale factor 10, the Part table is 2M rows, about 285MB, 55 rows per page or 149 bytes per row. The derived tables have 2 additional 4-byte integer columns, size 301MB, 52 rows per page or 157 bytes per row.

Parallel Hash Join

The smallest hash join on the modified Part table that results in a parallel execution plan occurs at around 100,000 rows. Lets first examine the non-parallel plan.

Hash Join DOP 1

The three major components of this plan are the outer and inner source index seeks, and the hash match. The Stream Aggregate is small to the major components.

Hash Join DOP 1 Hash Join DOP 1

The IO cost of a 1.42 corresponds roughly to a range scan/seek of 15MB (IO cost 1 is 10.5MB). The actual logical IO is 1931, very close to the values of (1.42446 - 0.003125) * 1350, and includes a few upper level index LIOs.

Below are the Hash Match and Stream Aggregate details. Note that the Hash Match has the largest cost of all operations in this execution plan, and the entire cost is in the CPU element. The 2 index seeks operations have their cost mostly in the IO element, with only 7.1% in the CPU.

Hash Join DOP 1 Hash Join DOP 1

Below is the parallel Hash Join execution plan at DOP 2.

Hash Join DOP 1

The Outer and Inner Source index seeks have identical costs so only the first is shown. As covered in Cost Based Optimizer Parallelism I the CPU cost is reduced by the degree of parallelism, but the IO cost does not change, which I call a saturated IO model.

Hash Join DOP 1 Hash Join DOP 1

The Hash Match details above and the Stream Aggregate details below at DOP 2. Both operations have cost reduced by the degree of parallelism as both elements have their costs entirely in the CPU portion.

Hash Join DOP 1 Hash Join DOP 1

The Parallelism operation, as its name implies, only occurs in parallel execution plans adding cost 0.0285. The cost structure of this operation is proportional to the number of rows, which implies that high row count queries are less likely to have a parallel execution plan as the cost of reconstituting threads may outweigh the formula benefit of parallel operations.

Parallel Loop Join

The second test query is a loop join. A parallel execution plan occurs at around 8000 rows for high degrees of parallelism. At DOP 2, this occurs around 10000 rows, which will be used for test purposes. The non-parallel loop join plan is shown below.

Loop Join DOP 1

The outer and inner source operation details are shown below. Note the inner source subtree cost of 29.0264, based on number of executions 10000. The CPU component should be 10000 * 0.0001581 = 1.581. Then the IO component is 29.0264 - 1.581 = 27.4454, which is approximately equal to 8782 * 0.003125. This is an estimate of the number of physical IO for 10000 executes. The assumption is that some of the pages would have been previously loaded into memory during the execution of this query, but not yet evicted.

Loop Join DOP 1 Loop Join DOP 1

There are 38465 leaf level pages in both tables. The alternate plan for this query is a hash join with a scan on the inner source table. In this plan, the table scan would have IO component 28.5, CPU 2.2 and the hash match would contribute another 9. So it would take another 35% more rows before the loop join plan would naturally shift to a hash join at DOP 1. At higher DOP, the hash join cost is reduced proportionate to DOP.

The Nested Loops and Stream Aggregate details are shown below. The Nested Loops is only 0.14% of the overall plan cost, and the Stream Aggregate even less.

Loop Join DOP 1 Loop Join DOP 1

Below is the parallel loop join query plan at DOP 2.

Loop Join DOP 1

The parallel outer and inner details are below. Note the very minor reduction in outer source CPU from 0.155756 to 0.150178. There is no change in the inner source operation, not even the element attributed to CPU.

Loop Join DOP 1 Loop Join DOP 1

The Nested Loops and Parallelism details are below. In the parallel plan, there is a CPU reduction of 0.005 from the outer source, 0.0209 from the Nest Loops, and 0.003 from the Stream Aggregate, to just ever so slightly overcome the Parallelism cost of 0.0285.

Loop Join DOP 1 Loop Join DOP 1

The total plan cost for the non-parallel loop join is 29.23 and the parallel plan cost is 29.229. If this model was even remotely accurate, we should ask why bother to switch to a parallel plan for such a minute gain. It turns out that the SQL Server parallel execution plan cost estimate is nothing close to the true cost structure, which raises a completely different set of questions, starting with: how can SQL Server be expected to generate good parallel (or not) execution plans if the cost model is completely different than the true cost structure?

Parallel Query Performance and Actual Cost - Hash Joins

Below is the hash join performance in rows per sec (left vertical scale), and (worker or CPU) cost in micro-sec per row (right vertical scale), both versus degree of parallelism.

Loop Join DOP 1
Hash Join 100K rows versus Degree of Parallelism

At DOP 1, the performance is just over 1M rows/sec, scaling well to DOP 4, levels off for DOP 6-12 at just over 4M rows/s, then jumps again at DOP 14. The peak gain from parallelism is 6.646 speed up at DOP 20 over DOP 1. The cost is just under 1 us per row at DOP 1, rising as high as 2.6 us per row at high DOP, more so if HT is involved? At DOP 1, the query execution time for 100K rows is 95ms.

It appears that at DOP 2 and 4, the individual threads are running on logical processors in different physical cores, hence the excellent scaling. At DOP 6-12, some of the logical cores are on the same phyical cores. Hyper-threading does improve performance to a moderate degree in SQL Server depending on the operation.

(I worked on a custom b-tree search engine with no locking protection. The scaling was essentially linear regardless of physical or logical cores from 1 to 24 threads. This may seem fantastic, but it does make sense because a b-tree search is just serialized memory accesses. Fetch a memory location, which determine the next memory location to fetch. The first memory access must be complete before the next can be issued. The core clock rate for 3.3GHz is 0.3ns. Memory access latency is 50ns for local node, 100ns for 1 hop remote node, corresponding to 150 and 300 CPU-cycles respectively.)

If this assessment is correct, then it would suggest the proper strategy for the SQL Server engine in parallel execution plans is to allocate 1 worker thread from a logical processor on each physical core, perhaps allocating from the cores on the same socket before proceeding to the next socket. Only if the desired degree of parallelism is greater than number of cores should two logical processor be allocated from any single physical core, and this should only occur in unusual circumstances.

The next level of sophistication would be to match thread assign with memory alignment, so that the majority of memory accesses are to the local node. This might require partitioned tables (with hash key partitioning?). This effort would be hugely complicated, and only applicable to special circumstances, but a true fanatic would not be deterred.

Below is the performance and cost structure for a hash join on the full table of 300MB and 2M rows. The cost per row is higher 1.495 us per row at DOP 1, rising to 3us at high DOP. However scaling is better with a peak speedup over DOP 1 of 11.35 and the peak rows per sec is also higher than the previous query.

Loop Join DOP 1
Hash Join - 2M rows versus Degree of Parallelism

I am not sure why the cost per row in this query is higher for the full table over the limited range of 100K rows. There was no tempdb activity in either case. One speculation is that the hash join intermediate results remain L3 cache (15M), and results in lower access time than off-die memory accesses.

A test with small and large merge joins, also exhibits the same behavior. Is it possible this is due to local and remote node memory locations? Life is not simple on NUMA systems with HT.

Just to be sure, the same full table scan query was tested on a 3GB 20M row table, as shown below. The DOP 1 cost and performance was about the same, slightly lower actually at 1.376 us per row. The scaling was better with peak speedup at 14.95 and peak performance at nearly 11M rows/sec.

Loop Join DOP 1
Hash Join - 20M rows versus Degree of Parallelism

Both full table hash joins (300MB and 3GB) demonstrate that parallel execution plan scaling is better for larger queries.

Parallel Query Performance and Actual Cost - Loop Joins

Below is the Loop Join performance in rows per sec and cost in usec per row for a query of 10000 rows on the 300MB tables. Notice that performance peaks at DOP 6 and degrades at higher DOP. The peak speedup over DOP 1 was 4.4.

Loop Join DOP 1
Loop Join - 10K rows versus Degree of Parallelism

The cost at DOP 1 is 1.827 usec per row. For 10000 rows, the execution time is a mere 1.8 ms (correction) 18ms. That this query benefit from parallel execution at all is amazing. Of course it also raises the serious question as why SQL Server should go to the effort of a parallel execution plan for a query that runs in 1.8ms 18ms.

This is due to the antiquated IO based cost model and the default CTP setting of 5. Based on the 100K row hash join hash with plan cost just over 5 and actual query time of 95ms, we might consider Cost Threshold for Parallelism around 25, making the theshold at around 0.5 sec. However, based on the 10K rows loop join with plan cost 29 and actual query time of 1.8ms, CTP of 25 would point to a 50K rows loop join with actual execution tim 10ms. (OK, it was late) However the 10K row loop join at plan cost 29 runs in 1.8ms!

The disparity between Loop and Hash join model and actual cost does allow a good strategy for the Cost Threshold for Parallelism setting without a complete overhaul of the SQL Server cost model. And this is not something the SQL Server team seems to be will to tackle. Another possiblilty is for the Cost Threshold for Parallelism setting to only consider the CPU element of the plan cost. Too bad we do not have access to source code to investigate this.

The outer and inner source tables were populated and indexed in a manner such that for this loop join, each row from the outer source joins to a row in a different page of the inner source table. When the rows join to consecutive rows in the inner source table, the cost per row is even lower at 1.33 usec per row.

Below is the Loop Join performance and cost for 100K rows on the 3GB tables. Performance continues to increase for higher DOP. The peak speedup over DOP 1 was 8.26.

Loop Join DOP 1
Loop Join - 100K rows versus Degree of Parallelism

In examining the CPU usage during the tests, it was evident that in some cases, only one logical processor of a physical core was used. In other cases, both logical processors on some cores were used. This leads to the uneven scaling versus DOP.

Parallel Execution Plan Throughput Tests

Another point interest is throughput with parallel execution plans is supporting concurrent sessions. The figure below shows performance in rows per sec at various DOP settings versus number of concurrent sessions. The results based on the number of queries completed in a 10-sec windows. Since the single session DOP 1 query run-time is 100ms, this should be reasonably accurate.

Loop Join DOP 1
Hash Join Performance (rows/s) by DOP versus # of Concurrent Sessions

SQL Server parallel execution plans are great for a single session, even for queries that run in so short a time that parallelism should have never been considered. However there are definitely issues with parallel execution in throughput tests.

Loop Join DOP 1
Loop Join Performance (rows/s) by DOP versus # of Concurrent Sessions

It is stressed that the queries used in the tests above run in 95ms and 1.8ms 18ms respectively at DOP 1. So it is not surprising that there are significant throughput issues in attempting concurrent parallel execution with such minuscule queries. I present these as examples because they are among the smallest queries for which SQL Server engages parallel execution with the default settings.

This is because the default setting are seriously obsolete. The original setting that targeted a 5 sec threshold for parallel execution on a Pentium Pro 166MHz (I think this a 0.8um or 800nm design) does not work well on modern microprocessor at 3GHz and 32nm design.

That said, SQL Server does have issues with concurrent parallel execution throughput on properly sized queries. So some effort here would be helpful. But the priority is below index on hash key, and parallel write operations.

The initial release neglected the simple index seek test

Parallel Query Performance and Actual Cost - Index Seek

Below is the performance and cost characteristics for a simple index seek aggregating 1 float type column for 350K rows. The cost per row is 0.1734 usec at DOP 1. Considering that the hash join test above does 2 index seeks and summed 2 float columns, we could speculated that the cost of the bare hash join is 0.952 - 2x0.1734 = 0.605 usec per row.

Loop Join DOP 1

Below is the characteristics for a simple index with only COUNT(*). The cost is now only 0.0783 usec per row. This implies that the float type column sum costs 0.095 usec per row.

Loop Join DOP 1

The bare index seek cost of 0.0783usec per row also amortizes the cost of the page access. At 52 rows per page, the full page cost including rows is 4.07usec. Previously, I had assessed that the bare page access cost was 1us on the Core 2 platform. The Westmere processor is more powerful and mostly importantly, has significantly lower memory round-trip time. So I am expecting the bare page cost on Westmere to be less than 1usec.

It might seem that we a quibbling over fractions of 1usec. But we need to consider that 1us on a modern microprocessor in 2-3,000 CPU-cycles, and this on a processor core that can execute multiple intructions per cycle. Those of you with very good friends on the Microsoft SQL Server engine team might persuade them to show you the source code and compiled binary/assembly. Even without looking source code, it is clear that there are not nearly that many instructions to touch a page, row and column.

The CPU-cycles are spent in the lock mechanism and the memory-round sequences. This is why there significant differences between a traditional database engine with all data in-memory and the "in-memory" database engine that have completely different data organization. It might better to refer to traditional database engines as page-row structures and the new stuff column-oriented structures.

See my blog SIMD Extensions for the Database Storage Engine for my thoughts on an alternative solution.

Parallel Execution Setting Summary

It is definitely clear that the default settings for Cost Threshold for Parallelism (5) and Max Degree of Parallelism (0 - unrestricted) are seriously obsolete and should be changed as standard practice on SQL Server installation.

It is not clear there is a single universal best value for Cost Threshold for Parallelism. I think the more important criteria is that there should not be a high volume of concurrently running queries with a parallel execution plans. Whatever the plan cost of the high volume queries are, the CTP should be set above it. Of course, the other important objective is to reduce the plan cost of these high volume queries, assuming this correlates to improved actual query execution time.

The strategy for Max Degree of Parallelism is also unclear. Before the advent of multi-core processors, a good strategy was to disable parallelism on transaction processing systems. It was also critical in the old days to ruthlessly enforce the restriction of only pure transactions too. Today, we have 32-40 powerful cores on transaction systems, and almost everyone runs complex queries on it. So parallelism is good.

Ideally I would like to restrict MaxDOP to the number of physical cores on a Data Warehouse system, and less than that on a transaction system. I would also like to get great scaling with parallelism. But because SQL Server does not balance the threads to one logical processor per phyical core, this strategy does not work. So this is something the SQL Server team needs to address.

I will make continued updates on this on my website www.qdpma.com in the topic Parallelism.

Published Thursday, November 15, 2012 4:11 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

tobi said:

The cost model is seriously outdated. It seems that changing it would not be a huge effort as just a few constants need to be tweaked. The fundamental formulae are still correct.

I guess the cost model could be parameterized easily on CPU throughput, sequential IO thoughput and random IOs/sec. Jsut three numbers that are easily benchmarked.

November 16, 2012 10:16 AM
 

chris wood said:

Just wondering if this post was triggered by Adam's talk last week at the Pass Summit. He mentioned that the I/O costing was based on late 90's hardware and not today's.

Chris

November 16, 2012 10:26 AM
 

jchang said:

tobi: figuring out what new cost model should be is not a big deal. What MS is afraid of is the implications of changing the old model that everyone has already got acustomed to, however out of date.

I suggest leaving the existing model as default, with the option of enabling the new model. There is Trace flag 2301 (see http://blogs.msdn.com/b/ianjo/archive/2006/04/24/582219.aspx) but this is a special for TPC-H and is not widely used? Still I would like a new model that can be enabled.

Chris: I mapped the SQL Server query optimizer cost model in 2002 (for version 2000, and later versions 2005 and 2008) this link says 2004, but that was the 2nd revision, overwriting the original.

http://www.sql-server-performance.com/2004/quantative-analysis1/

see also

http://www.sql-server-performance.com/author/joechang/

and of course, my web site has the slidedecks from SQL Server Connections 2003, CMG 2004, and HP World

http://www.qdpma.com/

http://www.qdpma.com/ppt/ExecutionPlanCostFormula.ppt

http://www.qdpma.com/ppt/QuantitativePerformanceAnalysis_CMG2004.ppt

November 16, 2012 10:49 AM
 

Reiner said:

Joe,

Absolutely incredible! Thanks so much for investigating this.

One of things I've always been wondering is how much impact does the number of NUMA-Nodes have on query performance together with MaxDOP.

When we changed from G6 HP-Servers to G7 HP-Servers (4-Socket), the runtime for our jobs was nearly cut in half. Of course the G7 have more cores, but we stuck with MaxDOP=4 and we also don't have a great deal of parallelism included in our jobs.

Could you shed some light onto this situation, possibly with a post like this? It seems that going from 4 NUMA-Nodes to 8-NUMA-Nodes explodes performance.

Thanks lots.

January 4, 2013 8:00 AM
 

jchang said:

Reiner, sorry I did not see your comment, please email if you are need a reply to an old post.

In 4-socket, I think HP skipped G6, going from DL580 G5 to G7. The G5 would have been Core2. The 300 series had a G6 for Nehalem-EP processors. By the time 4-socket Nehalem-EX was out, the 300 went to G7 on Westmere-EP, so I think HP just wanted people to Consider the 500 series contemporary with the 300. There would have been a huge difference between the Nehalem over the Core 2 processors. There should been only a small improvement in Westmere over Nehalem core-to-core.

Going above 4-socket, especially with 8 or more cores per socket can get gains, but I think it will require careful assessment in the ability to use high parallelism.

April 3, 2013 4:13 PM
 

Lonny Niederstadt said:

Lots of great detail!

You'll be glad to know that a number of folks are recommending changing the cost of parallelism on systems... speakers at PASS Summit and Microsoft PFEs :)

I'm a bit concerned with 3 things, though:

1. I see people suggesting changing the COP more often than I see building a distribution of query executions at the various cost ranges.  Pick a number that's too low for the system, and it just won't have an effect.  Probably more of a concern on DW/DSS systems than OLTP.

2. If the COP is changed, care must be taken to consider artificially low cost plans that may be forced to serial.  I'm thinking about something like queries that heavily use tables with ascending key clustered index, without trace flags 2389 and 2390, and maximum histogram range_hi key lower than maximum clustered index value AND lower than range of query interest.  If a really bad plan is forced due to estimatedrows=1 but it is executed in parallel... well, it'll suck but suck in parallel.  If COP is now above that poor cost estimate, the query would suck in serial - probably longer than it sucked in parallel. :)

3. I like to make sure that before changes to MAXDOP and COP, systems have been evaluated for spinlock contention and CMEMTHREAD waits within NUMA nodes/scheduler groups.  So many doggone cores in these new CPUs, and with the query memory allocation bottleneck by default at the NUMA level, its not too hard for a workload to start racking up benefits from trace flag 8048.

October 26, 2013 6:37 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement