THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand.

Cardinality Estimation for Disjunctive Predicates in SQL Server 2014

Introduction

Back in January 2014, I wrote an article for SQLperformance.com describing the cardinality estimation process for queries with multiple predicates, from the point of view of the old and new cardinality estimators. The article describes the various behaviours and formulas involved, along with the usual sprinkling of documented and undocumented trace flags.

I was able to describe the formula SQL Server 2014 uses to calculate a cardinality estimate for multiple predicates connected by AND (conjunctive predicates), which was already relatively well-known. Despite some fairly energetic research, and basic-to-intermediate skills with Excel, I was unable to deduce a similar formula for the disjunctive case, where predicates are connected by OR. The trace flag output I describe in the other article clearly showed that exponential backoff was used in the new 2014 cardinality estimator, but the precise formula eluded me.

I gave up on it until a few weeks ago, when I was invited to contribute to the a technical review of a White Paper Joe Sack was writing with assistance from Microsoft. One of the small contributions I was able to make was to point out that exponential backoff was used for disjunctions as well as conjunctions. The paper has now been released so I can now write about the detail behind the statement that appears there concerning cardinality estimation for disjunctive predicates (thank you Joe for pointing this out to me):

image

A Bit Of Boolean

The key bit of new information is the second sentence. As soon as I read it, a vague memory from my patchy education came back to me, as well as a cunning query rewrite Itzik Ben-Gan shows in one of his books. Both relate to a set of rules for transforming Boolean algebra, known as DeMorgan’s Laws. Wikipedia has this to say:

image

This gives us a way to turn a disjunction into a conjunction. It is a short step from there to understand that exponential backoff for disjunctions in the SQL Server 2014 cardinality estimator works by applying DeMorgan’s laws. Converting the troublesome OR to an AND (with the appropriate negations) allows us to reuse the AND exponential backoff calculation for OR queries!

The Theory

Borrowing from the Wikipedia article again, we have DeMorgan’s Laws stated as:

image

The interesting transformation from disjunction to conjunction can be expressed as:

not (A or B) is the same as (not A) and (not B)

This does not quite give us a direct translation from OR to AND. We need an extra step of observing:

(A or B) is the same as not (not (A or B))

This trivial double-negative means we can now directly substitute the not (A or B) above, to get:

(A or B) is the same as not ((not A) and (not B))

This gives us OR expressed in terms of NOT and AND.

The very last bit of theory is to remember that A and B are probabilities between 0 and 1, so:

(not X) = (1 – X)

Worked Example

Now we have all we need to show how the disjunctive exponential backoff calculation works in SQL Server 2014:

SELECT 
    COUNT_BIG(*) AS NumRows
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168412
    OR TH.TransactionDate BETWEEN '20070901' AND '20080313';

There are two predicates in this query, connected by OR.

The selectivity of the predicate on Transaction ID can be derived directly from the statistics histogram:

DBCC SHOW_STATISTICS
(
    'Production.TransactionHistory', 
    'PK_TransactionHistory_TransactionID'
);

image

This very compact histogram gives us all the information we need to say that the range of 68,413 Transaction ID values can be expected to produce 68,413 rows.

Expressed as a fraction of the 113,443 rows in the table, the selectivity of this predicate is 68413 / 113443 = 0.603061. We will call this value selectivity A.

The second predicate (on TransactionDate) is estimated from the histogram in a similar way (except the histogram has more steps than is convenient to show here). The histogram calculation results in a cardinality estimate for this predicate that is also 68,413 (because I deliberately chose the predicates to qualify exactly the same physical rows). The selectivity is likewise 0.603061.  We will call this value selectivity B.

Enter DeMorgan

We know from our work with DeMorgan that:

(A or B) is the same as not ((not A) and (not B))

So we can rewrite our (A or B) predicate immediately as not ((not A) and (not B)), enabling us to avoid worrying about the OR.

Note this transform is done purely for cardinality estimation purposes. The optimizer tree is not affected by this work.

Doing the Math

Now, given that selectivities A and B are both equal to 0.603061 in this contrived example, our expanded expression becomes:

not ((not 0.603061) and (not 0.603061))

We also know:

(not X) = (1 – X)

Expanding the inner nots means the computation becomes:

not ((1 - 0.603061) and (1 - 0.603061))

We can perform the numeric math inside the parentheses to get:

not (0.396939 and 0.396939)

Now we use exponential backoff to compute the and selectivity:

not (0.396939 * SQRT(0.396939))

And replace the outer not just as we did before:

1 - (0.396939 * SQRT(0.396939))

Now we have just numbers. The final selectivity is 0.749916.

The Result

The very final step is to multiply this selectivity by the cardinality of the base table to get our overall estimate:

0.749916 * 113443 = 85073 rounded up

The 2014 execution plan for this query, when using the new cardinality estimator, is:

image

The cardinality estimate is 85,073 rows, just as we predicted!

Sadly, the actual number of rows returned by this query is 68,413 rows (because both predicates qualify exactly the same rows). Nothing is perfect, and cardinality estimation doubly so.

 

© 2014 Paul White
email: SQLkiwi@gmail.com
twitter: @SQL_Kiwi (note the underscore!)

Published Tuesday, April 15, 2014 7:13 PM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

RichB said:

So.... do double notted ands get better results than old ors?

April 15, 2014 3:47 AM
 

Paul White said:

Hi RichB,

The pre-2014 formula was (S1 + S2) - (S1 * S2); see the linked article for details).The exponential backoff approach in 2014 might be "better" or "worse" for any particular query, it's hard to generalize.

April 15, 2014 1:49 PM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement