THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand. See also my articles on SQLperformance.com

A creative use of IGNORE_DUP_KEY

I'll_Be_There

Let’s say you have a big table with a clustered primary key, and an application that inserts batches of rows into it. The nature of the business is that the batch will inevitably sometimes contain rows that already exist in the table.

The default SQL Server INSERT behaviour for such a batch is to throw error 2627 (primary key violation), terminate the statement, roll back all the inserts (not just the rows that conflicted) and keep any active transaction open:

CREATE TABLE #Big (pk int NOT NULL CONSTRAINT PK_Big PRIMARY KEY);
GO
-- Start with values 1 & 5 in the table
INSERT #Big (pk) VALUES (1), (5);
 
-- Our batch transaction
BEGIN TRANSACTION;
    -- Insert a batch containing some pre-existing rows
    INSERT #Big (pk) VALUES (1), (2), (3), (4), (5);
 
    -- Show the contents of the table after the insert statement
    SELECT pk FROM #Big;
 
    -- Show the transaction count
    SELECT tran_count = @@TRANCOUNT;
 
-- Rollback
ROLLBACK TRANSACTION;
 
-- Final table contents
SELECT pk FROM #Big;
GO
-- Tidy up
DROP TABLE #Big;

The output is:

INSERT with duplicate values

Ignoring Duplicates

Let us further imagine that the desired behaviour is that new rows in a batch should be inserted, and any duplicates silently rejected. Most importantly, no error messages should be returned to the client. Ideally, this would be achieved without immediate application changes, and without impacting concurrent users of the table (the instance is Enterprise Edition, so online operations are available).

This seems like an ideal use for the IGNORE_DUP_KEY option:

CREATE TABLE dbo.Big
(
    pk int NOT NULL,
    
    CONSTRAINT PK_Big 
    PRIMARY KEY (pk)
    WITH (IGNORE_DUP_KEY = ON)
);
GO
-- Unique values
INSERT dbo.Big (pk)
VALUES (1), (3), (5);
GO
-- key 3 already exists
INSERT dbo.Big (pk)
VALUES (2), (3), (4);

That script executes successfully with just a warning (not an error!) about the duplicate key:

IGNORE_DUP_KEY output

The problem

We would like to add the IGNORE_DUP_KEY option to the existing primary key constraint, but the ALTER TABLE command does not have an ALTER CONSTRAINT clause. We do not want to drop the existing primary key and recreate it with the new option, because the table is large and we want to avoid disrupting concurrent users. Dropping and recreating would result in rebuilding the entire table first as a heap and then as a clustered table again.

Although there is no ALTER CONSTRAINT syntax, we do know that certain constraint modifications can be performed using ALTER INDEX REBUILD on the index used to enforce the constraint. We can, for example, change the type of compression used, the index fill factor, and whether row locking is enabled:

CREATE TABLE dbo.Big (pk int NOT NULL CONSTRAINT PK_Big PRIMARY KEY);
GO
INSERT dbo.Big (pk)
VALUES (1), (2), (3), (4), (5);
GO
ALTER INDEX PK_Big ON dbo.Big
REBUILD WITH (DATA_COMPRESSION = PAGE, ONLINE = ON);
GO
ALTER INDEX PK_Big ON dbo.Big
REBUILD WITH (FILLFACTOR = 90, ALLOW_ROW_LOCKS = OFF, ONLINE = ON);
GO
DROP TABLE dbo.Big;

Unfortunately, we cannot use the same trick to add the IGNORE_DUP_KEY option to the underlying index for the primary key:

ALTER INDEX with IGNORE_DUP_KEY

The same error message results even if the ONLINE = ON option is not specified. There will be a range of views about how accurate the error message text is here, but we cannot avoid the fact that the operation we want to perform is not supported by SQL Server.

A creative workaround

One idea is to add a new UNIQUE constraint (or index) on the same columns as the primary key, but with the IGNORE_DUP_KEY option added. This operation can be performed ONLINE:

-- Existing table
CREATE TABLE dbo.Big (pk int NOT NULL CONSTRAINT PK_Big PRIMARY KEY);
GO
-- Existing data
INSERT dbo.Big (pk) VALUES (1), (2), (3);
GO
-- New constraint (or index) with IGNORE_DUP_KEY, added ONLINE
ALTER TABLE dbo.Big
ADD CONSTRAINT UQ_idk
UNIQUE NONCLUSTERED (pk)
WITH (IGNORE_DUP_KEY = ON, ONLINE = ON);
 
--CREATE UNIQUE NONCLUSTERED INDEX UQ_idk 
--ON dbo.Big (pk)
--WITH (IGNORE_DUP_KEY = ON, ONLINE = ON);
GO
-- Key 3 is a duplicate, just a warning now
INSERT dbo.Big (pk) VALUES (3), (4), (5);
GO
SELECT pk FROM dbo.Big;

The new arrangement results in the correct final state of the database, without throwing an error. The effect is the same as if we had been able to modify the existing primary key constraint to add IGNORE_DUP_KEY:

Output with duplicate unique index

Execution plan analysis

There are some drawbacks to this idea, which we will explore in some detail. The drawbacks are significant, so adding the extra index with IGNORE_DUP_KEY is only really suitable as a temporary solution, as we will see. The first INSERT in the previous script (without the extra constraint or index with IGNORE_DUP_KEY) is pretty trivial; it just shows the constants from the VALUES clause being inserted to the clustered index:

Trivial INSERT

The second INSERT (with the IGNORE_DUP_KEY index) is rather more complex (click to enlarge):

INSERT with IGNORE_DUP_KEY

An extra index to maintain

One fairly obvious consequence of adding the new index is that the Clustered Index Insert operator now has to maintain the new nonclustered index too. I used Plan Explorer above because it shows the per-row nonclustered index insert more explicitly than SSMS, where you have to dig into the Object node in the Properties window with the relevant graphical operator selected:

SSMS Object Node

Another way to see the nonclustered index maintenance explicitly is to run the query again with undocumented trace flag 8670 to produce a wide update plan:

Wide update plan

Besides maintaining the extra index, the IGNORE_DUP_KEY plans seem to be doing a lot of extra work: there are lots of new operators compared with the simple insert. As you would expect, all the new operators are associated with ignoring duplicate keys, and there are two distinct cases to consider.

Rows that already exist

The first case relates to checking for INSERT rows that already exist in the base table. This checking is implemented in the execution plan by the Left Semi Join, Index Seek, and Assert operators:

Plan Right Fragment

The Index Seek looks for the key value of the current row we are looking to insert. The Semi Join cannot be an Inner Join because that would reject rows where a match was not found (and we need to insert a row in that case). Nevertheless, the query processor needs to know if an existing row was found with the same key value. The Semi Join sets a value to indicate this, which is stored in the Probe Column:

Nested Loops Operator Properties

The Assert operator (known internally as a Stream Check) tests a condition and raises an error if the test fails. Assert operators are often seen in plans that affect columns with CHECK constraints, for example. In this case, however, the Assert does not raise an error, it emits the ‘Duplicate key was ignored.’ message instead, and only passes a row on to its parent (the Sort operator) if the condition passes. The check performed by the Assert is based on the Probe value stored in the expression labelled [Expr1013] above:

Assert Operator Tooltip

The Asset passes rows where the predicate evaluates to anything other than NULL. If a match was found by the Index Seek (so the Probe Column, Expr1013 is not NULL) the Assert does not pass on the row and raises a warning instead. The following TF 8607 output fragment for the INSERT statement shows the option (there is nothing equivalent in regular show plan output unfortunately):

WarnIgnoreDuplicate

Duplicates within the Insert Set

The first check only looks for rows from the insert set that already exist in the base table. The query processor also needs to check for duplicates within the set of values we are inserting. The Segment and Top operators in the plan combine to meet this requirement:

Plan Middle Fragment

The Segment requires a stream ordered by the index keys, and adds a flag to indicate the start of a new group of key values. The Top operator returns only one row from each group. The overall effect is to remove rows with duplicate index key values.

The Top and Segment execution operators together implement a single physical query processor operator called a “Group-By Top”. I mention this detail because we need to be familiar with the internal name to understand the TF 8607 output, which indicates that this operator also emits a `Duplicate key was ignored.’ warning when it encounters a group with more than one key value:

WARN-DUP

Locking

Adding the new index means that locks will now be taken on the nonclustered index when checking for existing rows:

Index Seek Locks

It might surprise you to learn that the index seek acquires Range S-U locks on the nonclustered index (extract from the Books Online link below):

Range Locks

Key-range locks like this are only taken under the SERIALIZABLE isolation level, but our INSERT statement uses them whatever isolation level the user runs the query under (for example, range locks are still taken if the session is running under the default READ COMMITTED isolation level).

SQL Server raises the effective isolation level to SERIALIZABLE for the Index Seek (and just that operator) because it needs to be sure that if it does not find a match in the index, that situation remains the same until it inserts the new row. The Index Seek is looking for a key value that does not exist, so a range lock is necessary to prevent a concurrent transaction inserting a row with that key value before we do. There is no existing key in the index to lock (because the row does not exist) so a range lock is the only option. If SQL Server did not do this, our INSERT query might fail with a duplicate key error despite the IGNORE_DUP_KEY setting (we check for a row, don’t find it, someone else inserts it, then we try to).

The optimizer adds a SERIALIZABLE hint for the Index Seek operator (internally, a physical Range operator) as we can see using another extract from the TF 8607 output:

Query Processor Hints

The execution plan is also forced to use the nonclustered index with the IGNORE_DUP_KEY setting for this seek (FORCEDINDEX) and an update lock (UPDLOCK) hint is also given to help prevent conversion deadlocks. Nevertheless, you may find increased deadlocking if you choose to add an extra IGNORE_DUP_KEY index like this.

Permanent solutions

Adding a nonclustered unique index with the IGNORE_DUP_KEY option and the same key as the clustered primary key allowed us to solve an immediate problem without code changes, while keeping the table online, but it does come at a price. The execution plan is much more complex (and expensive) than the original INSERT, and there is a chance of deadlocks. The biggest performance impact of adding the extra index is of course the cost of maintaining it, meaning we need to look at a more permanent solution via a code change.

IGNORE_DUP_KEY

The (trivial) test I am going to run inserts 5000 unique rows from my table of numbers into the table we have been using so far. To establish a baseline, we will first look at the execution plan for an INSERT to the table with a nonclustered IGNORE_DUP_KEY index:

INSERT dbo.Big (pk)
SELECT n FROM dbo.Numbers
WHERE n BETWEEN 1 AND 5000;

IGNORE_DUP_KEY test

This plan is very similar to the one we analysed earlier, though a Merge is used to perform the Left Semi Join, and a wide update plan has been chosen, making the nonclustered index insert easier to see. The other main feature is an Eager Table Spool, required for Halloween Protection. The estimated cost of this execution plan is 0.363269 cost units on my installation.

Modified INSERT

The first code alternative to the extra index is to modify the INSERT statement to check that rows do not already exist in the destination. There are a number of SQL syntaxes for this, each with different characteristics and performance in different circumstances. To keep things simple, and because I only want to make a couple of specific points here, I have chosen my simple example to produce the same execution plans for all the common syntax variants. The first thing to do is to drop our extra constraint:

ALTER TABLE dbo.Big
DROP CONSTRAINT UQ_idk;

Now we can evaluate the modified INSERT statement:

-- I prefer this syntax
INSERT dbo.Big (pk)
SELECT n FROM dbo.Numbers
WHERE n BETWEEN 1 AND 5000
AND NOT EXISTS (SELECT * FROM dbo.Big WHERE pk = n);
 
-- With EXCEPT
INSERT dbo.Big (pk)
SELECT n FROM dbo.Numbers
WHERE n BETWEEN 1 AND 5000
EXCEPT
SELECT pk FROM dbo.Big;
 
-- Not recommended
INSERT dbo.Big (pk)
SELECT n FROM dbo.Numbers
WHERE n BETWEEN 1 AND 5000
AND n NOT IN (SELECT pk FROM dbo.Big);

All three produce this execution plan:

Modified INSERT test

This has an estimated cost of 0.0981188 units – much cheaper than the 0.363269 cost seen previously. The Halloween Protection spool still features in the plan, but there is a weakness in our queries that you might have already spotted. We are not doing anything to protect against duplicate key violation errors in case someone else inserts a row after we have checked to see if it exists, but before we insert it. The query optimizer added a SERIALIZABLE hint when it added an existence check, so if avoiding errors is important to us, we need to do the same:

INSERT dbo.Big (pk)
SELECT n FROM dbo.Numbers
WHERE n BETWEEN 1 AND 5000
AND NOT EXISTS (SELECT * FROM dbo.Big WITH (SERIALIZABLE) WHERE pk = n);

We do not need the UPDLOCK hint for two reasons. First, the engine automatically takes update locks for us when reading the source table. Second, we are reading from the same index we are inserting to (not reading from a nonclustered index and inserting to the clustered index) so the previous deadlock scenario is not applicable.

Using MERGE

Another option is to use the WHEN NOT MATCHED feature of the Merge statement. This time we will add the necessary SERIALIZABLE hint up front:

MERGE dbo.Big WITH (SERIALIZABLE) AS b
USING (SELECT n FROM dbo.Numbers WHERE n BETWEEN 1 AND 5000) AS s
ON s.n = b.pk
WHEN NOT MATCHED THEN 
INSERT (pk) VALUES (s.n);

MERGE test

This plan has an estimated cost of 0.0950127 units – slightly less than the 0.0981188 units for the modified INSERT plans. Some of this improvement is due to the lack of a Halloween Protection spool, for interesting reasons I cover in depth in a short series of articles to be published shortly (update: part one is out).

These are not meant to be performance tests by any stretch of the imagination. There are any number of subtle factors that will affect the execution plans and run times for different numbers of rows, different distributions, and so on. I should stress that I normally find MERGE plans perform less well than separate INSERT/UPDATE/DELETE more often than not. Anyway, in case you are interested, typical performance results on my machine for this specific test are (INSERT first, then MERGE), timings in milliseconds:

Performance results

IGNORE_DUP_KEY and Clustered Indexes

When IGNORE_DUP_KEY is specified for a unique clustered index (primary key or otherwise), duplicates are handled by the storage engine rather than the query processor. For example:

CREATE TABLE dbo.T (pk int CONSTRAINT PK_idk PRIMARY KEY WITH (IGNORE_DUP_KEY = ON));
INSERT T VALUES (1), (1), (2), (3);

The execution plan shows none of the complexity of the nonclustered index case:

image

The TF 8607 output contains none of the query processor hints to ignore duplicates and raise warnings:

image

The IGNORE_DUP_KEY side of things is handled entirely by the storage engine – if it finds a row in the unique index where it was going to insert one, it flags a warning but does not try to insert (which would cause an error). The estimated cost of this query plan is identical whether the unique clustered index has IGNORE_DUP_KEY or not – there is no extra work for the query processor.

If you are wondering why the same mechanism is not used for a nonclustered unique index with the IGNORE_DUP_KEY option, my understanding is it is because the Clustered Index is always updated before the nonclustered ones (even in a narrow plan). By the time the storage engine detected a duplicate in the nonclustered index, it would have already added the row to the base table. So, the query processor handles IGNORE_DUP_KEY for nonclustered indexes.

This is my understanding of the mechanics of IGNORE_DUP_KEY, I don’t know for sure and will happily correct any details I may have wrong. Any storage engine experts reading this please contact me if so.

Acknowledgement

This post was inspired by a post by Daniel Adeniji and our subsequent discussion of it. My thanks to him for permission to analyse the issue further here, and for allowing me to reference his original. The issue described here may not accurately reflect Daniel’s original problem – I used a certain amount of artistic licence. Even if the specific issue is not particularly interesting to you, I hope you enjoyed some aspects of the analysis and maybe picked up some new information along the way.

© 2013 Paul White – All rights reserved
twitter: @SQL_Kiwi
email: SQLkiwi@gmail.com

Published Friday, February 01, 2013 7:47 AM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

tobi said:

Amazing post, also looking forward to the post about Merge and Halloween Protection.

Another alternative to the "duplicate" UNIQUE index might be a trigger. It might be slower for DML but does not require storage space.

February 1, 2013 4:55 PM
 

wqw said:

Warning on IGNORE_DUP_KEY was raised as an *error* by current SQLOLEDB/ODBC drivers so it is impossible to use it in client applications. Never tested recent NCLI/.Net Providers but possibly the issue is fixed.

February 5, 2013 2:16 PM
 

Paul White said:

@tobi An INSTEAD OF trigger? I guess so, though the overheads are even worse than the extra operators in the plans I show here. Both INSTEAD OF and AFTER triggers require storage for the inserted and deleted tables.

@wqw Thanks, good to know.

February 5, 2013 5:45 PM
 

Paul White said:

The first part of the series on the Halloween Problem is now out:

http://www.sqlperformance.com/2013/02/t-sql-queries/halloween-problem-part-1

February 13, 2013 9:06 PM
 

wqw said:

FYI, on SQL Azure database `WITH IGNORE_DUP_KEY` option raises "Deprecated feature 'INDEX_OPTION' is not supported in this version of SQL Server."

February 15, 2013 11:22 AM
 

Paul White said:

wqw,

Interesting. Another thing: MERGE ignores IGNORE_DUP_KEY for insert operations too. Thanks for the Azure information.

February 16, 2013 5:23 AM
 

sqlscripter - Ed P said:

Neat article, I had to stuff 5 databases into 1 ofcourse for a "special" reporting db and got that same error years ago. Funny no one could get it to work for a long time I was told, I found the INGORE_DUP_KEY option and blingo....everyone in the pool...problem solved

March 12, 2013 12:50 PM
 

f byron said:

great article!  i've had some interesting adventures with IGNORE_DUP_KEY recently and would like to share the following observations (SQL 2008 R2):

1.) as far as i can tell, the only time you can get IGNORE_DUP_KEY=ON going for a clustered index is during initial PK creation (using the syntax you have above); specifying the IGNORE_DUP_KEY options WITH (DROP_EXISTING=ON,IGNORE_DUP_KEY=ON) will always fail, even if the PK/Clustered Index was created with IGNORE_DUP_KEY=ON in the first place.

2.) the above is aggravating because if you use DROP_EXISTING=ON to move your clustered index to a new filegroup/scheme/whatever you will lose the IGNORE_DUP_KEY setting and cannot get it back with dropping to heap and recreating.

3.) however, if you create an empty, identical table with IGNORE_DUP_KEY=ON for your primary key and ALTER TABLE...SWITCH to it, the setting from the new table will take effect.  so as a workaround for huge tables, under the right circumstances you can ALTER TABLE SWITCH to an identical (empty) shell, drop and recreate your original table with IGNORE_DUP_KEY=ON, and then switch back, and in effect the IGNORE_DUP_KEY = ON will now be in force on the original huge table.

June 18, 2014 7:32 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement