THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Data Warehouse modelling deliberations – foreign keys and unknown members

Earlier today I posted the following question on Twitter:

image

Foreign keys in a data warehouse. Yes or no? This discussion is looming at work, i know my position but am interested in what others think.
(https://twitter.com/jamiet/status/324215986729385986)

Specifically, I wanted to know if people were in favour of creating foreign keys from their fact to to their dimension tables or not.

To say it prompted a good response is a bit of an understatement, 38 responses so far and counting. Here are some of those responses:

image

image

image

image

image

image

image

image

image

image

image

It certainly seems to be an emotive subject and its clear (to me) that there’s no correct answer, just lots of opinions. That’s a good thing. The majority of responders appeared to be of the opinion that a data warehouse should contain foreign keys and that is my position too. In this blog post I want to outline why I believe that one should create foreign keys from a fact table to its dimension tables:

Maintain integrity

Of course, this is the main reason why foreign keys exist – to protect the integrity of your data. I see no reason not to use them for this purpose in a data warehouse. The main argument that I see going against is that with a sufficiently robust ETL solution it shouldn’t be necessary. That is true but I would counter with “how do you know that your ETL solution is suitably robust?” I don’t think its possible to anticipate every eventuality that may arise and for that reason I like the safety net that foreign keys provide. I liked Devin Knight’s response here, foreign keys breed confidence.

Communicate business logic

When I join a project that has a database in place the first thing I do is try and understand the data model – to do that I go and look at the foreign keys in that database. Understanding the dependencies between entities is crucial in any data model and the best means of communicating those is via foreign keys. If I encounter a database that is bereft of foreign keys then my heart sinks a little.

Performance

Foreign keys can, in some circumstances, be beneficial in improving query performance. Take a read of A Check and Foreign Key Constraint Improves Query Performance by Sarvesh Singh or Do Foreign Key Constraints Help Performance? by Grant Fritchey.

Code Generation

I am a big fan of generating ETL code where possible and foreign keys can be invaluable when doing so.

 

As I said there is no correct answer here so if you have any opinions, either agreeing or disagreeing, I look forward to reading your thoughts in the comments below.


Mark Stacey’s comment prompted an interesting digression into talking about surrogate keys for denoting unknown members and this is something I have strong opinions on too:

SNAGHTML1ce05180

In short, I don’t like the practice of using “–1” as the surrogate key for an unknown member. My reasoning is simple, I don’t like giving meaning to something that is supposed to be meaningless. How then should we indicate which is the unknown member? I propose a single-row table that includes a column for each dimension table, each with a foreign key to the unknown member in the respective dimension table.

Moreover I don’t like the practice of starting surrogate key counters from 1; the first value available for the integer datatype in SQL Server is –2147483648  so why not start from that?

I discuss both of these issues in much more depth at Considering surrogate keys for Unknown members.

Again if you have any thoughts on these subjects please put them in the comments. If nothing else I find it both fun and educational to debate this stuff.

@Jamiet

Update, Chris Adkin posted a comment below that contained a link to Microsoft's own guidance on building datawarehouses where it is stated:

"Many physical designs for data warehouses follow the star schema but do not completely specify the relationships between the fact and dimension tables, as mentioned earlier for foreign key constraints, for instance. Without the foreign key constraints explicitly specified, SQL Server must depend on heuristics to detect star schema query patterns."

Chris' take on this:

"So, assuming we are talking about a Kimball DW, there is most definitely value in using foreign key contstraints as this provides a fail safe for the heuristics getting it wrong."

Thanks Chris. 

Published Tuesday, April 16, 2013 11:36 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

@sqlslacker said:

I am most definitely NOT a fan of generated ETL code. All of the generated code I've ever had the joy of dealing with, no matter the language or purpose, has been a maintenance nightmare, and I'd rather have the ability to go in and fix or enhance it as I see fit long after the vendor drops off the face of the earth or stops providing us license files because we rebuilt the server (no seriously, that happened.)

I'm also not a fan of 'black box systems', which is almost what your data warehouse can turn into if nobody understands how your ETL process is supposed to work, and that's exactly what can and will happen when relying on the database/ETL tool/source system/users to take care of things for you.

I also see part of the function of the data warehouse as keeping the information that nobody else does, which means that sometimes you need to bring in or expect 'child' information that doesn't have 'parent' information (yet.) Knowing that customers in State A prefer Product B over Product C is a great asset a data warehouse can provide. Knowing that the sales people are capable of entering orders without a customer id is also a great asset that a data warehouse can provide - oh look, we need to improve our entry front end or train our sales people better! My point being, sometimes the data you DON'T have tells just as important a story as the data you do.

April 16, 2013 7:01 PM
 

Mike Henderson said:

I used to be in the "though shalt not create nullable columns" camp. That meant coming up with artificial values such as "-1" you speak of. Years ago I came off that high horse precisely because of the giving meaning to the meaningless notion you speak of.

There was a puzzle with nullable columns where application layer equivalent was a primitive type. For example a column with a data type of bit typically is programmed as a bool in the .Net code. A bool is never null, it's either on or off. But a bit column can be null. Programmers ended up writing extra code to reconcile the difference.

This conundrum has more less been resolved with nullable primitives (bool?), and I have stopped worrying about unknowns and learned to love nulls.

Mike

April 17, 2013 8:01 AM
 

Chris Adkin said:

Jamie,

From: http://technet.microsoft.com/en-us/magazine/2008.04.dwperformance.aspx

Many physical designs for data warehouses follow the star schema but do not completely specify the relationships between the fact and dimension tables, as mentioned earlier for foreign key constraints, for instance. Without the foreign key constraints explicitly specified, SQL Server must depend on heuristics to detect star schema query patterns.

the paper goes on to say:

Note that these are heuristic rules. There are few real circumstances that will lead the heuristics to pick a dimension table as the fact table.

So, assuming we are talking about a Kimball DW, there is most definitely value in using foreign key contstraints as this provides a fail safe for the heuristics getting it wrong. Also if you are using SQL 2008, you might want to check out trace flags 4199, 4137 and 4138 as these enable curtain optimizer hot fixes to kick in. we have used these and seen the use of star joins and few outer row optimisations kick in, this has reduced cube build times tremendously. These flasg can be applied at statement level using the QUERYTRACEON hint, I believe there was something on the SQL CSS engineers blog recently about this now being documented.

Chris

April 22, 2013 9:43 AM
 

jamiet said:

Fantastic link, thanks Chris. I'm devouring that as I write.

April 22, 2013 11:04 AM
 

Chris Adkin said:

Jamie to add further weight to this, the Microsoft sample data warehouse database uses foreign keys, also the use of foreign keys provides a safety net should anything go awry with the ETL. Connor Cunningham has a blog posting on how the optimizer can use the presense of foreign keys to eliminate joins.

That should provide you with enough to "Go up to the crease with and bat" ;-)

April 23, 2013 11:18 AM
 

Alejandro Mesa said:

Hi Jamie,

This is a great topic but I am affraid that we will use one more time the phrase "It depends".

I always start with a safe approach, using FK constraints and adding NC indexes to support them. I use them during DEV and QA to monitor the ETL process regarding this matter. If I know in advance that the magnitud of the loading is huge or significant, initial and subsequnt, then I tend to disable the FK constraints. I will continue monitoring the use of the NC indexes (on the fact table) and see if the benefits during query time overpass the friction during the loading. If it is the contrary then I will get rid of them too.

The risk of messing your data is present but we need to find a balance. The size of the load, the ETL window, how many datamarts, the pattern of querying them, how many dimensions are tied to your fact table, etc are things that we put toguether before determining the shape of the datamart.

I have used this approach for 8 years now and it has served me well. This doesn't mean I am right, and I will be very happy to be proved wrong since my users are the ones that will benefit at the end.

--

AMB

April 24, 2013 9:16 AM
 

jamiet said:

Hey Alejandro,

Good to hear from you, my friend!

I guess you can sum that up as "use FKs unless you can prove there is a good reason not to". that's pretty much my way of thinking too.

"Finding the balance" is damned difficult! that's what makes this fun! :)

cheers

Jamie

April 24, 2013 9:44 AM
 

Alejandro Mesa said:

<Jamie>

Finding the balance" is damned difficult!

</Jamie>

That is why they hire people like you, my friend.

--

AMB

April 24, 2013 2:51 PM
 

Andy said:

I found this blogpost when having a little nosey around the web for "FK or No FK in the DW" as up to now I've always used FKs no question.  However with the new SQL 2014 Clustered ColumnStore index I couldn't create one on a fact table with foreign keys...

http://msdn.microsoft.com/en-us/library/gg492088.aspx

"We view the clustered columnstore index as the standard for storing large data warehousing fact tables, and expect it will be used in most data warehousing scenarios."

Does this mean MS is telling us NOT to use foreign keys?

September 25, 2014 10:42 AM
 

jamiet said:

Hi Andy,

I don't think you should interpret it as that. I'd interpret it as "columnstore indexes are a good idea, the trade-off is that you won't be able to use FKs. The choice is yours, but we think the trade-off is probably worth it - your mileage may vary"

JT

September 25, 2014 10:54 AM
 

Andy said:

Yeah looks like it's more a decision based on what you want your architecture to be.  It's certainly prompted a few discussions in the office!

September 26, 2014 5:01 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement