THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Declarative ETL

T-SQL is a declarative programming language. This means that a T-SQL developer defines what (s)he wants the query to do, not how to do it. The secret sauce that turns the “what” into the “how” is a piece of software called the query optimiser and there are tomes dedicated to how one can make the query optimiser dance to their will. Generally the existence of a query optimiser is a good thing as in most cases it will do a better job of figuring out the “how” than a human being could*.

SSIS dataflows on the other hand are an imperative programming language**, the data integration developer dataflow builds a data pipeline to move data exactly as (s)he desires it to happen. In SSIS, There is no equivalent of the query optimiser.

I’ve often pondered whether there is an opportunity for someone to build a declarative data pipeline, that is, a method for a data integration developer to define what data should be moved rather than how to move it. Over the last few months I’ve come to the realisation that Power Query (formerly known as Data Explorer) actually fits that description pretty well. If you don’t believe me go and read Matt Masson’s blog post Filtering in Data Explorer in which he talks about query folding:

[Power Query] will automatically push filters directly to the source query

Even though we selected the full table in the UI before hiding the columns certain columns, we can see the source query only contains the columns we want

“filters” aren’t the only type of query folding that Data Explorer can do. If you watch the queries, you’ll see that other operations, such as column removal, renaming, joins, and type conversions are pushed as close to the source as possible.

Let’s not idly dismiss this. Query folding is very impressive and is (I think) analogous to predicate pushdown that is done by the query optimiser (for more on predicate pushdown read Nested Views, Performance, and Predicate Pushdown by Dave Wentzel). Does Power Query then have the equivalent of a query optimiser? I would say yes, it does although its more accurate to say that its the underlying query language called M for which this query optimiser exists!

So, we have a declarative data integration language which is surfaced in Power Query and hence can only (currently) push data into an Excel spreadsheet. Imagine if M were extended were extended into a fully-fledged data integration tool that could move data between any heterogeneous source, would that constitute “declarative ETL” and is there a need for such a tool?

I’ll leave that one out there for you to ponder. Comments would be most welcome.

@Jamiet 

*OK, you might debate whether the query optimiser can do a better job than a human but let’s save that for another day.

**We can debate whether or not SSIS dataflows are a programming language or not but again, let’s save that for another day. For the purposes of this discussion just go with me, OK.

Published Friday, August 16, 2013 4:05 PM by jamiet
Filed under: , ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Davide Mauri said:

I've come to have the idea the declarative programming is THE way for the future.

August 16, 2013 10:18 AM
 

Davide Mauri said:

For example I'd love to be able, in the future, to ask to a tool/system to "Add a dimension x to the DWH y, using the source view z".

While this is somehow doable right now, it requires A LOT of effort in defining and maintain metadata, templates and custom-made tools.

I'm really dreaming of such world since it will make IT MUCH more appealing to the business!

August 16, 2013 10:21 AM
 

jamiet said:

Interesting idea Davide. Its a bit of a digression from the post above but...I'm kinda doing something similar on my current project. Every time a new table gets introduced into our DWH we (effectively) click a button and generate some stored procedures/views/functions that are required to load data from the staging table into that table in the warehouse.

We generate a stored proc to take care of surrogate key generation (for dimension tables), surrogate key lookup (for facts and snowflaked dimensions) and SCD management.

We generate a "transform" stored proc with a big comment block in it saying /*IMPLEMENT THE MAPPING HERE*/. The code for everything else (lookups, SCD management) is generated based on known metadata.

Its been a great success. I never want to not implement data warehouse ETL in this way ever again.

Is that the sort of thing you had in mind?

JT

August 16, 2013 10:30 AM
 

Davide Mauri said:

Yeah, exactly. We also did something very similar in-house (something has been also released to the public):

http://www.solidq.com/sqj/Documents/2010_November_Issue/SQJ_005.pdf

but of course is just a small piece of work. Maybe M can be the way since is, by all means, a metalanguage? Power Query is nice but very limited IMHO...a deep integration with SSIS would be a very nice surprise :)

August 16, 2013 10:38 AM
 

jamiet said:

"a deep integration with SSIS would be a very nice surprise"

Indeed it would. I hope the right people get to read this :)

August 16, 2013 10:58 AM
 

Michael R. said:

This seems like a good idea to start something like an SSIS or even MDX query engine on top of Drill (or Hadoop for that matter).

August 18, 2013 10:51 AM
 

Koen Verbeeck said:

Declarative ETL would certainly be nice.

You can sort of build it now yourself, but it requires indeed a lot of coding effort upfront. For example, we build a flat file loader, where you just had to enter a line into a table specifying name, location, type of flat file, and it would get loaded automatically into the database. Required a lot of coding first, but it will pay off in future projects, where we will be able to do the import of source data in one day, instead of a few weeks.

August 21, 2013 2:09 AM
 

suhail ali said:

Jamie, the declarative ETL you are doing today via stored procedure would make a great blog post. I know you've talked about the merge statement and composable DML in past post so bringing it all together for a declarative ETL solution would be interesting.  I'm doing something very similar to I'm eager to compare notes.

August 21, 2013 12:58 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement