THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jorg Klein

Jorg Klein, Microsoft Business Intelligence consultant/architect from the Netherlands

  • Implement SSAS MD cell security using dimension security with blazing performance

    SQL Server Analysis Services (SSAS) Multidimensional (MD) is a great product, and in my opinion it’s still the only real option to go for when building complex enterprise BI solutions. It’s still very fast when implemented correctly and it’s mature and therefore very stable.

     

    The only real downside is cell security, which, in my opinion, is useless. It makes performance drop dramatically because it evaluates security cell-by-cell. I have seen reports that run in a few seconds without cell security taking 20 minutes with cell security implemented! Try to explain that to your customer.. It’s obvious that you can’t.

     

    Quite some workarounds exist for quite a while:

     

    • Make your measures invisible and create MDX calculations that either show or hide the measure value based on a dummy dimension. Drawbacks are the measure is hidden and not really secured and you need to create dummy dimensions/attributes and maintain them. http://www.sqljason.com/2010/04/implementing-measure-security-in-ssas.html
    • SSAS offers us the “Measures Dimension” which give you the possibility to secure measures like dimension members. Great, this is just what we need, but implementing it and creating MDX calculations based on the secured measures will give you errors for SSAS roles that do not have access to these measures. This is caused by the fact that the MDX script is executed after the security has been implemented. So if a user that doesn’t have access to a measure (set by dimension security) tries to connect to the cube while the MDX script contains a reference to this secured measure, this raises an error.
    • Create a hidden dimension on which you apply dimension security like described here: http://cwebbbi.wordpress.com/2011/12/22/replacing-cell-security-with-dimension-security/ Unfortunately this doesn’t work for measures.

     

    For a few years I’ve always implemented a workaround that uses the measures dimension. To prevent errors in the MDX script as described by option B above, I’ve added IsError() checks around all my calculation parts that could raise an error. For example, a simple calculation like Quantity * Price, where the price measure could be secured, looks like this:

     

    CREATE MEMBER CURRENTCUBE .[Measures].[Amount] AS
     
    IIf (
        IsError ( [Measures].[Price] ),
       
    NULL,
        [Measures].[Quantity] * [Measures].[Price]
      ),
     
    NON_EMPTY_BEHAVIOR =
      {
       
    IIf (
          IsError ( [Measures].[Price] ),
         
    NULL,
          [Measures].[Price]
        )
      } ;

     

    This calculation would not raise an error but NULL if a user doesn’t have permission to the Price measure. Quite straightforward, only trick here is the IsError() check in the NON_EMPTY_BEHAVIOR, of course you would have to do this in the script view of your SSAS calculations tab. In the form view this will look a bit strange but it doesn’t cause any errors: 

     

    clip_image001 

     

    Using this approach you are able to implement measure security using the dimension security while you are still able to create MDX calculations with NON_EMPTY_BEHAVIOR based on these secured measures. This made the report I talked about before to run in 5 seconds instead of the 20 minutes. I’ve used this approach for quite some years now, and it has always been sufficient.

     

    At the moment I’m working on a very big enterprise cube that contains almost 100 dimensions, 25 measure groups and millions of records. I’ve also implemented measure security like this and after a while I noticed some really heavy queries (reports) took quite some time, for example 30 seconds or 1 minute. Reason enough to re-think this approach. When I was at the SQL Server Days in Belgium last year, I’ve discussed this approach with Chris Webb. Of course he could understand what I was doing here and we agreed to email about this approach later on. Chris emailed me about a blogpost of him from some time ago: http://cwebbbi.wordpress.com/2008/01/22/dimension-security-tips-tricks-and-problems/. In this post Chris explains how to create secured calculations while using dimension securty by using named sets and scope statements:

     

    CREATE MEMBER CURRENTCUBE .Measures.Test AS
      NULL
    ;

     

    CREATE SET myset1 AS
     
    IIf (
        IsError (
    StrToMember ( "Measures.[Internet Sales Amount]" ) ),
        { },
        { Measures.Test }
      ) ;

     

    SCOPE ( myset1 ) ;
    This = Measures.[Internet Sales Amount] ;
    END SCOPE ;

     

    CREATE SET myset2 AS
     
    IIf (
        IsError (
    StrToMember ( "Measures.[Internet Sales Amount]" ) ),
        { Measures.[Internet Tax Amount] },
        { Measures.[Internet Sales Amount], Measures.[Internet Tax Amount] }
      ) ;

     

    SCOPE ( myset2 ) ;
    This = Measures.CurrentMember * 2 ;
    END SCOPE ;

     

    I did not know about this approach, so I tried it out. To be able to get good test results I used a very heavy query that hopefully nobody would ever try: a calculation over all sales data (millions of records) against all customers, all products and all time.  Unfortunately, the results were not very pleasing yet:

     

    1. Calculation with IIF/IsError checks in both the calculation part and the non empty part: ran in 50 seconds.

     

    2. Calculation with named set/scope approach: ran longer than 5 minutes, after that I stopped the query.

     

    So Chris and I emailed again and I was wondering if the lack of NON_EMPTY_BEHAVIOR (NEB) could be the cause of the slow performance of the named set/scope approach. Chris said that since SSAS 2008 the usage of NEB was not neccesary anymore, something I’ve heard about before. I always had the idea adding NEB did make a difference in some cases so I never stoped using it. So I kinda merged Chris’ and mine approach, adding the NEB using an IsError() function and I replaced the StrToMember check with a direct reference to the measure in the named sets because in my opinion that part was not neccesary:     

    IsError ( StrToMember ( "Measures.[Internet Sales Amount]" ) ) à became: Measures.[Internet Sales Amount].

     

    The result was almost unbelievable, the query now took not 5 minutes, not 50 seconds but only 2 seconds!! So adding NEB still can make a huge difference sometimes! After finding out about this, I contacted Chris again and he was also very surprised by the result.

     

    I’ve created an MDX calculation template with some explanation added as comments. The big performance gain is achieved because the IsError() check doesn’t need to be executed every time the calculation is executed because it’s already executed when the static named set is created (one single time), afterwards this value just seems to be reused. I think the same applies for the IsError() check in the NEB, it seems it’s only executed once and is reused. Anyway, the result is, your MDX calculations are just as fast with security applied as without. Something that was not possible for a long time for me and I guess everybody else too. Just use the template below as a standard for creating your calculations and benefit from this solution:

     

    --Calculation Template:

    ----------------------------------------

       ----------1: CHECKED CALCULATION: The secured "end product" used in other calculations----------

    CREATE MEMBER CURRENTCUBE.[Measures].[Quantity_CHECKED] AS

      NULL

      , VISIBLE = 0 ;

     

       ----------2: CHECK NAMED SET: Actual permission check performed here----------  

    CREATE SET [Quantity_CHECK] AS

      IIf (

       IsError ( [Measures].[Quantity] )

       , { }

       , { [Measures].[Quantity_CHECKED] }

      ) ;

     

       ----------3: SCOPE: Assignment of either the measure or "nothing" to CHECKED calculation (1)----------  

    SCOPE ( [Quantity_CHECK] ) ;

        This = [Measures].[Quantity] ;

    END SCOPE ;

     

       ----------4: Second secure calculation created here----------  

    CREATE MEMBER CURRENTCUBE.[Measures].[Price_CHECKED] AS

      NULL

      , VISIBLE = 0 ;

     

    CREATE SET [Price_CHECK] AS 

      IIf (

       IsError ( [Measures].[Price] )

       , { }

       , { Measures.[Price_CHECKED] }

      ) ;

    SCOPE ( [Price_CHECK] ) ;

        This = [Measures].[Price] ;

    END SCOPE ;

     

       ----------5: Calculation based on the secure calculations.

       ----------Non_Empty_Behavior set with IIf/IsError on measure. Using calculations inside NEB is not possible----------

    CREATE MEMBER CURRENTCUBE.[Measures].[Sales Amount] AS

      [Measures].[Quantity_CHECKED] * [Measures].[Price_CHECKED]

      , NON_EMPTY_BEHAVIOR =

      {

       IIf (

        IsError ( [Measures].[Price] )

        , NULL

        , { [Measures].[Price] }

       )

      }

      , VISIBLE = 1 ;

       -------------------------------------------

     

    Your calculations tab in SSAS will look like:
    clip_image003

     

     

    When I was investigating the inner working of the first part of the script I’ve added some comments to make it easy for myself to remember. I guess it’s also useful for anyone that want to know what’s really happening here:  

    CREATE MEMBER CURRENTCUBE.[Measures].[Quantity_CHECKED] --> Calculation is initially NULL and will be filled based on user rights later on.

    AS NULL,

    VISIBLE = 0; --> Make it invisible as this is a pure technical calculation that should not be used by end users. End users can use the measure this calculation is based on, if they have permission (set by dimension security on the Measures dimension).

     

    CREATE SET [Quantity_CHECK] AS --> Named Set will be used to perform the actual check for user rights, as this is a static named set this check will only be executed one time, at initialization, and will be reused.  

    IIf(IsError([Measures].[Quantity]) --> An IsError on the measure that needs to be checked will raise an error if a particular user has no rights on it because in that case the measure will simply not exist.

    , {} --> If an error was raised by the IsError function set the value of this set to nothing: {}

    , {[Measures].[Quantity_CHECKED]}); --> If no error was raised the user has rights to access the measure, in this case set the value of the set to the required CHECKED calculation created in the previous step.

    SCOPE([Quantity_CHECK]);  --> If in the Scope of the CHECK calculation, either the CHECK calculation can be {} (Nothing) or it can be the CHECKED calculation (based on user rights).  

    This=[Measures].[Quantity]; --> Assign the measure to This. This can be the CHECK calculation if the user has rights which will pass the measure through to the initial CHECKED calculation or it can be {}/Nothing which will pass the value of the measure to nothing instead of to the CHECKED calculation.

    END SCOPE-- So this Scope function either passes the measure to the CHECKED calculation or to nothing (the empty set {})

  • SSIS Denali CTP3 – What’s new?

    Last week Microsoft released CTP3 of SQL Server Integration Services (SSIS), code name: Denali. In this blog post I will look into the new key features and some of the minor improvements in SSIS Denali.

    1. Development

    Shared Connection Managers
    Connection Managers can now be shared on SSIS project level.
    You can create them in the solution explorer, in the folder Connection Managers:
    clip_image001

    Once created, they will appear automatically in all your SSIS packages. The names are in bold so you can recognize them between your package-level connection managers:
    clip_image002

    You can also create Shared Cache Connection Managers for your cached lookups. This will be very useful for lookups that are performed multiple times, for example when you look up dimension tables multiple times from your fact table ETL. You can just convert the local cache connection manager by right clicking it:
    clip_image003

    What happened to the Shared Data Sources we knew from prior SSIS versions? Well they are no more. Shared Data Sources only lived at design time and not at runtime so they were not really useful anyway. Shared Connection Managers do live at runtime and offer even more possibilities as I will show later on, so they replace the old Shared Data Sources.

    Data Flow - Column mappings
    SSIS always mapped columns from source to transformations or destinations with the help of lineage ids. Every column had a unique metadata ID that was known by all components in the data flow. If something changed in the source this would break the lineage ids and raised error messages like: The external metadata column collection is out of synchronization with the data source columns.
    To fix this error you would re-map all broken lineage ids with the “Restore Invalid Column References Editor”.
    In Denali lineage-ids are no longer used. Mappings are done on column names, which is great because you can now use auto map on column names and even copy/paste pieces of another data flow and connect them by mapping the corresponding column names.

    Data Flow - Flexible order of authoring
    This improvement helps you edit data flow components even when they don’t have an input attached. Theoretically you can build your data flow backwards; start with the destination and track back to the source.

    Data Flow - Groupings
    You can now group data flow components. Select the components you wish to group, right click and select Group:
    clip_image004 


    The result is some sort of a data flow sequence container:
    clip_image005

    By clicking the arrow it will collapse:
    clip_image006

    Data flow groups are 100% eye candy; you can’t set any properties on them.

    Data Flow - Data Quality Services Cleansing transformation
    With this transformation you can apply data quality rules in the data flow. This is done by using a Knowledge Base which can be created by yourself or downloaded from the Windows Azure Marketplace. For example you could apply a rule that checks if a given postal code column is valid for a particular town column in your record.

    Data Flow - Data Tap
    In Denali, we have the possibility to attach a “tap” at a data flow path (arrow). This tap captures all data coming through and dumps it in CSV files.

    2. Configuration

    SSIS Denali is fundamentally different to its predecessors when it comes to the concept and usage of configurations. SSIS package configurations are obsolete (they will still be available if you really want to use them ;-)) and parameters and environments are in the new kids in town.

    Parameters
    SSIS Parameters look a lot like SSIS variables but there are some differences. There are two types of these parameters:
    1. Package Parameters:
    Look at Package parameters as C# parameters, which are passed as input to a C# function(=your package). You can set them when executing (call) a package and the lifecycle of the parameters are limited to the SSIS package itself.

    The difference with SSIS variables? You can set the parameters while executing a package with SSMS or an Execute Package Task.

    Define your Package Parameters at the Package Parameters tab:
    clip_image007

    2. Project Parameters:
    Identical to Package Parameters, except for the scope, these Parameters exist on project level and can be referenced throughout all SSIS packages in your project. Package Parameters only live inside a single SSIS package.

    You can define Project Parameters in the solution explorer within your SSIS project:
    clip_image008

    Both Package and Project Parameters can be referenced from your SSIS packages, you recognize them by their prefix, $Package or $Project:
    clip_image009

    Setting parameters in a Execute Package Task is achieved by the new Parameter bindings tab:
    clip_image010

    It’s also possible to parameterize SSIS tasks on the Control Flow by right clicking them and choose Parameterize:
    clip_image011

    Loads of capabilities here! I now realize that I have created a workaround for Package Parameters with my SSIS Package design pattern for loading a data warehouse where I (mis)used Package Configurations as Package Parameters. Creating a new package design pattern for Denali definitely goes on my TODO list!

    Environments
    Environments are a collection of SSIS package settings that can be define on the SSIS Server. At runtime, the environment will override these settings in the SSIS packages. You can create multiple environments and when you run your SSIS packages you can select which environment it should use. It’s also possible to let multiple SSIS projects run under one environment, so flexibility all around the clock.

    To make you understand the principle of Environments right away I have created a simple example that you will commonly use: Create two Environments, one with development settings and one with production settings.

    I have deployed a Test SSIS project to SSMS which contains one SSIS package with one Shared Connection Manager. Notice the new Integration Services folder structure:
    clip_image012

    Next right click Environments and choose Create Environment:
    clip_image013

    Type Development as name for the Environment and click OK:
    clip_image014

    Now double click the Development Environment:
    clip_image015

    Click on the Variables tab and create an Environment Variable which will overwrite the Shared Connection Manager. Type in the connection string for the development server as the Value and click OK:
    clip_image017

    Next create another Environment with the name Production and also create an Environment Variable with the name SCM_EnvironmentVar. The only difference between these two variables should be the value of the variable; it contains the production server connection string instead of the development value.
    You now have two Environments with one Environment Variable for the Shared Connection Manager each:
    clip_image018

    We now need to reference the Development and Production Environments in the Test project so they can be used. Right click the Test project and choose Configure:
    clip_image019

    Go to the references page and click Add..
    clip_image021

    Add both Environments:
    clip_image023

    Both environment are now referenced from the Test SSIS project and are available. Let’s configure the Shared Connection Manager so it will use the value specified in one of the environments. Click on the Parameters page, the Connection Managers tab and the next to the ConnectionString property of the Shared Connection Manager:
    clip_image025

    Now select Use Environment Variable and select SCM_EnvironmentVar. Click OK:
    clip_image026

    The name of our Environment Variable is shown in the Value box of the ConnectionString property:
    clip_image028

    We have now succesfully attached the Environment Variable to override the ConnectionString property with a value from either the Development or the Production Environment. Executing the package under one of both Environments is very easy. To do so, right click the package and choose Run:
    clip_image029

    The Run Package dialog box appears and you instantly see a message that reminds you to assign a value to the ConnectionString property of connection manager SharedConnectionManager:
    clip_image030

    At the bottom of the dialog box you can select the Environment your package should use. Select the .\Development Environment and click OK:
    clip_image031

    In the overview report you see the package has succesfully ran under the Development Environment:
    clip_image032

    If you change the package to use the Production Environment and you run the package again, you’ll get the following execution information:
    clip_image033

    Pretty straightforward and easy to use. The big difference? All the configuration work has been done in SQL Server Management Studio! Exactly the place where it should be done. In the current SSIS versions most of this work was done within Visual Studio and the configuration was done by editing tables. So unfriendly! Big improvement, great work MS!
    In this example I have set a property of a Shared Connection Manager but of course it is possible to set Package or Project Parameters with Environment Variables.

    3. Deployment

    We have had the option to deploy SSIS packages from Visual Studio for a long time thanks to BIDS Helper. Microsoft finally added this functionality out of the box in Denali. They even did more: you can now also build your projects from within Visual Studio!

    There are two ways to deploy (push) your projects to the new SSIS server, directly from Visual Studio or by using an Integration Services Project Deployment File. They both use the new Integration Services Deployment Wizard.
    Deployments directly from Visual Studio can be done by right clicking your project and then choose Deploy. Now, the Deployment Wizard will pop straight up:
    clip_image034

    The other way is building your project first. This was already possible in SSIS version(s) prior to Denali but did not really have useful meanings.

    In Denali this is different. Building your project will generate an .ispac file which contains your entire project and this can be used by the Deployment Wizard. Again a very nice new feature.

    Handing over a new release of a SSIS solution to a customer’s IT department for deployment can now be done in a very nice, professional, manner. Ispac files can also be opened by Visual Studio to import your SSIS project.
    Building your project can be done by right clicking your project and choosing Build:
    clip_image035

    The output window displays the following:
    clip_image036

    When looking at the bin\Development folder we see the new Test.aspac file:
    clip_image037

    clip_image038

    Double clicking (or choosing Deploy in Visual Studio) this file will start the new Integration Services Deployment Wizard:

    You can now select the project that you wish to deploy. You can either select to use the .ispac Project deployment file or choose to deploy an existing project located in any Integration Services catalog. The last option is useful when you wish to deploy from, for example, a TEST SSIS solution that is already located in a catalog to an ACCEPTATION environment in another catalog.
    clip_image039

    The project is loaded and validated:
    clip_image040

    Next, you now need to select the destination, which has to be a SSIS Server:
    clip_image041

    Review your selections and click on Deploy to start the actual deployment:
    clip_image042

    The results show a successful deployment. Notice the protection level has been changed. What happened here?
    The wizard clears the protection level, as we know from Visual Studio, to prevent “cannot decrypt password” errors.

    Instead of using protection levels, passwords will be stored in clear text. The entire package (with the passwords in it) will be stored encrypted in the SSIS Server tables you are deploying to.
    clip_image043

    The project has now been deployed to the server:
    clip_image044

    When you right click the Test project and choose for Versions you are able to see the current version of your project:
    clip_image046

    If you deploy a newer version later on, the existing version(s) stay on the server and you can easily role back to a previous version if you’ve made a bad deployment.

    The diagram below shows the entire deployment life cycle (source: TechNet):
    clip_image048

    4. Management

    The new SSIS Server is the central storage and administration point of your SSIS solutions. No longer is this a standalone server that you’ll need to manage. Basically it is a database with Stored Procedures.

    You now have an Integration Services node available in SSMS when you connect to the database engine:
    clip_image049

    Under the Integration Services node you will find your SSISDB catalog which holds all your SSIS solutions with its packages, environments, etc. The physical storage of these objects will be conducted in a SQL Server database with the same name as the catalog:
    clip_image050

    This database also contains all the stored procedures containing all the programming code for the SSIS Server:
    clip_image051

    With the help of these stored procedures you can manage your SSIS Server: e.g. your parameter values, connection managers, and override properties by using environments.
    Next to the configuration functionalities I have discussed earlier you can also implement security to control access to the catalog, both on folder level and package level. When you right click a folder/package in the catalog and choose the properties you’ll get the following window where you can manage security access:
    clip_image052

    Finally you’ll get an out of the box reporting dashboard which is built on the out of the box logging functionality that SSIS offers. On this dashboard you’ll get information about the execution history of your package and its sub packages, view which parameters were used, view specific messages and get a performance report over time.
    All the information is logged automatically if any package runs on the server. The information is very detailed; you can even get the row counts between SSIS data flow transformations!

    When running a package you are able to select a logging level on the Advanced tab:
    clip_image053
    You can choose for:
    - None: turn logging of for performance reasons
    - Basic: error and warning logging
    - Performance: detailed trace information
    - Verbose: diagnostics and fault debugging

    When I run my Test package I’ll get the following execution dashboard. There are some hyperlinks that navigate to more detailed reports.
    clip_image055

    So no need for creating your own logging framework anymore, it’s all out of the box!

  • SSIS - Connect to Oracle on a 64-bit machine (Updated for SSIS 2008 R2)

    We recently had a few customers where a connection to Oracle on a 64 bit machine was necessary. A quick search on the internet showed that this could be a big problem. I found all kind of blog and forum posts of developers complaining about this. A lot of developers will recognize the following error message:

    Test connection failed because of an error in initializing provider. Oracle client and networking components were not found. These components are supplied by Oracle Corporation and are part of the Oracle Version 7.3.3 or later client software installation.
    Provider is unable to function until these components are installed.


    After a lot of searching, trying and debugging I think I found the right way to do it!

    Problems

    Because BIDS is a 32 bit application, as well on 32 as on 64 bit machines, it cannot see the 64 bit driver for Oracle. Because of this, connecting to Oracle from BIDS on a 64 bit machine will never work when you install the 64 bit Oracle client.

    Another problem is the "Microsoft Provider for Oracle", this driver only exists in a 32 bit version and Microsoft has no plans to create a 64 bit one in the near future.

    The last problem I know of is in the Oracle client itself, it seems that a connection will never work with the instant client, so always use the full client.
    There are also a lot of problems with the 10G client, one of it is the fact that this driver can't handle the "(x86)" in the path of SQL Server. So using the 10G client is no option!

    Solution

    • Download the Oracle 11G full client.
    • Install the 32 AND the 64 bit version of the 11G full client (Installation Type: Administrator) and reboot the server afterwards. The 32 bit version is needed for development from BIDS with is 32 bit, the 64 bit version is needed for production with the SQLAgent, which is 64 bit.
    • Configure the Oracle clients (both 32 and 64 bits) by editing  the files tnsnames.ora and sqlnet.ora. Try to do this with an Oracle DBA or, even better, let him/her do this.
    • Use the "Oracle provider for OLE DB" from SSIS, don't use the "Microsoft Provider for Oracle" because a 64 bit version of it does not exist.
    • Schedule your packages with the SQLAgent.

    Background information

    • Visual Studio (BI Dev Studio)is a 32bit application.
    • SQL Server Management Studio is a 32bit application.
    • dtexecui.exe is a 32bit application.
    • dtexec.exe has both 32bit and 64bit versions.
    • There are x64 and x86 versions of the Oracle provider available.
    • SQLAgent is a 64bit process.

    My advice to BI consultants is to get an Oracle DBA or professional for the installation and configuration of the 2 full clients (32 and 64 bit). Tell the DBA to download the biggest client available, this way you are sure that they pick the right one ;-)

    Testing if the clients have been installed and configured in the right way can be done with Windows ODBC Data Source Administrator:
    Start...
    Programs...
    Administrative tools...
    Data Sources (ODBC)


    ADITIONAL STEPS FOR SSIS 2008 R2

    It seems that, unfortunately, some additional steps are necessary for SQL Server 2008 R2 installations:

    1. Open REGEDIT (Start… Run… REGEDIT) on the server and search for the following entry (for the 32 bits driver): HKEY_LOCAL_MACHINE\Software\Microsoft\MSDTC\MTxOCI
    Make sure the following values are entered:

    image

    2. Next, search for (for the 64 bits driver): HKEY_LOCAL_MACHINE\Software\Wow6432Node\Microsoft\MSDTC\MTxOCI
    Make sure the same values as above are entered.

    3. Reboot your server.

  • Replication Services as ETL extraction tool

    In my last blog post I explained the principles of Replication Services and the possibilities it offers in a BI environment. One of the possibilities I described was the use of snapshot replication as an ETL extraction tool:
    “Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards.
    In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!”

    Well I have tried it out and I must say it worked well. I was able to let replication services do work in a fraction of the time it would cost me to do the same in SSIS.
    What I did was the following:

    1. Configure snapshot replication for some Adventure Works tables, this was quite simple and straightforward.
    2. Create an SSIS package that executes the snapshot replication on demand and waits for its completion.
      This is something that you can’t do with out of the box functionality. While configuring the snapshot replication two SQL Agent Jobs are created, one for the creation of the snapshot and one for the distribution of the snapshot. Unfortunately these jobs are  asynchronous which means that if you execute them they immediately report back if the job started successfully or not, they do not wait for completion and report its result afterwards. So I had to create an SSIS package that executes the jobs and waits for their completion before the rest of the ETL process continues.

    Fortunately I was able to create the SSIS package with the desired functionality. I have made a step-by-step guide that will help you configure the snapshot replication and I have uploaded the SSIS package you need to execute it.

    Configure snapshot replication

    0 
    The first step is to create a publication on the database you want to replicate.
    Connect to SQL Server Management Studio and right-click Replication, choose for New.. Publication…

    1 
    The New Publication Wizard appears, click Next

    2
    Choose your “source” database and click Next

    3
    Choose Snapshot publication and click Next

    4 
    You can now select tables and other objects that you want to publish

    5
    Expand Tables and select the tables that are needed in your ETL process

    6
    In the next screen you can add filters on the selected tables which can be very useful. Think about selecting only the last x days of data for example.

    7
    Its possible to filter out rows and/or columns. In this example I did not apply any filters.

    8
    Schedule the Snapshot Agent to run at a desired time, by doing this a SQL Agent Job is created which we need to execute from a SSIS package later on.

    9
    Next you need to set the Security Settings for the Snapshot Agent. Click on the Security Settings button.

    10 
    In this example I ran the Agent under the SQL Server Agent service account. This is not recommended as a security best practice. Fortunately there is an excellent article on TechNet which tells you exactly how to set up the security for replication services. Read it here and make sure you follow the guidelines!

    11 
    On the next screen choose to create the publication at the end of the wizard

    12
    Give the publication a name (SnapshotTest) and complete the wizard

    13 
    The publication is created and the articles (tables in this case) are added


    Now the publication is created successfully its time to create a new subscription for this publication.

    14 
    Expand the Replication folder in SSMS and right click Local Subscriptions, choose New Subscriptions

    15 
    The New Subscription Wizard appears

    16 
    Select the publisher on which you just created your publication and select the database and publication (SnapshotTest)

    17 
    You can now choose where the Distribution Agent should run. If it runs at the distributor (push subscriptions) it causes extra processing overhead. If you use a separate server for your ETL process and databases choose to run each agent at its subscriber (pull subscriptions) to reduce the processing overhead at the distributor.

    18
    Of course we need a database for the subscription and fortunately the Wizard can create it for you. Choose for New database

    19 
    Give the database the desired name, set the desired options and click OK

    20
    You can now add multiple SQL Server Subscribers which is not necessary in this case but can be very useful.

    21 
    You now need to set the security settings for the Distribution Agent. Click on the …. button

    22
    Again, in this example I ran the Agent under the SQL Server Agent service account. Read the security best practices here

    23 
    Click Next

    24 
    Make sure you create a synchronization job schedule again. This job is also necessary in the SSIS package later on.

    25
    Initialize the subscription at first synchronization

    26
    Select the first box to create the subscription when finishing this wizard

    27
    Complete the wizard by clicking Finish

    28
    The subscription will be created

    29
    In SSMS you see a new database is created, the subscriber. There are no tables or other objects in the database available yet because the replication jobs did not ran yet.

    Now expand the SQL Server Agent, go to Jobs and search for the job that creates the snapshot:

     30
    Rename this job to “CreateSnapshot”

    Now search for the job that distributes the snapshot:

     31
    Rename this job to “DistributeSnapshot”

    Create an SSIS package that executes the snapshot replication

    We now need an SSIS package that will take care of the execution of both jobs. The CreateSnapshot job needs to execute and finish before the DistributeSnapshot job runs. After the DistributeSnapshot job has started the package needs to wait until its finished before the package execution finishes.
    The Execute SQL Server Agent Job Task is designed to execute SQL Agent Jobs from SSIS. Unfortunately this SSIS task only executes the job and reports back if the job started succesfully or not, it does not report if the job actually completed with success or failure. This is because these jobs are asynchronous.

    The SSIS package I’ve created does the following:

    1. It runs the CreateSnapshot job
    2. It checks every 5 seconds if the job is completed with a for loop
    3. When the CreateSnapshot job is completed it starts the DistributeSnapshot job
    4. And again it waits until the snapshot is delivered before the package will finish successfully

    33

    Quite simple and the package is ready to use as standalone extract mechanism. After executing the package the replicated tables are added to the subscriber database and are filled with data:

     32

    Download the SSIS package here (SSIS 2008)


    Conclusion

    In this example I only replicated 5 tables, I could create a SSIS package that does the same in approximately the same amount of time. But if I replicated all the 70+ AdventureWorks tables I would save a lot of time and boring work! With replication services you also benefit from the feature that schema changes are applied automatically which means your entire extract phase wont break. Because a snapshot is created using the bcp utility (bulk copy) it’s also quite fast, so the performance will be quite good.

    Disadvantages of using snapshot replication as extraction tool is the limitation on source systems. You can only choose SQL Server or Oracle databases to act as a publisher.

    So if you plan to build an extract phase for your ETL process that will invoke a lot of tables think about replication services, it would save you a lot of time and thanks to the Extract SSIS package I’ve created you can perfectly fit it in your usual SSIS ETL process.

  • Replication Services in a BI environment

    In this blog post I will explain the principles of SQL Server Replication Services without too much detail and I will take a look on the BI capabilities that Replication Services could offer in my opinion.

    SQL Server Replication Services provides tools to copy and distribute database objects from one database system to another and maintain consistency afterwards. These tools basically copy or synchronize data with little or no transformations, they do not offer capabilities to transform data or apply business rules, like ETL tools do.
    The only “transformations” Replication Services offers is to filter records or columns out of your data set. You can achieve this by selecting the desired columns of a table and/or by using WHERE statements like this:
    SELECT <published_columns> FROM [Table] WHERE [DateTime] >= getdate() - 60


    There are three types of replication:

    Transactional Replication

    Transactional replication components and data flow

    This type replicates data on a transactional level. The Log Reader Agent reads directly on the transaction log of the source database (Publisher) and clones the transactions to the Distribution Database (Distributor), this database acts as a queue for the destination database (Subscriber). Next, the Distribution Agent moves the cloned transactions that are stored in the Distribution Database to the Subscriber.
    The Distribution Agent can either run at scheduled intervals or continuously which offers near real-time replication of data!

    So for example when a user executes an UPDATE statement on one or multiple records in the publisher database, this transaction (not the data itself) is copied to the distribution database and is then also executed on the subscriber. When the Distribution Agent is set to run continuously this process runs all the time and transactions on the publisher are replicated in small batches (near real-time), when it runs on scheduled intervals it executes larger batches of transactions, but the idea is the same.


    Snapshot Replication

    Snapshot replication components and data flow
    This type of replication makes an initial copy of database objects that need to be replicated, this includes the schemas and the data itself. All types of replication must start with a snapshot of the database objects from the Publisher to initialize the Subscriber. Transactional replication need an initial snapshot of the replicated publisher tables/objects to run its cloned transactions on and maintain consistency.

    The Snapshot Agent copies the schemas of the tables that will be replicated to files that will be stored in the Snapshot Folder which is a normal folder on the file system. When all the schemas are ready, the data itself will be copied from the Publisher to the snapshot folder. The snapshot is generated as a set of bulk copy program (BCP) files. Next, the Distribution Agent moves the snapshot to the Subscriber, if necessary it applies schema changes first and copies the data itself afterwards. The application of schema changes to the Subscriber is a nice feature, when you change the schema of the Publisher with, for example, an ALTER TABLE statement, that change is propagated by default to the Subscriber(s).


    Merge Replication
    Merge replication is typically used in server-to-client environments, for example when subscribers need to receive data, make changes offline, and later synchronize changes with the Publisher and other Subscribers, like with mobile devices that need to synchronize one in a while. Because I don’t really see BI capabilities here, I will not explain this type of replication any further.


    Replication Services in a BI environment
    Transactional Replication can be very useful in BI environments. In my opinion you never want to see users to run custom (SSRS) reports or PowerPivot solutions directly on your production database, it can slow down the system and can cause deadlocks in the database which can cause errors. Transactional Replication can offer a read-only, near real-time database for reporting purposes with minimal overhead on the source system.

    Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards.
    In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!

    Update:
    I got a questing regarding the supported Replication Services features in the different versions of SQL Server (Standard,Enterprise,etc). There is a nice table on MSDN that shows this!

  • SSIS Denali as part of “Enterprise Information Management”

    When watching the SQL PASS session “What’s Coming Next in SSIS?” of Steve Swartz, the Group Program Manager for the SSIS team, an interesting question came up:

    Why is SSIS thought of to be BI, when we use it so frequently for other sorts of data problems?

    The answer of Steve was that he breaks the world of data work into three parts:

    • Process of inputs

    • BI
       
    • Enterprise Information Management
      All the work you have to do when you have a lot of data to make it useful and clean and get it to the right place. This covers master data management, data quality work, data integration and lineage analysis to keep track of where the data came from. All of these are part of Enterprise Information Management.
      image


    Next, Steve told Microsoft is developing SSIS as part of a large push in all of these areas in the next release of SQL. So SSIS will be, next to a BI tool, part of Enterprise Information Management in the next release of SQL Server.

    I'm interested in the different ways people use SSIS, I've basically used it for ETL, data migrations and processing inputs. In which ways did you use SSIS?

  • Analysis Services Roadmap for SQL Server “Denali” and Beyond

    Last week Microsoft announced the “BI Semantic Model” (BISM). I wrote a blog post about this and now the Analysis Services team wrote an article named: Analysis Services – Roadmap for SQL Server “Denali” and Beyond.

    BiSemanticModel

  • Will SSAS, Cubes and MDX be abandoned because of the BI Semantic Model?

    At the PASS Summit that is happening in Seattle at the moment Microsoft announced the “BI Semantic Model” (BISM).

    It looks like BISM is something like the UDM that we now know from SSAS. While the UDM was the bridge between relational data to multidimensional data, BISM is the bridge between relational data to the column-based Vertipaq engine. Some compare BISM to Business Objects universes.

    The next version of SSAS will be able to either run in the old “UDM” mode or in “BISM” mode, a combination is not possible. Of course this will have some radical consequences, because there are a few major differences between the two modes:

    • The switch from multidimensional cubes to the in-memory Vertipaq engine
    • The switch from MDX to DAX

    So multidimensional cubes and MDX will be deprecated? No, not really, SSAS as we know it now will be a product in the future and will remain supported. But it looks like Microsoft will concentrate on BISM, mainly because multidimensional cubes and MDX are very difficult to learn. Microsoft wants to make BI more approachable and less difficult, just like with Self Service BI.
    I would say that it’s really time to start learning PowerPivot and DAX right now, if you have not already started learning it. If Microsoft will focus on the new BISM/Vertipaq technology that will be the future if you ask me.

    Chris Webb wrote an interesting article about BISM and it looks like he is not very enthusiastic about the strategy Microsoft takes here because this could be the end of SSAS cubes within a few years: “while it’s not true to say that Analysis Services cubes as we know them today and MDX are dead, they have a terminal illness. I’d give them two, maybe three more releases before they’re properly dead, based on the roadmap that was announced yesterday.”

    What’s also very interesting is the comprehensive comment on this article from Amir Netz. He explains BISM and UDM will live together in Analysis Services in the future and MOLAP is here to stay: “Make no mistake about it – MOLAP is still the bread and butter basis of SSAS, now and for a very long time. MDX is mature, functional and will stay with us forever.”

    Read the article from Chris Webb here and make sure you don’t miss the comment from Amir!

  • SQL Server code-named 'Denali' - Community Technology Preview 1 (CTP1)

    SQL Server Denali (SQL Server 2011) CTP1 has been released!

    Download it here

    SQL 2011 is expected to be ready in the third quarter in 2011! I’ve already blogged about a few new SSIS features here

    I will keep you posted!

  • SQL Azure Reporting is announced!

    sql-azure-logo-lg

    With SQL Azure Reporting Services you can use SSRS as a service on the Azure platform with all the benefits of Azure and the most features and capabilities of premise. It’s also possible to embed your reports in your Windows or Azure applications.

    Benefits of the Azure platform for Azure Reporting Services are:

    • Highly available, the cloud services platform has built-in high availability and fault tolerance
    • Scalable, the cloud services platform automatically scales up and down
    • Secure, your reports and SQL Azure databases are on a safe place in the cloud
    • Cost effective, you don’t have to set up servers and you don’t have to invest in managing servers
    • Use the same tools you use today to develop your solutions. Just develop your reports in BIDS or Report Builder and deploy to Azure

    Disadvantages are:

    • SQL Azure databases are the only supported data sources in the first version, more data sources are expected to come
    • No developer extensibility in the first version, so no custom data sources, assemblies, report items or authentication
    • No subscriptions or scheduled delivery
    • No Windows Authentication, only SQL Azure username/password is supported in the first version, similar to SQL Azure database. When SQL Azure database gets Windows Authentication, Azure Reporting will follow

    Despite the disadvantages of the first version I think SQL Azure Reporting Services offers great capabilities and can be extremely useful for a lot of organizations.
    I’m really curious about the CTP, which will be available before the end of this year. You can sign up for the SQL Azure Reporting CTP here

    Read more about SQL Azure Reporting here

  • MCITP – I passed the 70-455 “Upgrade: Transition Your MCITP SQL Server 2005 BI Developer to MCITP SQL Server 2008 BI Developer” exam!

    Recently I passed the 70-455 exam. This exam upgrades your SQL 2005 MCTS and MCITP certifications to SQL 2008.

    image

    The exam contains 2 sections(basically separate exams), each with 25 questions:
    - A part which covers exam 70-448: TS: Microsoft SQL Server 2008, Business Intelligence Development and Maintenance
    - A part which covers exam 70-452: PRO: Designing a Business Intelligence Infrastructure Using Microsoft SQL Server 2008

    You need to pass on both of the sections with a score that’s at least 700. If you fail one section, you fail on the entire exam.

     

    How did I study

    I searched the internet and the conclusion was that there is no preparation material available for the 70-452 exam but fortunately there was a self-paced training kit for the 70-448 exam, which also covers this exam. So i bought the book, scanned it for subjects that needed attention and fortunately that was enough to pass the exam for me.

    For the entire list of preparation materials for the 70-448 and 70-452 exams follow the links below:

    70-448 preparation materials

    70-452 preparation materials 

     

    My Current Transcript

    image

  • The next version of SSIS is coming!

    The latest releases of SQL Server contained (almost) no new SSIS features. With the release of SSIS 2008 the ability to use C# scripts, the improved data flow and the cached lookup were most thrilling new features. The release of SQL 2008 R2 only gave us the ability to use a bulk insert mode for the ADO.NET destination, which was a bit disappointing.

    Fortunately Matt Mason from the SSIS team announced that the next version of SQL Server (SQL 11) contain quite some exiting new functionality for SSIS!

    - Undo/Redo support. Finally, this should have been added a long time ago ;-)

    - Improved copy/paste mechanism. Let’s hope we keep the formatting of components after copy/pasting them!

    - Data flow sequence container

    - New icons and rounded corners for tasks and transformations

    - Improved backpressure for data flow transformations with multiple inputs (for example a Merge Join). When one of the inputs get to much data compared to the other, the component that receives the data can tell the data flow that it needs more data on the other input

    - The Toolbox window will automatically locate and show newly installed custom tasks

    I’m Curious about the first CTP!

  • SSIS – Delete all files except for the most recent one

    Quite often one or more sources for a data warehouse consist of flat files. Most of the times these files are delivered as a zip file with a date in the file name, for example FinanceDataExport_20100528.zip

    Currently I work at a project that does a full load into the data warehouse every night. A zip file with some flat files in it is dropped in a directory on a daily basis. Sometimes there are multiple zip files in the directory, this can happen because the ETL failed or somebody puts a new zip file in the directory manually. Because the ETL isn’t incremental only the most recent file needs to be loaded. To implement this I used the simple code below; it checks which file is the most recent and deletes all other files.

    Usage is quite simple, just copy/paste the code in your script task and create two SSIS variables:

    • SourceFolder (type String): The folder that contains the (zip) files
    • DateInFilename (type Boolean): A flag, set it to True if your filename ends with the date YYYYMMDD, set it to false if creation date of the files should be used

    Note: In a previous blog post I wrote about unzipping zip files within SSIS, you might also find this useful: SSIS – Unpack a ZIP file with the Script Task

    Public Sub Main()
    
        'Use this piece of code to loop through a set of files in a directory
        'and delete all files except for the most recent one based on a date in the filename.
    
        'File name example:
        'DataExport_20100413.zip
    
        Dim rootDirectory As New DirectoryInfo(Dts.Variables("SourceFolder").Value.ToString) 'Set the directory in SSIS variable SourceFolder. For example: D:\Export\
        Dim mostRecentFile As String = ""
        Dim currentFileDate As Integer
        Dim mostRecentFileDate As Integer
        Dim currentFileCreationDate As Date
        Dim mostRecentFileCreationDate As Date
    
        Dim dateInFilename As Boolean = Dts.Variables("DateInFilename").Value 'If your filename ends with the date YYYYMMDD set SSIS variable DateInFilename to True. If not set to False.
    
    
        If dateInFilename Then
    
            'Check which file is the most recent
            For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
    
                currentFileDate = CInt(Left(Right(fi.Name, 12), 8)) 'Get date from current filename (based on a file that ends with: YYYYMMDD.zip)
    
                If currentFileDate > mostRecentFileDate Then
    
                    mostRecentFileDate = currentFileDate
                    mostRecentFile = fi.Name
    
                End If
    
            Next
    
        Else 'Date is not in filename, use creation date
    
            'Check which file is the most recent
            For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
    
                currentFileCreationDate = fi.CreationTime 'Get creation date of current file
    
                If currentFileCreationDate > mostRecentFileCreationDate Then
    
                    mostRecentFileCreationDate = currentFileCreationDate
                    mostRecentFile = fi.Name
    
                End If
    
            Next
    
        End If
    
    
        'Delete all files except the most recent one
        For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
    
            If fi.Name <> mostRecentFile Then
    
                File.Delete(rootDirectory.ToString + "\" + fi.Name)
    
            End If
    
        Next
    
        Dts.TaskResult = ScriptResults.Success
    End Sub
  • SSIS - Package design pattern for loading a data warehouse - Part 2

    Since my last blog post about a SSIS package design pattern I’ve received quite some positive reactions and feedback. Microsoft also added a link to the post on the SSIS portal which made it clear to me that there is quite some attention for this subject.

    The feedback I received was mainly about two things:
    1. Can you visualize the process or make it clearer without the whole technical story so it's easier to understand.
    2. How should the Extract phase of the ETL process be implemented when source tables are used by multiple dimensions and/or fact tables.

    In this post I will try to answer these questions. By doing so I hope to offer a complete design pattern that is usable for most data warehouse ETL solutions developed using SSIS.


    SSIS package design pattern for loading a data warehouse

    Using one SSIS package per dimension / fact table gives developers and administrators of ETL systems quite some benefits and is advised by Kimball since SSIS has been released. I have mentioned these benefits in my previous post and will not repeat them here.

    When using a single modular package approach, developers sometimes face problems concerning flexibility or a difficult debugging experience. Therefore, they sometimes choose to spread the logic of a single dimension or fact table in multiple packages. I have thought about a design pattern with the benefits of a single modular package approach and still having all the flexibility and debugging functionalities developers need.

    If you have a little bit of programming knowledge you must have heard about classes and functions. Now think about your SSIS package as a class or object that exists within code. These classes contain functions that you can call separately from other classes (packages). That would be some nice functionality to have, but unfortunately this is not possible within SSIS by default.
    To realize this functionality in SSIS I thought about SSIS Sequence Containers as functions and SSIS packages as classes.
    I personally always use four Sequence Containers in my SSIS packages:
    - SEQ Extract (extract the necessary source tables to a staging database)
    - SEQ Transform (transform these source tables to a dimension or fact table)
    - SEQ Load (load this table into the data warehouse)
    - SEQ Process (process the data warehouse table to the cube)

    The technical trick that I performed - you can read about the inner working in my previous post - makes it possible to execute only a single Sequence Container within a package, just like with functions in classes when programming code.
    The execution of a single dimension or fact table can now be performed from a master SSIS package like this:

    1 - [Execute Package Task] DimCustomer.Extract
    2 - [Execute Package Task] DimCustomer.Transform
    3 - [Execute Package Task] DimCustomer.Load
    4 - [Execute Package Task] DimCustomer.Process

    The package is executed 4 times with an Execute Package Task, but each time only the desired function (Sequence Container) will run.

    If we look at this in a UML sequence diagram we see the following:
    ETL_Example1

    I think this sequence diagram gives you a good overview of how this design pattern is organized. For the technical solution and the download of a template package you should check my previous post.


    How should the Extract phase of the ETL process be implemented when a single source table is used by multiple dimensions and/or fact tables?

    One of the questions that came up with using this design pattern is how to handle the extraction of source tables that are used in multiple dimensions and/or fact tables. The problem here is that a single table would be extracted multiple times which is, of course, undesirable.

    On coincidence I was reading the book “SQL Server 2008 Integration Services: Problem – Design - Solution” (which is a great book!) and one of the data extraction best practices (Chapter 5) is to use one package for the extraction of each source table. Each of these packages would have a very simple dataflow from the source table to the destination table within the staging area.
    Of course this approach will be more time consuming than using one big extract package with all table extracts in it but fortunately it also gives you some benefits:
    - Debugging, sometimes a source has changed, i.e. a column’s name could have been changed or completely deleted. The error that SSIS will log when this occurs will point the administrators straight to the right package and source table. Another benefit here is that only one package will fail and needs to be edited, while the others can still execute and remain unharmed.
    - Flexibility, you can execute a single table extract from anywhere (master package or dim/fact package).

    I recently created some solutions using this extract approach and really liked it. I used 2 SSIS projects:
    - one with the dimension and fact table packages
    - one with only the extract packages
    I have used the following naming conventions on the extract packages: Source_Table.dtsx and deployed them to a separate SSIS folder. This way the packages won’t bother the overview during development.
    A tip here is to use BIDS Helper; it has a great functionality to deploy one or more packages from BIDS.

    Merging this approach in the design pattern will give the following result:
    - The dimension and fact table extract Sequence Containers will no longer have data flow tasks in it but execute package tasks which point to the extract packages.
    - The Extract Sequence Container of the master package will execute all the necessary extract packages at once.

    This way a single source table will always get extracted only one time when executing your ETL from the master package and you still have the possibility to unit test your entire dimension or fact table packages.
    Drawing this approach again in a sequence diagram gives us the following example with a run from the master package (only the green Sequence Containers are executed):
    ETL_Example2

    And like this with a run of a single Dimension package:
    ETL_Example2_Customer

    Overall, the design pattern will now always look like this when executed from a master package:
    ETL_Overview


    Conclusion
    I think this design pattern is now good enough to be used as a standard approach for the most data warehouse ETL projects using SSIS. Thanks for all the feedback! New feedback is of course more than welcome!

  • SSIS – Package design pattern for loading a data warehouse

    I recently had a chat with some BI developers about the design patterns they’re using in SSIS when building an ETL system. We all agreed in creating multiple packages for the dimensions and fact tables and one master package for the execution of all these packages.

    These developers even created multiple packages per single dimension/fact table:

    • One extract package where the extract(E) logic of all dim/fact tables is stored
    • One dim/fact package with the transform(T) logic of a single dim/fact table
    • One dim/fact package with the load(L) logic of a single dim/fact table

    I like the idea of building the Extract, Transform and Load logic separately, but I do not like the way the logic was spread over multiple packages.
    I asked them why they chose for this solution and there were multiple reasons:

    • Enable running the E/T/L parts separately, for example: run only the entire T phase of all dim/fact tables.
    • Run the extracts of all dimensions and fact tables simultaneously to keep the loading window on the source system as short as possible.

    To me these are good reasons, running the E/T/L phases separately is a thing a developer often wants during the development and testing of an ETL system.
    Keeping the loading window on the source system as short as possible is something that’s critical in some projects.

    Despite the good arguments to design their ETL system like this, I still prefer the idea of having one package per dimension / fact table, with complete E/T/L logic, for the following reasons:

    • All the logic is in one place
    • Increase understandability
    • Perform unit testing
    • If there is an issue with a dimension or fact table, you only have to make changes in one place, which is safer and ore efficient
    • You can see your packages as separate ETL “puzzle pieces” that are reusable
    • It’s good from a project manager point of view; let your customer accept dimensions and fact tables one by one and freeze the appropriate package afterwards
    • The overview in BIDS, having an enormous amount of packages does not make it clearer ;-)
    • Simplifies deployment after changes have been made
    • Changes are easier to track in source control systems
    • Team development will be easier; multiple developers can work on different dim/fact tables without bothering each other.

    So basically my goal was clear: to build a solution that has all the possibilities the aforesaid developers asked for, but in one package per dimension / fact table; the best of both worlds.

    Solution:

    The solution I’ve created is based on a parent-child package structure. One parent (master) package will execute multiple child (dim/fact) packages. This solution is based on a single (child) package for each dimension and fact table. Each of these packages contains the following Sequence Containers in the Control Flow: 
     ChildControlFlow 

    Normally it would not be possible to execute only the Extract, Transform, Load or (cube) Process Sequence Containers of the child (dim/fact) packages simultaneously.

    To make this possible I have created four Parent package variable configurations, one for each ETL phase Sequence Container in the child package:

    clip_image003

    clip_image005

    Each of these configurations is set on the Disable property of one of the Sequence Containers:
    clip_image007

    Using this technique makes it possible to run separate Sequence Containers of the child package from the master package, simply by dis- or enabling the appropriate sequence containers with parent package variables.
    Because the default value of the Disable property of the Sequence Containers is False, you can still run an entire standalone child package, without the need to change anything.

    Ok, so far, so good. But, how do I execute only one phase of all the dimension and fact packages simultaneously? Well quite simple:


    First add 4 Sequence Containers to the Master package. One for each phase of the ETL, just like in the child packages


    Add Execute Package Tasks for all your packages in every Sequence Container


    clip_image008


    If you would execute this master package now, every child package would run 4 times as there are 4 Execute Package Tasks that run the same package in every sequence container.
    To get the required functionality I have created 4 variables inside each Sequence Container (Scope). These will be used as parent variable to set the Disable properties in the child packages. So basically I’ve created 4 variables x 4 Sequence Containers = 16 variables for the entire master package.

    Variables for the EXTRACT Sequence Container (vDisableExtract False):
    clip_image009

    Variables for the TRANSFORM Sequence Container (vDisableTransform False):
    clip_image010

    The LOAD and PROCESS Sequence Containers contain variables are based on the same technique.

    Results:

    Run all phases of a standalone package: Just execute the package:
    clip_image011

    Run a single phase of the ETL system (Extract/Transform/Load/Process): Execute the desired sequence container in the main package:

    RunAllTransforms 

    Run a single phase of a single package from the master package:
    RunSinglePhaseOfOnePackage


    Run multiple phases of the ETL system, for example only the T and L: Disable the Sequence Containers of the phases that need to be excluded in the master package:

    RunMultiplePhasesAndExcludeOthers

    Run all the child packages in the right order from the master package:
    When you add a breakpoint on, for example, the LOAD Sequence Container you see that all the child packages are at the same ETL phase as their parent: 
    RunCompleteMasterPackageBreakPoint


    When pressing Continue the package completes: 
    RunCompleteMasterPackageBreakPointCompleted


    Conclusion:

    This parent/child package design pattern for loading a Data Warehouse gives you all the flexibility and functionality you need. It’s ready and easy to use during development and production without the need to change anything.

    With only a single SSIS package for each dimension and fact table you now have the functionality that separate packages would offer. You will be able to, for example, run all the Extracts for all dimensions and fact tables simultaneously like the developers asked for and still have the benefits that come with the one package per dimension/fact table approach.

    Of course having a single package per dimension or fact table will not be the right choice in all cases but I think it is a good standard approach.
    Same applies to the ETL phases (Sequence Containers). I use E/T/L/P, but if you have different phases, which will be fine, you can still use the same technique.

    Download the solution with template packages from the URL’s below. Only thing you need to do is change the connection managers to the child packages (to your location on disk) and run the master package!


    Download for SSIS 2008

    Download for SSIS 2005


    If you have any suggestions, please leave them as a comment. I would like to know what your design pattern is as well!


    ATTENTION: See Part-2 on this subject for more background information!


    Backgrounds:

    How to: Use the Values of Parent Variables in a Child Package: http://technet.microsoft.com/en-us/library/ms345179.aspx

More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement