THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Rob Farley

- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and runs training courses around the world in SQL Server and BI topics.

  • Heroes of SQL

    Every story has heroes. Some heroes distinguish themselves by their superpowers; others by extraordinary bravery or compassion; some are simply heroes because of what they do in their jobs.

    We picture the men and women who work in the emergency departments of hospitals, soldiers who go back into the line of fire to rescue their colleagues, and of course, those who have been bitten by radioactive spiders.

    We don’t tend picture people who work with databases.

    But let me explain something – at the PASS Summit next month, you will come across a large number of heroes. The people who are presenting show extraordinary bravery to stand up in front of a room full of people who want to learn and who will write some of the nastiest things about them in evaluation forms. The members of the SQL Server Product Group (who you can see at the SQL Clinic) from Microsoft have incredible information about how SQL Server works on the inside. And then you have people like Paul White, Jon Kehayias and Ted Krueger, who have obviously spent too much time around arachnids with short half-lives.

    The amazing thing about the SQL Server community is their willingness to be heroes – not only by stepping up at conferences, but in helping people with their every day problems. It’s one thing to be a hero to help those in your workplace, by making sure that backups are performed, and that your databases are checked for corruption regularly, but people in the SQL Server community help people they don’t know on forums, they write blogs posts, and they attend (and organise) SQL Saturdays and other events so that they can sit and talk to strangers.

    The PASS Summit is the biggest gathering of SQL professionals in the world each year. So come along and see why people in the SQL community are different.TSQL2sDay150x150

    They’re heroes.

    @rob_farley 

    PS: Thanks to another SQL Hero, Tracy McKibben (@realsqlguy), for his effort in hosting this month’s T-SQL Tuesday.

  • Less than a month away...

    The PASS Summit for 2014 is nearly upon us, and the MVP Summit is immediately prior, in the same week and the same city. This is my first MVP Summit since early 2008. I’ve been invited every year, but I simply haven’t prioritised it. I’ve been awarded MVP status every year since 2006 (just received my ninth award), but in 2009 and 2010 I attended SQLBits in the UK, and have been to every PASS Summit since then. This year, it’s great that I get to do both Summits in the same trip, but if I get to choose just one, then it’s an easy decision.

    So let me tell you why the PASS Summit is the bigger priority for me.

    Number of people

    Actually, the PASS Summit isn’t that much larger than the MVP Summit, but the MVP Summit has thousands of non-SQL MVPs, and only a few hundred in the SQL space. Because of this, the ‘average conversation with a stranger’ is very different. While it can be fascinating to meet someone who is an MVP for File System Storage, the PASS Summit has me surrounded by people who do what I do, and it makes for more better conversations as I learn about who people are and what they do.

    Access to Microsoft

    The NDA content that MVPs learn at the MVP Summit is good, but the PASS Summit will have content about every-SQL-thing you ever want. The same Microsoft people who present at the MVP Summit are also at the PASS Summit, and dedicate time to the SQL Clinic, which means that you can spend even more time working through ideas and problems with them. You don’t get this at the MVP Summit.

    Non-exclusivity

    Obviously not everyone can go to the MVP Summit, as it’s a privilege that comes as part of the MVP award each year (although it’s hardly ‘free’ when you have to fly there from Australia). While it may seem like an exclusive event is going to be, well, exclusive, most MVPs are all about the wider community, and thrive on being around non-MVPs. There are less than 400 SQL MVPs around the world, and ten times that number of SQL experts at the Summit. While some of the top experts might be MVPs, a lot of them are not, and the PASS Summit is a chance to meet those people each year.

    Content from the best

    The MVP Summit has presentations from people who work on the product. At my first MVP Summit, this was a huge deal. And it’s still good to hear what these guys are thinking, under NDA, when they can actually go into detail that they know won’t leave the room. But you don’t get to hear from Paul White at the MVP Summit, or Erin Stellato, or Julie Koesmarno, or any of the other non-Microsoft presenters. The PASS Summit gives the best of both worlds.

    I’m really looking forward to the MVP Summit. I’ve missed the last six, and it’s been too long. MVP Summits were when I met some of my oldest SQL friends, such as Kalen Delaney, Adam Machanic, Simon Sabin, Paul & Kimberly, and Jamie Thomson. The opportunities are excellent. But the PASS Summit is what the community is about.

    MVPs are MVPs because of the community – and that’s what the PASS Summit is about. That’s the one I’m looking forward to the most.

    @rob_farley

  • Passwords

    Another month, and another T-SQL Tuesday. I have some blog posts I’ve been meaning to write, but the scheduling of T-SQL Tuesday and my determination to keep my record of never having missed one keeps me going. This month is hosted by Sebastian Meine (@sqlity), and is on the topic of Passwords.

    TSQL2sDay150x150

    Passwords are so often in the news. We read about how passwords are stolen through security breaches on a regular basis, and have plenty of suggestions on how using complex passwords can help (although the fact that tools such as 1Password put passwords on the clipboard must be an issue…), or that we should use passwords that are complex through length but simple in form such as a sentence – and we naturally see xkcd.com jump in on things with poignant commentary on life in a tech world.

    This post is actually not to tell you all to avoid using passwords more than once, or to use sufficiently complex that you don’t put onto your clipboard, or anything like that.

    Instead, I want you to think about what a password means.

    A password means that you have secret information that only you have. It’s what ‘secret’ means. As soon as you tell that secret information to multiple places, it’s not secret any more. Anyone who has seen my passport knows where I was born, and there are plenty of ways to work out my mother’s maiden name, yet these are considered ‘secret’ information that can be used to check that I’m me.

    These days, I carry multiple RSA tokens around with me, so that I can log into client sites, or connect to bank’s internet banking. The codes on these devices are considered secret, but actually, they contain a secret piece of information that can be used to identify me, through the codes they generate. Combining a password and these codes is considered enough to identify me, but not in a way that can let someone else in a few seconds later when the numbers change.

    When I develop SSIS packages for clients, or just about anything that needs to connect to sensitive data, I don’t try to figure out what passwords need to be included. Where possible (frustratingly it’s not always), I don’t include passwords in database connections at all – it’s secret information that I shouldn’t have to know. Instead, I let the package run with credentials that are stored within the SQL instance. When the package is deployed, it can run with the appropriate permissions, according to the rights given to the user identified in the credential. The trust that is established by the credential is enough to let it do what it needs to, and all I need to tell the package is “Assume you have sufficient rights for this.” I don’t need to store the password anywhere in the package that way, and I’m separated from production data, as every developer should be.

    I studied cryptography at university, although that was nearly twenty years ago and I hope things have moved on since then. I know various algorithms have been ‘cracked’, but the principles of providing secret information for identification carry on. I believe public/private key pairs are still excellent methods of proving that someone is who they say they are, so that I can generate something that you know comes from me, and you can generate something that only I can decrypt (and by using both my key pair and yours will allow us to have a secure conversation – until one of our private keys is compromised).

    Today we need to be able to identify ourselves through multiple devices and our ‘secret’ information is stored on servers, protected by passwords. Our passwords are secret, and anyone who knows any password we have used before could try to see if this is our secret information for other servers.

    I don’t know what the answer is, but I’m careful with my information. That said, I was the victim of credit-card skimming just recently, which the bank detected and cancelled my cards.

    Just be careful with your passwords. They are secret, and you should treat them that way. If you can make use of RSA tokens, or multi-factor authentication, or some other method that can trust you, then do so. Hopefully those places that you entrust your secret information will do the right thing by you…

    Be safe out there!

    @rob_farley

  • SQL Spatial: Getting “nearest” calculations working properly

    If you’ve ever done spatial work with SQL Server, I hope you’ve come across the ‘nearest’ problem.

    You have five thousand stores around the world, and you want to identify the one that’s closest to a particular place. Maybe you want the store closest to the LobsterPot office in Adelaide, at -34.925806, 138.605073. Or our new US office, at 42.524929, -87.858244. Or maybe both!

    You know how to do this. You don’t want to use an aggregate MIN or MAX, because you want the whole row, telling you which store it is. You want to use TOP, and if you want to find the closest store for multiple locations, you use APPLY. Let’s do this (but I’m going to use addresses in AdventureWorks2012, as I don’t have a list of stores). Oh, and before I do, let’s make sure we have a spatial index in place. I’m going to use the default options.

    CREATE SPATIAL INDEX spin_Address ON Person.Address(SpatialLocation);

    And my actual query:

    WITH MyLocations AS
    (SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)),
                           ('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326))
                   ) t (Name, Geo))
    SELECT l.Name, a.AddressLine1, a.City, s.Name AS [State], c.Name AS Country
    FROM MyLocations AS l
    CROSS APPLY (
        SELECT TOP (1) *
        FROM Person.Address AS ad
        ORDER BY l.Geo.STDistance(ad.SpatialLocation)
        ) AS a
    JOIN Person.StateProvince AS s
        ON s.StateProvinceID = a.StateProvinceID
    JOIN Person.CountryRegion AS c
        ON c.CountryRegionCode = s.CountryRegionCode
    ;

    image

    Great! This is definitely working. I know both those City locations, even if the AddressLine1s don’t quite ring a bell. I’m sure I’ll be able to find them next time I’m in the area.

    But of course what I’m concerned about from a querying perspective is what’s happened behind the scenes – the execution plan.

    image

    This isn’t pretty. It’s not using my index. It’s sucking every row out of the Address table TWICE (which sucks), and then it’s sorting them by the distance to find the smallest one. It’s not pretty, and it takes a while. Mind you, I do like the fact that it saw an indexed view it could use for the State and Country details – that’s pretty neat. But yeah – users of my nifty website aren’t going to like how long that query takes.

    The frustrating thing is that I know that I can use the index to find locations that are within a particular distance of my locations quite easily, and Microsoft recommends this for solving the ‘nearest’ problem, as described at http://msdn.microsoft.com/en-au/library/ff929109.aspx.

    Now, in the first example on this page, it says that the query there will use the spatial index. But when I run it on my machine, it does nothing of the sort.

    image

    I’m not particularly impressed. But what we see here is that parallelism has kicked in. In my scenario, it’s split the data up into 4 threads, but it’s still slow, and not using my index. It’s disappointing.

    But I can persuade it with hints!

    If I tell it to FORCESEEK, or use my index, or even turn off the parallelism with MAXDOP 1, then I get the index being used, and it’s a thing of beauty! Part of the plan is here:

    image

    It’s massive, and it’s ugly, and it uses a TVF… but it’s quick.

    The way it works is to hook into the GeodeticTessellation function, which is essentially finds where the point is, and works out through the spatial index cells that surround it. This then provides a framework to be able to see into the spatial index for the items we want. You can read more about it at http://msdn.microsoft.com/en-us/library/bb895265.aspx#tessellation – including a bunch of pretty diagrams. One of those times when we have a much more complex-looking plan, but just because of the good that’s going on.

    This tessellation stuff was introduced in SQL Server 2012. But my query isn’t using it.

    When I try to use the FORCESEEK hint on the Person.Address table, I get the friendly error:

    Msg 8622, Level 16, State 1, Line 1
    Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

    And I’m almost tempted to just give up and move back to the old method of checking increasingly large circles around my location. After all, I can even leverage multiple OUTER APPLY clauses just like I did in my recent Lookup post.

    WITH MyLocations AS
    (SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)),
                           ('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326))
                   ) t (Name, Geo))
    SELECT
        l.Name,
        COALESCE(a1.AddressLine1,a2.AddressLine1,a3.AddressLine1),
        COALESCE(a1.City,a2.City,a3.City),
        s.Name AS [State],
        c.Name AS Country
    FROM MyLocations AS l
    OUTER APPLY (
        SELECT TOP (1) *
        FROM Person.Address AS ad
        WHERE l.Geo.STDistance(ad.SpatialLocation) < 1000
        ORDER BY l.Geo.STDistance(ad.SpatialLocation)
        ) AS a1
    OUTER APPLY (
        SELECT TOP (1) *
        FROM Person.Address AS ad
        WHERE l.Geo.STDistance(ad.SpatialLocation) < 5000
        AND a1.AddressID IS NULL
        ORDER BY l.Geo.STDistance(ad.SpatialLocation)
        ) AS a2
    OUTER APPLY (
        SELECT TOP (1) *
        FROM Person.Address AS ad
        WHERE l.Geo.STDistance(ad.SpatialLocation) < 20000
        AND a2.AddressID IS NULL
        ORDER BY l.Geo.STDistance(ad.SpatialLocation)
        ) AS a3
    JOIN Person.StateProvince AS s
        ON s.StateProvinceID = COALESCE(a1.StateProvinceID,a2.StateProvinceID,a3.StateProvinceID)
    JOIN Person.CountryRegion AS c
        ON c.CountryRegionCode = s.CountryRegionCode
    ;

    But this isn’t friendly-looking at all, and I’d use the method recommended by Isaac Kunen, who uses a table of numbers for the expanding circles.

    It feels old-school though, when I’m dealing with SQL 2012 (and later) versions. So why isn’t my query doing what it’s supposed to? Remember the query...

    WITH MyLocations AS
    (SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)),
                           ('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326))
                   ) t (Name, Geo))
    SELECT l.Name, a.AddressLine1, a.City, s.Name AS [State], c.Name AS Country
    FROM MyLocations AS l
    CROSS APPLY (
        SELECT TOP (1) *
        FROM Person.Address AS ad
        ORDER BY l.Geo.STDistance(ad.SpatialLocation)
        ) AS a
    JOIN Person.StateProvince AS s
        ON s.StateProvinceID = a.StateProvinceID
    JOIN Person.CountryRegion AS c
        ON c.CountryRegionCode = s.CountryRegionCode
    ;

    Well, I just wasn’t reading http://msdn.microsoft.com/en-us/library/ff929109.aspx properly.

    The following requirements must be met for a Nearest Neighbor query to use a spatial index:

    1. A spatial index must be present on one of the spatial columns and the STDistance() method must use that column in the WHERE and ORDER BY clauses.

    2. The TOP clause cannot contain a PERCENT statement.

    3. The WHERE clause must contain a STDistance() method.

    4. If there are multiple predicates in the WHERE clause then the predicate containing STDistance() method must be connected by an AND conjunction to the other predicates. The STDistance() method cannot be in an optional part of the WHERE clause.

    5. The first expression in the ORDER BY clause must use the STDistance() method.

    6. Sort order for the first STDistance() expression in the ORDER BY clause must be ASC.

    7. All the rows for which STDistance returns NULL must be filtered out.

    Let’s start from the top.

    1. Needs a spatial index on one of the columns that’s in the STDistance call. Yup, got the index.

    2. No ‘PERCENT’. Yeah, I don’t have that.

    3. The WHERE clause needs to use STDistance(). Ok, but I’m not filtering, so that should be fine.

    4. Yeah, I don’t have multiple predicates.

    5. The first expression in the ORDER BY is my distance, that’s fine.

    6. Sort order is ASC, because otherwise we’d be starting with the ones that are furthest away, and that’s tricky.

    7. All the rows for which STDistance returns NULL must be filtered out. But I don’t have any NULL values, so that shouldn’t affect me either.

    ...but something’s wrong. I do actually need to satisfy #3. And I do need to make sure #7 is being handled properly, because there are some situations (eg, differing SRIDs) where STDistance can return NULL. It says so at http://msdn.microsoft.com/en-us/library/bb933808.aspx – “STDistance() always returns null if the spatial reference IDs (SRIDs) of the geography instances do not match.” So if I simply make sure that I’m filtering out the rows that return NULL…

    …then it’s blindingly fast, I get the right results, and I’ve got the complex-but-brilliant plan that I wanted.

    image

    It just wasn’t overly intuitive, despite being documented.

    @rob_farley

  • Nepotism In The SQL Family

    There’s a bunch of sayings about nepotism. It’s unpopular, unless you’re the family member who is getting the opportunity.

    But of course, so much in life (and career) is about who you know.

    From the perspective of the person who doesn’t get promoted (when the family member is), nepotism is simply unfair; even more so when the promoted one seems less than qualified, or incompetent in some way. We definitely get a bit miffed about that.

    But let’s also look at it from the other side of the fence – the person who did the promoting. To them, their son/daughter/nephew/whoever is just another candidate, but one in whom they have more faith. They’ve spent longer getting to know that person. They know their weaknesses and their strengths, and have seen them in all kinds of situations. They expect them to stay around in the company longer. And yes, they may have plans for that person to inherit one day. Sure, they have a vested interest, because they’d like their family members to have strong careers, but it’s not just about that – it’s often best for the company as well.

    I’m not announcing that the next LobsterPot employee is one of my sons (although I wouldn’t be opposed to the idea of getting them involved), but actually, admitting that almost all the LobsterPot employees are SQLFamily members… …which makes this post good for T-SQL Tuesday, this month hosted by Jeffrey Verheul (@DevJef).TSQL2sDay150x150

    You see, SQLFamily is the concept that the people in the SQL Server community are close. We have something in common that goes beyond ordinary friendship. We might only see each other a few times a year, at events like the PASS Summit and SQLSaturdays, but the bonds that are formed are strong, going far beyond typical professional relationships.

    And these are the people that I am prepared to hire. People that I have got to know. I get to know their skill level, how well they explain things, how confident people are in their expertise, and what their values are. Of course there people that I wouldn’t hire, but I’m a lot more comfortable hiring someone that I’ve already developed a feel for. I need to trust the LobsterPot brand to people, and that means they need to have a similar value system to me. They need to have a passion for helping people and doing what they can to make a difference. Above all, they need to have integrity.

    Therefore, I believe in nepotism. All the people I’ve hired so far are people from the SQL community. I don’t know whether I’ll always be able to hire that way, but I have no qualms admitting that the things I look for in an employee are things that I can recognise best in those that are referred to as SQLFamily.

    …like Ted Krueger (@onpnt), LobsterPot’s newest employee and the guy who is representing our brand in America. I’m completely proud of this guy. He’s everything I want in an employee. He’s an experienced consultant (even wrote a book on it!), loving husband and father, genuine expert, and incredibly respected by his peers.

    It’s not favouritism, it’s just choosing someone I’ve been interviewing for years.

    @rob_farley

  • LobsterPot Solutions in the USA

    We’re expanding!

    I’m thrilled to announce that Microsoft Gold Partner LobsterPot Solutions has started another branch appointing the amazing Ted Krueger (5-time SQL MVP awardee) as the US lead. Ted is well-known in the SQL Server world, having written books on indexing, consulting and on being a DBA (not to mention contributing chapters to both MVP Deep Dives books). He is an expert on replication and high availability, and strong in the Business Intelligence space – vast experience which is both broad and deep.lp_usa_square

    Ted is based in the south east corner of Wisconsin, just north of Chicago. He has been a consultant for eons and has helped many clients with their projects and problems, taking the role as both technical lead and consulting lead. He is also tireless in supporting and developing the SQL Server community, presenting at conferences across America, and helping people through his blog, Twitter and more.

    Despite all this – it’s neither his technical excellence with SQL Server nor his consulting skill that made me want him to lead LobsterPot’s US venture. I wanted Ted because of his values. In the time I’ve known Ted, I’ve found his integrity to be excellent, and found him to be morally beyond reproach. This is the biggest priority I have when finding people to represent the LobsterPot brand. I have no qualms in recommending Ted’s character or work ethic. It’s not just my thoughts on him – all my trusted friends that know Ted agree about this.

    So last week, LobsterPot Solutions LLC was formed in the United States, and in a couple of weeks, we will be open for business!

    LobsterPot Solutions can be contacted via email at contact@lobsterpotsolutions.com, on the web at either www.lobsterpot.com.au or www.lobsterpotsolutions.com, and on Twitter as @lobsterpot_au and @lobsterpot_us.

    Ted Kruger blogs at LessThanDot, and can also be found on Twitter and LinkedIn.

    This post is cross-posted from http://lobsterpotsolutions.com/lobsterpot-solutions-in-the-usa

  • SSIS Lookup transformation in T-SQL

    There is no equivalent to the SSIS Lookup transformation in T-SQL – but there is a workaround if you’re careful.

    The big issue that you face is about the number of rows that you connect to in the Lookup. SQL Books Online (BOL) says:

    • If there is no matching entry in the reference dataset, no join occurs. By default, the Lookup transformation treats rows without matching entries as errors. However, you can configure the Lookup transformation to redirect such rows to a no match output. For more information, see Lookup Transformation Editor (General Page) and Lookup Transformation Editor (Error Output Page).
    • If there are multiple matches in the reference table, the Lookup transformation returns only the first match returned by the lookup query. If multiple matches are found, the Lookup transformation generates an error or warning only when the transformation has been configured to load all the reference dataset into the cache. In this case, the Lookup transformation generates a warning when the transformation detects multiple matches as the transformation fills the cache.

    This is very important. It means that every row that enters the Lookup transformation comes out. This could be coming out of the transformation as an error, or through a ‘No Match’ output, with an ignored failure, or having found a row. But it will never return multiple copies of the row, even if it has matched two rows. This last point is inherently different to what happens in T-SQL. In T-SQL, any time you do a join, whether an INNER JOIN or an OUTER JOIN, if you match multiple rows on the right hand side, you get two copies of the row from the left. When doing Lookups in the world of ETL (as you would with SSIS), this is a VeryBadThing.

    TSQL2sDay150x150You see, there’s an assumption with ETL systems that things are under control in your data warehouse. It’s this assumption that I want to look at in this post. I do actually think it’s quite a reasonable one, but I also recognise that a lot of people don’t feel that it’s something they can rely on. Either way, I’ll show you a couple of ways that you can implement some workarounds, and it also qualifies this post for this month’s T-SQL Tuesday, hosted by Dev Nambi.

    Consider that you have a fact row, and you need to do a lookup into a dimension table to find the appropriate key value (I might know that the fact row corresponds to the Adelaide office, but having moved recently, I would want to know whether it’s the new version of the office or the old one). I know that ‘ADL’ is unique in my source system – quite probably because of a unique constraint in my OLTP environment – but I don’t have that guarantee in my warehouse. Actually, I know that I will have multiple rows for ADL. Only one is current at any point in time, but can I be sure that if I try to find the ADL record for a particular point in time, I will only find one row?

    A typical method for versioning dimension records (a Type 2 scenario) is to have a StartDate and EndDate for each version. But implementing logic to make sure there can never be an overlap is tricky. It’s easy enough to test, particularly since LAG/LEAD functions became available, but putting an actual constraint in there is harder – even more so if you’re dealing with something like Microsoft’s Parallel Data Warehouse, which doesn’t support unique constraints (this is totally fair enough, when you consider that the rows for a single table can be spread across hundreds of sub-tables).

    If we know that we have contiguous StartDate/EndDate ranges, with no gaps and no overlaps, then we can confidently write a query like:

    FROM facttable f
    LEFT JOIN dimtable d
    ON d.BusinessKey = f.DimBK
    AND d.StartDate <= f.FactDate
    AND f.FactDate < d.EndDate

    By doing a LEFT JOIN, we know that we’re never going to eliminate a fact by failing to match it (and can introduce an inferred dimension member), but if we have somehow managed to have overlapping records, then we could inadvertently get a second copy of our fact row. That’s going to wreck our aggregates, and the business will lose faith in the environment that has been put in.

    Of course, your dimension management is sound. You will never have this problem. Really. But what happens if someone has broken the rules and manually tweaked something? What if there is disagreement amongst BI developers about the logic that should be used for EndDate values (some prefer to have a gap of a day, as in “Jan 1 to Jan 31, Feb 1 to Feb 28”, whereas others prefer to have the EndDate value the same as the next StartDate. There’s definitely potential for inconsistency between developers.

    Whatever the reason, if you suddenly find yourself with the potential for two rows to be returned by a ‘lookup join’ like this, you have a problem. Clearly the SSIS Lookup transform ensures that there is never a second row considered to match, but T-SQL doesn’t offer a join like that.

    But it does give us APPLY.

    We can use APPLY to reproduce the same functionality as a join, by using code such as:

    FROM facttable f
    OUTER APPLY (SELECT * FROM dimtable d
                 WHERE d.BusinessKey = f.DimBK
                 AND d.StartDate <= f.FactDate
                 AND f.FactDate < d.EndDate) d1

    But because we now have a fully-fledged correlated table expression, we can be a little more tricky, and tweak it with TOP,

    FROM facttable f
    OUTER APPLY (SELECT TOP (1) * FROM dimtable d
                 WHERE d.BusinessKey = f.DimBK
                 AND d.StartDate <= f.FactDate
                 AND f.FactDate < d.EndDate) d1

    , which leaves us being confident that the number of rows in the set produced by our FROM clause is exactly the same number as we have in our fact table. The OUTER APPLY (rather than CROSS APPLY) makes sure we have lose rows, and the TOP (1) ensures that we never match more than one.

    But still I feel like we have a better option that having to consider which method of StartDate/EndDate logic is used.

    What we want is the most recent version of the dimension member at the time of the fact record. To me, this sounds like a TOP query with an ORDER BY and a filter,

    FROM facttable f
    OUTER APPLY (SELECT TOP (1) * FROM dimtable d
                 WHERE d.BusinessKey = f.DimBK
                 AND d.StartDate <= f.FactDate
                 ORDER BY d.StartDate DESC) d1

    , and you will notice that I’m no longer using the EndDate at all. In fact, I don’t need to bother having it in the table at all.

    Now, the worst scenario that I can imagine is that I have a fact record that has been backdated to before the dimension member appeared in the system. I’m sure you can imagine it, such as when someone books vacation time before they’ve actually started with a company. The dimension member StartDate might be populated with when they actually start with the company, but they have activity before their record becomes ‘current’.

    Well, I solve that with a second APPLY.

    FROM facttable f
    OUTER APPLY (SELECT TOP (1) * FROM dimtable d
                 WHERE d.BusinessKey = f.DimBK
                 AND d.StartDate <= f.FactDate
                 ORDER BY d.StartDate DESC) d1
    OUTER APPLY (SELECT TOP (1) * FROM dimtable d
                 WHERE d1.BusinessKey IS NULL
                 AND d.BusinessKey = f.DimBK
                 AND d.StartDate > f.FactDate
                 ORDER BY d.StartDate ASC) d1a

    Notice that I correlate the second APPLY to the first one, with the predicate “d1.BusinessKey IS NULL”. This is very important, and addresses a common misconception, as many people will look at this query and assume that the second APPLY will be executed for every row. Let’s look at the plan that would come about here.

     

    image

    I don’t have any indexes on facttable – I’m happy enough to scan the whole table, but I want you to notice the two Nested Loop operators and the lower branches for them. A Nested Loop operator sucks data from it’s top branch, and for every row that comes in, requests any matching rows from the lower one.

    We already established that the APPLY with TOP is not going to change the number of rows, so the number of rows that the left-most Nested Loop is pulling from its top branch is the same as the one on its right, which also matches the rows from the Table Scan. And we know that we do want to check dimtable for every row that’s coming from facttable.

    But we don’t want to be doing a Seek in dimtable a second time for every row that the Nested Loop pulls from factable.

    Luckily, that’s another poor assumption. People misread this about execution plans all the time.

    When taught how to read an execution plan, many will head straight to the top-right, and whenever they hit a join operator, head to the right of that branch. And it’s true that the data streams do start there. It’s not the full story though, and it’s shown quite clearly here, through that Filter operator.

    That Filter operator is no ordinary one, but has a Startup Expression Predicate property.

    image

    This means that the operator only requests rows from its right, if that predicate is satisfied. In this case, it means if it didn’t find matching row the first time it looked in dimtable. Therefore, the second Index Seek won’t get executed except in very rare situations. And we know (but the QO doesn’t) that it will be typically none at all, and that the estimated cost is not going to be 33%, but much more like 0%.

    So now you have a way of being able to do lookups that will not only guarantee that one row (at most) will be picked up, but you also have a pattern that will let you do a second lookup for those times when you don’t have the first.

    And keep your eye out for Startup Expression Predicates – they can be very useful for knowing which parts of your execution plan don’t need to get executed...

    @rob_farley

  • SQL 2014 does data the way developers want

    A post I’ve been meaning to write for a while, good that it fits with this month’s T-SQL Tuesday, hosted by Joey D’Antoni (TSQL2sDay150x150@jdanton)

    Ever since I got into databases, I’ve been a fan. I studied Pure Maths at university (as well as Computer Science), and am very comfortable with Set Theory, which undergirds relational database concepts. But I’ve also spent a long time as a developer, and appreciate that that databases don’t exactly fit within the stuff I learned in my first year of uni, particularly the “Algorithms and Data Structures” subject, in which we studied concepts like linked lists. Writing in languages like C, we used pointers to quickly move around data, without a database in sight. Of course, if we had a power failure all this data was lost, as it was only persisted in RAM. Perhaps it’s why I’m a fan of database internals, of indexes, latches, execution plans, and so on – the developer in me wants to be reassured that we’re getting to the data as efficiently as possible.

    Back when SQL Server 2005 was approaching, one of the big stories was around CLR. Many were saying that T-SQL stored procedures would be a thing of the past because we now had CLR, and that obviously going to be much faster than using the abstracted T-SQL. Around the same time, we were seeing technologies like Linq-to-SQL produce poor T-SQL equivalents, and developers had had a gutful. They wanted to move away from T-SQL, having lost trust in it. I was never one of those developers, because I’d looked under the covers and knew that despite being abstracted, T-SQL was still a good way of getting to data. It worked for me, appealing to both my Set Theory side and my Developer side.

    CLR hasn’t exactly become the default option for stored procedures, although there are plenty of situations where it can be useful for getting faster performance.

    SQL Server 2014 is different though, through Hekaton – its In-Memory OLTP environment.

    When you create a table using Hekaton (that is, a memory-optimized one), the table you create is the kind of thing you’d’ve made as a developer. It creates code in C leveraging structs and pointers and arrays, which it compiles into fast code. When you insert data into it, it creates a new instance of a struct in memory, and adds it to an array. When the insert is committed, a small write is made to the transaction to make sure it’s durable, but none of the locking and latching behaviour that typifies transactional systems is needed. Indexes are done using hashes and using bw-trees (which avoid locking through the use of pointers) and by handling each updates as a delete-and-insert.

    This is data the way that developers do it when they’re coding for performance – the way I was taught at university before I learned about databases. Being done in C, it compiles to very quick code, and although these tables don’t support every feature that regular SQL tables do, this is still an excellent direction that has been taken.

    @rob_farley

  • When is your interview?

    Sometimes it’s tough to evaluate someone – to figure out if you think they’d be worth hiring. These days, since starting LobsterPot Solutions, I have my share of interviews, on both sides of the desk. Sometimes I’m checking out potential staff members; sometimes I’m persuading someone else to get us on board for a project. Regardless of who is on which side of the desk, we’re both checking each other out.

    The world is not how it was some years ago. I’m pretty sure that every time I walk into a room for an interview, I’ve searched for them online, and they’ve searched for me. I suspect they usually have the easier time finding me, although there are obviously other Rob Farleys in the world. They may have even checked out some of my presentations from conferences, read my blog posts, maybe even heard me tell jokes or sing. I know some people need me to explain who I am, but for the most part, I think they’ve done plenty of research long before I’ve walked in the room.TSQL2sDay150x150

    I remember when this was different (as it could be for you still). I remember a time when I dealt with recruitment agents, looking for work. I remember sitting in rooms having been giving a test designed to find out if I knew my stuff or not, and then being pulled into interviews with managers who had to find out if I could communicate effectively. I’d need to explain who I was, what kind of person I was, what my value-system involved, and so on.

    I’m sure you understand what I’m getting at. (Oh, and in case you hadn’t realised, it’s a T-SQL Tuesday post, this month about interviews.)

    At TechEd Australia some years ago (either 2009 or 2010 – I forget which), I remember hearing a comment made during the ‘locknote’, the closing session. The presenter described a conversation he’d heard between two girls, discussing a guy that one of them had just started dating. The other girl expressed horror at the fact that her friend had met this guy in person, rather than through an online dating agency. The presenter pointed out that people realise that there’s a certain level of safety provided through the checks that those sites do. I’m not sure I completely trust this, but I’m sure it’s true for people’s technical profiles.

    If I interview someone, I hope they have a profile. I hope I can look at what they already know. I hope I can get samples of their work, and see how they communicate. I hope I can get a feel for their sense of humour. I hope I already know exactly what kind of person they are – their value system, their beliefs, their passions. Even their grammar. I can work out if the person is a good risk or not from who they are online. If they don’t have an online presence, then I don’t have this information, and the risk is higher.

    So if you’re interviewing with me, your interview started long before the conversation. I hope it started before I’d ever heard of you. I know the interview in which I’m being assessed started before I even knew there was a product called SQL Server. It’s reflected in what I write. It’s in the way I present. I have spent my life becoming me – so let’s talk!

    @rob_farley

  • Tricks in T-SQL and SSAS

    This past weekend saw the first SQL Saturday in Melbourne. Numbers were good – there were about 300 people registered, and the attendance rate seemed high (though I didn’t find out the actual numbers). Looking around during the keynote, I didn’t see many empty seats in the room, and I knew there were 300 seats, plus people continued to arrive as the day went on.

    My own session was fun. I’d been remarkably nervous (as I often am) beforehand, particularly as this was a talk I hadn’t given in about 3.5 years. There were elements of it that I teach often enough, but it was more about the structure of the talk, which ends up being so critical to how things end up working. I may give the impression of talking completely off-the-cuff, but I do have most of it thoroughly planned – the lack of slides and firm agenda is primarily there to allow me to have flexibility to match the audience.

    Anyway, as my demos were coming together, I found myself putting ‘GO’ between various lines, so that my CREATE statements didn’t get the red squiggly underlines by the SSMS window. I find it kinda frustrating when I’m just going to be running individual statements, but nevertheless, it’s good to avoid the squiggles. But of course, GO isn’t part of T-SQL, and I thought it was worth mentioning it. I think only one person in the room (a former student) had heard me explain this before, so it worked out okay. And it fits in nicely with this month’s T-SQL Tuesday, which is on the topic of “Dirty Little Tricks”, and hosted by Matt Velic (@mvelic).TSQL2sDay150x150

    So I’m going to show you two tricks, which are essentially harmless, but also help demonstrate potentially useful features of SQL Server – one in T-SQL, and one in Analysis Services.

    The T-SQL one, as I’ve already mentioned, is about the GO keyword, which isn’t actually part of T-SQL.

    You see, it’s a feature of SQL Server Management Studio, and of sqlcmd, but it’s not really a database engine thing. It’s the batch separator, and defines the point at which a bunch of T-SQL commands should be separated from the bunch that follow. It’s particularly useful for those times when the command you’re issuing needs to be the first command in the batch (such as a CREATE command), or even issued completely by itself (such as SET SHOWPLAN_XML ON).

    …and it’s configurable.

    image

    This is the SSMS Options dialog, and you’ll see an option where you can change GO to be something else.

    I had thought at some point that you could change the Batch Separator to just about anything else, and then create a stored procedure called ‘go’, but of course, if you have more than one statement in your batch, then you must use EXEC to run a stored procedure. So hoping that ‘go’ might run a stored procedure by appearing at the end of your batch doesn’t work. Besides that would be a BadThingToDo.

    Proper mischief involves changing it to a keyword, such as CREATE, DELETE or SELECT. If you make it SELECT, then all kinds of things will stop working in SSMS, and every SELECT query will come back with “A fatal scripting error occurred. Incorrect syntax was encountered while parsing SELECT.” Well, for new windows at least.

    The point at which it becomes really annoying for your unsuspecting colleague is that restarting SSMS only makes it worse. The setting is stored in C:\Users\YourName\AppData\Roaming\Microsoft\SQL Server Management Studio\12.0\SqlStudio.bin (the 12.0 means 2014 – it’s 11.0 for SQL2012), so even if you think installing a new copy of SSMS will fix it, it won’t.

    Sadly for you, reader, if they do an internet search, they’ll find posts like this one, and they will quickly realise who inflicted this pain on them.

    The other trick that I thought I’d mention is with SSAS translations, and again, demonstrates a nice feature of SQL Server.

    One rarely-used feature of Analysis Services (Multidimensional) is Translations.

    I say it’s rarely used, but if you have an environment that needs to cater for multiple languages, then you could well use them. They allow someone who has different language settings on their client machine to see localised data, dimension names, and so on. It’s really quite clever, and can be very influential in gaining acceptance of a system that must be used throughout all the worldwide branches of your organisation.

    But where you can have fun (so long as it doesn’t go into production) is when you have someone on your dev or test team who is originally from a different country (but with the same language), and likes to have their computer set to their home language. Like someone in Australia who likes to use English (New Zealand), or who likes to have English (Canada), despite the fact that they’ve been living in the United States for some years.

    The trick here is to introduce a translation in the language that they choose to use. They’ll be the only person who will notice it, and you can go as subtle or as blatant as you like.

    In the editor for the Cube file, you will see a Translations tab over on the right. It lets you enter the words in that language for the various concepts. So you could throw in the odd “eh” for Canadians, or mix up the vowels for Kiwis.

    Once you get into Dimension translations, you have so many more options! You can tell the data within attributes to come from a different column, even one that you’ve only made up for the DSV. That means that the reports they see can throw in the odd reference to hockey, or hobbits, or whatever else you might decide is appropriate to mess with their heads. Of course, when they see the report having the wrong names for things, they’ll tell someone else to fix it, but there won’t be anything to fix. It’s almost the ultimate “Doesn’t work on my machine” scenario, just to mess with that one person who doesn’t have their language settings the same as everyone else.

    …but please don’t let either of these go on in production. The last thing you need is to have someone think SQL is broken in production, or to have someone think you’re racist, when you’re just picking on New Zealanders.

    @rob_farley

  • Scans are better than Seeks. Really.

    There are quite a few reasons why an Index Scan is better than an Index Seek in the world of SQL Server. And yet we see lots of advice saying that Scans are bad and Seeks are good.

    Let’s explore why.

    Michael Swart (@MJSwart) is hosting T-SQL Tuesday this month, and wants people to argue against a popular opinion. Those who know me and have heard my present would realise that I often argue for things that are somewhat unconventional, and that I have good reason for doing so. (For example, in my Advanced T-SQL course, I teach people how to write GROUP BY statements. Because most people do it wrong most of the time.)

    TSQL2sDay150x150

    So today I’m going to look at some of what’s going on with Scans and Seeks, and will demonstrate why the Seek operator is the one that has more to do.

    I’m not going to suggest that all your execution plans will be better if all the Seeks are replaced by Scans of those same indexes. That’s simply not the case. But the advice that you always hear is a generalisation. Some Seeks are better than some Scans, and some Scans are better than some Seeks. But best of all of them is a particular Scan, and hopefully this post will go some way to convincing you of that, and demonstrate ways that you can help your queries take advantage of this technique.

    From the user’s perspective, the big thing with Seeks is that the database engine can go straight to the required data for a particular query, whereas Scans search through the whole table for the data that’s needed. This is fairly true, and certainly, if it were the whole story, then it would be very hard to argue against Seeks. After all – if we can go straight to the required data, then that’s perfect! Hopefully you’re already thinking that it does sound too good to be true, and yet this is what we’re taught about Seeks.

    An index uses a tree-structure to store its data in a searchable format, with a root node at the ‘top’. The data itself is stored in an ordered list of pages at the ‘leaf level’ of the tree, with copies of the ‘key data’ in levels above. The ‘key data’ is anything that’s defined in the index key, plus enough extra data to make sure that each row is uniquely identifiable (if the index is a ‘unique index’, then it already has enough information, if not, then the clustered index key(s) are included – with uniquifier column if the CIX keys are not unique), and therefore searchable. This means that the data can be found quite quickly, but it still requires some searching. It’s not like we have the file, pageid and slot number ahead of time. Then we really could go straight to the data we needed, which is what happens when we do a RID Lookup against a heap. We might find that this address stores nothing more than a forwarding record to another RID, but still we’re getting to the data very quickly. With an Index Seek, or even a Key Lookup, we need to find the data by searching for it through the levels of the tree.

    I’ll also point out that a Seek takes two forms: Singleton and RangeScan, depending on whether the systems knows that we’re looking for at most one record, or whether we’re looking for multiple records. The singleton form is only used when the system already has sufficient data to identify a unique record. If there is any chance that a second record could match, then a RangeScan is performed instead. For the sake of the post, let’s consider the singleton form a special case of the RangeScan form, because they both dive in to the index the same way, it’s just that the singleton only dives down, rather than looking around once there.

    So the Seek operation works out that it can use the index to find some rows that satisfy a predicate – some condition in an ON, WHERE or HAVING clause. It works out a predicate that indicates the start of the range, and then looks for that row. The database engine starts at the top of the tree, at the root node, and checks the index key entries there to find out which row to go to at the next level down, where it repeats the operation, eventually reaching the leaf level, where it can find the start of the range. It then traverses the leaf pages of the index, until it reaches the end of the range – a condition which must be checked against each row it finds along the way.

    A Scan simply starts at the first page of the index and starts looking. Clearly, if only some of the rows are of interest, those rows might not be all clumped together (as they would be in an index on a useful key), but if they are, then a Seek would’ve been faster for the same operation, but there’s important part here:

    Seeks are only faster when the index is not ideal.

    Seeks are able to locate the data of interest in a less-than-perfect index more quickly than simply starting at the first page and traversing through.

    But that search takes effort, both at the start, and on each record that must be checked in the RangeScan. I’m not just talking about any residual predicates that need to be applied – it needs to check each row to see if it’s found the end of the range. Granted, these checks are probably very quick, but it’s still work.

    What’s more, a Seek hides information more than a Scan.

    When you’re troubleshooting, and you look at a Scan operator, you can see what’s going on. You might not be able to see how many rows have actually considered (ie, filtered using the Predicate) before returning the handful that you’ve asked for (particularly if the scan doesn’t run to completion), but other than that, it’s pretty simple. A Seek still has this (residual) Predicate property, but also has a Seek Predicate that shows finds the extents of the RangeScans – and we have no idea how big they are. At least with a Scan we can look in sys.partitions to see how many rows are in there.

    Wait – RangeScans? Plural?

    Yes. The execution plan does tell you that there are multiple RangeScans, if you look at the properties of the Seek operator. Obviously not in ‘Number of Executions’, or in ‘Actual’ anything. But in the Seek Predicates property. If you expand it. And count how many (at leas they’re numbered) entries there are. Each of these entries indicates another RangeScan. Each with its own cost.

    image

    And it’s not about the ‘Tipping Point’

    I’m not going to talk about the fact that a Seek will turn into a Scan if the Seek is not selective enough, because that’s just not true. A Seek of a non-covering index, one that then requires lookups to get the rest of the required information will switch to using a covering index, even if that index is not ideal, if the number of lookups needed makes the ‘less ideal but covering’ index a less-costly option. This concept has nothing at all to do with Seeks and Scans. I can even make a Scan + Lookups turn into a Seek at a tipping point if you’re really keen... it’s entirely about the expense of Lookups.

    So, Seeks have slightly more work to do, but this work is to make up for the indexes are typically ‘less-than-perfect’.

    Whenever you need just a subset of an index, where that subset is defined by a predicate, then a Seek is going to be useful. But in a perfect world, many of our indexes can be pre-filtered to the rows of interest. That might be “active tasks” or “orders from today”, or whatever. If a query hits the database looking for this set of things, then a Scan is ideal, because we can choose to use an index which has already been filtered to the stuff we want.

    So I don’t mind Scans. I don’t view them with the same level of suspicion as I do Seeks, and I often find myself looking for those common predicates that could be used in a filtered index, to potentially make indexes which are pre-filtered, and which are more likely to be scanned, because they have the 20 rows of interest (rather than seeking into a much larger index to get those 20 rows).

    There’s more to this – I’ve delivered whole presentations on this topic, where I show how Scans can often make Top queries run quite nicely, and also how Seeks can tend to be called too frequently.

    I don’t want you to start working to turn all your plans’ Seeks into Scans – but you should be aware that quite often, a Seek is only being done because your index strategy has space for improvement.

    @rob_farley

  • Victims of success

    I feel like every database project has major decisions now, which are remarkably fundamental to the direction that’s going to be taken. And it’s almost as if new options appear with ever-increasing frequently.

    Consider a typical database project, involving a transactional system to support an application, with extracts into a data warehouse environment for reporting, possibly with an analytical layer on top for aggregations.TSQL2sDay150x150

    Not so long ago, the transactional system could be one of a small number of database systems, but if you were primarily in the Microsoft space you’d be looking at SQL Server, either Standard or Enterprise (but that decision would be relatively easy, based on the balance between cost and features), with extracts into another database, using Analysis Services for aggregations and Reporting Services for reports. Yes, there were plenty of decisions to make, but the space has definitely become more complex. If you’re thinking about a BI solution, you need to work out whether you should leverage the SharePoint platform for report delivery, figure out whether you want to use the Tabular or Multidimensional models within SSAS, Project or Package within SSIS, and of course, cloud or ‘on-premise’.

    This month’s T-SQL Tuesday topic, hosted by fellow MCM Jason Brimhall (@sqlrnnr) is on the times when a bet has had to be made, when you’ve had to make a decision about going one way rather than another, particularly when there’s been an element of risk about it. These decisions aren’t the kind of thing that could cause massive data loss, or cost someone their job, but nonetheless, they are significant decisions that need to be made, often before all the facts are known.

    As I mentioned before, one of the biggest questions at the moment is: Cloud or “On-Premise”

    I’m not going to get into the “on-premise” v “on-premises” argument. The way I look at it, “on-premise” has become an expression that simply means “not in the cloud”, and doesn’t mean it’s actually on your premises at all. The question is not about whether you have a physical server that you can walk up to without leaving your office – plenty of organisations have servers hosted with an ISP, without being ‘in the cloud’. It also doesn’t mean that you’re avoiding virtual machines completely.

    So by ‘cloud’, I’m talking about a system like Windows Azure SQL Database. You’ve made the decision to adopt something like WASD, and are dealing with all the ramifications of such a system. Maintenance of it is being handled as part of your subscription. You’re not making decisions about what operating system you’re using, or what service accounts are being used. You’re spinning up a database in the cloud, because you’ve made a decision to take the project that way.

    WASD has a much smaller initial outlay than purchasing licenses, and the pricing model is obviously completely different – not only using a subscription basis, but considering data transfer (for outbound data) too. If you’re comparing the cost of getting your system up and running, then the fact that you’re not having to set up servers, install an operating system, have media for backups, and so on, means that choosing the cloud can seem very attractive.

    But there are often caveats (how often are ‘bets’ made more risky because of a caveat that was ignored or at least devalued?).

    For example, right now, the largest WASD database is limited to 150GB. That might seem a lot for your application, but you still need to have considered what might happens if that space runs out. You can’t simply provision a new chunk of storage and tell the database to start using that as well.

    You need to have considered what happens when the space runs out. Because it will.

    I would like to think that every database system has asked this question, but too frequently, it doesn’t get asked, or otherwise, the answer is disregarded. Many on-premise systems find it easy enough to throw extra storage at the problem, and this is a perfectly valid contingency plan. Other systems have a strict archiving procedure in place, which can also ensure that the storage stays small. But still, there are questions to ask, and a plan to work out.

    To me, it feels a lot like what happened to Twitter in its early days. The concept of Twitter is very simple – it’s like text messages sent to the world. But because the idea caught on, scaling become a bigger problem than they expected, much earlier than they expected. They were a victim of their own success. They worked things out, but there were definitely growing pains.

    In the 1990s, many of us in the IT industry spent a decent amount of time fixing code that no one imagined would need to be still running in the futuristic 21st century. After the fact, many claimed that the problem had been over-exaggerated, but those of us who had worked on systems knew that a lot of things would have broken if we hadn’t invested that effort. It’s just that when a lot of software got written, no one expected it to be still be used in 2000. Those coders didn’t expect to be so successful.

    It’s too easy to become a victim of success. I tell people that if they have done a good job with their database application, they will probably have underestimated its popularity, and will have also underestimated the storage requirements, and so on. I’ve seen many environments where storage volumes were undersized, and volumes which had been intended for one type of use now serve a variety (such as a drive for user database data files now containing tempdb or log files, even the occasional backup). As a consultant I never judge, because I understand that people design systems for what they know at the time, not necessarily the future. And storage is typically cheap to add.

    But when it comes to Windows Azure SQL Databases, have a plan for what you do when you start to reach 150GB. Scaling out should be a question asked early, not late.

  • Converting Points to a Path

    Suppose your SQL table has a bunch of spatial points (geographies if you like) with an order in which they need to appear (such as time) and you want to convert them into a LineString, or path.

    One option is to convert the points into text, and do a bunch of string manipulation. I’m not so keen on that, even though it’s relatively straightforward if you use FOR XML PATH to do the heavy lifting.

    The way I’m going to show you today uses three features that were all introduced in SQL Server 2012, to make life quite easy, and I think quite elegant as well.

    Let’s start by getting some points. I’ve plotted some points around Adelaide. To help, I’m going to use Report Builder to show you the results of the queries – that way, I can put them on a map and you can get a feel for what’s going on, instead of just seeing a list of co-ordinates.

    First let’s populate our data, creating an index that will be helpful later on:

    select identity(int,1,1) as id, *
    into dbo.JourneyPoints
    from
    (values
        (geography::Point(-34.924269, 138.599252, 4326), 'Cnr Currie & KW Sts', cast('20140121 9:00' as datetime)),
        (geography::Point(-34.924344, 138.597544, 4326), 'Cnr Currie & Leigh Sts', '20140121 9:30'),
        (geography::Point(-34.923025, 138.597458, 4326), 'Cnr Leigh & Hindley Sts', '20140121 10:00'),
        (geography::Point(-34.923016, 138.597608, 4326), 'Cnr Bank and Hindley Sts', '20140121 10:30'),
        (geography::Point(-34.921775, 138.597533, 4326), 'Cnr Bank St and North Tce', '20140121 11:00'),
        (geography::Point(-34.921520, 138.601814, 4326), 'Cnr North Tce and Gawler Pl', '20140121 11:30'),
        (geography::Point(-34.924071, 138.601975, 4326), 'Cnr Gawler Pl and Grenfell St', '20140121 12:00'),
        (geography::Point(-34.923966, 138.605590, 4326), 'Cnr Grenfell and Pulteney Sts', '20140121 12:30'),
        (geography::Point(-34.921338, 138.605405, 4326), 'Cnr Pulteney St and North Tce', '20140121 13:00')
      ) p (geo, address, timeatlocation);

    create index ixTime on dbo.JourneyPoints(timeatlocation) include (geo);

    select * from dbo.JourneyPoints;

    Great. Starting at the corner of Currie and King William Streets, we wander through the streets, including Leigh St, where the LobsterPot Solutions office is (roughly where the ‘e’ is).

    image

    I’ve labelled the points with the times, but it’s still not great viewing. Frankly, it’s a bit hard to see what route was taken.

    What we really want is to draw lines between each of them. For this, I’m going to find the next point in the set, using LEAD(), and use the spatial function ShortestLineTo to get the path from our current point to the next one.

    select geo,
           lead(geo) over (order by timeatlocation) as nextGeo,
           geo.ShortestLineTo(lead(geo) over (order by timeatlocation)) as lineToNext,
           timeatlocation
    from dbo.JourneyPoints;

    I didn’t need to use pull back the fields geo and nextGeo, but I figure that the lineToNext column might be confusing at first glance, since it uses the subsequent row’s position as an argument in a function on the current row’s position. Anyway, hopefully you get the gist, here’s what it looks like.

    image

    This is way better – you can see the path that was taken, and can easily tell that the route didn’t just go straight up North Tce, it ducked down Gawler Place instead.

    What’s more – with each part of the journey still being a separate row, I can colour each part differently. You know, in case I don’t like the “Tomato” colour in my last example (yes, that colour is called “Tomato”, no matter whether you say it “tomato”, or “tomato”, or even “tomato”).

    To colour it differently, I’m going to throw in an extra field, which is just the number of minutes since we started. I’ll use the old fashioned OVER clause for that, to count the number of minutes since the earlier time.

    select geo.ShortestLineTo(lead(geo) over (order by timeatlocation)) as lineToNext,
           timeatlocation,
           datediff(minute, min(timeatlocation) over (), timeatlocation) as minutesSinceStart
    from dbo.JourneyPoints;

    image

    Cool – now I can easily tell which end it started at (the more tomatoey end), and where it ended (the paler end). Each segment is the same colour, but that’s okay.

    Now, I said I’d use three SQL 2012 features, and so far the only new ones have been LEAD and ShortestLineTo. But remember I still have several rows, and each section of the route is a separate line. Well, to join them up, I’m going to use 2012’s UnionAggregate function. To use this, I need to use a sub-query (I’ll go with a CTE), because I can’t put an OVER clause inside an aggregate function.

    with lines as (
    select geo.ShortestLineTo(lead(geo) over (order by timeatlocation)) as LineToNext
    from dbo.JourneyPoints
    )
    select geography::UnionAggregate(LineToNext) as WholeRoute
    from lines;

    Now I have my solution! I’ve converted points into lines, in the right order.

    image

    You may be wondering how this performs – what kind of execution plan is going to appear.

    Well it’s this:

    image

    image

    Look at this – there are Stream Aggregates (which just watch the data as it comes through, popping rows out when needed, but never holding onto anything except the aggregate as it grows), a Spool (which is used to do a bit of the windowing trickery, but also holding onto very little), and the Sequence Project & Segment operators which generate a row_number as a marker for the lead function. You might be interested to know that the right-most Stream Aggregate has the following “Defined Value” property:

    [Expr1005] = Scalar Operator(LAST_VALUE([spatial_test].[dbo].[JourneyPoints].[geo])),
    [[spatial_test].[dbo].[JourneyPoints].geo] = Scalar Operator(ANY([spatial_test].[dbo].[JourneyPoints].[geo]))

    For each group (which is defined as the row), it uses the LAST_VALUE of geo, and ANY of geo. ANY is the current one, and LAST_VALUE is the row after it. It’s the last row, because the Spool gives up two rows for each ‘window’ – the current row and the lead row. In this scenario, with 9 rows of data in the index, the Spool pulls in (from the right) 9 rows, and serves up (to the left) 17. That’s two per original row, except the last which doesn’t have a lead row.

    So the overhead on making this work is remarkably small. With an index in the right order, the amount of work to do is not much more than scanning over the ordered data.

    Finally, if I had wanted to do this for several routes, I could have put a RouteID field in the table, used PARTITION BY RouteID in each OVER clause, and GROUP BY RouteID in the final query. If you do this, then you should put routeid as the first key column in your index. That way, the execution plan can be almost identical (just with slightly more explicit grouping, but with identical performance characteristics) to before.

    with lines as (
    select routeid, geo.ShortestLineTo(lead(geo) over (partition by routeid order by timeatlocation)) as LineToNext
    from dbo.JourneyPoints
    )
    select routeid, geography::UnionAggregate(LineToNext) as WholeRoute
    from lines
    group by routeid
    ;

    But I don’t have a picture of that, because that wasn’t the query I was wanting.

  • Write-BlogPost

    A couple of years I ago I was going to write a song about automation, in reggae style, which could maybe have been used by the Trinidad SQL community – particularly Nigel Sammy (@nigelsammy). The theme was going to be around the fact that you need automation because the sun shines and the beach is calling.

    But of course, automation is about so much more than freeing up time for the beach (even here in Adelaide, where every weekday this week is set to be over 40C). Automation helps you be consistent in what you do by removing manual steps, and lets you focus your attention on the things that require thought, rather than being the same as always.

    TSQL2sDay150x150This month’s T-SQL Tuesday is about automation, and I thought I’d write about how a few of my favourite applications help me massively in the quest for better automation. The host-post asks about what has changed since the last time automation was a topic, but that time I mainly looked at Policy Based Management, which is great for being able to make sure that things happen. This time, I want to look particularly at the things I use to develop repeatable commands, thereby reducing how much I have to do compared to how much can be done by the machine.

    SQL Server Management Studio (SSMS)

    The Script button in dialogs! Oh how I love it. In fact, I wish that there were no OK button on dialog boxes in SSMS. I would be perfectly fine with a “Script and Close” button instead. I know I could have an Extended Events session or Trace running to be able to pick up what has just been run on the SQL box, but that doesn’t quite cut it. When I hit the OK button, I don’t actually know what commands are going to be run. I’ll have a good idea, of course, but if I’ve been tabbing through options and accidentally changed something, I might not have noticed (ok, I’m sure I will have, no one ever makes that mistake in real life). Even more significantly though, I might want to be able to run exactly the same command against another server. The Script button is amazingly useful and should be used by EVERYONE.

    gVim

    While I was at university, I used Unix a lot. My PC at home ran Linux, and I shuddered whenever I’d find out I had to use a Microsoft environment. It’s okay – I got over it – but one thing that remains is my appreciation for the text editor vi. I was pretty much forced to use it for a long while, and for a good year or more, I think I learned a new way of doing things almost every day. Just about every time you’d sit with someone else and work with them, you’d see something they’d do and go “Oh, how did you do that?” Of course, they’d reply with “Oh, that’s just pressing star”, or something like that. It was a good time, and I developed an appreciation for vi (and later, vim, and its Windows client gVim), which has stayed with me. Still I find myself opening Visual Studio and filling a row with ‘j’s as I hope to scroll down through the code.

    From an automation perspective, gVim is great. The whole environment is based on keystrokes, so there’s never any reliance on putting the mouse cursor somewhere and clicking. Furthermore, I can hit ‘q’ and then record a macro, playing it back with @ (ok, it’s actually q followed by another letter, in which you store the macro, and @ followed by the letter for the macro of interest). This makes it great not just for writing code, but editing all kinds of text. I like Excel for being able to use formulas which can be repeated across each row, but I also find myself leveraging gVim’s macros for doing things even more easily – and navigating multiple lines.

    PowerShell

    I so wish that Windows had the macro-recording concept of gVim, or the Script button of SSMS. It would be really nice to be able to go to some spot in the Registry, or some Control Panel dialog, make some change, and say “And please give me a Script for what I’ve just done!” (If someone knows how to do this, PLEASE let me know)

    But even so, PowerShell is tremendously useful. In my Linux days I would control everything through a shell environment (I preferred tcsh for some reason – I forget why – bash was good too, of course), and as such I could look back at what I’d just done, store scripts to repeat things another time, and so on. I don’t get that feeling with Windows, but PowerShell helps. I feel comfortable loading up a piece of XML in PowerShell (even an Execution Plan), and I love how easily I can move around XML in PowerShell.

    Of course, every month I write a post for T-SQL Tuesday, and it would be quite neat to have a script that would automate that for me. But there are plenty of things that I don’t have automated (and may never do), and putting blog posts together is probably going to remain one of those. I can’t see myself creating a fully-automated Write-BlogPost cmdlet any time soon.

  • Waiting, waiting…

    “It just runs slow these days”

    I’m sure you’ve heard this, or even said it, about a computer that’s a few years old. We remember the days when the computer was new, and it seemed to just fly – but that was then, and this is now. Change happens, things erode, and become slower. Cars, people, computers. I can accept that cars get slower. They lose horsepower over time as the precision components wear and become less precise. I also know that my youth is a thing of the past. But electronics? What happens there?

    Well, in my experience, computers don’t get slower. They just feel slower. I see two main reasons, and neither of them are because of ageing hardware.

    Your computer might be slower than it was yesterday even. In the world of databases we might even be investigating why the computer is slower than it was five minutes ago. Again, it’s probably not because of ageing hardware.

    One possible reason is that we’re simply asking systems to do more. If we’re comparing our laptops to when we bought them, we’re probably refreshing webpages more frequently (often in the background) and have installed too many utilities (hopefully not in the background, but you never know), and the system has more to get done in a given minutes compared to when it was new. With a database server, the amount of data has probably grown, there may be more VLFs in the log file to deal with, more users pushing more transactions. These are not things you want to uninstall like that annoying browser search bar on your aunt’s ageing computer, but they can be a very valid reason for things to be slower. Hopefully you are tuning your system to make sure that scalability is possible, and you’re very happy with the amount of extra work that’s being done, even if it does mean that some processes take a little longer than they once did.

    This problem can be summarised by the fact that the system is having to wait for resources to be free so that it can get its stuff done.

    Another reason for slowness is that the system is having to wait more for other reasons, things that you don’t want it having to wait for. An increase in busyness will cause slowness because of waiting, but you can easily make the argument that this is ‘acceptable’. It’s much more of a problem if the system is being slower without actually achieving any more than it was before.

    TSQL2sDay150x150Waits are the topic of this month’s T-SQL Tuesday, hosted by Robert Davis (@sqlsoldier). Go and have a look at his post to see what other people have written about on this topic.

    In the SQL Server world, this kind of problem is identified by looking at wait stats. The system records what processes are waiting for, and you can see these by querying sys.dm_os_wait_stats. It’s very useful, but querying it in isolation isn’t as useful as taking snapshots of it. If you want to store copies of it over time, you may prefer to do something along the lines of:

    --A schema for monitoring data can be useful
    create schema monitoring;

    --Create a table that has the structure of sys.dm_os_wait_stats
    select top (0) *
    into monitoring.waits
    from sys.dm_os_wait_stats;

    --Add a column to know the the stats are collected
    alter table monitoring.waits
    add snapshot_time datetime default sysdatetime();

    --Run this section regularly
    insert monitoring.waits (wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
    select * from sys.dm_os_wait_stats;

    Regularly collecting snapshots of wait_stats like this can give you a picture of what has occurred over time. You can easily pull this data into a report, or into Excel, or even get a picture of a recent version quite easily, using a query such as:

    with numbered as (
    select *,
        wait_time_ms - lead(wait_time_ms) over (partition by wait_type order by snapshot_time desc) as diff_wait_time,
        waiting_tasks_count - lead(waiting_tasks_count) over (partition by wait_type order by snapshot_time desc) as diff_wait_count,
        1000 * datediff(second,lead(snapshot_time) over (partition by wait_type order by snapshot_time desc),snapshot_time) as diff_ms,
        row_number() over (partition by wait_type order by snapshot_time desc) as rownum
    from monitoring.waits
    )
    select wait_type, snapshot_time, diff_wait_count, diff_wait_time, diff_ms
    from numbered
    where rownum = 1
    order by diff_wait_time desc, wait_type;

    This query compares the amount of wait time for each type (which is frustratingly stored as a string) since the previous one, using the LEAD function that was introduced in SQL Server 2012 (LEAD rather than LAG because we’re looking at snapshot_time desc, not ASC). Using ROW_NUMBER(), we can easily pick out the latest snapshot by filtering to rownum = 1, but if you’re just wanting to chart them, the contents of the CTE will be enough.

    Make sure you keep an eye on the amount of data you’re storing, of course, and be careful of the impact of someone inadvertently clearing the stats (though as the query picks up deltas, you should be able to consider a filter that will ignore the deltas that might have spanned a period during which the stats were cleared).

    This post is not going to go into all the different wait types to tell you which ones are worth worrying about and which ones are worth ignoring. But what I would suggest to you is that you track what’s going on with your environment and keep an eye out for things that seem unusual. When troubleshooting, you will find any history invaluable.

More Posts Next page »

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement