THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

  • Clustering every SQL Server instance

    You may disagree, but I believe it is a good practice to cluster all the SQL Server instances. That is, even when you plan to run a SQL Server instance on a single machine, you should install it in a single node cluster.

    The primary advantage is that you only need a single standard SQL Server build instead of one for the stand alone and one for the clustered. This results in simplified configurations such as when you configure network aliases, Kerberos, and multiple instances.

    If you need to add a second node later, that change will be totally transparent to your client. You may question how often one would need to add a second node. True, that may not happen very often. But when it does happen, you’d be thankful that you already have the cluster in place. Moreover, consider a DR scenario where you don’t want to have a two-node cluster doing nothing just in case there is a disaster. If it turns out that you do need to run your production on DR for an extended time period, you probably want to protect your production with a cluster. In that case, adding another node would be painless if your DR is already a single-node cluster.

    When it comes to using network aliases, I strongly prefer to expose and manage all the network aliases of my SQL Server instances explicitly as network resources. Network resources introduce additional constraints that make the network aliases less fragile to mistakes and help reduce chaos when we need to move them among the hosts.

    Kerberos configurations can sometimes be finicky and the procedures for a clustered instance and those for a stand-alone instance are different. Anything you can do to reduce the Kerberos configuration complexity is a plus for robustness of your environment.

    When multiple instances are configured in a cluster, they must have separate IP addresses, and this makes it trivially easy to create network aliases for a named instance, completely eliminating the need to reference any two-part name on any client and making the ‘server location’ completely transparent to all the clients. By 100% server location transparency, I mean the ability to move a SQL Server instance to a different machine or a different cluster without changing any configuration at all on the client side. This offers huge convenience, and in many cases is an absolute necessity.

    From my personal experience, I have not seen any serious downside to this approach. Hope that’s your experience as well. As always, I’d like to hear your different views/opinions on this.

  • Performance impact: the cost of NUMA remote memory access

    These days if you get a new server-class machine to run SQL Server, you can almost be 100% sure that it’ll be running on NUMA hardware. The recent AMD Opteron and Intel Nehalem-based processors are all built on some form of NUMA architecture.

    The current consensus is that as the number of processors grows, their shared memory bus can easily get congested and becomes a major impediment to scalability. NUMA hardware solves this scalability challenge by dividing the processors into groups, with each group having a private memory bus (or memory controller). Such a group of processors is often built into a single physical package, and known as a socket. By limiting the number of processors inside a group (or a socket), the congestion on the private memory bus can therefore be controlled. A NUMA hardware-based server scales by increasing the number of such processor groups and connect them via specially-designed high-speed interconnect. In AMD Opteron, the interconnect among the NUMA nodes is called HyperTransport, and in Intel Nehalem it’s called QuickPath interconnect.

    Of course, there is a cost of connecting these sockets whether it’s with HyperTransport or with QuickPath. And that is, accessing memory on another NUMA node via the interconnect is slower than accessing memory local to the same NUMA node without going over an interconnect. Accessing memory on another NUMA node is called remote memory access, whereas accessing memory on the same NUMA node is called local memory access.

    The exact cost differential between remote memory access and local memory access varies across processors and processor clock frequencies. For instance, this Dell whitepaper has some test results on the Xeon 5500 processors, showing that local memory access can have 40% higher bandwidth than remote memory access, and the latency of local memory access is around 70 nanoseconds whereas remote memory access has a latency of about 100 nanoseconds.

    On your system, you may not be able to run a memory benchmark to measure the local to remote memory access differential. But you can get a rough idea using a free Sysinternals tool called CoreInfo, which you can download from Microsoft. In particular, the CoreInfo output has a section on approximate cross-NUMA node access cost. On my test server, it looks like the following:

    Calculating Cross-NUMA Node Access Cost...
    Approximate Cross-NUMA Node Access Cost (relative to fastest):
         00  01
    00: 1.0 1.2
    01: 1.3 1.0

    You may get slightly different results across multiple runs of CoreInfo even on the same system. But the above result shows that, for instance, accessing memory node 00 from node 01 is about 30% more expensive than accessing memory local to node 00.

    Objective

    The question that a SQL Server professional may ask is: how should I expect this local-to-remote memory access cost differential to show up in a SQL Server workload?

    The exact impact, of course, depends on many factors. Nonetheless, it would be nice to get a feel for the worst case scenario.

    A worst case scenario is when SQL Server has to do 100% remote memory access in order to process a query. That is, the workers that process the queries from a workload are all running on one NUMA node, while the pages for the database are all cached in memory local to another NUMA node.

    I did some search to see what might have been said on this subject, but could not find any published empirical data points that show the performance impact of remote memory access on a SQL Server workload. So I decided to construct a test and find it out myself. This post reports the findings from the test.

    Test Environment

    A brief summary of the test environment is in order. The test server is an HP ProLiant 360 G7 with two Intel Xeon X5690 (Westmere-EP) processors at 3.47GHz. Since hyperthreading is enabled, each socket has 12 logical processors. The total physical memory is 64GB in eight 8GB DDR3 modules, four modules (or 32GB in total) per socket. Each X5690 processor (or socket) has six cores ad 12MB of L3 cache. The OS is Windows Server 2008 R2 Enterprise X64 Edition with Service Pack 1. The DBMS is Microsoft SQL Server 2008 R2 Enterprise X64 Edition (10.50.1600). Max server memory is set to 50,000MB for the SQL Server instance.

    Test Setup

    To prepare for the test, the steps outlined below are followed:

    1. Follow the BOL article,  How to: Configure the Database Engine to Listen on Multiple TCP Ports, to configure the SQL Server instance to listen on two additional ports: 50001 and 50002.
    2. Follow the BOL article, How to: Map TCP/IP Ports to NUMA Nodes, to configure the SQL Server instance to affinitize connections to port 50001 to node 1, and connections to port 50002 to node 2.
    3. Restart the SQL Server instance, and verify that it’s listening on ports 50001 and 50002.
    4. Connect to the SQL Server instance on port 50001 by specifying NYCSQL01,50001 (assuming that the SQL Server instance name is NYCSQL01), and execute the attached SQL script to create the test tables and populate them with the test data. the main test table tran_stats is populated with 50,000,000 rows, and is about 9GB in size. Note that each NUMA node has 32GB of physical memory, so there is more than enough room to cache the entire test table on node 1.
    5. Use the perfmon counters under SQLServer:Buffer Node (primarily Database Pages) to verify that the test table is entirely loaded into memory local to node 1

    Two Test Scenarios

    With all the test data cached on node 1, we run two test scenarios:

    • 100% local memory access. Limit query processing to the cores on node 1 only. We accomplish this by connecting the workload driver to port 50001 only.
    • 100% remote memory access. Limit query processing to the cores on node 2 only. We accomplish this by connecting the workload driver to port 50002 only.

    Now, from my previous post and from Bob Dorr’s post, we know that we cannot have query parallelism if we want to have the control outlined above. So for this test, maxdop is set to 1 throughout. And while we are running the test, we watch the perfmon counters under SQLServer:Buffer Node to verify that indeed we are getting only local memory access or only remote memory access, depending on which scenario is being tested.

    The two key counters to watch are:

    • SQLServer:Buffer Node\Local node page lookups/sec, and
    • SQLServer:Buffer Node\Remote node page lookups/sec

    In the case of 100% local memory access, perfmon counter SQLServer:Buffer Node\Remote node page lookups/sec should have a zero value. And it should be zero for all the nodes because we don’t have any other activity on the SQL Server instance. In the case of 100% remote memory access, perfmon counter SQLServer:Buffer Node\Local node page lookups/sec is expected to have a zero value for all the nodes.

    Test Queries

    We use the same reporting query as we used in some of the previous posts (such as this one on hyperthreading). The query is reproduced as follows:

    select COUNT(*) from ( 
    
       select id_server_config,id_thread_name,id_tran_type,id_tran_status,id_users,id_cpus, 
    
                 COUNT(*) as cnt 
    
      from tran_stats t1 join dimension t2 
    
          on t1.server_config = t2.server_config and 
    
               t1.thread_name = t2.thread_name and 
    
               t1.tran_type   = t2.tran_type and  
    
               t1.tran_status = t2.tran_status and 
    
               t1.users       = t2.users and 
    
               t1.cpus        = t2.cpus 
    
    group by id_server_config,id_thread_name,id_tran_type,id_tran_status,id_users,id_cpus 
    
    ) as v
    
    option (maxdop 1)
    

    To ensure that the processors are sufficiently loaded, this query is run in streams over multiple concurrent connections to port 50001 for the 100% local memory access scenario, and to port 50002  for the 100% remote memory access scenario. Since we are interested in the worst case scenario, we need to test various numbers of concurrent streams and settle on one that shows the largest local-to-remote memory access differential. In our case, that number is 32, i.e. with 32 concurrent query streams, we see the largest performance difference between 100% local memory access and 100% remote memory access.

    Test Results

    The following chart shows the largest throughput difference in terms of queries per hour between 100% local memory access and 100% remote memory access:

    NUMA_local_vs_remote_access

    As I mentioned in my previous post on SQL Server NUMA behavior, I ran into two surprises when playing with NUMA affinity. One is that NUMA affinity doesn’t work with query parallelism. The other surprise, to a less extent, is that the cost differential between local memory access and remote memory access is not as large as I expected from reading all the published materials on the Nehalem-based NUMA architecture. In fact, no matter where you read it, you’d always be warned of the cost of remote memory access, and that often would be the very first thing you’d be told.

    So it’s no surprise that I’d be surprised by the test result presented in the above chart. More specifically, the largest difference I could find between 100% local memory access and 100% remote memory access is about 5%, and I was definitely expecting more than that for this particular 100% cached workload.

    This in a way is good news because that means we probably shouldn’t worry too much about data load not being balanced across the NUMA nodes, especially when we have some other performance issues to contend with.

     

    Appendix. Code for loading the test data

    drop table tran_stats
    
    go
    
    create table dbo.tran_stats (
    
        server_config     varchar(100),
    
        users             int,
    
        cpus              smallint,
    
        thread_name       varchar(20),
    
        tran_count        int,
    
        tran_type         varchar(20),
    
        tran_status       varchar(20),
    
        tran_begin        datetime,
    
        tran_end          datetime,
    
        resultset_size    smallint,
    
        filler            char(100))
    
    go
    
    set nocount on
    
    go
    
    truncate table tran_stats
    
    go
    
    declare @i int = 1
    
    declare @batch_size int = 100000
    
    begin tran
    
    while @i <= 50000000
    
    begin
    
           insert into tran_stats with (tablock) (server_config,users,cpus,thread_name,tran_count,tran_type,tran_status,tran_begin,tran_end,resultset_size,filler)
    
           select 'X580 G' + right('0' + cast((@i % 40) as varchar(5)), 2),
    
                   ((@i % 5) + 1) * 10,
    
                   ((@i % 8) + 1) * 10,
    
                   'VC' + cast(@i % 100 as varchar(15)),
    
                   @i,
    
                   case when @i % 3 = 0 then 'READ-ONLY'
    
                        when @i % 3 = 1 then 'WRITE-ONLY'
    
                        when @i % 3 = 2 then 'READ-WRITE'
    
                     END,
    
                   case when @i % 3 = 0 then 'SUCCESS'
    
                        when @i % 3 = 1 then 'FAILURE'
    
                        when @i % 3 = 2 then 'ROLLBACK'
    
                     END,
    
                     GETDATE(),
    
                     DATEADD(second, @i % 50, getdate()),
    
                     @i % 100,
    
                     'abc'
    
                  if @i % @batch_size = 0
    
                  begin
    
                      commit tran
    
                      begin tran
    
                  end
    
                  
    
                  select @i = @i + 1
    
    end
    
    if @@TRANCOUNT > 0
    
    commit tran
    
    go
    
    create clustered index cix_tran_stats on tran_stats(tran_begin)
    
    go
    
    drop table Dim_server_config
    
    go
    
    create table Dim_server_config (
    
       id_server_config int identity,
    
       server_config  varchar(100)
    
    )
    
    go
    
    drop table Dim_thread_name
    
    go
    
    create table Dim_thread_name (
    
       id_thread_name int identity,
    
       thread_name  varchar(20)
    
    )
    
    go
    
    drop table Dim_tran_type
    
    go
    
    create table Dim_tran_type (
    
       id_tran_type int identity,
    
       tran_type  varchar(20)
    
    )
    
    go
    
    drop table Dim_tran_status
    
    go
    
    create table Dim_tran_status (
    
       id_tran_status int identity,
    
       tran_status  varchar(20)
    
    )
    
    go
    
    drop table Dim_users
    
    go
    
    create table Dim_users (
    
       id_users int identity,
    
       users int
    
    )
    
    go
    
    drop table Dim_cpus
    
    go
    
    create table Dim_cpus (
    
       id_cpus int identity,
    
       cpus int
    
    )
    
    go
    
    insert Dim_server_config(server_config)
    
    select distinct server_config 
    
      from (select top 100000 server_config from tran_stats with (nolock)) as v
    
    insert Dim_thread_name(thread_name)
    
    select distinct thread_name 
    
      from (select top 100000 thread_name from tran_stats with (nolock)) as v
    
    insert Dim_tran_type(tran_type)
    
    select distinct tran_type 
    
      from (select top 100000 tran_type from tran_stats with (nolock)) as v
    
      
    
    insert Dim_tran_status(tran_status)
    
    select distinct tran_status 
    
      from (select top 100000 tran_status from tran_stats with (nolock)) as v
    
      
    
    insert Dim_users(users)
    
    select distinct users 
    
      from (select top 100000 users from tran_stats with (nolock)) as v
    
      
    
    insert Dim_cpus(cpus)
    
    select distinct cpus 
    
      from (select top 100000 cpus from tran_stats with (nolock)) as v
    
    go
    
    if OBJECT_ID('dbo.dimensions') is not NULL
    
       drop table dbo.dimensions
    
    go
    
    select distinct * into dbo.dimensions
    
      from Dim_server_config, Dim_thread_name, Dim_tran_type, Dim_tran_status, Dim_users, Dim_cpus
    
    go
    
                
  • No respect: NUMA affinity meets query parallelism

    What happens when NUMA affinity meets query parallelism? It gets no respect!

    SQL Server allows you to affinitize a TCP port to a specific NUMA node or a group of NUMA nodes. Books Online has an article on How to: Map TCP/IP ports to NUMA Nodes. And this BOL article discusses various NUMA affinity scenarios.

    Recently, I have been playing with NUMA affinity on various servers with hardware NUMA, such as those with Intel X5690 and Intel E7-4870, running SQL Server 2008 R2 RTM (10.50.1600) and SQL Server 2008 R2 SP1 (10.50.2500), both Enterprise X64 Edition. Although the feature largely behaved as advertised, I did run into some surprises! In this post and a future post, I’d like to document what surprised me. Perhaps, it’s just a misunderstanding on my part. But I want to throw it out there in case someone has a good explanation.

    When I was running OLTP workloads with maxdop set to 1, everything behaved exactly as documented. In other words, when I affinitized port 50001 to NUMA node 1 and connected all the clients to the SQL Server 2008 R2 instance on port 50001, I was able to verify that only the processors on node 1 were driven busy. In addition, I loaded all the data from the test table via the connection to port 50001 on the server, and was able to confirm that all the pages were loaded into the memory local to node 1. This was expected as the NUMA node had 64GB of physical memory, far more than enough to fit the entire table that was about 10GB in size. Moreover, I was pleased to see that there was only local memory access. Again, this was expected because all the connections from the OLTP workload driver were made on port 50001 which was affnitized to node 1.

    Then, I ran into two surprises. I’ll report a big surprise in this post and report the other one in a follow-up post.

    When I set the reporting query maxdop to 2, 10, and 20 (or anything other than 1), things started to go beyond what I could find in SQL Server Books Online. For instance, with the query maxdop set to 2, I found that SQL Server appeared to avoid the affinitized NUMA node rather than favor it. In all my tests, I consistently observed that the query submitted to port 50001 was processed by nodes other than the affinitized node (in this node 1). The following charts offer some samples of the supporting evidence.

    DOP2_NUMA_affinity_run_1

    In the above chart, option(maxdop 2) was set, and the processors on node 2 were observed to be busy. The Remote node page lookups/sec counter shows that all the page lookups were remote page lookups. That’s because the pages were cached in memory local to node 1, but the processors on node 2 were doing the work.

    The next chart shows a second run of the same query still with option (maxdop 2). In this case, SQL Server decided to process the query with the processors on node 3.

    DOP2_NUMA_affinity_run_2

    Similarly, the next two charts show that the query with option (maxdop 2) was processed on a node other than node 1.

    DOP2_NUMA_affinity_run_3

    DOP2_NUMA_affinity_run_4

    The point is that although the query was submitted via port 50001 and port 50001 was affinitized to NUMA node 1, SQL Server consistently chose not to use node 1 in processing the query. This does not seem to be the expected behavior because nowhere can I find in BOL that says that the NUMA affinity feature only applies to non-parallel plans. If you spot it in BOL, please let me know.

    I also tested other degrees of parallelism, and the story is the same. For instance, with maxdop set to 10, the following chart shows that all the 10 cores on node 2 were used when I’d expect SQL Server to use the 10 cores on node 1.

     

    DOP10_NUMA_affinity_run_1

    With maxdop set to 20, most of the cores, including the cores on node 1, were used, as shown in the following chart. Note that on node 1 it was local node page lookups, whereas on the other nodes it was all remote page lookups.

    DOP20_NUMA_affinity_run_1

    This behavior came as a big surprise. To me, this is a serious bug because it defeats the very purpose of the NUMA affinity feature. But hey, you never know. After all, how many times have you run into a bug only to be told it’s a feature by design? In this case, I have no idea. Hopefully, someone in the know can shed some light.

  • T-SQL stored procedure for finding/replacing strings in a text file. Really?

    I know people have been doing all sorts of things with T-SQL, and I have absolutely no issue with people trying to push the limit of what T-SQL can do, or what you can use it to accomplish, especially when it’s for demonstration or pedagogical purposes, or as an intellectual exercise. But then I bumped into an article on writing a T-SQL stored procedure to find and replace strings in a text file. That really unsettled me!

    Sure, when you are in a hurry, you need to grab a tool--any tool--to get the job done, and T-SQL happens to be that tool immediately within the reach, go right ahead. But that doesn’t mean T-SQL is the right tool for the task of manipulating strings in a text file, or something worth recommending for that task.

    In my opinion, T-SQL is not the right tool because there exist numerous other tools that are far better at dealing with strings in a text file.

    Okay, what if T-SQL is the only way you know? Well, in that case I’m sorry! But why not invest some time in learning a text processing tool?

    On the other hand, I’m always wary about saying something is terribly bad just because I’m ignorant or because of my narrow experience. I’m curious to learn if there is a good case out there for a T-SQL stored procedure for finding/replacing strings in a text file.

  • Performance impact: SQL2008 R2 audit and trace

    We are told that SQL Server 2008 R2 Audit (and SQL Server 2008 Audit) has much less performance overhead than SQL Trace when we try to capture the same information. Knowing how SQL Server R2 Audit is implemented (i.e. on top of the extended events infrastructure), I’ve always taken that as a given and never bothered to check it out.

    Recently, I had to capture some object access information, and it turned out that SQL Audit was not the most convenient tool for the task. I had to go back to SQL Trace. There, the question of how much cheaper is an audit relative to an equivalent trace came up. I thought it’s time to explore that a bit.

    I wanted to explore two questions:

    • In the most extreme case (i.e. when the processors are heavily loaded), how much degradation in transaction throughput should I expect if I turn on SQL Audit vs. if I turn SQL Trace?
    • If the processors are moderately loaded, should I expect a performance degradation if I turn on SQL Audit or SQL Trace?

    The first question is the worst case scenario you probably should never run into. But it’s always good to know where the boundary is. The second question is a realistic and more common scenario.

    The processor load levels were controlled with my trusty TPC-C workload driver as follows:

    • Heavy CPU load. This was generated by 100 simulated users submitting the standard TPC-C readonly queries without any wait between any two consecutive calls. The total processor time on the 12-core (or 24-logical-processor) server was sustained at ~92%, and the transaction throughput (as measured by one of the TPC-C transactions) was steady at ~16,000 per second or about 32,000 batch requests per second on the SQL Server 2008 R2 instance.
    • Medium CPU load. This was generated by 100 simulated users submitting the standard TPC-C readonly queries with a 10ms wait between two consecutive calls. Compared to the previous load scenario, this otherwise rather small wait caused the total processor time on the server to sustain at a very low level at about 2%. In a way, it may be more appropriate to label this a light load level. But the sustained transaction throughput was about 3,100 per second or 6,200 batch requests per second, which still kept SQL Server busy looking up the pages.

    For each of the above two load levels, three conditions were tested:

    • No Audit/No Trace. In this case, I did not enable any SQL Audit session or run any SQL Trace. This was our baseline.
    • With Audit. In this case, I turned on a standard audit with 1000ms queue delay and the output was sent to files with each file capped at 500MB. The database audit specification included all the SELECT statements on all the TPC-C user tables. No SQL Trace was enabled.
    • With Trace. A SQL Trace that was more or less equivalent to the audit described above was started with the trace results sent to the same drive folder as in the above audit. The trace files were also specified to rollover to a new one ever 500MB. No SQL audit was enabled.

    The SQL script to create the audit and the database audit specification is listed below:

    USE [master]
    
    GO
    
    CREATE SERVER AUDIT [Audit]
    
    TO FILE 
    
    (FILEPATH = N'E:\'
    
    ,MAXSIZE = 500 MB
    
    ,MAX_ROLLOVER_FILES = 2147483647
    
    ,RESERVE_DISK_SPACE = OFF
    
    )
    
    WITH
    
    (QUEUE_DELAY = 1000
    
    ,ON_FAILURE = CONTINUE
    
    ,AUDIT_GUID = '5501f383-4357-42f5-a9c4-c87a66ecc08b'
    
    )
    
    GO
    
    USE [sqlTestC]
    
    GO
    
    CREATE DATABASE AUDIT SPECIFICATION [DBAudit]
    
    FOR SERVER AUDIT [Audit]
    
    ADD (SELECT ON OBJECT::[dbo].[warehouse] BY [public]),
    
    ADD (SELECT ON OBJECT::[dbo].[district]  BY [public]),
    
    ADD (SELECT ON OBJECT::[dbo].[customer]  BY [public]),
    
    ADD (SELECT ON OBJECT::[dbo].[history]   BY [public]),
    
    ADD (SELECT ON OBJECT::[dbo].[new_order] BY [public]),
    
    ADD (SELECT ON OBJECT::[dbo].[orders]    BY [public]),
    
    ADD (SELECT ON OBJECT::[dbo].[item]      BY [public]),
    
    ADD (SELECT ON OBJECT::[dbo].[stock]     BY [public])
    
    WITH (STATE = ON)
    
    GO

    And the SQL script to create the SQL trace is pasted below:

    declare @rc int
    
    declare @TraceID int
    
    declare @maxfilesize bigint
    
    set @maxfilesize = 500
    
    exec @rc = sp_trace_create @TraceID output, 2, N'E:\Audit_Trace', @maxfilesize, NULL 
    
    if (@rc != 0) goto error
    
    declare @on bit
    
    set @on = 1
    
    exec sp_trace_setevent @TraceID, 45, 7, @on
    
    exec sp_trace_setevent @TraceID, 45, 55, @on
    
    exec sp_trace_setevent @TraceID, 45, 8, @on
    
    exec sp_trace_setevent @TraceID, 45, 16, @on
    
    exec sp_trace_setevent @TraceID, 45, 48, @on
    
    exec sp_trace_setevent @TraceID, 45, 64, @on
    
    exec sp_trace_setevent @TraceID, 45, 1, @on
    
    exec sp_trace_setevent @TraceID, 45, 9, @on
    
    exec sp_trace_setevent @TraceID, 45, 17, @on
    
    exec sp_trace_setevent @TraceID, 45, 25, @on
    
    exec sp_trace_setevent @TraceID, 45, 41, @on
    
    exec sp_trace_setevent @TraceID, 45, 49, @on
    
    exec sp_trace_setevent @TraceID, 45, 10, @on
    
    exec sp_trace_setevent @TraceID, 45, 18, @on
    
    exec sp_trace_setevent @TraceID, 45, 26, @on
    
    exec sp_trace_setevent @TraceID, 45, 34, @on
    
    exec sp_trace_setevent @TraceID, 45, 50, @on
    
    exec sp_trace_setevent @TraceID, 45, 66, @on
    
    exec sp_trace_setevent @TraceID, 45, 3, @on
    
    exec sp_trace_setevent @TraceID, 45, 11, @on
    
    exec sp_trace_setevent @TraceID, 45, 35, @on
    
    exec sp_trace_setevent @TraceID, 45, 51, @on
    
    exec sp_trace_setevent @TraceID, 45, 4, @on
    
    exec sp_trace_setevent @TraceID, 45, 12, @on
    
    exec sp_trace_setevent @TraceID, 45, 28, @on
    
    exec sp_trace_setevent @TraceID, 45, 60, @on
    
    exec sp_trace_setevent @TraceID, 45, 5, @on
    
    exec sp_trace_setevent @TraceID, 45, 13, @on
    
    exec sp_trace_setevent @TraceID, 45, 29, @on
    
    exec sp_trace_setevent @TraceID, 45, 61, @on
    
    exec sp_trace_setevent @TraceID, 45, 6, @on
    
    exec sp_trace_setevent @TraceID, 45, 14, @on
    
    exec sp_trace_setevent @TraceID, 45, 22, @on
    
    exec sp_trace_setevent @TraceID, 45, 62, @on
    
    exec sp_trace_setevent @TraceID, 45, 15, @on
    
    exec sp_trace_setevent @TraceID, 41, 7, @on
    
    exec sp_trace_setevent @TraceID, 41, 15, @on
    
    exec sp_trace_setevent @TraceID, 41, 55, @on
    
    exec sp_trace_setevent @TraceID, 41, 48, @on
    
    exec sp_trace_setevent @TraceID, 41, 64, @on
    
    exec sp_trace_setevent @TraceID, 41, 1, @on
    
    exec sp_trace_setevent @TraceID, 41, 9, @on
    
    exec sp_trace_setevent @TraceID, 41, 25, @on
    
    exec sp_trace_setevent @TraceID, 41, 41, @on
    
    exec sp_trace_setevent @TraceID, 41, 49, @on
    
    exec sp_trace_setevent @TraceID, 41, 26, @on
    
    exec sp_trace_setevent @TraceID, 41, 50, @on
    
    exec sp_trace_setevent @TraceID, 41, 66, @on
    
    exec sp_trace_setevent @TraceID, 41, 3, @on
    
    exec sp_trace_setevent @TraceID, 41, 11, @on
    
    exec sp_trace_setevent @TraceID, 41, 35, @on
    
    exec sp_trace_setevent @TraceID, 41, 51, @on
    
    exec sp_trace_setevent @TraceID, 41, 4, @on
    
    exec sp_trace_setevent @TraceID, 41, 12, @on
    
    exec sp_trace_setevent @TraceID, 41, 60, @on
    
    exec sp_trace_setevent @TraceID, 41, 5, @on
    
    exec sp_trace_setevent @TraceID, 41, 13, @on
    
    exec sp_trace_setevent @TraceID, 41, 29, @on
    
    exec sp_trace_setevent @TraceID, 41, 61, @on
    
    exec sp_trace_setevent @TraceID, 41, 6, @on
    
    exec sp_trace_setevent @TraceID, 41, 14, @on
    
    declare @intfilter int
    
    declare @bigintfilter bigint
    
    exec sp_trace_setfilter @TraceID, 10, 0, 7, N'SQL Server Profiler - 95a2ce5b-6ec1-4d27-a667-fb86601af54a'
    
    exec sp_trace_setstatus @TraceID, 1
    
    select TraceID=@TraceID
    
    goto finish
    
    error: 
    
    select ErrorCode=@rc
    
    finish: 
    
    go
    

    The test results are summarized in the next two charts.

    audit_performance_heavy_load

    With all the CPUs on the server heavily loaded, turning on SQL Audit to log all the SELECT statements caused the transaction throughput to drop from 16,043 per second to 13,825 per second. That was about 14% drop. Given how heavy the load was and how busy the traffic was on the server, that’s a relatively small drop. Now the impact of turning on the trace was quite a different story. There, the transaction throughput went from 16,043 per second to a mere 5,012 per second. That was a 60% drop. if we use 5012 as the base reference, turning off the trace increased the transaction throughput for more than 300%. The difference in terms of performance impact was striking between SQL Audit and SQL Trace.

    Again, I should note that this was an extreme case. In a real world situation, most likely the system wouldn’t be this busy. So the 14% drop in throughput is probably the ceiling, and you should probably expect much less an adverse impact by turning on a similar audit on a real production server. This next chart shows the result of turning on an audit (and separately a trace) when we don’t beat up the server with non-stop transaction traffic. As you can see, by just giving the server a little bit breathing room with a 10ms wait between the calls, the load on the server was reduced significantly and turning on either the audit or the trace had no impact at all on the throughput. Yes, the total CPU consumption went up a little bit as a result of turning on the audit or the trace, but the difference was rather insignificant.

    audit_performance_medium_load

    These test results are good news for the SQL Server 2008 R2 audit feature. Now, only if it could be made a bit more useful!

    The test environment was a DL360 G7 with two X5690 (6 core) processors and 64GB of RAM. Hyperthreading was enabled. The OS was Windows 2008 R2 Enterprise x64 Edition with SP1 and DBMS was SQL Server 2008 R2 Enterprise x64 Edition (10.50.1600).

  • Performance impact: hyperthreading on Intel Westmere-EP processors (X5690)

    Recently, I have been looking into the performance impact of enabling hyperthreading on various platforms with various SQL Server workloads. All the results I have shared so far are from a DL580 G7 with four Westmere-EX  (E7-4870) processors. Overall, the results of enabling hyperthreading have been positive for both the tested reporting queries and the tested OLTP queries, although I did run into one exception where a reporting query workload performed better without hyperthreading. In all cases, when the system was pushed really hard, hyperthreading was found to either increase the throughput, or at least have no negative impact.

    On a different platform, will we see similar patterns? Or will we see hyperthreading behave differently?

    In this post, I’d like to share some results I obtained on a DL360 G7 with two 3.46GHz six-core Westmere-EP (aka Gulftown) processors. With two X5690 sockets, the server had 12 cores, and with hyperthreading (HT) enabled, the OS and SQL Server see 24 processors. I repeated all the tests I reported about previously, namely:

    • Standard TPC-C readonly queries. This is the OLTP query workload described in an earlier post. It is the workload that I found did not scale well on a four-socket DL580 G7. As discussed in Evaluating server hardware: a sign of the times, when running this test workload, SQL Server was not able to push all the processors sufficiently. It does not have the same problem with the two-socket DL360 G7.
    • Augmented TPC-C readonly queries. This is the OLTP workload described in the same post, that is the standard TPC-C readonly workload plus an additional SELECT query, which causes the server to do more work and therefore scales better with high core count.
    • Reporting queries - P1. This is the test workload described in Performance impact: hyperthreading for reporting queries. To distinguish this reporting query workload from the next one, I’ll refer to it in this post as Reporting queries – P1.
    • Reporting queries- P2. This is test workload described in Performance impact: not all is better with hyperthreading. This test workload uses a query suggested by Serguei Tarassov in a comment to my post.

    Note that since the two-socket DL360 G7 platform is less powerful that the four-socket DL580 G7 platform I tested previously, it would not make sense to calibrate the tests exactly the same as used before. I did not want to limit our test results to overly saturated load levels. Nor did I want to use a workload whose load level was too light since our primary interest was to understand how hyperthreading affects the processing capacity of these systems.

    To find a good test range, I first ran the above-mentioned test workloads with a rather wide load level range, the results of which gave me an overall picture of how these systems would respond to the load levels. The tests were then calibrated as follows:

    • Standard TPC-C readonly queries and Augmented TPC-C readonly queries. With the query DOP set to 8, I tested 1, 4, 8, and 16 concurrent query streams. With the query DOP set to 1, I tested 4, 8, 16, 32, and 64 concurrent query streams. 64 streams proved to be too heavy for the system under test, the test could not reliably run to completion, and the results had to be discarded.
    • Reporting queries – P1 and Reporting queries – P2. I tested 50, 100, 200, and 400 simulated users.

    The following two system configurations were used to help check the impact of enabling hyperthreading on the two-socket DL360 G7 platform:

    • with HT:       12 cores with hyperthreading enabled. The OS and the SQL Server instance see 24 processors.
    • without HT: 12 cores with hyperthreading disabled. The OS and the SQL Server instance see 12 processors.

    To summarize the test environment, the server was a DL360 G7 with two X5690 (6 core) processors and 64GB of RAM. The OS was Windows 2008 R2 Enterprise x64 Edition with SP1 and DBMS was SQL Server 2008 R2 Enterprise x64 Edition (10.50.1600). I’d have preferred to run these tests on a 10.50.2500 instance. But you test with what you have, not what you wish to have. In all cases, the test data was fully cached in the buffer pool before test measures were taken.

    For the reporting queries workloads, the tran_stats table DDL and the data load script can be found in the attachment of this previous post. And the table was populated with 50 million rows.

     

    Standard TPC-C readonly queries

    hyperthreding_OLTP_TPC_Standard_queries

    Overall, the impact of enabling hyperthreading was positive. the highest impact was at 100 users, where hyperthreading resulted in a ~19% boost of the transaction throughput.

     

    Augmented TPC-C readonly queries

    hyperthreding_OLTP_TPC_Augmented_queries

    Hyperthreading did not have any significant impact with the augmented TPC-C readonly workload.

     

    Reporting queries – P1 with DOP = 8

    HT_8DOP_reporting_queries_P1_X5690

    With 4, 8, and 16 query streams, hyperthreading had a significant positive impact, as high as ~25% increase in the query throughput. With a single query stream, however, the workload performed consistently better without hyperthreading.

    I looked at the single stream case more closely. With the help of CoreInfo.exe and Task Manager’s NUMA Node view, SQL Server and Windows were seen to consistently favor the processors within the same NUMA node when scheduling threads to execute intra-query parallelism. In fact, I did not see a single case where the threads processing this reporting query with DOP set to 8 were spread across the two nodes. Note that since there were six cores in each node, some of the threads would have to share a core. This scheduling strategy generally makes sense because the threads would be able to share the L3 cache and it possibly can help reduce remote memory access. Apparently, for this particular workload, L3 cache and local memory offered no advantage. Without hyperthreading, the threads were scheduled on different cores and that appeared to offer better performance.

     

    Reporting queries – P1 with DOP = 1

    HT_1DOP_reporting_queries_P1_X5690

    When the reporting queries were not processed with parallelism (i.e. DOP set to 1), the impact of enabling hyperthreading was generally positive, especially at a higher load level.

     

    Reporting queries – P2 with DOP = 8

    HT_8DOP_reporting_queries_P2_X5690

    The results here were mixed. At lower load levels (i.e. when the number of query streams was 4 or 8), higher query throughput was achieved without hyperthreading. At the higher load levels, however, the workload saw higher throughput with hyperthreading enabled. I do not have any explanation for this pattern. But I should reiterate that I did repeat the tests several times, and the pattern was consistent.

     

    Reporting queries – P2 with DOP = 1

    HT_1DOP_reporting_queries_P2_X5690

    When the reporting queries were not processed with parallelism, the impact of enabling hyperthreading was generally positive, especially at a higher load level. This exactly mirrors the observation with reporting queries – P1 when DOP was set to 1.

     

    16 streams with DOP set to 16

    What abut the case I reported in the pervious post where the reporting test workload had 16 streams running queries with DOP set to 16 and hyperthreading was seen to have a negative impact?

    I did repeat that test on this two-socket DL360 G7 platform. And it turned out that the impact of enabling hyperthreading was different. As shown in the following chart, hyperhtreading resulted in a ~14% increase in throughput.

    HT_16DOP_16streams_reporting_queries_P2_X5690

     

    Summary

    In majority of the cases I studied, the impact of hyperthreading was either positive or negligible. In particular, I did not see a single negative case (so far) when the workloads were OLTP queries, when DOP was set to 1 for reporting queries, or when the system under test was heavily loaded. There were a few cases where performance was better when hyperthreading was disabled for reporting queries. It is worth noting that the same workload can indeed behave differently on different platforms with respect to the hyperthreading impact.

    I’d suggest the following:

    • Test your SQL workload with and without hyperthreading. This sounds like a platitude, and it is. But then, you see the test results.
    • If you can’t test your SQL workload (and in many cases you simply can’t, or can’t afford to, perform this type of controlled tests) or you have no idea what your SQL workload may look like, by default I recommend enabling hyperthreading (for Westmere-based processors anyway and to stretch it a bit for the recent Nehalem-based processors).
  • Performance impact: not all is better with hyperthreading

    In the comments to my previous post on the performance impact of enabling hyperthreading on reporting queries, Serguei Tarassov indicated that it might be interesting to try different reporting queries, and suggested a specific parameterized test query. I happened to have some free time over this past long weekend, and I was curious to see how a different workload might behave, differently, similarly or the same. So I modified the workload driver client I wrote earlier to work with Serguei’s query, ran the tests, and found a few interesting things in the process that I thought I should share.

    Although I ran the tests for various numbers of concurrent streams and various degrees of parallelism, for this post I’ll focus on the following two workloads:

    • Workload 1: 32 concurrent query streams with maxdop set to 32 for the test query (actually a stored procedure called p_repot_16)
    • Workload 2: 16 concurrent query streams with maxdop set to 16 for the test query (actually a stored procedure called p_report_32)

    Obviously, Workload 2 was lighter than Workload 1. Two test environments were examined with the test workloads:

    • with HT:       40 cores with hyperthreading enabled. The OS and the SQL Server instance see 80 processors.
    • without HT: 40 cores with hyperthreading disabled. The OS and the SQL Server instance see 40 processors.

    The tran_stats table and the test environment were exactly the same as described in the previous post except that the tran_stats table was populated with 50 million rows (as opposed to 200 million rows) and that tran_stats was the only table used in this test.

    The code for p_report_16 and the code for p_report-32 are identical except their maxdop settings:

    CREATE PROCEDURE dbo.p_report_16
          @tran_begin datetime
    WITH RECOMPILE
    AS
    SET NOCOUNT ON;
    select sum(cnt) as cnt_sum from
    (SELECT COUNT(1) as cnt,
             t1.server_config, t1.thread_name, t1.tran_type, t1.tran_status, t1.users, t1.cpus
        FROM tran_stats t1 INNER JOIN
             (SELECT AVG(resultset_size) AS avg_resultset_size, server_config
                FROM tran_stats
              GROUP BY server_config
             ) t2
             ON t1.server_config = t2.server_config AND
                t1.resultset_size <= t2.avg_resultset_size
       WHERE t1.tran_begin BETWEEN @tran_begin AND DATEADD(second, 1, @tran_begin)
       GROUP BY t1.server_config, t1.thread_name, t1.tran_type, t1.tran_status, t1.users, t1.cpus
    ) as v
    option (maxdop 16)
    go

    CREATE PROCEDURE dbo.p_report_32
          @tran_begin datetime
    WITH RECOMPILE
    AS
    SET NOCOUNT ON;
    select sum(cnt) as cnt_sum from
    (SELECT COUNT(1) as cnt,
             t1.server_config, t1.thread_name, t1.tran_type, t1.tran_status, t1.users, t1.cpus
        FROM tran_stats t1 INNER JOIN
             (SELECT AVG(resultset_size) AS avg_resultset_size, server_config
                FROM tran_stats
              GROUP BY server_config
             ) t2
             ON t1.server_config = t2.server_config AND
                t1.resultset_size <= t2.avg_resultset_size
       WHERE t1.tran_begin BETWEEN @tran_begin AND DATEADD(second, 1, @tran_begin)
       GROUP BY t1.server_config, t1.thread_name, t1.tran_type, t1.tran_status, t1.users, t1.cpus
    ) as v
    option (maxdop 32)

    Every time the proc (p_report_16 or p_report_32) was called in the workload driver, the input parameter value was picked randomly between min(tran_begin) and max(tran_begin).

    Each stream of the stored proc calls was submitted from a separate thread in the workload driver without any wait between two consecutive calls. All streams started at the same time, and each stream ran for 20 minutes. The begin time and the end time of each call was recorded on the client side and loaded into a table at the conclusion of the test for analysis. The total number of queries processed in this 20-minute test run was tallied and the queries per hour metric was then calculated.

    Workload 1: 32 concurrent streams running p_report_32

    The following chart summarizes the result:

    hyperthreading_32users_32dop

    The chart shows that with this workload (i.e. 32 concurrent streams of the queries with maxdop set to 32), the 40-core server achieved higher query throughput with hyperthreading enabled than it did with hyperthreading disabled. Using the configuration without hyperthreading as the baseline, the throughput improvement was about 7%, that is (5848-5472)/5472.

    I didn’t expect to see the same or even similar number as we saw in the previous tests. But at least, the result still shows that the performance impact of hyperthreading is positive.

    Workload 2: 16 concurrent streams running p_report_16

    The result of Workload 2 is a bit surprising and very different from what we have seen so far:

    hyperthreading_16users_16dop

    I had to re-run the workload several times to ensure that the result highlighted in the above chart was stable and repeatable. And it was! The 40-core server was able to support a much higher query throughput without hyperthreading than it was with hyperthreading. If we use the throughput with hyperthreading as the baseline (4460 queries per hour), disabling hyperthreading resulted in a ~23% gain to a throughput of ~5482 queries per hour!

    I do not have a root cause explanation for this difference. I did look around, but did not come up with anything. However, I should reiterate that I did repeat the test several times, and the pattern was consistent.

    Now, since I had the execution statistics of each call logged in a table, I looked at the response times more closely and found some interesting pattern in the response times. Take a look at the following summary table (the response times are in seconds):

    hyperthreading_p_report

    In the case of Workload 2 (16 streams with maxdop=16), when hyperthreading was enabled, the query response time fluctuated much more wildly than when hyperthreading was disabled. The response time standard deviation was 0.56 without hyperthreading. Compare that to 7.56 when hyperthreading was enabled.

    With both workloads, all the processors (as seen in the OS) were pushed to 100% or nearly 100% with or without hyperthreading.

    One more thing to note is that I did not test the above two workloads in isolation. Instead, as a standard practice, I’d run tests to cover a full spectrum of the load levels so that I can see the overall trends and patterns first before I zero in on certain specific scenarios. In this case, I did the same and then focused on the above two workloads. So what’s the story with the other workloads?

    With 32 concurrent query streams, the 40-core server with hyperthreading consistently outperformed it without hyperthreading for maxdop = 16, 20, and 32. With 16 concurrent streams, however, the results were mixed.  With maxdop set to 16 and 20, it performed better without hyperthreading than with hyperthreading. With maxdop set to 32, the 16 streams workload performed better when hyperthreading was enabled than when hyperthreading was disabled.

    Finally, I should point out that although the query plan was largely stable, it’s possible to get a different plan when the input parameter value was near the extreme upper end. But since the number of calls was sufficiently high (1400~1900 total number of calls for each test run), I do not expect the change in the query plan to have had a systematic impact on the test outcome.

    I’ll upload the workload driver program with all the source code next time so that you can try it out in your own environment. I need to write up a little readme and can’t do it in this post.

    The test environment was a DL580 G7 with four E7-4870 (10 core) processors and 264GB of RAM. The OS was Windows 2008 R2 Enterprise x64 Edition with SP1 and DBMS was SQL Server 2008 R2 Enterprise x64 Edition (10.50.2500). The tran_stats table DDL and the data load script can be found in the attachment of this previous post.

  • Performance impact: driving up context switches/sec

    Too many context switches per second are considered bad for your database performance. But how many is too many has never been clear. With the core count of new servers going up rapidly, it becomes even less clear how we should evaluate this counter to help understand the SQL Server behavior in the environments we support. Recognizing that any attempt to rehash what is already said/recommended out there will more likely be a disservice than a service, I’d like to look at it from a different angle, and hopefully contribute to its understanding with some data points.

    Personally, I subscribe to the belief that one of the best ways to understand a behavior is to be able to create and manipulate the behavior on demand. And it naturally follows to ask: how can we drive up the value of the System\Context Switches/sec counter with a SQL Server workload?

    Knowing how SQL Server schedules its tasks, I’d expect to be able to drive up context switches/sec by running a lot of very small tasks.

    And that is indeed the case. Here is how it goes.

    I first create two stored procedures that basically does nothing on the server side. These are just null transactions. (By the way, the parameters in these proc don’t mean anything. They are there because the client program I use expect them and I’m too lazy to modify the client program. Plus, it adds absolutely no value to modify the client code.)

    CREATE PROC spStockLevel    @w_id          int,
                                @d_id          tinyint,
                                @threshold     smallint
    AS
         SET NOCOUNT ON
         SELECT 5
    GO

    CREATE PROC spOrderStatus   @w_id  int,
                                @d_id  tinyint,          
                                @c_id   int,
                                @c_last char(16) = ''
    AS
       SET NOCOUNT ON
       SELECT 1, 'Jones', '
    John', 'M', '2012-01-01', 2, 21, 2  
    GO

    Then, I simulate 200 concurrent users by starting 200 threads on a client and each thread calling these two procs within an infinite loop with no wait between the calls. The following chart shows the sustained values of the Context Switches/sec on a DL580 G7 with four E7-4870 processors when different number of cores are enabled. In all the cases, hyperthreading is enabled. And note that each E7-4870 has 10 cores.

    Context_switches

    With this approach, the value of the Context Switches/sec counter is driven into 200,000 to 250,000 per second range. These are pretty high numbers. I have no idea if they can be driven even higher with a different approach. But I know that I have not seen the counter approaching this level in any real production environment. If you have, let us know what kind values you have seen and with what kind of workload.

    I should also report that this null transaction workload fails to push the total processor time very high. See the following chart:

    CPU_time_privilegded_time

    The maximum %Processor Time (_Total) that can be reached by this workload ~24%. And this is not only the case with the different number of core count, but also is the case no matter how many concurrent users (threads) are submitting the transactions.

    It is worth noting, and it is evident from the chart, that the %Privileged Time (_Total) accounts for a very large percentage of the %Processor Time (_Total). In a real production environment, this would spell trouble. With this null transaction workload, I don’t know whether this should be expected and is by design, or something is not behaving properly and the %Privileged Time should be much lower. But I do know that when the transactions are actually doing something useful (e.g. by including some non-trivial SELECT statements), we’ll see the %Privileged Time (_Total) value go down rather quickly. For instance, with the workload used in this previous post, the %Privileged Time (_Total) is typically around 1% while %Processor Time (_Total) is near 100%. And with that workload (which is doing a lot more useful work in its transactions), the Context Switches/sec counter is typically observed to be less than 49,000.

    How useful are these data points? I’m not really sure. Hey, at least we know that this particular workload can drive up the Context Switches/sec counter. And if this starts a conversation, it would be a plus.

  • Performance impact: hyperthreading for OLTP queries -- II

    This is in part a response to a comment by Paul White (@SQL_Kiwi) to my previous post on the performance impact of enabling hyperthreading (HT) on OLTP queries, and in part due to my desire to capture a more complete set of test data for future investigation on this very topic. I’m posting below the results of re-running the same exact test as described in the previous post but with the SQL Server instance bumped up to build 10.50.2500 from 10.50.1600. The former is SQL Server 2008 R2 with Service Pack 1, whereas the latter is SQL Server 2008 R2 RTM.

    In addition, I have included the core count as a formal test parameter, and tested the performance impact of enabling HT at the following core counts: 16, 20, 24, 32, and 40. The core count was controlled through the BIOS on machine reboot. The exact BIOS feature for controlling the number of cores is primarily under System Options –> Processor Options –> Enhanced Processor Core Disable (Intel Core Select). This allows one to enter the number of cores to be enabled per socket.

    The results are as follows:

    HT_cores_E7-4870_SQL2008_10.50.2500

    In the previous post, enabling HT is reported to increase the throughput by 5~7.5%. In the above chart, the increase in the throughput varies between 9% and 14%. So it may appear that SQL Server 2008 R2 SP1 responds slightly better to enabling HT than does SQL Server 2008 R2 RTM. However, because there is always a margin of error in any test, I’d assign more significance to the overall trends and patterns than the exact numbers.  And in both cases, the over trends and patterns are similar in that this particular OLTP workload responded positively to enabling HT on the E7-4870 processors.

  • Elephant in the room: TPC-E

    Over the past few months in several semi-formal occasions, I ran into folks from your well-known vendors (minus Microsoft). Some of the folks were from the vendors’ performance labs and were involved in conducting benchmark tests and publishing benchmark results. So naturally they were very happy and eager to talk about performance, I mean any database performance, until the topic of TPC-E came up. They might not be squirming. But it’s hard to be mistaken that they really didn’t want to talk about it or go anywhere near it, especially not on any sort of record.

    That just felt weird!

    On one hand, you have this industry standard database OLTP benchmark that all these firms were on board the Transaction Processing Performance Council in its official approval and release in February 2007. And it was supposed to eventually replace TPC-C. On the other hand, for all these years Microsoft has been the only database platform vendor in releasing any TPC-E benchmark results.

    It has been four years since the benchmark’s release, and I don’t think everyone can still claim they are evaluating it and trying to work out the wrinkles. So there must be very solid reasons for the other DBMS vendors to not touch it. And yet no one wanted to discuss those reasons publicly. Sure, you hear things through the grapevine. But the silence from the official channels is just deafening!

    Or have I missed something really obvious?

  • Performance impact: hyperthreading for OLTP queries

    My previous post focuses on the performance impact of enabling hyperthreading (HT) on a machine with four Intel Westmere-EX processors on reporting queries. Let’s turn our attention to OLTP queries.

    To oversimplify it, reporting queries are generally processed by scanning a large number of pages, whereas quick index seeks are the hallmark of OLTP queries.

    The OLTP queries used to check out the hyperthreading impact are the two TPC-C read-only transactions (Order Status and Stock Level), slightly modified to work properly in the test environment. The need to modify these two TPC-C transactions is discussed in an early post. Briefly, it’s because in their original forms they don’t scale properly on this 80-cpu test machine running SQL Server 2008 R2. You can find the modified code of these two transactions in the attachment of this post.

    The benchmark described in the same post is used to drive the tests whose results are reported here. If you want to find out more about the benchmark, please check out that post, so I won’t duplicate its description here.

    Essentially, with this OLTP benchmark (or workload) I want to see how it behaves when I increase the load level by increasing the number of concurrent users, with and without hyperthreading on a DL580 G7 with four E7-4870 processors (10 cores per processor) and 264GB of RAM. The OS is Windows 2008 R2 Enterprise x64 Edition with SP2 and DBMS is SQL Server 2008 R2 Enterprise x64 Edition (10.50.1600). There is no intra-query parallelism; maxdop is set to 1 in all the cases.

    Four test configurations are examined:

    • 40 core with HT. That is 80 logical cpus. HT is enabled in the BIOS.
    • 40 cores without HT. That is 40 logical cpus. HT is disabled in the BIOS.
    • 20 cores with HT. That is 40 logical cpus. HT is enabled in the BIOS.
    • 20 cores without HT. That is 20 logical cpus. HT is disabled in the BIOS.

    The following chart shows the results when 200 simulated users are driving the workload concurrently. Note that with 200 users, all the processors on this test system are pushed to ~100%.

    hyperthreding_OLTP_queries

    Compared with the results for the reporting queries (see the previous post), the performance gain from enabling HT on the 40 cores is very much more marginal with this OLTP benchmark. We see an increase of about 5% from ~7900 batch requests per second to ~8300 batch requests per second. Enabling HT on the 20 cores produces an improvement of similar magnitude (about 7.5%).

    Going from 20 cores with HT to 40 cores without HT, however, gives the OLTP queries a dramatic throughput gain of about 67% from ~4730 batch requests per second to ~7900 batch requests per second. We see similar behavior with the reporting queries. Knowing how HT works, I think it is intuitive we expect a big gain, whether or not it should be 67%.

    So for both the reporting queries and the OLTP queries that I have tested, the hyperthreading results are positive. So far so good!

  • Performance impact: hyperthreading for reporting queries

    There are a lot of questions on hyperthreading, but not a lot of answers.  There is no shortage of opinions, but very few are based on significant first hand experience or solid test data.

    We know that the hyperthreading technology in the older generations of the Intel processors was not well received by the SQL Server community. Hyperthreading in the newer generations of the Intel’s Nehalem-based Xeon processors, however, is decidedly better implemented, and it appears to be much better received by the SQL Server community.

    I have done some tests with the Intel Westmere-EX processors, exploring how its hyperthreading feature may influence various types of SQL Server workloads. And my experience from these tests is generally positive. In fact, I have not seen any significantly negative performance impact that could be directly attributed to enabling hyperthreading. I plan to post some of my empirical observations. The intent is not to settle any issues. Rather in keeping with my own tradition, it is to contribute some data points to this very important subject.

    In this post, I focus on the performance impact of enabling hyperthreading on reporting queries. Since there is a large number of variables that could heavily influence the results of any hyperthreading related test, quite a bit of attention is paid to ensure repeatability of the test results.

    The tests are set up as follows:

    • The tests use a single common query (shown below). The focus is not on the single query performance, but how streams of this query perform.
    • There are two key test variables:
      • Degree of parallelism. It is controlled with the option (maxdop) clause appended to the query.
      • Number of concurrent query streams.  It is controlled through a C# program by spawning as many threads, each of which drives the query in a serial loop for a configurable amount of time.
    • There are three hyperthreading (HT) test configurations:
      • 40 core with HT. That is 80 logical cpus. HT is enabled in the BIOS.
      • 40 cores without HT. That is 40 logical cpus. HT is disabled in the BIOS.
      • 20 cores with HT. That is 40 logical cpus. HT is enbaled in the BIOS.
    • The query execution plans are logged by the test driver and verified to be constant across each test run and each query execution.
    • The performance statistics of each query execution, regardless of its maxdop setting or which stream it is in, is captured by the test driver and loaded into a table for analysis and overall metric calculation.
    • The test database consists of a 200-million-row main table tran_stats, exactly as described in this previous post. The SQL scripts for creating and populating this table are in the attached zip file. The test database also includes a dimension table which basically is the #tmp table in this same previous post. The scripts to create and populate this dimension table are also included in the attached zip file.
    • The test results are summarized with a query processing throughput metric, Queries per Hour, representing the number of queries processed in an hour for the given test configuration.
    • All tests are run while the data are fully cached. The storage subsystem is not a factor in these tests.

    Essentially, with this test workload we want to see how it behaves when we increase the load level by increasing the number of concurrent query streams for a given maxdop setting, with and without hyperthreading on a DL580 G7 with four E7-4870 (10 core) processors and 264GB of RAM. The OS is Windows 2008 R2 Enterprise x64 Edition with SP2 and DBMS is SQL Server 2008 R2 Enterprise x64 Edition (10.50.1600).

    The main test query (without the maxdop option clause) is as follows:

    select COUNT(*) from (
       select id_server_config,id_thread_name,id_tran_type,id_tran_status,id_users,id_cpus,
                 COUNT(*) as cnt
      from tran_stats t1 join dimension t2
          on t1.server_config = t2.server_config and
               t1.thread_name = t2.thread_name and
               t1.tran_type   = t2.tran_type and 
               t1.tran_status = t2.tran_status and
               t1.users       = t2.users and
               t1.cpus        = t2.cpus
    group by id_server_config,id_thread_name,id_tran_type,id_tran_status,id_users,id_cpus
    ) as v

    I pick this particular query because it is a rather plain reporting query. Nothing exotic and something you’d see often in the real world. I also happen to be playing with it recently for other purposes.

    By the way, in the spirit of full disclosure, if there is interest in the actual test driver program, I can post it here.

    So, how do the results look like?

    40 cores with HT vs. 40 cores without HT

    For maxdop = 2, the query throughput results are summarized in the following chart.

    maxdop2_40cores_HT_vs_40cores

    Most notably, when the load level is stressed with 64 concurrent streams, the system manages to push through ~478 queries per hour with HT. Compare that to ~374 queries per hour without HT at the same load level. That is about 28% increase in the processing power (for this specific configuration, always remember that!). That’s a substantial improvement.

    Also note that for streams 1 through 16, there is little to no difference whether or not HT is enabled. That is expected. After all, if the queries are not competing for processors, whether you have 40 or 80 processors should not make a big difference. This also means that if you just run your test with a single query, you probably won’t see any difference, one way or another. And if you draw a conclusion based on that single query result, the conclusion is likely to be misleading.

    The next three charts show the results for maxdop=4, 8, and 20, respectively. The results look similar to that for maxdop=2 in terms of the general trend and pattern, though the extent of performance improvement varies. The improvement is generally between 8% and 25%.

    maxdop4_40cores_HT_vs_40cores

    maxdop8_40cores_HT_vs_40cores

    maxdop20_40cores_HT_vs_40cores

    40 cores without HT vs. 20 cores with HT

    In this case, they both appear as 40 logical processors at the OS level and to the SQL Server instance except that one has 40 cores and the other has 20 cores. So which configuration should have better performance capacity?

    You probably would pick 40 cores without HT. And you’d be correct. The following charts are the supporting evidence.

    maxdop2_40cores_vs_20cores_HT

    maxdop4_40cores_vs_20cores_HT

    maxdop8_40cores_vs_20cores_HT

    maxdop20_40cores_vs_20cores_HT

    When pushed, 40 cores without HT outperforms 20 cores with HT substantially by as much as ~48%.

    So overall in these tests, the performance impact of hyperthreading on the query throughput is positive, and in some cases very substantial.

    To reiterate, I’d be careful--if at all--extrapolating the observations here. Hopefully, though, as the evidence accumulates from the community, we can have a better comfort level in how we use this feature effectively.

  • Evaluating server hardware: a sign of the times

    For nearly ten years, I have had success in using a specifically modified version of the TPC-C benchmark for evaluating server hardware running SQL Server.

    Depending on your purpose, you can evaluate computer hardware in numerous ways with numerous benchmarks. Arbitrarily picking a generic benchmark to evaluate a server is of no interest to me. Rather, I’m interested in a piece of hardware only in how SQL Server runs on that hardware or that specific configuration of hardware. Now in this day and age, you never really just evaluate hardware. No matter how you approach it, you always end up evaluating a complete system under test, and that includes hardware, the operating system, and server software of your interest, in my case,  that would be SQL Server. You get to infer the hardware performance by trying to keep the other components in the system under test constant. But even that may not be feasible or desirable at times. Often, you are really just interested in the performance of the entire system under test.

    To have any meaningful results to talk about and to have a meaningful frame of reference, at the bare minimum you may want to keep you test workload constant across the systems under test. In many of my server hardware evaluation efforts, the modified TPC-C benchmark mentioned earlier is what remains constant, and it is the constant frame of reference with which I compare various server hardware models and configurations. Good news is that the evaluation data I have collected over the years show this particular test benchmark to have both predictive and discriminating power.

    It’s predictive in that the results I got by running this benchmark on new processors or server models correlate well with the processing power seen by actual SQL Server production workloads. This of course does not mean that every production workload will see what the benchmark sees. After all, if your workload is strictly disk I/O bound, this benchmark has zero predictive power. But as an engineering tool and when you must make a prediction before you have a chance to test all your production workloads against a specific new server model/configuration, which by the way is not feasible, the benchmark provides a useful set of data into the decision making process. 

    The chart below shows an example of the benchmark’s predictive power. It contains real data from running the benchmark on a set of real servers (many of which were commonly used in the datacenters for running SQL Server), but with the server identities scrambled.

    benchmark_results

    Knowing how these servers generally behave—performance wise—with the real production workloads, I can also interpret the above chart as an indication of the benchmark’s discriminating power in that the benchmark is able to tell these servers apart. But more than that, in repeated blind tests where we run the same benchmark test against two servers that are known to be different but with that fact withheld from the test administrator, we consistently see the benchmark results correctly identify which server to be the more powerful.

    Most importantly, however, the benchmark is practically useful beyond a mere engineering tool. I was often called into cases where the application users complained that their databases appeared to behave differently on the two nodes within a cluster with the nodes being identical in terms of their hardware configuration. With this benchmark, I was able to determine whether the nodes were indeed different and by how much (of course as measured by the benchmark), when other generic hardware benchmark tools and system diagnostics tools failed to tell the difference.

    Now you may ask how I could be so sure about what the benchmark was telling us. Well, that is not an easy question to answer in many cases. The beauty with this benchmark is that it’s relatively quick to run and there are shortcuts to make it even more efficient to administer. And because of that, we could afford to go through a process by elimination in some cases to actually zero in on the root cause. For an example, see what I reported earlier on identifying a bad memory module.

    So what exactly is this modified TPC-C benchmark? It’s the following (at the risk of repeating myself because I did mention it several times in my previous posts):

    1. It uses a TPC-C compliant database populated for 100 warehouses. Note that the number of warehouses is the scaling factor in TPC-C. You control the size of the database by specifying the number of warehouses for the benchmark. That then determines how many rows each table would have and so on. In practice, a TPC-C database scaled to 100 warehouses is about 10GB in size in terms of its data and indexes.
    2. Its client program (i.e. the workload driver) calls only two out of the five official TPC-C transactions, and these two transactions are read-only. They are the Order Status transaction and the Stock Level transaction.
    3. The two transactions are called in a 50/50 split. In other words,  before each call to the database, the client uses a random number generator to generate an integer between 1 and 100 in an even distribution, and if the number is between 1 and 50, the Order Status transaction is called, and if the number is between 51 and 100, the Stock Level transaction is called.
    4. Unlike in an official TPC-C benchmark, it doesn’t allow any wait time between two consecutive database calls. This of course makes the calls much heavier than any/most real world apps. But it makes it efficient to load up the server.
    5. I only take measures when all the data pages are cached, and no disk I/O traffic is observed. This effectively takes the storage subsystem out of the picture.

    And that is it!

    Life is good, eh?

    Well, bad news is that it doesn’t scale beyond a certain number of processors. This was never a problem until the 10-core processors came along. Lately, I’ve found this particular benchmark failing to fully utilize all the processor power on machines with a large number of processors, most notably those with 40 or more processors. Take a look at the following chart that shows the maximum total CPU usage that can be obtained with the benchmark on a server with four Intel Xeon E7-4870 processors:

    total_cpu_usage

     

    Note that each of the E7-4870 processor has 10 cores and each core can have hyperthreading enabled. So the total number of processors on the machine visible to Windows is 80. To help demonstrate the problem, I looked into four different configurations in more detail:

    1. 20 cpus: Disabled half the cores and disabled hyperthreading from the BIOS
    2. 40 cpus: Disabled half the cores and enabled hyperthreading from the BIOS. That is, 20 cores with each core having two threads
    3. 40 cpus: Enabled all the cores and disabled hyperthreading from the BIOS. That is, 40 cores with each core having a single thread
    4. 80 cpus: Enabled all the cores and enabled hyperthreading from the BIOS. That is, 40 cores with each core having two threads

    When the server is configured to have 20 logical processors (20 cores without hyperthreading), the benchmark has no problem driving all the processors to 100%, and this is expected because all the pages are already cached and SQL Server should not be waiting for any resource.

    With 40 cpus, either 40 cores without hyperthreading or 20 cores with hyperthreading, the benchmark fails to drive all the processors to 100% no matter how many client users (i.e. threads) are fired up to submit the transactions. This is not expected. Obviously, SQL Server is not spending all the time doing useful work in this case. Checking the usual suspects (e.g. wait stats, latch stats, and spinlocks) reveals no clue as to what it may be waiting on. SQL Server doesn’t seem to expose any instrumentation beyond what it considers as resources, and it’s beyond me to profile how SQL Server may be spending its time at the lower level, which I’m sure will shed light on what’s going on.

    The 80 cpus case is most interesting. Although the maximum total CPU usage is reported to be around 50% in the chart, often I can only drive the total usage to around 22%. When SQL Server does manage to use 50% of the total CPU time, all the processors are observed to be active, and the load is more or less evenly distributed across the processors. When the benchmark can’t drive the total CPU usage above 22%, half the processors are seen to have no activities from Task Manager. There is obviously a scheduling issue with SQL Server in this case. I have no idea exactly what triggers it.

    Whatever the underlying problem may be, the benchmark as configured is not effective with machines that have a large number of processors (say 40 cores or more).

    Fortunately, it’s not difficult to modify the benchmark for it to scale well with 80 processors. One modification is to have each transaction do more work than it currently does. Remember: I’m not doing any official TPC-C work, and don’t publish any official TPC-C benchmark results. All I care is to use it as a tool to help evaluate SQL Server running on some hardware configuration. So to that end, I can change it any way or shape I fancy as long as I get useful results for what I want to accomplish.

    Take the Stock Level transaction as an example. Its code is very simple and is reproduced as follows:

    CREATE PROC dbo.spStockLevel    @w_id          int,
                                    @d_id          tinyint,
                                    @threshold     smallint
    AS
    SET NOCOUNT ON

    DECLARE @o_id_low int,
            @o_id_high int

    SELECT  @o_id_low   = (d_next_o_id - 20),
            @o_id_high  = (d_next_o_id - 1)
      FROM  district
    WHERE  d_w_id = @w_id
       AND  d_id   = @d_id       

    SELECT  COUNT(DISTINCT(s_i_id))
      FROM  stock, order_line    
    WHERE  ol_w_id   =  @w_id
       AND  ol_d_id   =  @d_id
       AND  ol_o_id BETWEEN @o_id_low AND @o_id_high
       AND  s_w_id    =  ol_w_id
       AND  s_i_id    =  ol_i_id
       AND  s_quantity  <  @threshold
    GO

    One way to make it do more work is to put the main code segment--with the resultset suppressed--in a while loop. Unfortunately, this does not work out too well. I still see the total CPU usage up against a ceiling and it is still able to use all the processor power. Another approach is to include more index lookups such as the following:

    declare @a float
    select @a = avg(i_price) from item with(nolock)
    where i_id < 15000

    This adds a moderate amount of work, but turns out to be more effective. When I add this piece of SELECT statement in each of the two read-only transactions, the benchmark scales well with any number of processors (tested up to 80). Of course, only more tests can tell whether the benchmark modified as such has sufficient predictive and discriminating power to be practically useful.

    No doubt there are many other ways to salvage this benchmark. But that discussion is not the focus of this post.

    This purpose of this post is two fold. One is to leave a reference on one of the tools I use often in collecting empirical data, and that helps the future posts that depend on the tool. Secondly, this serves as a small example to highlight the significant impact of many-core machines on what we do, and there is work ahead for the Microsoft SQL Server team.

  • Where is SQL Server running in a cluster?

    There are numerous ways to find where a SQL Server instance is running in a cluster. The most convenient tool was cluster.exe. Unfortunately, I have to say it was the most convenient tool, and no longer is because no single cluster.exe works with all versions of Windows. You could also use the PowerShell cluster cmdlets such as Get-ClusterGroup. But it’s not ubiquitous.

    These days, if I have to quickly find out which node a SQL Server instance may be running, I generally run the following command on my workstation:

    cmd>tasklist /S ServerNodeName | findstr sqlser

    You should run this commend once for each node in the cluster. For me, this is by far the quickest way, primarily because it’s ubiquitous, requires no prior setup, and can be executed anywhere as long as there is connectivity and permission.

    It works well with any single SQL Server instance cluster. If you have two or more SQL Server instances in a cluster, you can see where all these instances are running, but you can’t tell one from another. That is, if you use the above command to check all the nodes and find more than one sqlservr.exe, you may not be able to tell which sqlservr.exe is what you are looking for without further checking.

  • Performance impact: a little business logic goes a long way

    I’m running into this little performance tuning pattern enough number of times that it is worth a special mention.

    As it often happens, the app folks complain about a proc call being very slow, and I track it down to a specific line in the proc. The line appears to be harmlessly simple, as simple as the following:

    SELECT MAX(BusinessDate) FROM BusinessTransactions;

    But it takes a long time to complete. Upon further inspection, it turns out that BusinessTransactions is actually a very complex view with multiple joins and a few union all’s across archive tables. Oh, how I hate these views! But that’s a different story.

    You can try to optimize the view, and that’s fine. But there is a much simpler way.

    A little bit of checking around reveals that BusinessTransactions is holding business transaction data that is streaming in 24x7 as long as the company is open for business, and the BusinessDate column stamps each row with the date on which the transaction is entered. Armed with that knowledge, we can re-write the query with a small change:

    SELECT MAX(BusinessDate) FROM BusinessTransactions
    WHERE BusinessDate > @today_minus_20;

    Variable @today_minus_20 is essentially DATEADD(DAY, –20, GETDATE()). The nature of the business dictates that there will always be data from the past 20 days in BusinessTransactions, and in the unlikely scenario where it does not have any data from the past 20 days, we have a much bigger problem to worry about than this query not returning the correct result.

    In a recent case, after I made this little change, the query duration went from about 20 minutes to less than 15 seconds.

More Posts Next page »

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement