THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Need for Queue Depth Control in SQL Server IO

I have complained about lack of queue depth control in SQL Server IO before and I am complaining again. I fully understand the desire to restrict the number of knobs, which used incorrectly can cause more harm than good. The main point about IO queue depth is that there is not a single right answer for all circumstances but rather depends on both the hardware environment and the purpose.

One example is a transaction processing server that is also used for reports, or perhaps queries that behave as a report. The transaction may involve a series of queries each for a single or few rows. These would be issued synchronously at queue depth 1.

A report-type query such that the estimated involves 25 or more rows would issue IO asynchronously at high queue depth. In principle, this would increase the storage subsystem throughput by allowing the hard disk to employ elevator seeking. But it also increases latency.

The report query completes faster despite the higher latency because it issues multiple IO at once. The transaction query issues 1 (sequence of) IO, and waits for it to complete before issues the next.

So why complain now? I am testing a storage system for a data warehouse environment. I noticed that for a table scan, SQL Server was not issuing IO at sufficient high queue depth when the data files are distributed over 16 or more volumes (RAID groups). SQL Server issues 512K IO at high queue depth for 8 or fewer volumes. The onset of high queue depth IO seems to occur when degree of parallelism (DOP) is equal or higher than the number of files.

The throughput on each volume can be up to 600MB/s, the volume being comprised of six 10K HDDs in one case and six 7.2K HDDs in another. With 8 volumes spread over 4 RAID controllers, it is possible to for a table scan to generate IO at approximately 4GB/s.

Each RAID controller, HP P812 (this is an older model) in this case, can support 2GB/s IO. This is rather low considering the PCI-E slot is gen 2 and x8 wide, supporting 3.2GB/s. This might be because the memory on the RAID controller is 64-bits (8-byte) wide and operates in 800MHz for a nominal bandwidth of 6.4GB/s. That's a lot right? Well not if a disk read is written to memory first, then read back to be sent to the server (where the memory write-read sequence happens again).

SDRAM, including DDR derivatives, write at half the read rate. So the effective write-read throughput is one-third of the memory bandwidth. LSI controllers have similar (or less) bandwidth to external memory, but can drive higher bandwidth. There may be some other effect, perhaps the controller's internal memory?

Anyways, with 4 controllers each capable of 2GB/s, the whole system should support 8GB/s? Why 4GB/s? It turns out there is a setting in Resource Governor named REQUEST_MAX_MEMORY_GRANT_PERCENT which can "improve I/O throughput" at high degree parallelism. (oh yeah, the applicable DMV is dm_resource_governor_workload_groups) Well simply issuing IO at sufficient queue depth so that all volumes are working will do this. There is no reason to be cryptic.

In summary. there is no a priori single correct setting on queue depth and IO size for any or all circumstances and certainly not for different hardware configurations. To favor any specific purpose, SQL Server must know how many hard disk drives (or SSDs) that each volume is comprised of, and whether these are dedicated or part of a shared pool. The operating system cannot see beyond what the RAID controller presents, so we must provide this information manually.

Next, SQL Server must know what the mission is. Is the mission to keep latency low for transactions? Or is the mission to flood the queues for maximum throughput in batch processing? Or something in between? It is also disappointing that SQL Server has only a single table scan IO size, such that any RAID volume not at a specific number of disks x RAID stripe size is sub-optimal.

Below is the disk IO generated by the TPC-H LINEITEM table scan after DBCC DROPCLEANBUFFERS, SQL Server already has memory allocated from the OS (see previous post on Lock Pages in Memory)

tablescandiskIO
this is not working for some reason? OK - missing slash (2014-03-08)

table scan disk IO

Published Friday, October 04, 2013 12:11 AM by jchang

Attachment(s): http://www.qdpma.com/storage_files/tablescandiskio.png

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Allen McGuire said:

So is this something we hard code into the instance configuration? I too feel the pain migrating from the "queue depth" to "queue wait" - we as DBA's need to have a better metric for I/O analysis.

October 4, 2013 1:44 AM
 

tobi said:

My proposal would be that SQL Server can benchmark the underlying CPU and IO hardware to determine the necessary key numbers, such as sequential and random IO performance as well as optimal io size and queue depth.

If Microsoft does not want us to configure these values (and shoot ourselves) they could at least have an automatic benchmark tool that sets the values that were automatically determined.

https://connect.microsoft.com/SQLServer/feedback/details/652362/add-config-setting-to-adjust-the-cost-of-random-and-sequential-io

October 4, 2013 6:22 AM
 

jchang said:

Allen: people have gone overboard with wait stats, which are not hugely helpful in query tuning. It may be of interest after the SQL is tuned to see where the waits are, more so if you have source code. Oracle had good reason to go to wait stats: they used to support every flavor of UNIX and then some. You could not expect an Oracle expert to know how to collect info on so many different platforms. I think MS wanted to get away from perf counters because sometimes even the DBA does not have access.

tobi: there are 2 criteria: 1) the equipment, 2) the mission. The trick is how to allow a big query to generate heavy IO when the storage is otherwise idle, but back off when there are transaction queries. This should be worked into the infrastructure of Resource Governor.  

October 4, 2013 9:33 AM
 

Lonny Niederstadt said:

Could it be the total number of outstanding IOs, rather than queue length per LUN, that is limiting the workflow.  In AIX its just as easy to change the fibre channel adapter num_cmd_elems as it is to change LUN queue_depth (and just as necessary).  Overruns at LUN level (iostat) and adapter level (fcstat) are easy to monitor.  Its a little more cryptic changing such attributes in Windows, and downright frustrating trying to monitor them for overruns.

If SQL Server had a LUN queue length target, it may also need an aggregate queue length target.  Especially if a QLogic HBA or an Emulex HBA not using 'LUN' as 'queue depth target' is in place.  The two different targets may be extremely difficult for SQL Server to manage well, and for administrators I'd much rather that they optimize their adapter configuration first before tuning a queue length target or even a maximum read coalesce target (eg the Oracle mbrc).

Have you increased the adapter maximum transfer size already?  Hitachi recommends increasing to 4 mb if using the latest Emulex HBA drivers in their UCP data warehouse fast track guide.  I haven't done testing at 4 mb yet, but testing I've done at 2 mb has been quite good with the test workloads.  (On AIX there's a recommendation to increase max transfer size to 2 mb just to get the increased DMA memory for the driver, not sure if Windows allocates differing amount of DMA memory based on max transfer size)

Combining a larger maximum transfer size with the -E startup option also recommended by Hitachi, for a 4mb rather than 64k potential switching point SQL Server proportional fill algorithm, can have great impact on the sequential vs random nature of disk reads in data warehouse workloads.  

http://www.hds.com/assets/pdf/hitachi-ucp-select-for-medium-level-data-warehousing.pdf

I'd be wary of implementing a queue length target at the database level - primarily because neither Oracle nor DB2 have such a target.  Maybe my RDBMS world needs to expand :).  Oracle indirectly affects queue length with the multi-block read count (MBRC).  The mbrc controls the max number of database blocks coalesced into a single larger IO request during table and index scans.  The administrator can set the mbrc, or Oracle can tune the mbrc real-time based on items such as the sessions value.  Its really easy for an Oracle-tuned mbrc to result in LUN queue depth overruns due to current operational conditions.

Oracle has exposed mbrc config for years, but I haven't seen any blog posts the relationship of mbrc to (queue length, queue depth overruns, aggregate read bandwidth) except maybe my own rambling :)  If SQL Server put queue length targets into resource governor or elsewhere, I would expect more liability than payoff except in a few 'tuned by expert' cases.

October 4, 2013 6:34 PM
 

jchang said:

I will defer to the English as official arbiters on the definition of queue. Conceivably queue might mean that is in line waiting to be serviced, but in Windows, queue includes IO outstanding.

In this case, the storage systems were fully capable of 8 an 10GB/s. The problem is a combination of the fact that SQL Server issues (up to) 512K IO for table scans, meaning the RAID stripe size and number of disks need to match, and that at 16 volumes, it would not increase the queue depth when appropriate.

This was direct attach storage, so the FC HBA does not matter, but all of the above is good for SAN storage, but that will have the problem discussed here as well.

In any case, this is a known issue for which the official Microsoft  resolution is the REQUEST_MAX_MEMORY_GRANT_PERCENT setting, as disconnected as it might seem.

To a degree, part of the SQL Server strategy of issuing IO at low queue depth so long as the IO bandwidth is sufficient for the assigned DOP is a good idea, it means the storage system will also be able to handle other requests without excessive latency. Of course, I would not jump the queue depth from a low value to a high value the moment incoming IO is less than what could be consumed.

And why does the queue depth not increase when there are 16 volumes, and threads are underfed without the Resource Governor setting? Why not more direct control.

I am saying that SQL Server needs to know about the characteristics of each volume - number of disks, RAID stripe, whether it is a SAN shared pool etc.

And in addition, within a database, there should be strategies depending on the mission.

I fully credit Oracle with all the great contributions to database technology, but just because it was not invented at Oracle does not mean it is not a good idea,

October 6, 2013 1:03 PM
 

nonymous joe said:

"And why does the queue depth not increase when there are 16 volumes, and threads are underfed without the Resource Governor setting? Why not more direct control."

because they havent spent much thought on this feature?

or for the same reason the Oracle guys said "we cant control all the wait states on all the various platforms", aka MS cant optimize for constantly changing disk layouts?

Perhaps they could at startup of the engine,

but after, if split the log file or do partitioning?.

October 25, 2013 12:03 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement