THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

EMC VNX2 and VNX Future

Update 2013-10: StorageReview on EMC Next Generation VNX
Update 2013-08: News reports that VNX2 will come out in Sep 2013

While going through the Flash Management Summit 2012 slide decks, I came across the session Flash Implications in Enterprise Storage Designs by Denis Vilfort of EMC, that provided information on performance of the CLARiiON, VNX, a VNX2 and VNX Future.

A common problem with SAN vendors is that it is almost impossible to find meaningful performance information on their storage systems. The typical practice is to cited some meaningless numbers like IOPS to cache or the combined IO bandwidth of the FC ports, conveying the impression of massive IO bandwidth, while actually guaranteeing nothing.

VNX (Original)

The original VNX was introduced in early 2011? The use of the new Intel Xeon 5600 (Westmere-EP) processors was progressive. The decision to employ only a single socket was not.

EMC VNX

EMC did provide the table below on their VNX mid-range systems in the document "VNX: Storage Technology High Bandwidth Application" (h8929) showing the maximum number of front-end FC and back-end SAS channels along with the IO bandwidths for several categories.

EMC VNX

It is actually unusual for a SAN storage vendor to provide such information, so good for EMC. Unfortunately, there is no detailed explanation of the IO patterns for each category.

Now obviously the maximum IO bandwidth can be reached in the maximum configuration, that is with all IO channels and all drive bays populated. There is also no question that maximum IO bandwidth requires all back-end IO ports populated and a sufficient number of front-end ports populated. (The VNX systems may support more front-end ports than necessary for configuration flexibility?)

However, it should not be necessary to employ the full set of hard disks to reach maximum IO bandwidth. This is because SAN systems are designed for capacity and IOPS. There are Microsoft Fast Track Data Warehouse version 3.0 and 4.0 documents for the EMC VNX 5300 or 5500 system. Unfortunately Microsoft has backed away from bare table scan tests of disk rates in favor of a composite metric. But it does seem to indicate that 30-50MB/s per disk is possible in the VNX.

What is needed is a document specifying the configuration strategy for high bandwidth specific to SQL Server. This includes the number and type of front-end ports, the number of back-end SAS buses, the number of disk array enclosures (DAE) on each SAS bus, the number of disks in each RAID group and other details for each significant VNX model. It is also necessary to configure the SQL Server database file layout to match the storage system structure, but that should be our responsibility as DBA.

It is of interest to note that the VNX FTDW reference architectures do not employ Fast Cache (flash caching) and (auto) tiered-storage. Both of these are an outright waste of money on DW systems and actually impedes performance. It does make good sense to employ a mix of 10K/15K HDD and SSD in the DW storage system, but we should use the SQL Server storage engine features (filegroups and partitioning) to place data accordingly.

A properly configured OLTP system should also employ separate HDD and SSD volumes, again using of filegroups and partitioning to place data correctly. The reason is that the database engine itself is a giant data cache, with perhaps as much as 1000GB of memory. What do we really expect to be in the 16-48GB SAN cache that is not in the 1TB database buffer cache? The IO from the database server is likely to be very misleading in terms of what data is important and whether it should be on SSD or HDD.

CLARiiON, VNX, VNX2, VNX Future Performance

Below are performance characteristics of EMC mid-range for CLARiiON, VNX, VNX2 and VNX Future. This is why I found the following diagrams highly interesting and noteworthy. Here, the CLARiiON bandwidth is cited as 3GB/s and the current VNX as 12GB/s (versus 10GB/s in the table above).

EMC VNX

I am puzzled that the VNX is only rated at 200K IOPS. That would correspond to 200 IOPS per disk and 1000 15K HDDs at low queue depth. I would expect there to be some capability to support short-stroke and high-queue depth to achieve greater than 200 IOPS per 15K disk. The CLARiiON CX4-960 supported 960 HDD. Yet the IOPS cited corresponds to the queue depth 1 performance of 200 IOPS x 200 HDD = 40K. Was there some internal issue in the CLARiiON. I do recall a CX3-40 generating 30K IOPS over 180 x 15K HDD.

A modern SAS controller can support 80K IOPS, so the VNX 7500 with 8 back-end SAS buses should handle more than 200K IOPS (HDD or SSD), perhaps as high as 640K? So is there some limitation in the VNX storage processor (SP), perhaps the inter-SP communication? or a limitation of write-cache which requires write to memory in both SP?

VNX2?

Below (I suppose) is the architecture of the new VNX2. (Perhaps VNX2 will come out in May with EMC World?) In addition to transitioning from Intel Xeon 5600 (Westmere) to E5-2600 series (Sandy Bridge EP), the diagram indicates that the new VNX2 will be dual-processor (socket) instead of single socket on the entire line of the original VNX. Considering that the 5500 and up are not entry systems, this was disappointing.

EMC VNX

VNX2 provides 5X increase in IOPS to 1M and 2.3X in IO bandwidth to 28GB/s. LSI mentions a FastPath option that dramatically increases IOPS capability of their RAID controllers from 80K to 140-150K IOPS. My understanding is that this is done by completely disabling the cache on the RAID controller. The resources to implement caching for large array of HDDs can actually impede IOPS performance, hence caching is even more degrading on an array of SSDs.

The bandwidth objective is also interesting. The 12GB/s IO bandwidth of the original VNX would require 15-16 FC ports at 8Gbps (700-800MBps per port) on the front-end. The VNX 7500 has a maximum of 32 FC ports, implying 8 quad-port FC HBAs, 4 per SP.

The 8 back-end SAS busses implies 4 dual-port SAS HBAs per SP? as each SAS bus requires 1 SAS port to each SP? This implies 8 HBAs per SP? Intel Xeon 5600 processor connects over QPI to a 5220 IOH with 32 PCI-E gen 2 lanes, supporting 4 x8 and 1x4 slots, plus a 1x4 Gen1 for other functions.

In addition, a link is needed for inter-SP communication. If one x8 PCI-E gen2 slot is used for this, then write bandwidth would be limited to 3.2GB/s (per SP?). A single socket should only be able to drive 1 IOH even though it is possible to connect 2. Perhaps the VNX 7500 is dual-socket?

An increase to 28GB/s could require 40 x8Gbps FC ports (if 700MB/s is the practical limit of 1 port). A 2-socket Xeon E5-2600 should be able to handle this easily, with 4 memory channels and 5 x8 PCI-E gen3 slots per socket.

VNX Future?

The future VNX is cited as 5M IOPS and 112GB/s. I assume this might involve the new NVM-express driver architecture supporting distributed queues and high parallelism. Perhaps both VNX2 and VNX Future are described is that the basic platform is ready but not all the components to support the full bandwidth?

EMC VNX

The 5M IOPS should be no problem with an array of SSDs, and the new NVM express architecture of course. But the 112GB/s bandwidth is curious. The number of FC ports, even at a future 16Gbit/s is too large to be practical. When the expensive storage systems will finally be able to do serious IO bandwidth, it will also be time to ditch FC and FCOE. Perhaps the VNX Future will support infini-band? The puprose of having extreme IO bandwidth capability is to be able to deliver all of it to a single database server on demand, not a little dribblet here and there. If not, then the database server should have its own storage system.

The bandwidth is also too high for even a dual-socket E5-2600. Each Xeon E5-2600 has 40 PCI-E gen3 lanes, enough for 5 x8 slots. The nominal bandwidth per PCIe G3 lane is 1GB/s, but the realizable bandwidth might be only 800MB/s per lane, or 6.4GB/s. A socket system in theory could drive 64GB/s. The storage system is comprised of 2 SP, each SP being a 2-socket E5-2600 system.

To support 112GB/s each SP must be able to simultaneously move 56GB/s on storage and 56GB/s on the host-side ports for a total of 112GB/s per SP. In addition, suppose the 112GB/s bandwidth for read, and that the write bandwidth is 56GB/s. Then it is also necessary to support 56GB/s over the inter-SP link to guarantee write-cache coherency (unless it has been decided that write caching flash on the SP is stupid).

Is it possible the VNX Future has more than 2 SP's? Perhaps each SP is a 2-socket E5-4600 system, but the 2 SPs are linked via QPI? Basically this would be a 4-socket system, but running as 2 separate nodes, each node having its own OS image. Or that it is a 4-socket system? Later this year, Intel should be releasing an Ivy Bridge-EX, which might have more bandwidth? Personally I am inclined to prefer a multi-SP system over a 4-socket SP.

Never mind, I think Haswell-EP will have 64 PCIe gen4 lanes at 16GT/s. The is 2GB/s per lane raw, and 1.6GB/s per lane net, 12.8GB/s per x8 slot and 100GB/s per socket. I still think it would be a good trick if one SP could communicate with the other over QPI, instead of PCIe. Write caching SSD at the SP level is probably stupid if the flash controller is already doing this? Perhaps the SP memory should be used for SSD metadata? In any case, there should be coordination between what each component does.

Summary

It is good to know that EMC is finally getting serious about IO bandwidth. I was of the opinion that the reason Oracle got into the storage business was that they were tired of hearing complaints from customers resulting from bad IO performance on the multi-million dollar SAN.

My concern is that the SAN vendor field engineers have been so thoroughly indoctrinated in the SaaS concept that only capacity matters while having zero knowledge of bandwidth, that they are not be able to properly implement the IO bandwidth capability of the existing VNX, not to mention the even higher bandwidth in VNX2 and Future.

Updates will be kept on QDPMA Storage.

Published Monday, February 25, 2013 8:27 AM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Mark said:

Hi Joe - Great blog. One of the best I have read...

Can you comment further on your statement "It is of interest to note that the VNX FTDW reference architectures do not employ Fast Cache (flash caching) and (auto) tiered-storage. Both of these are an outright waste of money on DW systems and actually impedes performance." ?

I can see your point about the auto-tiering (scheduled migration of 1MB slices), but even with SQL intelligence utilized I would think Fast Cache would help (although even it works via promotion, and does consume RAM to manage additional metadata). It would be great to understand why you have that position on Fast Cache.

April 11, 2013 10:37 AM
 

jchang said:

I should clarify that I am assuming Kimball-style DW analysis where we are routinely scanning the entire DW, or a sufficiently large table greater than the size of both (DB server) memory and the Fast Cache  PCI-E SSDs. In this case, what data is hotter than the other? Because the VNX storage processor will see equal access to the entire table. Worst case will be that data will get swapped in and out of Fast Cache - actually impeding access to data. In best case, if the storage system had been designed correctly, Fast-Cache does not have better bandwidth than other storage elements, SAS SSD or HDD.

My understanding is that EMC Fast Cache are the PCI-E devices with super-low latency, perhaps 35-60 micro-sec, versus SAS-SSD at 100-200 us, and HDD sequential at 1-5ms. But this is of lesser value in DW.

Now even a DW will get execution plans with loop joins, generating non-sequential 8KB IO. So long as SQL Server issues asynchronous, the longer latency of SAS-SSD over PCI-E SSD is not an issue. But are you aware of LOB IO? this is issued synchronously, so super low latency is of interest. Of course, we would not want to configure PCI-E SSD as Fast Cache, but rather a direct storage volume.

I will re-iterate that Fast Cache and Auto-Tier is suitable for situations where we cannot isolate hot data, examples being Share Point and Exchange. Also for situations where it would be too much effort like a SQL Server with very many little databases or many instances of SQL Server. In general, I am addressing the organizations main line-of-business database supported by one or more full-time DBAs whose job it is to make sure the system is solid.

April 11, 2013 1:25 PM
 

Mark said:

EMC Fast Cache is not (currently) a PCI-E device. It is just a pool of SSDs, RAID-1 mirrored for protection, that 64KB chunks of data can be promoted into if they are accessed frequently. It is (unlike netApp's FlashCache) a read/write cache (hence the need for RAID-1 protection). Since it is just SSDs, reads from a piece of data from Fast Cache is not much different than from a piece of data that is in an SSD tier in a storage pool. NetApp's FlashCache is a true PCI-E device, and is populated using a RAM eviction algorithm.

I have been debating about the performance improvements offered by FAST cache. At least 20% of the improved speed can be due to blocks that are frequently overwritten, and RAID-1 doesn't require any read-modify-write or any parity calculations (vs typical RAID-5). Overall performance improvements are highly dependent on the data and the application, so broad statements need to be avoided.

My biggest issue with FAST Cache is that it uses enormous 64KB pages, that are contiguous relative to how the data is laid out on disk and NOT logically contiguous relative to how the data resides in the LUN. For a thin provisioned LUN on a storage pool, a cached 64KB page can map to 8 physically non-contiguous blocks. Some may be hot, but most may be cold. The end result is that the total FAST Cache capacity to cache truly hot 8KB blocks for databases is reduced by up to 87%, compared to a 4KB flash cache block size with NetApp.

Thank you for clearing up how your application works - you are absolutely correct, if there isnt a real hot region then not only would Fast cache be a waste of money, but the RAM that it would consume would actually hurt performance.

Again, awesome blog, keep it up!

April 12, 2013 6:47 AM
 

jchang said:

thanks for clarifying the Fast Cache, EMC has a way of describing features in a way that is confusing, given that the details are not provided,

So for both the main line-of-business transaction and DW server, my strong preference on storage is to rely on sheer force of numbers - IO channels and HDD/SSDs, rather than on "intelligence" features in the SAN. I consider the 8-24GB memory cache on the SP to be mostly irrelevant because my database server has 1024GB.

Another reason I do not like Fast Cache and auto-tier in this situation is that they cost money, a lot of it, which takes away from money that I could use on IO channels and the number of SSD/HDDs, to support my brute force strategy.

Let me also point out that in SQL Server, the user waits for data reads, and log writes. Data writes are handled by the lazy writer. Log reads occur during log backups and transaction roll backs, which in a well designed db, rarely happends.

April 12, 2013 11:22 AM
 

Joe said:

jchang. I thnk you need to revisit you summary statement about "EMC is finally getting serious about IO bandwidth", if you wanted to manage IO and Bandwidth you would purchase a Symmetrix VMAX, which Clearly EMC has been serious about for Many years. Not trying to debate your blog, it was very informative, EMC clearly leads the performance Scaling debate, by far, and even more now with All Flash Arrays, and Flash Caching cards as well...

May 2, 2013 1:56 PM
 

jchang said:

Symmetrix is an "enterprise" product, whatever that means other than extremely expensive. It may do management tricks. I am of the opinion that enterprise storage is so the SAN admin gets a really expensive toy that will let him/her grant storage with minimum effort or concern to actual storage performance requirements. Because the main line of business database is actively managed by a team of DBAs, who should make the effort to carefully plan and configure storage, the whole point of an enterprise SAN is moot.

To claim the VMAX is suitable for high bandwidth is ridiculous. The older DMX could only do 3GB/s, even though EMC liked to give the impression that the aggregate bandwidth of all the ports was much more, while never claiming what the actual bandwidth capability was. I seriously doubt the original VMAX (2009) could do much in bandwidth, being built on the Intel Core2 processor architecture, which just did not have adequate memory and IO bandwidth.

The newer VMAX x0 should be better, being built on Westmere? But EMC is quiet on what the VMAX BW is. Seriously, at best you will get 30MB/s per HD. If we wanted 10GB/s, that would take 330 disks. A typical VMAX quote will work out to $6000 per disk, so you are looking at $2M !@#$%^&*!!!

300 15K disks in direct attach should cost $500 per HD amortizing enclosures for <$200K. So please, no jokes on VMAX for DW style IO bandwidth.

I also bring up that EMC does not detail what is involved in configuring a system running SQL Server connected to VMX to get 10GB/s in a table scan test. It will be horribly complicated, and you will be fighting the SAN admin and EMC FE every step of the way. Yes I know EMC has DW papers, but on 2 occasions the EMC FE had never seen it and strictly followed the stock EMC config: Flash cache + tiered storage. Getting 10GB/s to SQL Server on direct-attach is easy.

I do not object to being on a SAN. SAN is a good choice for the transaction processing system. Direct-attach is much cheaper for DW, but a dedicated SAN like the VNX or better yet VNX2 is ok. But it is absolutely essentially to have full control of the DW SAN. The SAN admin and EMC does not seem to have even a basic idea of DW, and will send you down a totally stupid path. So this really precludes the VMAX on cost, as there is no way to argue that it is departmental resource.

One more thing. Remember the DMX3? for the state of Virginia that had a double memory card failure. This meant the system did not shutdown in a consistent state, hence could not restart. But the DR site could not come up either. No explanation was given. I do not blame EMC for an extremely rare double component failure, and the DR failure was probably not EMC's fault. But one of the major reasons for getting enterprise SAN is for this capability. I think too many people think they can buy an HA/DR "product" without bother to realize that they need to develop the operational skill to achieve actual HA/DR capability.

May 3, 2013 8:55 AM
 

jchang said:

this article EMC’s Data Warehouse/Business Intelligence Competency Center Now Open, is dated Dec 2008

http://sqlmag.com/database-administration/emc-s-data-warehousebusiness-intelligence-competency-center-now-open

Also, EMC World 2013 starts tomorrow

http://www.emc.com/microsites/emcworld/2013/index.htm

May 5, 2013 11:41 AM
 

Mike N said:

Ahhh! No VNX2 yet :( I was waiting for that announcement myself (although it was sitting on the stage powering the MCx code demo).

May 30, 2013 3:15 PM
 

Lonny Niederstadt said:

Regarding the DMX3 failure and extended outage for the state of Virginia - have you read the audit?

http://jlarc.virginia.gov/other/Nortrop%20Grumman%20Audit.pdf

In an interview conducted with the EMC engineer who made the decision to replace memory board zero (0) first, it was stated that the decision to replace memory board zero (0) first was based on prior experience. During the initial troubleshooting of reviewing the log files, there were some uncorrectable (hard) errors observed on memory board (1) and correctable (soft) errors on both memory boards zero (0) and one (1). Both memory boards (0) and (1) were showing correctable error counts as being at maximum. As part of standard troubleshooting procedures, the engineer reset the counter to observe the frequency of the errors being generated in real time. During this time no uncorrectable errors were experienced and memory board zero (0) was posting correctable errors faster than memory board one (1). Further status views of the global memory continued to show memory board zero (0) logging correctable errors faster than memory board (1). When asked directly why the engineer determined to replace memory board zero (0) first, which had no uncorrectable errors as opposed to memory board one (1), which did show that uncorrectable errors had been logged at some time in the past, the engineer responded, that due to the rate at which memory board zero (0) was logging errors, prior experience indicated that it was only a matter of time before memory board zero (0) would begin to log uncorrectable errors and that was the deciding factor in his decision to replace memory board zero (0) first.

July 4, 2013 1:31 AM
 

jchang said:

MikeN: I was very surprised that VNX2 was not launched at EMC World, given the preview at FMS. I suppose VNX2 depends on many new subcomponents, and perhaps some were not ready, or this was never the launch event in the first place.

Lonny: thanks for the link, real hard information completely different from the very misleading press material.

So the source cause was not a true hardware double component failure, but rather the incorrect procedure in attempting to replace problematic components with the system still operational.

I wondered if the failure to bring up the remote site was due to a backlog in the replication traffic, which I had seen at other sites. But, not having read the report completely, SRDF replicated corrupt information? I would think that once one of the redundant memory boards was removed, the system should hard crash (windows OS - black screen, not blue) on an uncorrectable memory failure, instead of propagating corrupt data.

But the greater problem was the disconnect between storage system administration and DBAs. The storage people want to consider the DMX to be absolutely reliable, and the DBA did not challenge.

I do local backups because the network people just do not seem to understand parallel 10GbE transfers. But ideally the backups should be kept on a different system, with parallel 10GbE links.

Also, the maintenance on the primary DMX should have provided the occasion to simply fail over to the remote site.

July 4, 2013 10:46 AM
 

Silab said:

Hi JChang,

About FAST Cache auto detect(SLC only);

How the system detect FAST Cache SSDs RAID type if not configured as FAST VP?

SSDs example:

3 x 100 GB = R1?

5 x 100 GB = R1/0 or R5?

December 18, 2013 4:28 AM
 

Jun said:

All FAST Cache by default configured as R1 in VNX2

May 21, 2014 1:38 AM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement