THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

The Path to In-Memory Database Technology

The term in-memory database can be subject to misinterpretation. An in-memory database was originally used to describe a storage engine designed for the memory access characteristics of modern microprocessors, not simply a database stored in memory. Today it is common for a database to reside almost entirely in the buffer cache, i.e., memory of a traditional relational DBMS, but this is very different from an in-memory database just defined. As Microsoft recently announced that the next version of SQL Server will incorporate in-memory database technology under the Hekaton codename, it is worthwhile now to revisit in more detail the difference between the original disk storage oriented and in-memory databases, along with the differences in computer system architecture between then and now that drove the change in the database storage engine.

The First Relational Databases

Relational databases originated from the papers of Edgar Codd published from 1970 on. Oracle may have had the first commercial product. A group at UC Berkeley (Stonebraker and Wong) built the INGRES, from which Sybase and later SQL Server descended. Ingres was developed on a DEC PDP-11, which was a popular mini-computer system at the time (16-bit integer/register). The Design and Implementation of INGRES paper by Stonebraker, Wong and Held, ACM 1976 mentions support for UNIX on the PDP-11/40 45 and 70 models. The 11/40 could support a process address space of 64K and 128K on the 11/45 and 11/70 models.

The main element was for a database engine to make best use of limited memory to complete a query with minimal disk IO.

Computer System Architecture Evolution

DEC PDP-11

The DEC PDP-11 came out in 1970 at a relatively low price-point such that it was a very popular system in university environments. The Spring Joint Computer Conference 1970 paper A new architecture for mini-computers - The DEC PDP-11 cites a $5000-10K price target. This is may have been why one happened to be available for the original Ingres development project. The PDP 11 Handbook lists the PDP-11/10 as having 1,024 words of 16-bit read-only memory and 128 word read-write memory. The PDP-11/20 model has 4,096 words of 16-bit read-write (Magnetic) core memory. The max data transfer rate on the Unibus was one word every 750ns. Core memory had a 1.2 µs cycle time and 500 ns access time.

Wikipedia lists the history of the PDP-11 Unibus models as:

  • PDP-11/20 and 11/15: original with core memory, non-microprogrammed
  • PDP-11/35 and 11/40: with microprogramming (1972?)
  • PDP-11/45, 50 and 55: upto 256KB semiconductor memory (1971?)
  • PDP-11/70: upto 4MB memory and 2KB cache (1975)

Microsoft Research now seems to be the repository of DEC material under the Gordon Bell section, including DEC Museum and 16-bit timeline. The timeline information between Wikipedia and Microsoft Research do not appear to be entirely conistent. Either it is difficult to interpret surviving documents or people's recollections of this era are fading.

DEC VAX 11/780

There is more information on the next generation DEC VAX 11/780, the first 32-bit mini-computer. This system came out in 1977. See VAX-11/780 Hardware Users's Guide and VAX Product Sales Guide for details. Also search for the VAX-11/780 Architecture Handbook from Carnegie Mellon ECE. The CPU was built with TTL, had a 200ns clock and 8KB cache. No transistor count is cited? The VAX-11/780 pricing was between $210K and 350K?

The system was described as 1MIPS, but that was because the performance was roughly comparable to an IBM system (370/158-3?) that was accepted as 1MIPS. It turned out the VAX 11/780 executed 0.5M native instructions per sec to deliver equivalent peformance to the IBM 1MIPS. John McMallum jcmit cites the IBM 370/158-3 as 0.73MIPS and the VAX-11/780 as 1MIPS.

VAX System

The CPUs of this time were very limited in the number transistors, and should have only basic instructions. It would have not been feasible for compiled binaries to be built on basic instructions. The native VAX (or PDP-11) instruction set were comprised of complex instructions, which are translated by a set of microprogrammed instructions (microcode) into the basic instructions? The presumption based on 0.5 VAX MIPS and the 5MHz clock cycle is then that the average VAX instruction decomposes into 10 basic instructions or rather clock cycles, accounting for memory access time?

The memory system contains one or two memory controllers. Each controller can handle 1 to 16 arrays. The memory array has a cycle time of 600ns. A memory controller buffers one command while processing another. The memory controllers can be interleaved.

Cache access time is 200ns, basically 1-cycle access. Memory cycle time is 600ns. Read access time at the processor is 1800ns. Effective average operand access time is 290ns.

The first systems used 4Kbit DRAM supporting a maximum system memory of 2MB, in increments of 128K.

VAX System

Later systems used 16Kbit DRAM, supporting up to 8MB memory, in 256K increments. Minimum memory was cited as 128K and 256K in the 1977 and 1979 handbooks, but later documents cited minimum memory of 1M?

VAX System

If we do that math, we can work out that excluding overhead for ECC, 4096 chips are required for 8MB at 16Kbit per DRAM. The VAX 11/780 has a 72-bit memory path comprised of a 64-bit word with 8-bits for ECC.

By comparison, a modern server system supports 1TB memory over 64 DIMM sockets with 16GB DIMMs. There are 36 chips on each 16GB DIMM (32 for data, 4 for ECC) at 4Gbit per chip. The DRAM package could be single or double die package (DDP). So the system could have upto 2048 chips plus 256 for ECC.

Microprocessors

Over the course of time, computer systems transitioned to single chip microprocessors. The low-end systems transitioned first to realize the cost benefits of lower part count. Eventually high-end systems transitioned to microprocessors as well, due to the chip to chip signal delays not scaling with improving transistor performance within a chip.

Pipelining

The next step in microprocessor architecture was pipelined execution. A complete single instruction is comprised of a sequence of many operations. By dividing the sequence into many steps, the clock rate for completing a single step can be higher than for the whole instruction. By allowing the a sequence of instructions to overlap, one instruction could be completed each clock cycle with pipelining. Wikibooks Microprocessor Design/Pipelined Processors has excellent illustrations of pipelining. The time to execution a single full instruction is several clock cycles.

Pipeline
Pipeline

The Intel 80486 (1989) has a 5-stage pipeline: fetch, decode1, decode2 (effective address), execute, and write-back.

The Pentium (1993) pipeline stages are: Prefetch, Fetch (MMX only?), D1 Instruction Decode, D2 Address Generate, Execute, Writeback. So that makes 5 stage for the original Pentium and 6 for the MMX?

Intel is curiously vague on the exact number of pipeline stages for the Pentium Pro to Pentium III, collectively known as the P6 line. The later Pentium M could be an improved P6, but is also called a new design. It might be because the actual number of stages varies with the instruction? The Pentium III has been cited as 10 stages, and the Pentium Pro (P6) could be the same. The later Pentium III processors may have added a (prefetch) stage purely to account for the time to access L1 cache as core frequency increased with process generation and maturity.

The first Pentium 4 processors (Willametter and Northwood) are 20 stage, the second generation Prescot is 31 stages.

The diagram below is from "The Microarchitecture of the Pentium 4 Processor", Intel Technology Journal Q1 2001 showing 10 stages for a basic Pentium III, and 20 stages for the 180nm and 130nm Pentium 4s, Willamette and Northwood.

Pipeline

In other documents, I have P6 as:
IFU1, IFU2, IFU3, Dec1, Dec2, RAT, ROB, Dis, EX, Ret1, Ret2.

The Core 2 (Conroe 65nm, Penryn 45nm, there were other codenames for server and mobile) 14 stages. The Core 2 brand name was later changed to Core, even though pre-Core 2 processors had already been sold with Core (Duo and Solo) as brand name. The difficult decisions that marketing pukes must make.

Superscalar

The next significant innovation was super-scalar execution, where a microprocessor could complete several instructions in parallel each clock cycle. The Intel Pentium has limited 2-wide super-scalar. The Pentium Pro had a more broadly usable 3-wide. The super-scalar execution units typically have special uses, so it is not always possible to complete an instruction on all units in each cycle.

microarchitecture

The Intel Pentium 4 is shown with 4 ports, 2 for execution, 1 Load and 1 Store port. I recall the Pentium 4 as 3-wide, which might be the maximum throughput of the ports.

The Intel Core microarchitecture (Conroe/Penryn) is described as 4 instructions per clock cycle versus 3 in previous architectures. The diagram shows 5 units, 3 for different aspects of ALU, FP and vector, 1 Load and 1 Store. Also mentions 14-stage pipeline.

The Intel Nehalem is shown in IDF 2008 with 6 execution units, 3 for integer, FP and vector, 1 Load, 1 Store Address and 1 Store Data. The Intel Sandy-Bridge is shown with 6 execution units, 3 for integer, FP and vector, 2 for Load/Store Address and 1 Store Data.

The Intel IDF Fall 2012 presentation on Haswell shows 8 units: 4 integer of which 3 can do vector, and 2 can do FP, , 1 Load/Store Addres, 1 Store Data, 1 Store Address.

Million Instructions Per Second MIPS

Technically, a instruction on one system architecture has no inherent correlation to an instruction on a different system architecture. So there should be no correlation between MIPS on one system to another. But people need or want to compare systems, and MIPS had already become popular, so the MIPS was standardized, first as Whetstone (contains floating-point), and then later Dhrystone (no floating-point). One DMIPS is the performance of the VAX-11/780, rather than 1 million specific actual IPS.

There are minor inconsistencies between MIPS from various sources. The table below is mostly from Wikipedia Instruction per second . Last two items are multi-core processoes and the MIPS rating is for all cores, but the D IPS/clock is per core. Another broad compilation is jcmit.

ProcessorMHzMIPSD IPS/clock
/core
Year
Intel 4004 0.740 0.092 0.1 1971
IBM 370 158-3   1 MIPS at 8.69MHz 0.1 1972
Intel 8080 2 0.33 (not Dhrystone) 0.165 1974
VAX-11/780 5 1 Dhrystone 0.2 1977
Intel 80286 12.5 2.66 0.2 1982
Intel 80386 33 9.9 0.3 1985
Intel 80486 DX2 66 54 0.8 1992
Intel Pentium 100 188 1.88 1994
Intel Pentium Pro 200 541 2.7 1996
Intel Pentium III 600 2054 3.4 1999
Intel Pentium 4EE 3200 9726 3.0 2003
Intel Core 2 (2c) 2930 27079 4.6 2006
Intel Core i7 920 (4c) 2660 82300 7.7 2008

Notice the sharp rise in IPS per clock between the Intel 80386 (non-pipelined) and the 80486DX2 (pipelined) to nearly 1 per clock. Presumably the main contributor is the 8K (unified) on die for the 80486 and 8K data + 8 K instruction cache for the Pentium. The high-end 486 and Pentium systems of this period also had off-die L2 cache as well. I do not recall if off-die cache was common for 386 systems.

Thereafter, IPS/clock is greater than 1 with the advent of super-scalar execution. Both the Pentium Pro and Pentium III are 3-wide, so the increase in IPC might be due to the SIMD capability of the Pentium III. The Pentium 4 gave up a degree of IPC on the very deep pipeline to achieve extraordinarily high clock rates. The Core 2 was 5-wide? The Core i7 is 5-wide but also has hyperthreading. The latest Sandy-Bridge is 6 wide?

Intel provides MIPS rating of their earlier processors up to Pentium in List of Intel microprocessors

ProcessorMHzMIPStransistorsprocessYear
Intel 4004 0.740 0.07 2,300 10µm 1971
Intel 8080 2 0.29 6,000 6µm 1974
Intel 8086 5
8
0.33
0.66
29,000 3µm 1978
Intel 80286 6
12.5
0.9
2.66
134,000 1.5µm 1982
Intel 80386DX 16
33
5
9.9
275,000 1µm 1985
1989
Intel 80486DX 25
50
20
41
1.2M 1µm
0.8µm
1989
1991
Intel 80486DX4 75
100
53
70.7
1.6M 0.6µm 1994
Intel Pentium (P5) 60
66
100
112
3.1M 0.8µm 1993
Intel Pentium (P54) 90
100
149.8
166.3
3.2M 0.6µm 1994
Intel Pentium (P54CS) 133
166
218.9
247
3.3M 0.35µm 1995
1996

Memory

A complete DRAM history is more difficult to trace, along with the primary manufacturers chaning over time. Wikipedia is generally a good starting point. Dynamic random access memory, cites DRAM Design Overview from Stanford University by Junji Ogawa.

YearDRAMDensitytRAC
1971 Intel 1103 1K 300ns
1973 TI TMS403 4K ?
1977 Mostek MK4116 16K 250ns
1980 ? 64K 200ns
1983 ? 256K 150ns
1986 ? 1M ?
1989 ? 4M 80ns
1992 ? 16M ?
1995 ? 64M ?

DRAM timing is particularly difficult to understand, more so with the radical change from asynchronous (FPM and EDO) DRAM to synchronous SDRAM, and DDR timings. Wikipedia SDRAM latency provides the diagram below.

microarchitecture

Other references are Anantech everything-you-always-wanted-to-know-about-sdram-memory-but-were-afraid-to-ask. and Ulrich Drepper's What every programmer should know about memory.

The aspect of focus here is memory access latency. This element was generally quoted for asynchronous DRAM products. After the change to synchronous DRAM, the industry emphasis changed to bandwidth timings.

The last of the FPM and EDO DRAM products were available with 50ns access times, but 60ns products were more common. Perhaps the 50ns access time required cherry picking from a normal production run?

Today, the best DDR3 may have an access time of 25-30ns at the DRAM chip. Local memory access time at the processors (with integrated memory controller) is on the order of 50ns? The difference due to signal transmission from processor to memory and back. On server systems using registered memory, there may be buffer chip between processor and DRAM? On multi-processor (socket) systems, access to remote node memory may be over 95ns? to an adjacent node and 150ns+ for 2-hop distances?

DDR transfers data on both edges of the clock, i.e., at double the clock rate. Internally, DRAM is now organizied into multiple banks in order to sustain data transfers at a very high-rate.

The entire discussion above pertains to mainstream DRAM, which emphasis cost relative capacity first, followed by bandwidth, with the expectation that computer system will be comprised of many DRAM chips. For example, a recent generation personal computer will have 2 memory channels, each 64-bits wide. The DRAM components are organized as x8, providing an 8-bit data path, so there are 8 chips to form a 64-bit channel, and the minimum system memory has 16-chips.

There specialty DRAM products designed around different requirements. Graphics DRAM is designed for high bandwidth on the assumption that the memory system will be comprised of few chips. Consider a graphics subsystem that needs only 1GB comprised of 1Gbit chips. The desired bandwidth might require a 256-bit path. So GDDR DRAM are often organized wider, x32 being popular.

Another specialty is reduced latency DRAM for network systems. These systems do not require monstrous system memory capacity, but do need super low latency to support fast turn-around time for high-speed networking, in the 10-40Gbit/s range. A Micron RLDRAM document mentions tRC of 8ns versus 46-52ns for DDR3?

It has been my opinion that server system memory has long since become out of balance with the original concept of system main memory. The latency has become to long for memory. Today most memory is used for caching of one form or another, including the database buffer cache. The time is right to split computer system memory. There should be a smaller memory subsystem emphasizing very low latency, not just with specialtly DRAM, but also with physical proximity, perhaps in the same package as the processor. A separate larger subsystem can continue to implement bulk DRAM, tolerating longer latency.

It has long been known that memory access latency cannot keep up with the microprocessor. Of course, the Intel server microprocessor clocks rates have settled into the 3GHz range, with further progress emphasizing the number superscalar execution ports, and the number of cores on a single die (or socket). For a 3GHz processor and 50ns local node access, memory latency is now 150 CPU clock cycles away, and 300+ for remote node memory access.

Micron and other memory companies have formed the Hybrid Memory Cube consortium, proposing a radical re-architecture of the memory system. See Hot Chips HC23 Hybrid Memory Cube (HMC). by J. Thomas Pawlowski, Micron and High-Performance Memories for Packet Processing.

On the VAX-11/780, the CPU clock was 5MHz or 200ns cycle time, but a complete instruction averaged 10 cycles. DRAM access time was 250ns, 600ns to the memory controller and 1800ns to the processor. This was before the advent of SIMM and DIMM technology. The processor, memory controller and memory were all on separate boards, with long signal delays. So essentially, memory access time was comparable to the time complete one VAX (complex?) instruction.

The a single Intel Sandy-Bridge core can complete 6 instructions per clock cycle if there are no memory stalls. The key to modern microprocessor performance is an effective cache strategy to hide memory latency. This can be successful is there is locality or if memory can be prefeched, ideally 150+ cycles before it is needed. An alternative strategy is sequential memory access to make use of the high memory bandwidth of modern systems.

YearInstructions
per clock
CPU clockEffective ns/InstMemory Access
1978 0.1 200ns 2000 1800ns
2012 upto 6 0.33ns 0.055 50ns (local)
100ns (1 hop)

Summarizing, the CPU clock was faster than memory access even back in 1978. However, the CPU was also a very simple device that required 10 full cycles to complete a (microprogammed) instruction. So the net result was instruction time was comparable to memory access. Today, a single core is capable of completing 6 instructions per clock. This is on top of the 150-1 ratio between local memory access latency to CPU clock. The decisions made thirty years ago for good reasons nolonger hold today. The current nature of computer system architecture points to a completely different strategy given the long latency for memory access.

The modern microprocessor is designed to operate with pipelining and superscalar execution. There should be multiple independent instructions that can be executed on each clock. Furthermore, instructions executed in one clock should not have intractable dependencies on instructions in the immediately preceding clock cycles.

The most difficult code for modern microprocessors is pointer chasing. This is where a memory access retrieves the next memory location to be accessed. If the memory address is not in cache, then the access time to DRAM is over 150 cpu-cycles, during which the processor core has nothing to do. Once the memory is accessed, this provides the address of the next memory fetch. Unfortunately, this code sequence just happens to describe a b-tree search.

Modern Computer Architecture and Memory

Page and Row Storage Organization

Having covered the computer system architecture transistions from 1970 to 2012, including the processor core and the memory system, it is appropriate to return to the orignal relational database implementation. The following diagram is from The Design and Implementation of INGRES by Stonebraker, Wong and Held, ACM 1976.

SQL Page

The page and row storage organization from Ingres in the 1970s is still in use today. The diagrams below are from Microsoft MSDN Understanding Pages and Extents

SQL Page

and SQLSkills Inside the Storage Engine: Anatomy of a page

SQL Page

Now examine the sequence of operations to access rows and columns with page-row storage, with consideration for whether memory access operations are in cache, or can be prefetched.

Assume that we have already found the sequence of rows required by a query from an index. The information we have for each row is the file_id, page_id, and row

  • 1) Check if the page is in the SQL Server buffer cache.
    Also the OS must check if the page is in memory (unless lock pages in memory is in effect)
  • 2) Acquire a shared lock or latch on the page (table and row locks)
  • 3) Read the page header
  • 4) Read the 2-byte row offset at the end of the page
  • 5) Read the row/record header
  • 6a) Fixed column loads: Address = row offset + column offset
  • 6b) Nullable columns: Load NULL bitmap, calculate offset, load?
  • 6c) Variable length: follow the chain of column offset to the desired column?

Comments:
1) the cost of the page in cache check could be as high as 1000 cpu-cycles? This is based on a series of table scan tests I did for varying number of rows per page. with the lock pages in memory permission on. The OS check could be equally expensive. One of Thomas Kejser's slides from SQLBits mentions that lock pages in memory performance impact could be significant.
Note to storage vendors: this is why the claim that caching solves IO performance problems is totally stupid.
2) It is necessary to place some kind of lock or latch on the page even if nolock or tablock is applied on the query. This is so the SQL Server storage engine knows that the page cannot be evicted from the buffer cache while being read.
4) The reason that the row offset and the actual row data is filled in from opposite directions of the page is the improve storage efficiency. With nullable or variable length data, it is not known how many rows will fit in any given page. This requires non-sequential memory access patterns.
5) One might think in a SELECT COUNT(*) query that we could just read the m_slotCnt value in the page header, or read the number of 2-byte row offset values at the end of page, but apparently SQL Server actually reads the row header for each row.
6) Fixed length non-nullable columns are the least effort because the column offset is known ahead of time and the same for each row. One of the recent SQL Server versions improved the handling of nullable columns by having a bitmask for all columns in each row, which simplifies the process of determining the offset?

Variable length columns are then difficult? I think we have to go to the first variable length column, read the length to get the next length value and so on until we find the desired column. It would be nice to see the source code for this. Perhaps someone would be helpful in examining the code of one of the open source databases?

There are also cycles expended to handle conversion from raw bytes to the correct data type and special SQL Server rules.

A couple of years ago, I proposed extension to to the Intel vector instructions in SIMD Extensions for the Database Storage Engine in order to facilitate database page-row-column address calculation. This would require working out the details on the new instructions, and getting Intel to implement this in the next processor still early in the design stage. I suspect that it would also be necessary to change the way metadata is stored to facilitate loading into the vector registers. It would take 2-3 years for the new processor to enter production. There would be another 2-3 years before the new technology is broadly deployed. Of course all of this should be been started ten years ago when CPU frequency went over 1GHz.

I ran a test on a series of tables with a range of rows per page from 1 to 476. The queries consisted of a table scans, first getting just a count of rows and then aggregating successive columns. The first three systems are 2-socket, running SQL Server 2008R2 on Windows Server 2008R2. The last system is single socket running SQL Server 2012 on Windows Server 2012.

The table below shows in CPU-nanoseconds for the cost per page (including 1 row) of the count query, the cost for each additional rows, and then the additionak cost for the first and each additional column aggregated.

SystemTablePage
w/1 row
additional
row
first
column
additional
column
X5670 2.93GHz Heap 720 58 85 38
X5670 2.93GHz Clust 810 58 84 39
X5660 2.80GHz Heap 630 60 109 43
X5660 2.80GHz Clust 760 64 97 45
E5-2690 2.9GHz Heap 640 52 72 34
E5-2690 2.9GHz Clust 700 48 74 34
E3-1270 3.4GHz Heap 560 47 80 35
E3-1270 3.4GHz Clust 656 45 68 30

Given that all processors cores were around 3GHz, the CPU-cycles for each page is the range of 2000 cycles, each row and column contributing another 150 or so cycles. When the first row is accessed, ie, reading the last 2 bytes of an 8KB page, the entire 64-byte cache line comprising 32 row offset values would be read into cache. The approx 150ns cost for per row corresponds to a single memory access for the row header, with the row offset most likely already in cache.

The tests compared column accesses in sequence. The single column aggregate is on the second column. The two column aggregate is on the second and third columns, which should be stored in adjacent bytes. There is some indication the pairs of columns are marginally more than a single column but the cost off 100+ cycles per successive column seems to be high. Is the type conversion? or due the interpreted code?

My standard SQL Server configuration is with the lock pages in memory right assigned, as this is required for Trace flag 834: use large-page allocations for the buffer pool. I was not aware of Thomas Kejser's report that the lock pages in memory by itself would have significant performance impact. If possible, I will re-run the above tests with and without lock pages in memory.

Scaling and Locks

Another major top in database performance is scaling to a very high number of many cores. This is both scaling over the cores in a single processor socket and scaling over all the cores of a multi-socket system. Apparently the locking mechanism is a serious obstacle to scaling.

A few years ago, I did a study of a non-transactional b-tree search engine, ie, without locking. Not only did it scale perfectly over the physical cores, it also scaled perfectly over the Hyper-Threading logical cores. This was possible because the b-tree search is a series of pointer chasing memory accesses, resulting in many no-ops cycles within a single thread. With no lock contention, the scaling was perfect.

I also looked at compression and parallelism. At DOP 1, queries to compressed tables consumed about 40% more CPU than to an uncompressed tables, depending on the operation. The uncompressed tables would scale with increasing the degree of parallelism up to a point, before scaling falls off and the performance is saturated. The compressed tables scaled perfectly until the performance was equal to the uncompressed tables. The interpretation was that contention for locks was limiting scaling with parallelism. The compression added enough CPU on each thread to relieve the contention. At high degree of parallelism, 16-32 in some examples, the compression essentially become free.

Transactional memory is currently a topic of discussion. See the Intel Developer Forum 2012 session ARCS0004, Intel Transaction Synchronization Extensions. The objective is a lower overhead alternative to locking that can used on most cases? The Microsoft paper also discusses lockless or lock free memory strategies?

In-Memory Database

As soon as it was evident that CPU-cycles and memory access latency were on diverging paths, perhaps around 1990, it was realized that the page row storage system with pointer chasing code to retrieve scattered metadata would not realize the full capability of modern processors. Hence the term in-memory database for describing a storage engine optimized for processor - memory access characteristics.

Another option is columnar data storage. The sequential data access could then take advantage of the memory bandwidth of systems, which was improving at an adequate pace. Furthermore, the data type within each would be known, except for the (hopefully) rarely used variant.

By the time of the later Intel 486 or early Pentium processors, the cpu cycle time to memory access latency ratio had exceed ten. So there was talk of 10X or greater performance with in-memory and columnar database technology.

At that time, system memory had not become ridiculously huge as it is today, so in-memory databases were not really practical to achieve broad adoption.

Today server memory capacity is both huge and cheap, with 16GB DIMM pricing below $400. Of course the mainstream database systems have progressed far beyond their original base with a deep infrastructure of tools and features that migrating to a different DBMS would involve huge effort and risk.

The natural solution is incorporate in-memory database technology into an existing DBMS. Microsoft SQL Server has already incorporated columnar storage in version 2012.

Breakthrough performance with in-memory technologies (Nov 8, 2012) on the Microsoft Technet SQL Server Blog by Dave Campbell, and The coming in-memory database tipping point (Apr 9, 2012) describes the rational behind in Hekaton. The first Nov 8 post cites Per-Ake Larson et al High-Performance Concurrency Control Mechanisms for Main-Memory Databases which describes method to reduce locking and other concurrency overhead.

Additional references:
Oracle TimesTen In-Memory Database Architectural Overview,
IBM solidDB redbook,
Wikipedia In-memory database, and Column-oriented DBMS.

The diagram below is from Oracle TimesTen In-Memory Database Architectural Overview

TimesTen

ps
I am still working on this, to fill in missing data, correct mistakes, etc
I will try to make updates here, but if not, the permanent copy is here
http://www.qdpma.com/CBO/InMemory.html

Published Monday, February 11, 2013 5:08 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Chris Adkin said:

Ji Joe,

I doth my cap to you, a tremendous amount of effort must have gone into this posting.

Regarding CPU architectures, max clock speeds will probably top out somewhere around 4.5 Ghz-ish. This has lead to a trend in design with increasingly numerous cores per socket. I wonder when the penny will drop and CPU designer will realise there will come a point when dedicating more die real estate to L1/2/3 caches rather than extra cores will make a real performance difference.

Chris

February 12, 2013 7:20 AM
 

Mr Jones said:

"Note to storage vendors: this is why the claim that caching solves IO performance problems is totally stupid. "

Amen

Like, Duh,

So, Mr Storage Vendor (AKA MS (arent ALL caching stategies basically vending the "storage" of something not close)), what happens after EVERYTHING IS CACHED? AKA no spinning storage.

I know, maybe that better design thingy....?

AKA, Caching is to make up for....unoptimized code perhaps...wink

thus restarts the old war where the Hardware guys pickup the slack for poor software design, Not saying there is any....wink

Intel Core 12: SQL Server storage engine on the Die....

No wait, isnt it already...if its....cached? wink

February 12, 2013 1:40 PM
 

jchang said:

Chris:

I do not believe that there is a fundamental limit to CPU clock frequency, but clearly Intel has stopped bothering to pursue this aspect. The IDF2012 discussion on Haswell cites going to 8 ports on the superscalar side.

Cache is a complex subject considering the objective is performance over a broad range of applications and considering that each market segment has their own priorities. Also consider that once an application is in cache, there will be no further improvement, and that some applications cycle memory with poor (spatial and temporal) locality.

In the old days of single core, per Moore's Law, the objective for doubling the number of transistors is 1.4X performance. Because logic and cache transistors have different density, I like to account for logic and cache separately on this aspect. To be more accurate, the objective for doubling the die size is 1.4X.

So as a rought approximation in determining cache size, we want each 10% we add to contribute 4% improvement. The first element of cache contributes far more than that, but then it falls off rapidly.

Increasing L1 has implications on L1 latency, ie, the number of pipeline stages for pre-fetch. L3 latency is already so high it does not really matter. L2 might have some impact, but the cost might not be worth the value.

Now with multi-core, if our application scales with cores, ie, little synchronization effort and no resource contention, then scaling is linear with the number of cores, die area or transistors, ie, better than Moore's Law.

Hence there is no single answer. Ideally, Intel should have a range of products, with different numbers of cores and with different types of cores. A further complexity with cache is what market segment benefits from cache and how is it valued.

Several year ago, the main impact was multi-processor scalability. That is, it had negligible impact at single core, but did improve scaling at 4+ cores. This was highly valued in the database server market to the degree that people would happily (no joke!) pay thousands $$ for big cache over the few hundred dollar desktop processor.

So my personal opinion is that the cache today is probably good enough. I prefer segmenting the memory system to get small super low latency.

February 12, 2013 2:23 PM
 

jchang said:

Mr J: my object to storage vendors and their caching claims is primarily to the fact that the database engine is a cache. On modern server systems, we might have 256-384GB in a 2-socket and 1024GB in a 4-scoket.

If the SAN vendor sales rep talks about their 32GB SAN controller cache, I would have a to-scale representation of 1024GB vs 32GB ready to show them.

Of course they will also try to sell their SSD caching solution.

Clearly they have never heard of database server FileGroups and Partitioning?

So here goes: our database server has 256-1024GB memory, of which 85% is buffer cache. What data is it that is not buffer but will be on the SAN?

February 12, 2013 2:29 PM
 

Thomas Kejser said:

Hi Joe

I did some measurements of PAGE compression recently. I measure an overhead of up to 60% in CPU cycles to do updates in place.

Measuring the lock times (By using XEvent and getting the CPU cycle count for each event) there is a HUGE overhead for compression as lock are held a lot longer (4x longer). This indicates that you should never compress tables that have a lot of lock contention.

March 10, 2013 10:49 PM
 

jchang said:

hey Thomas,

at SQLBits, I reported page compression as having 40% overhead on a single thread read, but in parallel queries it could be essentially free. Given that compression burns (much) more cpu than decompression, 60% for write seems cheap. Of course MS probably picked a lower overhead compression algorithm. And the 8KB page is probably too small to get really good compression. Too bad there is not a full extent high compression option. I have noticed that creating nonclustered indexes at high very row count is painfully slow, perhaps 3-4X overhead in elapsed time, perhaps this is inability to do parallelism in the final step?

By update in place, I assume you are updating a fix length value with another fixed length value? Without compression, it really will be in place. With page compression, I assume we must decompress the entire page and recompress the same, hence effectively we must lock the entire page?

Now what if the new value does not compress into the same size? there would a page split? we could build are indexes with appropriate fill factor? I suppose a column with sequentially increasing or otherwise predictable values compresses well, while random values do not? So we should test the updates on compressible and non-compressible values?

March 12, 2013 11:39 AM
 

merrillaldrich said:

Great article. Love it! Thank you for putting this together.

July 12, 2013 2:36 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement