The term in-memory database can be subject to misinterpretation.
An in-memory database was originally used to describe a storage engine designed for the memory access characteristics of modern microprocessors, not simply a database stored in memory.
Today it is common for a database to reside almost entirely in the buffer cache, i.e.,
memory of a traditional relational DBMS, but this is very different from an in-memory database just defined.
As Microsoft recently announced that the next version of SQL Server will incorporate in-memory database technology under the Hekaton codename, it is worthwhile now to revisit in more detail the difference between the original disk storage oriented and in-memory databases,
along with the differences in computer system architecture between then and now that drove the change in the database storage engine.
The First Relational Databases
Relational databases originated from the papers of Edgar Codd published from 1970 on.
Oracle may have had the first commercial product.
A group at UC Berkeley (Stonebraker and Wong) built the INGRES, from which Sybase and later SQL Server descended.
Ingres was developed on a DEC PDP-11, which was a popular mini-computer system at the time (16-bit integer/register).
The Design and Implementation of INGRES
paper by Stonebraker, Wong and Held, ACM 1976 mentions support for UNIX on the PDP-11/40 45 and 70 models.
The 11/40 could support a process address space of 64K and 128K on the 11/45 and 11/70 models.
The main element was for a database engine to make best use of limited memory to complete a query with minimal disk IO.
Computer System Architecture Evolution
The DEC PDP-11 came out in 1970 at a relatively low price-point such that it was a very
popular system in university environments.
The Spring Joint Computer Conference 1970 paper
A new architecture for mini-computers - The DEC PDP-11 cites a $5000-10K price target.
This is may have been why one happened to be available for the original Ingres development project.
PDP 11 Handbook
lists the PDP-11/10 as having 1,024 words of 16-bit read-only memory
and 128 word read-write memory.
The PDP-11/20 model has 4,096 words of 16-bit read-write
(Magnetic) core memory.
The max data transfer rate on the Unibus was one word every 750ns.
Core memory had a 1.2 µs cycle time and 500 ns access time.
Wikipedia lists the history of the PDP-11 Unibus models as:
- PDP-11/20 and 11/15: original with core memory, non-microprogrammed
- PDP-11/35 and 11/40: with microprogramming (1972?)
- PDP-11/45, 50 and 55: upto 256KB semiconductor memory (1971?)
- PDP-11/70: upto 4MB memory and 2KB cache (1975)
Microsoft Research now seems to be the repository of DEC material under the Gordon Bell section, including
The timeline information between Wikipedia and Microsoft Research do not appear to be entirely conistent.
Either it is difficult to interpret surviving documents or people's recollections
of this era are fading.
DEC VAX 11/780
There is more information on the next generation DEC VAX 11/780, the first 32-bit mini-computer.
This system came out in 1977.
VAX-11/780 Hardware Users's Guide
VAX Product Sales Guide
for details. Also search for the VAX-11/780 Architecture Handbook from
Carnegie Mellon ECE.
The CPU was built with TTL, had a 200ns clock and 8KB cache.
No transistor count is cited?
The VAX-11/780 pricing was between $210K and 350K?
The system was described as 1MIPS,
but that was because the performance was roughly comparable to an IBM system (370/158-3?) that was
accepted as 1MIPS.
It turned out the VAX 11/780 executed 0.5M native instructions per sec to deliver equivalent peformance
to the IBM 1MIPS.
John McMallum jcmit cites the IBM 370/158-3 as 0.73MIPS
and the VAX-11/780 as 1MIPS.
The CPUs of this time were very limited in the number transistors,
and should have only basic instructions.
It would have not been feasible for compiled binaries to be built on basic instructions.
The native VAX (or PDP-11) instruction set were comprised of complex instructions,
which are translated by a set of microprogrammed instructions (microcode)
into the basic instructions?
The presumption based on 0.5 VAX MIPS and the 5MHz clock cycle
is then that the average VAX instruction decomposes is 10 basic instructions
or rather clock cycles, accounting for memory access time?
The memory system contains one or two memory controllers.
Each controller can handle 1 to 16 arrays.
The memory array has a cycle time of 600ns.
A memory controller buffers one command while processing another.
The memory controllers can be interleaved.
Cache access time is 200ns, basically 1-cycle access.
Memory cycle time is 600ns. Read access time at the processor is 1800ns.
Effective average operand access time is 290ns.
The first systems used 4Kbit DRAM supporting a maximum system memory of 2MB, in increments of 128K.
Later systems used 16Kbit DRAM, supporting up to 8MB memory, in 256K increments.
Minimum memory was cited as 128K and 256K in the 1977 and 1979 handbooks,
but later documents cited minimum memory of 1M?
If we do that math, we can work out that excluding overhead for ECC, 4096 chips are required for 8MB at 16Kbit per DRAM.
The VAX 11/780 has a 72-bit memory path comprised of a 64-bit word with 8-bits for ECC.
By comparison, a modern server system supports 1TB memory over 64 DIMM sockets with 16GB DIMMs.
There are 36 chips on each 16GB DIMM (32 for data, 4 for ECC) at 4Gbit per chip.
The DRAM package could be single or double die package (DDP).
So the system could have upto 2048 chips plus 256 for ECC.
Over the course of time, computer systems transitioned to single chip microprocessors.
The low-end systems transitioned first to realize the cost benefits of lower part count.
Eventually high-end systems transitioned to microprocessors as well,
due to the chip to chip signal delays not scaling with improving transistor performance
within a chip.
The next step in microprocessor architecture was pipelined execution.
A complete single instruction is comprised of a sequence of many operations.
By dividing the sequence into many steps, the clock rate for completing a single
step can be higher than for the whole instruction.
By allowing the a sequence of instructions to overlap,
one instruction could be completed each clock cycle with pipelining.
Microprocessor Design/Pipelined Processors
has excellent illustrations of pipelining.
The time to execution a single full instruction is several clock cycles.
The Intel 80486 (1989) has a 5-stage pipeline: fetch, decode1, decode2 (effective address), execute, and write-back.
The Pentium (1993) pipeline stages are: Prefetch, Fetch (MMX only?), D1 Instruction Decode, D2 Address Generate, Execute, Writeback.
So that makes 5 stage for the original Pentium and 6 for the MMX?
Intel is curiously vague on the exact number of pipeline stages for the Pentium Pro to Pentium III,
collectively known as the P6 line.
The later Pentium M could be an improved P6, but is also called a new design.
It might be because the actual number of stages varies with the instruction?
The Pentium III has been cited as 10 stages, and the Pentium Pro (P6) could be the same.
The later Pentium III processors may have added a (prefetch) stage purely to account for the time to access L1 cache
as core frequency increased with process generation and maturity.
The first Pentium 4 processors (Willametter and Northwood) are 20 stage,
the second generation Prescot is 31 stages.
The diagram below is from "The Microarchitecture of the Pentium 4 Processor", Intel Technology Journal Q1 2001
showing 10 stages for a basic Pentium III, and 20 stages for the 180nm and 130nm Pentium 4s,
Willamette and Northwood.
In other documents, I have P6 as:
IFU1, IFU2, IFU3, Dec1, Dec2, RAT, ROB, Dis, EX, Ret1, Ret2.
The Core 2 (Conroe 65nm, Penryn 45nm, there were other codenames for server and mobile) 14 stages.
The Core 2 brand name was later changed to Core, even though pre-Core 2 processors had already been sold
with Core (Duo and Solo) as brand name.
The difficult decisions that marketing pukes must make.
The next significant innovation was super-scalar execution,
where a microprocessor could complete several instructions in parallel each clock cycle.
The Intel Pentium has limited 2-wide super-scalar.
The Pentium Pro had a more broadly usable 3-wide.
The super-scalar execution units typically have special uses,
so it is not always possible to complete an instruction on all units in each cycle.
The Intel Pentium 4 is shown with 4 ports, 2 for execution, 1 Load and 1 Store port.
I recall the Pentium 4 as 3-wide, which might be the maximum throughput of the ports.
The Intel Core microarchitecture (Conroe/Penryn) is described as 4 instructions per clock cycle
versus 3 in previous architectures.
The diagram shows 5 units, 3 for different aspects of ALU, FP and vector, 1 Load and 1 Store.
Also mentions 14-stage pipeline.
The Intel Nehalem is shown in IDF 2008 with 6 execution units, 3 for integer, FP and vector,
1 Load, 1 Store Address and 1 Store Data.
The Intel Sandy-Bridge is shown with 6 execution units, 3 for integer, FP and vector,
2 for Load/Store Address and 1 Store Data.
The Intel IDF Fall 2012 presentation on Haswell shows 8 units: 4 integer of which 3 can do vector, and 2 can do FP,
, 1 Load/Store Addres, 1 Store Data, 1 Store Address.
Million Instructions Per Second MIPS
Technically, a instruction on one system architecture has no inherent correlation to an instruction
on a different system architecture.
So there should be no correlation between MIPS on one system to another.
But people need or want to compare systems, and MIPS had already become popular,
so the MIPS was standardized, first as Whetstone (contains floating-point),
and then later Dhrystone (no floating-point).
One DMIPS is the performance of the VAX-11/780, rather than 1 million specific actual IPS.
There are minor inconsistencies between MIPS from various sources.
The table below is mostly from Wikipedia
Instruction per second .
Last two items are multi-core processoes and the MIPS rating is for all cores,
but the D IPS/clock is per core.
Another broad compilation is jcmit.
|IBM 370 158-3
||1 MIPS at 8.69MHz
||0.33 (not Dhrystone)
|Intel 80486 DX2
|Intel Pentium Pro
|Intel Pentium III
|Intel Pentium 4EE
|Intel Core 2 (2c)
|Intel Core i7 920 (4c)
Notice the sharp rise in IPS per clock between the Intel 80386 (non-pipelined) and the 80486DX2 (pipelined)
to nearly 1 per clock.
Presumably the main contributor is the 8K (unified) on die for the 80486
and 8K data + 8 K instruction cache for the Pentium.
The high-end 486 and Pentium systems of this period also had off-die L2 cache as well.
I do not recall if off-die cache was common for 386 systems.
Thereafter, IPS/clock is greater than 1 with the advent of super-scalar execution.
Both the Pentium Pro and Pentium III are 3-wide, so the increase in IPC might be due to the SIMD capability of the Pentium III.
The Pentium 4 gave up a degree of IPC on the very deep pipeline to achieve extraordinarily high clock rates.
The Core 2 was 5-wide?
The Core i7 is 5-wide but also has hyperthreading.
The latest Sandy-Bridge is 6 wide?
Intel provides MIPS rating of their earlier processors up to Pentium in
List of Intel microprocessors
|Intel Pentium (P5)
|Intel Pentium (P54)
|Intel Pentium (P54CS)
A complete DRAM history is more difficult to trace,
along with the primary manufacturers chaning over time.
Wikipedia is generally a good starting point.
Dynamic random access memory,
DRAM Design Overview
from Stanford University by Junji Ogawa.
DRAM timing is particularly difficult to understand,
more so with the radical change from asynchronous (FPM and EDO) DRAM
to synchronous SDRAM, and DDR timings.
provides the diagram below.
Other references are Anantech
and Ulrich Drepper's
What every programmer should know about memory.
The aspect of focus here is memory access latency.
This element was generally quoted for asynchronous DRAM products.
After the change to synchronous DRAM, the industry emphasis changed
to bandwidth timings.
The last of the FPM and EDO DRAM products were available with 50ns access times,
but 60ns products were more common.
Perhaps the 50ns access time required cherry picking from a normal production run?
Today, the best DDR3 may have an access time of 25-30ns at the DRAM chip.
Local memory access time at the processors (with integrated memory controller) is
on the order of 50ns?
The difference due to signal transmission from processor to memory and back.
On server systems using registered memory, there may be buffer chip between processor and DRAM?
On multi-processor (socket) systems, access to remote node memory may be over 95ns?
to an adjacent node and 150ns+ for 2-hop distances?
DDR transfers data on both edges of the clock, i.e., at double the clock rate.
Internally, DRAM is now organizied into multiple banks in order to sustain
data transfers at a very high-rate.
The entire discussion above pertains to mainstream DRAM,
which emphasis cost relative capacity first, followed by bandwidth,
with the expectation that computer system will be comprised of many DRAM chips.
For example, a recent generation personal computer will have 2 memory channels, each 64-bits wide.
The DRAM components are organized as x8, providing an 8-bit data path,
so there are 8 chips to form a 64-bit channel,
and the minimum system memory has 16-chips.
There specialty DRAM products designed around different requirements.
Graphics DRAM is designed for high bandwidth on the assumption that the memory system
will be comprised of few chips.
Consider a graphics subsystem that needs only 1GB comprised of 1Gbit chips.
The desired bandwidth might require a 256-bit path. So GDDR DRAM are often organized wider, x32 being popular.
Another specialty is reduced latency DRAM for network systems.
These systems do not require monstrous system memory capacity,
but do need super low latency to support fast turn-around time for high-speed networking,
in the 10-40Gbit/s range.
A Micron RLDRAM document mentions tRC of 8ns versus 46-52ns for DDR3?
It has been my opinion that server system memory has long since become out of balance with the original concept of system main memory.
The latency has become to long for memory.
Today most memory is used for caching of one form or another, including the database buffer cache.
The time is right to split computer system memory.
There should be a smaller memory subsystem emphasizing very low latency, not just with specialtly DRAM,
but also with physical proximity, perhaps in the same package as the processor.
A separate larger subsystem can continue to implement bulk DRAM, tolerating longer latency.
It has long been known that memory access latency cannot keep up with the microprocessor.
Of course, the Intel server microprocessor clocks rates have settled into the 3GHz range,
with further progress emphasizing the number superscalar execution ports,
and the number of cores on a single die (or socket).
For a 3GHz processor and 50ns local node access, memory latency is now 150 CPU clock cycles away,
and 300+ for remote node memory access.
Micron and other memory companies have formed the
Hybrid Memory Cube consortium, proposing a radical re-architecture of the memory system.
See Hot Chips HC23
Hybrid Memory Cube (HMC).
by J. Thomas Pawlowski, Micron
High-Performance Memories for Packet Processing.
On the VAX-11/780, the CPU clock was 5MHz or 200ns cycle time,
but a complete instruction averaged 10 cycles.
DRAM access time was 250ns, 600ns to the memory controller and 1800ns to the processor.
This was before the advent of SIMM and DIMM technology.
The processor, memory controller and memory were all on separate boards, with long signal delays.
So essentially, memory access time was comparable to the time complete one VAX (complex?) instruction.
The a single Intel Sandy-Bridge core can complete 6 instructions per clock cycle if there are no memory stalls.
The key to modern microprocessor performance is an effective cache strategy to hide memory latency.
This can be successful is there is locality or if memory can be prefeched,
ideally 150+ cycles before it is needed.
An alternative strategy is sequential memory access to make use of the high memory bandwidth of modern systems.
|CPU clock||Effective ns/Inst||Memory Access|
100ns (1 hop)
Summarizing, the CPU clock was faster than memory access even back in 1978.
However, the CPU was also a very simple device that required 10 full cycles to complete
a (microprogammed) instruction. So the net result was instruction time was comparable to memory access.
Today, a single core is capable of completing 6 instructions per clock.
This is on top of the 150-1 ratio between local memory access latency to CPU clock.
The decisions made thirty years ago for good reasons nolonger hold today.
The current nature of computer system architecture points to a completely different strategy
given the long latency for memory access.
The modern microprocessor is designed to operate with pipelining and superscalar execution.
There should be multiple independent instructions that can be executed on each clock.
Furthermore, instructions executed in one clock should not have intractable dependencies
on instructions in the immediately preceding clock cycles.
The most difficult code for modern microprocessors is pointer chasing.
This is where a memory access retrieves the next memory location to be accessed.
If the memory address is not in cache, then the access time to DRAM is over 150 cpu-cycles,
during which the processor core has nothing to do.
Once the memory is accessed, this provides the address of the next memory fetch.
Unfortunately, this code sequence just happens to describe a b-tree search.
Modern Computer Architecture and Memory
Page and Row Storage Organization
Having covered the computer system architecture transistions from 1970 to 2012,
including the processor core and the memory system,
it is appropriate to return to the orignal relational database implementation.
The following diagram is from
The Design and Implementation of INGRES
by Stonebraker, Wong and Held, ACM 1976.
The page and row storage organization from Ingres in the 1970s is still in use today.
The diagrams below are from Microsoft MSDN
Understanding Pages and Extents
Inside the Storage Engine: Anatomy of a page
Now examine the sequence of operations to access rows and columns with page-row storage,
with consideration for whether memory access operations are in cache,
or can be prefetched.
Assume that we have already found the sequence of rows required by a query from an index.
The information we have for each row is the file_id, page_id, and row
1) Check if the page is in the SQL Server buffer cache.
Also the OS must check if the page is in memory (unless lock pages in memory is in effect)
2) Acquire a shared lock or latch on the page (table and row locks)
3) Read the page header
4) Read the 2-byte row offset at the end of the page
5) Read the row/record header
6a) Fixed column loads: Address = row offset + column offset
6b) Nullable columns: Load NULL bitmap, calculate offset, load?
6c) Variable length: follow the chain of column offset to the desired column?
1) the cost of the page in cache check could be as high as 1000 cpu-cycles?
This is based on a series of table scan tests I did for varying number of rows per page.
with the lock pages in memory permission on.
The OS check could be equally expensive. One of Thomas Kejser's slides from SQLBits
mentions that lock pages in memory performance impact could be significant.
Note to storage vendors: this is why the claim that caching solves IO performance problems is totally stupid.
2) It is necessary to place some kind of lock or latch on the page
even if nolock or tablock is applied on the query.
This is so the SQL Server storage engine knows that the page cannot be evicted from the buffer cache while being read.
4) The reason that the row offset and the actual row data is filled in from opposite
directions of the page is the improve storage efficiency.
With nullable or variable length data, it is not known how many rows will fit in any given page.
This requires non-sequential memory access patterns.
5) One might think in a SELECT COUNT(*) query that we could just read the m_slotCnt value in
the page header, or read the number of 2-byte row offset values at the end of page,
but apparently SQL Server actually reads the row header for each row.
6) Fixed length non-nullable columns are the least effort because the column offset is known ahead of time and the same for each row.
One of the recent SQL Server versions improved the handling of nullable columns by having a bitmask for all columns in each row, which simplifies the process of determining the offset?
Variable length columns are then difficult?
I think we have to go to the first variable length column, read the length to get the next length value and so on until we find the desired column. It would be nice to see the source code for this.
Perhaps someone would be helpful in examining the code of one of the open source databases?
There are also cycles expended to handle conversion from raw bytes to the correct data type
and special SQL Server rules.
A couple of years ago, I proposed extension to to the Intel vector instructions in
SIMD Extensions for the Database Storage Engine
in order to facilitate database page-row-column address calculation.
This would require working out the details on the new instructions,
and getting Intel to implement this in the next processor still early in the design stage.
I suspect that it would also be necessary to change the way metadata is stored to facilitate loading into the vector registers.
It would take 2-3 years for the new processor to enter production.
There would be another 2-3 years before the new technology is broadly deployed.
Of course all of this should be been started ten years ago when CPU frequency went over 1GHz.
I ran a test on a series of tables with a range of rows per page from 1 to 476.
The queries consisted of a table scans, first getting just a count of rows
and then aggregating successive columns.
The first three systems are 2-socket, running SQL Server 2008R2 on Windows Server 2008R2.
The last system is single socket running SQL Server 2012 on Windows Server 2012.
The table below shows in CPU-nanoseconds for the cost per page (including 1 row) of the count query,
the cost for each additional rows, and then the additionak cost for the first and each additional column aggregated.
Given that all processors cores were around 3GHz, the CPU-cycles for each page is the range of 2000 cycles, each row and column contributing another 150 or so cycles.
When the first row is accessed, ie, reading the last 2 bytes of an 8KB page,
the entire 64-byte cache line comprising 32 row offset values would be read into cache.
The approx 150ns cost for per row corresponds to a single memory access for the row header,
with the row offset most likely already in cache.
The tests compared column accesses in sequence. The single column aggregate is on the second column. The two column aggregate is on the second and third columns, which should be stored in adjacent bytes. There is some indication the pairs of columns are marginally more than a single column but the cost off 100+ cycles per successive column seems to be high.
Is the type conversion? or due the interpreted code?
My standard SQL Server configuration is with the lock pages in memory right assigned,
as this is required for Trace flag 834: use large-page allocations for the buffer pool.
I was not aware of Thomas Kejser's report that the lock pages in memory by itself
would have significant performance impact.
If possible, I will re-run the above tests with and without lock pages in memory.
Scaling and Locks
Another major top in database performance is scaling to a very high number of many cores.
This is both scaling over the cores in a single processor socket
and scaling over all the cores of a multi-socket system.
Apparently the locking mechanism is a serious obstacle to scaling.
A few years ago, I did a study of a non-transactional b-tree search engine, ie,
without locking. Not only did it scale perfectly over the physical cores,
it also scaled perfectly over the Hyper-Threading logical cores.
This was possible because the b-tree search is a series of pointer chasing memory accesses,
resulting in many no-ops cycles within a single thread.
With no lock contention, the scaling was perfect.
I also looked at compression and parallelism. At DOP 1, queries to compressed tables consumed
about 40% more CPU than to an uncompressed tables, depending on the operation.
The uncompressed tables would scale with increasing the degree of parallelism up to a point, before scaling falls off and the performance is saturated. The compressed tables scaled perfectly until the performance was equal to the uncompressed tables.
The interpretation was that contention for locks was limiting scaling with parallelism.
The compression added enough CPU on each thread to relieve the contention.
At high degree of parallelism, 16-32 in some examples, the compression essentially become free.
Transactional memory is currently a topic of discussion. See the Intel Developer Forum 2012
session ARCS0004, Intel Transaction Synchronization Extensions.
The objective is a lower overhead alternative to locking that can used on most cases?
The Microsoft paper also discusses lockless or lock free memory strategies?
As soon as it was evident that CPU-cycles and memory access latency were on diverging
paths, perhaps around 1990, it was realized that the page row storage system with pointer chasing code to retrieve scattered metadata would not realize the full capability of modern processors. Hence the term in-memory database for describing a storage engine optimized for
processor - memory access characteristics.
Another option is columnar data storage.
The sequential data access could then take advantage of the memory bandwidth of systems,
which was improving at an adequate pace.
Furthermore, the data type within each would be known, except for the (hopefully) rarely used variant.
By the time of the later Intel 486 or early Pentium processors, the cpu cycle time to memory access latency
ratio had exceed ten. So there was talk of 10X or greater performance
with in-memory and columnar database technology.
At that time, system memory had not become ridiculously huge as it is today,
so in-memory databases were not really practical to achieve broad adoption.
Today server memory capacity is both huge and cheap, with 16GB DIMM pricing below $400.
Of course the mainstream database systems have progressed far beyond their original base
with a deep infrastructure of tools and features that migrating to a different
DBMS would involve huge effort and risk.
The natural solution is incorporate in-memory database technology into an existing DBMS.
Microsoft SQL Server has already incorporated columnar storage in version 2012.
Breakthrough performance with in-memory technologies
(Nov 8, 2012) on the Microsoft Technet SQL Server Blog by Dave Campbell,
The coming in-memory database tipping point (Apr 9, 2012)
describes the rational behind in Hekaton.
The first Nov 8 post cites Per-Ake Larson et al
High-Performance Concurrency Control Mechanisms for Main-Memory Databases
which describes method to reduce locking and other concurrency overhead.
Oracle TimesTen In-Memory Database Architectural Overview,
IBM solidDB redbook,
Wikipedia In-memory database,
and Column-oriented DBMS.
The diagram below is from Oracle TimesTen In-Memory Database Architectural Overview
I am still working on this, to fill in missing data, correct mistakes, etc
I will try to make updates here, but if not, the permanent copy is here