Thinking too much about memory latency

I’ve been learning about query optimizers in databases, which are responsible for determining the runtimes of various algorithms based on the size of the tables, memory, and data distribution. Which got me thinking: how long do these algorithms take to run? If we assign real-world values to these computations, it can give us an intuition of what values we might expect in the real world.

The models

When one first encounters time complexity in algorithms, it’s usually with the RAM model. The RAM model, or Random Access Machine, assumes every basic mathematical operation (compare, swap, add, subtract) takes 1 unit of time. Any memory access is considered to take 0 unit of time, and memory is also assumed to be infinite. This model is useful for comparing the computational (read: mathematical) complexity of different algorithms. Perhaps it measures the overlap or co-variation of a set of primative mathematical operations with a problem and a desired solution.

(An example of merge sort using the RAM model, which doesn’t take into account memory constraints)

If you learned Big-O notation, that uses the RAM model. The issue? It has little-to-no use in determining real-world computation time (especially in large datasets). Neither of the above assumptions about memory, that memory is infinite and it takes 0 time to access, is true in reality.

Of course, when we’re new to computing and don’t know about the existence of memory, such statements are fine. And if we assume 1) the data fits into memory and either 2a) no cache or 2b) a similar cache efficiency between different algorithms, then the computational run time does represent an accurate comparison between runtimes. It won’t represent an absolute runtime of the algorithms, but does offer a relative comparison between them.

If we want to factor in memory latency and memory size in calculating our runtimes, we need a new model. Enter the external memory model. It assumes both finite memory and a latency to retrive memory from the hard drive and is often used in memory-intensive applications like database engines.

An overview of memory latency in von-Neumann architecture

Almost all computers use the von-Neumann architecture, so that’s the basis of our analysis.

Understanding the memory hierarchy

Below is a quick sketch of what we can assume our system looks like.

This diagram is an approximation. Certain systems might also have an L3 cache, a larger or smaller memory size, or a different type of hard drive.

Understanding latency in the memory hierarchy

We can steal latency numbers from the list of numbers every programmer should know.

Let’s take a moment and think about these numbers.

Observation 1: References to different parts of the hierarchy have different unit sizes

A reference is the amount of time a round trip takes to read data from the next level of the hierarchy. For example, if I wanted to request a byte from memory, it would take 100ns, but if I wanted to read a byte from my SSD it would take 150,000ns. Which is a whopping ~1,500x slower.

However, the raw latency numbers omit an important detail: the minimum size of data reads from different parts of the hierarchy. The hardware design of an SSD does not allow us to transfer just one byte. The minimum amount of data we can transfer from an SSD is 1 page, or about 4kb. Similarly, the smallest unit we can get from memory is 1 cache line, or 64 bytes. When we take into account the different units, we find memory references are closer to ~17x faster than a SSD per byte.

Observation 2: Streaming data is a lot faster per-byte than a reference

There is also a large per-byte speedup when reading large amounts of sequential data (1 mb) versus a single reference. What gives?

There are two main reasons:

Seeking gets amortized over the data

When a read request comes in, the memory module needs to first locate where the data is stored before reading it and sending it back. We call this seeking. In the case of sequential reading, the seeking still needs to happen only once to find the data start. So we can split (amortize) the constant cost of the seek across the data, resulting in less overhead per byte.

Parallel reads and read ahead

Hardware designers know that when computers read data, they usually end up reading large chunks of data at once. So they’ve explicitly designed hardware to allow data to be read in parallel. So the speed of reading 1mb is full-saturation of the data channels between the the two points, where 4kb is the minimum amount of time for a round-trip. SSDs take advantage of parallel reads, resulting in memory only being 4x faster for large reads.

Finding the point of throttling

Most time complexity calculations with the external memory model use the number of pages being read into memory from the hard drive. The implication is that data transfer time between the hard drive and memory dwarfs other time costs. But is this really true? We saw above that an SSD is only 4x slower for large reads.

Let’s try an example. Say the CPU has to read 1mb from a SSD. What’s the approximate runtime cost of that?

If something happens sequentially, we can say

T(A, B) = T(A) + T(B)

and if something happens in parallel we say

T(A, B) = max(T(A), T(B)).

For the first page of the 1MB, we can say it takes 4,000 ns to read, then for each subsequent page, the memory will be feeding the data in parallel to the cache. So we can say

4,000 ns + max(4,000 ns, 1,000 ns)x(block_size-1) + 1,000ns

which just ends up being the SSD throughput. In fact, because the only cost we pay is a constant penalty at the end (to get the last page into memory), it scales asymtotically to the hard drive throughput.

This relationship can be generalized. Crucially, while for a given workload, not only is CPU <- memory much smaller than memory <- hard drive, because the two can happen in parallel, we have T = max(A,B) rather than T=A+B where A >> B. Of course, we can only keep applying this property if the “drain” is faster than the “source.” So will our drain ever get backed up? Well, modern CPUs run instructions in ~1/3ns, so there do not have to be many reads before the workload becomes memory-bound.

Why this still isn’t enough to predict absolute, real-world performance

Of course, no system works like this in the real world. For example, this performance is predicated on not running an OS and having context switching. Or other programs, like a garbage collector or security software. Another issue is that time complexity inherently throws out coefficients, which are kinda important. A third issue is the model does not take into account the distribution of data, which can greatly impact real-world performance. In fact, database engines keep track of data distribution estimates to help predict algorithm runtime.

However, similar to the RAM model, if we assume roughly equal penalty for our algorithms for things we ignore (cache, for example) we can use it as a relative metric for comparison, even if we can’t predict absolute runtime. In a database engine planner, we don’t care how many ns it will take for a query to execute, we care about the approxiate relative performance (ie execution plan A is 2x faster than execution plan B).

If we did want to use our model to calculate approximate runtime, it would be much more accurate than the RAM model. At is stands, it’s a much better tool to compare runtimes of algorithms in the real world. As the saying goes: “All models are wrong. Some are useful.”