Back of the envelope math for Computers

Introduction

I rely constantly on back of the envelope math calculations or as Simon Eskildsen would say “napkin math”. This is something I usually do before I open a profiler, before I write a benchmark, before I argue about micro-optimizations with some colleagues. A rough mental model of computer speeds saves me from bad designs and wasted time. This is a practice very commonly used in Physics where you just look at the order of magnitude and try to model the behaviour of a system.

This post is my personal cheat sheet for latency and bandwidth across CPU, RAM, SSDs, and HDDs, plus a few examples that can be really useful when reviewing a design document or talking about system architecture.

The Numbers I Actually Remember

I do not remember exact specs. I remember orders of magnitude.

| Component       | Latency (approx) | Bandwidth (approx) |
| --------------- | ---------------- | ------------------ |
| CPU register    | ~0.3 ns          | Enormous           |
| L1 cache        | ~1 ns            | 1–2 TB/s           |
| L2 cache        | ~4 ns            | Hundreds of GB/s   |
| L3 cache        | ~10–15 ns        | 100+ GB/s          |
| RAM             | ~80–120 ns       | 25–80 GB/s         |
| NVMe SSD        | ~50–150 µs       | 3–7 GB/s           |
| SATA SSD        | ~100 µs          | ~500 MB/s          |
| HDD             | ~5–10 ms         | 100–200 MB/s       |

What matters. Each level down the hierarchy costs roughly 10× more latency.

That single rule already explains most performance surprises.

Latency vs Bandwidth. Where I See People Get Tricked

Latency answers. How long until the first byte shows up?
Bandwidth answers. How fast can I stream once it does?

I have personally made the mistake of optimizing for bandwidth when latency was the real bottleneck.

This is why a cache miss hurts even when I am touching just a few bytes.

Math Examples

Example 1. The linked list mistake

Let’s look at an example of a linked list for a hot path because “the data structure was clean”.

10 million nodes, pointer chasing in RAM.

One RAM access ≈ 100 ns
10M × 100 ns = 1 second

Try to replace it with a flat array.

Sequential scan, prefetch works
~40 MB / 40 GB/s ≈ 1 millisecond

Same data. Three orders of magnitude difference. Sometimes having pointer-heavy structures in performance-critical code is not needed.

Example 2. “It’s just a disk read”

I once had a task doing ~100 random reads per request on an SSD.

100 × 100 µs = 10 ms

This looks fine if you take it by itself. Then traffic spiked.

Try to test the same workload but on an HDD and you see the following:

100 × 8 ms = 800 ms

That experiment permanently changed my intuition around random I/O.

Example 3. CPU cycles vs memory

On a 3 GHz CPU.

1 cycle ≈ 0.33 ns
RAM access ≈ 300 cycles
Branch misprediction ≈ 15 cycles

This teaches you that obsessing over instruction counts is useless if you forget about cache misses

Tiny C++ Benchmarks I Use for Reality Checks

These are not scientific benchmarks. I use them to keep my intuition honest.

Compile with optimizations enabled. -O2 or -O3. (play with them both)

Prototype code to perform back of envelope math: memory latency

#include <vector>
#include <chrono>
#include <iostream>
#include <random>
#include <algorithm>

int main() {
    const size_t N = 1'000'000;
    std::vector<size_t> next(N);

    std::mt19937_64 rng(1234);
    for (size_t i = 0; i < N; ++i) next[i] = i;
    std::shuffle(next.begin(), next.end(), rng);

    size_t idx = 0;
    auto start = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < N; ++i) {
        idx = next[idx];
    }
    auto end = std::chrono::high_resolution_clock::now();

    auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    std::cout << "Average access latency: "
              << static_cast<double>(ns) / N << " ns\n";
}

Conclusion

Here are some mental rules I actually use:

Cache misses dominate performance before arithmetic does
Sequential access beats random access
One disk seek equals millions of CPU instructions

Bandwidth affects throughput, latency affects responsiveness

Back of the envelope math is not about being precise. It is about being directionally correct early. When my estimate says milliseconds but reality says seconds, then you have some debugging to do :-)