Step 3 — Benchmark and the Page Cache

Goal

See the page cache with your own eyes by measuring warm-cache and cold-cache pread latency. This is the experiment that should make you suspicious of every microbenchmark you ever read.

The Benchmark

pagealloc bench <file> <pages> <iters>:

Preallocate file to pages * 4 KiB using a sequential write loop.
fsync to make sure it's on disk.
Time iters random preads of one page each.
Drop the page cache.
Time iters random preads again.
Print p50/p99/p999 for each phase plus throughput in MB/s.

Implementation lives in:

../src/rust/src/bin/pagealloc.rs (bench subcommand)
../src/go/cmd/pagealloc/main.go (bench subcommand)
../src/cpp/src/main.cc (bench subcommand)

Dropping the Page Cache

On Linux:

sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Our benchmark binary calls this automatically if it can. If it can't (no sudo), it prints a warning and skips the cold phase.

On macOS:

sudo purge

Same logic — the binary attempts it, warns if it can't.

Expected Numbers

On a modern laptop with NVMe:

$ ./pagealloc bench /tmp/bench.bin 65536 50000   # 256 MB file, 50k iters
preallocated 65536 pages = 262144 KiB

WARM cache:
  iterations : 50000
  p50        : 1.2 µs
  p99        : 5.8 µs
  p99.9      : 18 µs
  throughput : 1840 MB/s

dropped page cache

COLD cache:
  iterations : 50000
  p50        : 64 µs
  p99        : 180 µs
  p99.9      : 340 µs
  throughput : 56 MB/s

Two observations:

Warm cache is ~50× faster than cold. The page cache makes microbenchmarks lie. If you benchmarked a database after running the benchmark for warmup, you measured memcpy, not disk.
p99 is 4–5× p50 even on cold cache. Latency tails come from queue depth, kernel scheduling, NVMe garbage collection. This motivates io_uring (Lab 21) and request hedging in distributed systems (Lab 20).

On a Spinning Disk (if you have one)

COLD cache:
  p50        : 6.4 ms     ← 100× worse than NVMe
  p99        : 18 ms
  throughput : 0.6 MB/s   ← versus 56 MB/s on NVMe

This 100× gap is why LSM-trees exist. Random reads on HDD are unworkable for OLTP; engines either:

Avoid them (sequential append-only logs).
Hide them behind cache (large block caches + bloom filters).
Punt to SSD (HDD as cold tier only).

Throughput vs Latency

Watch what happens with iters = 1000 vs iters = 100000:

$ ./pagealloc bench /tmp/bench.bin 65536 1000
WARM throughput : 4200 MB/s

$ ./pagealloc bench /tmp/bench.bin 65536 100000
WARM throughput : 1800 MB/s

Higher iteration counts include more cache eviction (as the random distribution gradually evicts pages we already cached), exposing memory bandwidth and TLB misses. Real workloads sit between these. A single benchmark number is almost always wrong.

Try This

Add a flag to control the access pattern: sequential vs random. Sequential preads benefit from the kernel's read-ahead heuristic. On the same NVMe device you should see:

RANDOM     cold : 56 MB/s
SEQUENTIAL cold : 2400 MB/s    ← 40× faster, all due to read-ahead

This is why scans are cheap and point lookups are expensive — even on SSD.

What Just Happened

You measured the page cache, the access pattern's effect on throughput, and the gap between p50 and p99. These three insights drive every storage design in this curriculum:

Page cache exists → your in-process block cache (Lab 8) must be smarter than LRU on raw bytes, otherwise you're duplicating the kernel's work.
Sequential >> random → LSM-tree compaction (Lab 7) sorts data on disk to convert all future reads to sequential ranges.
p99 >> p50 → consensus heartbeats (Lab 17) must tolerate occasional 100× slow fsyncs without triggering elections.

You've finished Lab 01. Run docs/verification.md to confirm all 8 checks pass. Then move on to db-02-data-structures-for-storage.

Distributed Systems Engineer — Build Databases & Consensus From Scratch