Step 3 — Benchmark and the Page Cache
Goal
See the page cache with your own eyes by measuring warm-cache and cold-cache pread latency. This is the experiment that should make you suspicious of every microbenchmark you ever read.
The Benchmark
pagealloc bench <file> <pages> <iters>:
- Preallocate
filetopages * 4 KiBusing a sequential write loop. fsyncto make sure it's on disk.- Time
itersrandompreads of one page each. - Drop the page cache.
- Time
itersrandompreads again. - Print p50/p99/p999 for each phase plus throughput in MB/s.
Implementation lives in:
../src/rust/src/bin/pagealloc.rs(benchsubcommand)../src/go/cmd/pagealloc/main.go(benchsubcommand)../src/cpp/src/main.cc(benchsubcommand)
Dropping the Page Cache
On Linux:
sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
Our benchmark binary calls this automatically if it can. If it can't (no sudo), it prints a warning and skips the cold phase.
On macOS:
sudo purge
Same logic — the binary attempts it, warns if it can't.
Expected Numbers
On a modern laptop with NVMe:
$ ./pagealloc bench /tmp/bench.bin 65536 50000 # 256 MB file, 50k iters
preallocated 65536 pages = 262144 KiB
WARM cache:
iterations : 50000
p50 : 1.2 µs
p99 : 5.8 µs
p99.9 : 18 µs
throughput : 1840 MB/s
dropped page cache
COLD cache:
iterations : 50000
p50 : 64 µs
p99 : 180 µs
p99.9 : 340 µs
throughput : 56 MB/s
Two observations:
- Warm cache is ~50× faster than cold. The page cache makes microbenchmarks lie. If you benchmarked a database after running the benchmark for warmup, you measured
memcpy, not disk. - p99 is 4–5× p50 even on cold cache. Latency tails come from queue depth, kernel scheduling, NVMe garbage collection. This motivates
io_uring(Lab 21) and request hedging in distributed systems (Lab 20).
On a Spinning Disk (if you have one)
COLD cache:
p50 : 6.4 ms ← 100× worse than NVMe
p99 : 18 ms
throughput : 0.6 MB/s ← versus 56 MB/s on NVMe
This 100× gap is why LSM-trees exist. Random reads on HDD are unworkable for OLTP; engines either:
- Avoid them (sequential append-only logs).
- Hide them behind cache (large block caches + bloom filters).
- Punt to SSD (HDD as cold tier only).
Throughput vs Latency
Watch what happens with iters = 1000 vs iters = 100000:
$ ./pagealloc bench /tmp/bench.bin 65536 1000
WARM throughput : 4200 MB/s
$ ./pagealloc bench /tmp/bench.bin 65536 100000
WARM throughput : 1800 MB/s
Higher iteration counts include more cache eviction (as the random distribution gradually evicts pages we already cached), exposing memory bandwidth and TLB misses. Real workloads sit between these. A single benchmark number is almost always wrong.
Try This
Add a flag to control the access pattern: sequential vs random. Sequential preads benefit from the kernel's read-ahead heuristic. On the same NVMe device you should see:
RANDOM cold : 56 MB/s
SEQUENTIAL cold : 2400 MB/s ← 40× faster, all due to read-ahead
This is why scans are cheap and point lookups are expensive — even on SSD.
What Just Happened
You measured the page cache, the access pattern's effect on throughput, and the gap between p50 and p99. These three insights drive every storage design in this curriculum:
- Page cache exists → your in-process block cache (Lab 8) must be smarter than LRU on raw bytes, otherwise you're duplicating the kernel's work.
- Sequential >> random → LSM-tree compaction (Lab 7) sorts data on disk to convert all future reads to sequential ranges.
- p99 >> p50 → consensus heartbeats (Lab 17) must tolerate occasional 100× slow
fsyncs without triggering elections.
Next
You've finished Lab 01. Run docs/verification.md to confirm all 8 checks pass. Then move on to db-02-data-structures-for-storage.