db-11 — Pager System
The first lab of the B-tree track where bytes leave RAM. db-10 built
a B-tree out of Box<Node>s and proved three languages agreed on
shape; this lab builds the substrate that turns those shapes into
durable files. Every disk-backed engine in the series from here on
— SQLite (db-15), MVCC (db-13), the distributed KV store (db-20),
and the capstone (db-23) — sits on top of a pager. This is the
component most production databases share.
What is it?
A pager is the layer that:
- Carves a file into fixed-size pages (we use 4 KiB by default; tests run with 256 B to keep dumps readable).
- Hands out pages by 1-based page id; page 0 is reserved for a file header that nails down the format.
- Maintains an in-memory page cache of bounded capacity, evicts
with LRU, and writes dirty pages back to disk on eviction
and on explicit
flush(). - Calls
fsyncexactly when the user asks for durability, never on every write.
The interface is intentionally minimal:
open(path, page_size, cache_capacity) -> Pager
Pager::allocate() -> PageId // grow file by one page
Pager::read(pid) -> Vec<u8> // page_size bytes
Pager::write(pid, bytes) // bytes.len() == page_size
Pager::flush() // write all dirty + fsync
Pager::close()
No B-tree nodes, no records, no keys. The B+-tree in db-15 will encode those structures into the page bytes; the pager neither knows nor cares what the bytes mean.
Why does it matter?
- The cache is the database. Every production engine spends most of its time hitting a buffer pool, not reading disk. The LRU policy, the dirty bit, and the eviction discipline are the difference between "fits in RAM, fast" and "thrashes, dead".
- The file layout is a binding contract. Once two implementations agree on byte 0 of every page, the database is portable across languages and platforms. db-15 will reuse this contract; the cross-language hash test in this lab proves it holds before the B+-tree code ever runs.
- fsync is the only thing that buys durability. Every other write is just a hint to the OS. Knowing exactly when fsync runs (and why) is what separates working systems from data-loss outages.
How does it work?
File layout
offset 0 offset = N * page_size
┌─────────────────────────┬─────────────┬─────────────┬─────┐
│ page 0 (header) │ page 1 │ page 2 │ ... │
│ magic | psz | npages │ user bytes │ user bytes │ │
│ + zero-pad to page_size│ │ │ │
└─────────────────────────┴─────────────┴─────────────┴─────┘
Page 0 is 24 bytes of header + zero-padding:
| offset | size | field | value |
|---|---|---|---|
| 0 | 16 | magic | "DSE-PAGER-v1\0\0\0\0" (ASCII + NULs) |
| 16 | 4 | page_size | u32 little-endian |
| 20 | 4 | num_pages | u32 little-endian (includes page 0) |
| 24 | rest | zeros | padding to page_size |
num_pages is the durable page count — what the file claims after
fsync. The in-memory pager may have allocated pages beyond that
which have not been flushed yet; close()/flush() reconcile them.
Cache, in pictures
cache_capacity = 3, recent = [pid=5] [pid=2] [pid=7]
MRU LRU
read(5) → hit, promote 5 to head [5] [2] [7]
write(9) → miss, evict 7 (writeback) [9*] [5] [2] ← 9 dirty
read(2) → hit, promote 2 [2] [9*] [5]
flush() → write 9, fsync [2] [9 ] [5]
Each frame in the cache carries:
pid: u32data: Vec<u8>of lengthpage_sizedirty: bool- linked-list pointers (prev / next) into the LRU chain
The lookup table is a hashmap pid → frame_index (Rust) or pid → *list.Element (Go) or pid → list iterator (C++). All three give
O(1) lookup; promotion to MRU is O(1) doubly-linked-list splice.
Read path
read(pid):
if pid == 0 or pid > num_pages_in_memory: panic
if pid in cache:
promote cache[pid] to MRU
cache_hits += 1
return clone of cache[pid].data
else:
cache_misses += 1
if cache is full:
evict tail; if dirty, pwrite then mark clean
buffer = pread(page_size bytes at offset pid * page_size)
insert (pid, buffer, dirty=false) at MRU
return clone
Write path
write(pid, bytes):
assert bytes.len() == page_size
if pid in cache:
cache[pid].data = bytes
cache[pid].dirty = true
promote to MRU
else:
if cache is full: evict tail with write-back as above
insert (pid, bytes, dirty=true) at MRU ← no read!
The "write-without-read" path is the optimization that makes bulk loads cheap. A B+-tree splitting a leaf allocates a fresh page and writes the whole 4 KiB at once; reading the old (uninitialized) contents first would double I/O for nothing.
Allocate
allocate():
pid = num_pages_in_memory
num_pages_in_memory += 1
return pid (1-based; pid 0 is the header)
The file is extended lazily — only when the page is actually written
back (either via eviction or flush). This means a sequence of
allocate(); allocate(); allocate() without writes never touches
disk, which matters for transactions that roll back.
Flush
flush():
rewrite page 0 with current num_pages
for each cached page in ascending pid order:
if dirty: pwrite at offset pid*page_size; mark clean
fsync
Sorting by pid before write turns N scattered seeks into one
sequential pass on a spinning disk. On SSDs the win is smaller but
still real (TLB-friendly access pattern; predictable readahead).
Determinism
The lab's verification depends on every operation being deterministic given the seed, the workload, and the cache capacity. Two things that look like they could leak nondeterminism but do not:
HashMapiteration order. We never iterate the cache map; the flush loop sorts dirty frames bypidfirst.fsynctiming.fsyncdoes not change the byte contents of the file, only their visibility after a crash. The sha256 we compare is taken from the post-flushfile, which is fully determined.
Where this fits
- Upstream: none directly; the pager is a from-scratch component.
- Downstream: db-12 (SQL frontend storage), db-13 (MVCC over snapshot page versions), db-14 (secondary index B+-trees over the pager), db-15 (SQLite-complete), db-21 (advanced storage variants), and every distributed lab from db-16 onwards stores state on a pager-backed file.