db-11 — Pager System

The first lab of the B-tree track where bytes leave RAM. db-10 built a B-tree out of Box<Node>s and proved three languages agreed on shape; this lab builds the substrate that turns those shapes into durable files. Every disk-backed engine in the series from here on — SQLite (db-15), MVCC (db-13), the distributed KV store (db-20), and the capstone (db-23) — sits on top of a pager. This is the component most production databases share.

What is it?

A pager is the layer that:

  1. Carves a file into fixed-size pages (we use 4 KiB by default; tests run with 256 B to keep dumps readable).
  2. Hands out pages by 1-based page id; page 0 is reserved for a file header that nails down the format.
  3. Maintains an in-memory page cache of bounded capacity, evicts with LRU, and writes dirty pages back to disk on eviction and on explicit flush().
  4. Calls fsync exactly when the user asks for durability, never on every write.

The interface is intentionally minimal:

open(path, page_size, cache_capacity) -> Pager
Pager::allocate() -> PageId            // grow file by one page
Pager::read(pid)  -> Vec<u8>           // page_size bytes
Pager::write(pid, bytes)               // bytes.len() == page_size
Pager::flush()                         // write all dirty + fsync
Pager::close()

No B-tree nodes, no records, no keys. The B+-tree in db-15 will encode those structures into the page bytes; the pager neither knows nor cares what the bytes mean.

Why does it matter?

  • The cache is the database. Every production engine spends most of its time hitting a buffer pool, not reading disk. The LRU policy, the dirty bit, and the eviction discipline are the difference between "fits in RAM, fast" and "thrashes, dead".
  • The file layout is a binding contract. Once two implementations agree on byte 0 of every page, the database is portable across languages and platforms. db-15 will reuse this contract; the cross-language hash test in this lab proves it holds before the B+-tree code ever runs.
  • fsync is the only thing that buys durability. Every other write is just a hint to the OS. Knowing exactly when fsync runs (and why) is what separates working systems from data-loss outages.

How does it work?

File layout

offset 0                            offset = N * page_size
┌─────────────────────────┬─────────────┬─────────────┬─────┐
│  page 0 (header)        │   page 1    │   page 2    │ ... │
│  magic | psz | npages   │ user bytes  │ user bytes  │     │
│  + zero-pad to page_size│             │             │     │
└─────────────────────────┴─────────────┴─────────────┴─────┘

Page 0 is 24 bytes of header + zero-padding:

offsetsizefieldvalue
016magic"DSE-PAGER-v1\0\0\0\0" (ASCII + NULs)
164page_sizeu32 little-endian
204num_pagesu32 little-endian (includes page 0)
24restzerospadding to page_size

num_pages is the durable page count — what the file claims after fsync. The in-memory pager may have allocated pages beyond that which have not been flushed yet; close()/flush() reconcile them.

Cache, in pictures

        cache_capacity = 3,   recent = [pid=5] [pid=2] [pid=7]
                                MRU             LRU

  read(5)   →   hit, promote 5 to head        [5] [2] [7]
  write(9)  →   miss, evict 7 (writeback)     [9*] [5] [2]      ← 9 dirty
  read(2)   →   hit, promote 2                [2] [9*] [5]
  flush()   →   write 9, fsync                [2] [9 ] [5]

Each frame in the cache carries:

  • pid: u32
  • data: Vec<u8> of length page_size
  • dirty: bool
  • linked-list pointers (prev / next) into the LRU chain

The lookup table is a hashmap pid → frame_index (Rust) or pid → *list.Element (Go) or pid → list iterator (C++). All three give O(1) lookup; promotion to MRU is O(1) doubly-linked-list splice.

Read path

read(pid):
    if pid == 0 or pid > num_pages_in_memory: panic
    if pid in cache:
        promote cache[pid] to MRU
        cache_hits += 1
        return clone of cache[pid].data
    else:
        cache_misses += 1
        if cache is full:
            evict tail; if dirty, pwrite then mark clean
        buffer = pread(page_size bytes at offset pid * page_size)
        insert (pid, buffer, dirty=false) at MRU
        return clone

Write path

write(pid, bytes):
    assert bytes.len() == page_size
    if pid in cache:
        cache[pid].data = bytes
        cache[pid].dirty = true
        promote to MRU
    else:
        if cache is full: evict tail with write-back as above
        insert (pid, bytes, dirty=true) at MRU       ← no read!

The "write-without-read" path is the optimization that makes bulk loads cheap. A B+-tree splitting a leaf allocates a fresh page and writes the whole 4 KiB at once; reading the old (uninitialized) contents first would double I/O for nothing.

Allocate

allocate():
    pid = num_pages_in_memory
    num_pages_in_memory += 1
    return pid                       (1-based; pid 0 is the header)

The file is extended lazily — only when the page is actually written back (either via eviction or flush). This means a sequence of allocate(); allocate(); allocate() without writes never touches disk, which matters for transactions that roll back.

Flush

flush():
    rewrite page 0 with current num_pages
    for each cached page in ascending pid order:
        if dirty: pwrite at offset pid*page_size; mark clean
    fsync

Sorting by pid before write turns N scattered seeks into one sequential pass on a spinning disk. On SSDs the win is smaller but still real (TLB-friendly access pattern; predictable readahead).

Determinism

The lab's verification depends on every operation being deterministic given the seed, the workload, and the cache capacity. Two things that look like they could leak nondeterminism but do not:

  • HashMap iteration order. We never iterate the cache map; the flush loop sorts dirty frames by pid first.
  • fsync timing. fsync does not change the byte contents of the file, only their visibility after a crash. The sha256 we compare is taken from the post-flush file, which is fully determined.

Where this fits

  • Upstream: none directly; the pager is a from-scratch component.
  • Downstream: db-12 (SQL frontend storage), db-13 (MVCC over snapshot page versions), db-14 (secondary index B+-trees over the pager), db-15 (SQLite-complete), db-21 (advanced storage variants), and every distributed lab from db-16 onwards stores state on a pager-backed file.