db-11 — Analysis

What had to be decided before any code was written, and why each choice locks in trade-offs the rest of the B-tree track will pay for or be paid by.

Required invariants

  1. File layout is canonical. Byte 0..15 of page 0 is the magic string DSE-PAGER-v1\0\0\0\0; bytes 16..19 are page_size LE; bytes 20..23 are num_pages LE; bytes 24..page_size-1 are zero. Any pager implementation that produces or consumes a file must agree on these bytes to the bit.
  2. Cache capacity is hard. After every operation, the number of resident frames is <= cache_capacity. The eviction path maintains this invariant before admitting a new frame, never after.
  3. Dirty pages survive eviction. If a page is evicted while dirty == true, its bytes are written to disk before the frame is reused. The cache may evict at any time; a dropped dirty page is a data-loss bug.
  4. Determinism. Given (path, page_size, cache_capacity, seed, ops, scenario), the post-flush file bytes are a pure function of those inputs. Two languages running the same workload must produce sha256-identical files.
  5. Page 0 is reserved. User code receives only pid >= 1 from allocate(). read(0) / write(0) is undefined behaviour (panic in Rust; documented but unenforced in Go/C++).

Design decisions

Why a 16-byte magic instead of 8

8 bytes (e.g. DSEPAGER) would have fit one register and saved 8 bytes per file. 16 bytes lets us include a version suffix and a human-readable prefix that shows up in strings(1). The cost is trivial; the debugging payoff (file db.bin | grep DSE) is immediate.

Why fixed page size at open() rather than per-page

A real engine fixes page size when the database file is created and refuses to mount it under a different page size. We bake this in by writing the page size into the header and re-reading it on open. The cost: changing page size means rewriting the file. The gain: no per-page metadata, no alignment surprises, page offsets are just pid * page_size.

Why 1-based page ids

Page 0 is the header. Letting allocate() return 0 would force every caller to remember the "0 is reserved" rule and to check it on every dereference. By starting allocation at 1, the contract is enforced by construction: any pid you legitimately hold is safe to read.

Why LRU (and not CLOCK, 2Q, ARC, LFU, …)

LRU is the textbook policy and the easiest one to verify deterministic across three languages. Its weakness — sequential scans flush a hot working set — is real but invisible at the cache sizes our tests use (capacity 8 over 100 pages). db-22 will revisit and measure; until then, simplicity dominates.

Why a doubly-linked list, not a BTreeMap<LastUsed, PageId>

A balanced map gives O(log n) operations and self-orders by recency. A doubly-linked list plus a hashmap gives O(1) operations and the same eviction order, at the cost of one extra pointer per frame. For a cache of 1000 frames the difference is ~10x in cache hit latency. Worth the boilerplate; LMDB, Postgres, SQLite, RocksDB, InnoDB all use the list-plus-map shape.

Why write-back, not write-through

Write-through (every write() synchronously persists) is simpler but makes random updates ~100x slower because every dirty page costs a seek and an fsync. Write-back lets us batch many writes to the same page (db-10's B-tree insert may rewrite the same node several times during a single workload) and amortize one disk write per page per flush. The tax is the dirty-page accounting, which is enforced by invariant 3 above.

Why fsync only on flush()

The pager's user owns the durability story. SQLite calls flush at every COMMIT; an LSM (db-05) calls it after every WAL append; an embedded counter store might call it once a minute. Pushing the decision up keeps the pager honest: it never claims durability it cannot deliver. The cross-test scenarios all call flush() exactly once at the end, which is why their hashes are stable.

Why write-without-read on a cache miss

If write(pid, bytes) evicts a clean page and admits (pid, bytes, dirty=true) without first reading the old contents, the disk's bytes are overwritten entirely on the next eviction or flush. This is safe because write requires bytes.len() == page_size — the whole page is supplied. Reading the old contents first would be a 4 KiB I/O for data we throw away immediately. A proper engine extends this with "page allocation hints" so that the OS can skip the readahead too; we don't bother.

Why the workload uses SplitMix64 (the same PRNG as db-10)

Three reasons:

  1. Identical implementation across languages. Three lines of wrapping-add and xor-mul; if any language gets it wrong the sha256 changes on the very first scenario.
  2. No external dependency. Crypto-quality PRNGs would need different libraries per language; SplitMix64 is purely arithmetic.
  3. Consistency across the track. Reusing the same PRNG as db-10 means a future cross-lab test can compare hashes from "B-tree built in RAM" against "B-tree built on the pager" using the same key sequence.

The PRNG draws exactly one u64 per iteration and uses specific bit-slices for op/byte/pid. A variable number of draws per iteration would make scenarios diverge in their key streams, which would defeat purpose 3.

Why the scenarios are sequential / random / mixed

  • sequential stresses the readahead-friendly path: page ids walk in monotonic order, cache hits dominate, evictions are predictable.
  • random stresses the eviction path: cache hit ratio is the cache_capacity / num_pages ratio, evictions happen on most writes, dirty pages move through the cache constantly.
  • mixed is what real workloads look like: a hot subset (selected by (r>>60)&1) plus a long tail of cold pages.

These three together exercise the entire cache state machine. If any of them diverges across languages, the bug is localized (sequential bugs are accounting; random bugs are eviction; mixed bugs are interaction).

Tradeoffs worth flagging

  • No free-list, ever. allocate() only grows the file. Once a B+-tree splits a page and later coalesces it, the now-unused page id is leaked. db-21 (storage engine advanced) will reclaim via a free-list page; here it would just be unverifiable code.
  • Vec<u8> per frame. Every cached page is its own allocation. A real engine packs frames into a single arena (the buffer pool) and indexes by offset. db-22 will measure the difference and likely arena-allocate.
  • No checksums. A corrupted page returns its corrupted bytes silently. db-15 will add a CRC32 to the page footer when SQLite semantics demand it.
  • No mmap. mmap-backed pagers (LMDB) are dramatically simpler but inherit the OS's page-replacement decisions, which we want to control here for testability. db-21 may explore the mmap variant.
  • Single-threaded. No latching, no per-page reader/writer locks. db-13 (MVCC) and db-17 (Raft) will introduce concurrency on top of this layer.