db-11 — Broader Ideas

Where this disk-backed pager fits in the rest of the track, and which real-world techniques live one or two steps beyond it.

Immediate next labs

  • db-12 — SQL frontend. The first consumer of the pager. A row in a table becomes some bytes inside some page; an INSERT is a Pager::write. The B+-tree layer that maps rows-to-pages is built in db-15 but its scaffolding starts here.
  • db-13 — Transactions and MVCC. Each transaction sees a consistent snapshot of the pager. The simplest implementation is copy-on-write at the page level: a write conceptually allocates a new page rather than mutating the old, and snapshots hold roots pointing at the version they read. Our pager's monotonic allocate() is the right primitive for this.
  • db-14 — Indexes. Secondary indexes are additional B+-trees living in the same pager file as the primary. Multiple trees, one pager, one buffer pool.
  • db-15 — SQLite-complete. Stitches db-10..db-14 together. Will add page checksums, the rollback journal or WAL, and the free-list page so that deleted pages don't leak.
  • db-21 — Storage engine advanced. Revisits this pager with CLOCK / 2Q eviction, a freelist, an mmap variant, and possibly a group-commit fsync scheduler.
  • db-22 — Performance and benchmarking. Measures hit ratio, eviction rate, and fsync cost under realistic workloads; compares LRU against alternative policies.

How this lab's pieces show up in real systems

  • The 4 KiB page is the de-facto default in every major engine (Postgres, SQLite, InnoDB, RocksDB SST blocks). It matches both the typical filesystem block size and the Linux page-cache granule, which means partial pages cost no extra readahead.
  • The header-on-page-0 trick is universal: SQLite, BoltDB, InnoDB, even Berkeley DB all reserve page 0 for metadata.
  • Write-back with LRU is the classic buffer-pool design; Postgres calls it shared_buffers, InnoDB calls it innodb_buffer_pool_size, SQLite calls it the page_cache. Our implementation is the textbook version they all started from.
  • fsync-only-on-flush is the contract every transactional engine demands of its pager: the WAL or rollback journal layer above decides when, the pager just provides the primitive. The DBMS literature calls this the "no-force" policy.
  • The doubly-linked-list + hashmap LRU is the pattern in every production buffer pool — Postgres's BufferLookup, InnoDB's buf_LRU, RocksDB's LRUCache, even your CPU's L2 replacement logic. The textbook is real.

Variants worth implementing later

  • CLOCK replacement — a single circular array with a reference bit per frame. Approximates LRU at lower overhead because there's no list splice per access. PostgreSQL uses this.
  • 2Q — two LRU lists, one for "seen once" and one for "seen twice or more". Resists scan-induced cache pollution. Cheap to implement on top of the existing LRU code.
  • ARC (Adaptive Replacement Cache) — IBM's adaptive variant of 2Q. Patented but reimplementable.
  • Copy-on-write pages (LMDB-style) — every write allocates a fresh page; old versions stay live for concurrent readers. Trades higher write amplification for free MVCC.
  • mmap-backed pagermmap the whole file, let the OS manage the page cache. Drastically simpler code; loses control over eviction and durability.

Performance experiments worth running later

  • Plot hit ratio vs cache_capacity / num_pages for each scenario. Expect a knee around 25..50% for the mixed scenario.
  • Measure the cost of flush() as a function of dirty-page count. Sorted writes should be sub-linear vs unsorted on spinning disk.
  • Compare write-back vs write-through latency for a steady stream of small updates. The write-back win should be order-of-magnitude on any device with non-trivial fsync cost.
  • Vary page_size from 256 B to 64 KiB. The hit ratio improves with smaller pages (finer caching granule) but per-operation bookkeeping cost grows.

What "production-quality" would require beyond this lab

  • Crash recovery. Right now, a crash in the middle of a flush leaves a half-written page on disk and no way to detect it. SQLite uses a rollback journal; Postgres uses WAL + a checkpointer. db-13 will introduce the simplest form of this.
  • Checksums. A CRC32 footer per page so torn writes are detectable, not silently returned to the caller.
  • A free-list page so deleted pages can be reused, otherwise files grow monotonically.
  • Concurrent access. Reader-writer latching at the page level so the pager scales to multiple threads.
  • Direct I/O / O_DIRECT to bypass the OS page cache and prevent double-buffering. Needed at high throughput; subtle to get right.
  • Async I/O. io_uring on Linux, IOCP on Windows. The synchronous pread/pwrite we use is fine for teaching and for any workload where the database is not the bottleneck.