db-11 — Broader Ideas
Where this disk-backed pager fits in the rest of the track, and which real-world techniques live one or two steps beyond it.
Immediate next labs
- db-12 — SQL frontend. The first consumer of the pager. A row
in a table becomes some bytes inside some page; an
INSERTis aPager::write. The B+-tree layer that maps rows-to-pages is built in db-15 but its scaffolding starts here. - db-13 — Transactions and MVCC. Each transaction sees a
consistent snapshot of the pager. The simplest implementation is
copy-on-write at the page level: a write conceptually allocates a
new page rather than mutating the old, and snapshots hold roots
pointing at the version they read. Our pager's monotonic
allocate()is the right primitive for this. - db-14 — Indexes. Secondary indexes are additional B+-trees living in the same pager file as the primary. Multiple trees, one pager, one buffer pool.
- db-15 — SQLite-complete. Stitches db-10..db-14 together. Will add page checksums, the rollback journal or WAL, and the free-list page so that deleted pages don't leak.
- db-21 — Storage engine advanced. Revisits this pager with CLOCK / 2Q eviction, a freelist, an mmap variant, and possibly a group-commit fsync scheduler.
- db-22 — Performance and benchmarking. Measures hit ratio, eviction rate, and fsync cost under realistic workloads; compares LRU against alternative policies.
How this lab's pieces show up in real systems
- The 4 KiB page is the de-facto default in every major engine (Postgres, SQLite, InnoDB, RocksDB SST blocks). It matches both the typical filesystem block size and the Linux page-cache granule, which means partial pages cost no extra readahead.
- The header-on-page-0 trick is universal: SQLite, BoltDB, InnoDB, even Berkeley DB all reserve page 0 for metadata.
- Write-back with LRU is the classic buffer-pool design;
Postgres calls it
shared_buffers, InnoDB calls itinnodb_buffer_pool_size, SQLite calls it thepage_cache. Our implementation is the textbook version they all started from. - fsync-only-on-flush is the contract every transactional engine demands of its pager: the WAL or rollback journal layer above decides when, the pager just provides the primitive. The DBMS literature calls this the "no-force" policy.
- The doubly-linked-list + hashmap LRU is the pattern in
every production buffer pool — Postgres's
BufferLookup, InnoDB'sbuf_LRU, RocksDB'sLRUCache, even your CPU's L2 replacement logic. The textbook is real.
Variants worth implementing later
- CLOCK replacement — a single circular array with a reference bit per frame. Approximates LRU at lower overhead because there's no list splice per access. PostgreSQL uses this.
- 2Q — two LRU lists, one for "seen once" and one for "seen twice or more". Resists scan-induced cache pollution. Cheap to implement on top of the existing LRU code.
- ARC (Adaptive Replacement Cache) — IBM's adaptive variant of 2Q. Patented but reimplementable.
- Copy-on-write pages (LMDB-style) — every write allocates a fresh page; old versions stay live for concurrent readers. Trades higher write amplification for free MVCC.
- mmap-backed pager —
mmapthe whole file, let the OS manage the page cache. Drastically simpler code; loses control over eviction and durability.
Performance experiments worth running later
- Plot hit ratio vs
cache_capacity / num_pagesfor each scenario. Expect a knee around 25..50% for the mixed scenario. - Measure the cost of
flush()as a function of dirty-page count. Sorted writes should be sub-linear vs unsorted on spinning disk. - Compare write-back vs write-through latency for a steady stream of small updates. The write-back win should be order-of-magnitude on any device with non-trivial fsync cost.
- Vary
page_sizefrom 256 B to 64 KiB. The hit ratio improves with smaller pages (finer caching granule) but per-operation bookkeeping cost grows.
What "production-quality" would require beyond this lab
- Crash recovery. Right now, a crash in the middle of a flush leaves a half-written page on disk and no way to detect it. SQLite uses a rollback journal; Postgres uses WAL + a checkpointer. db-13 will introduce the simplest form of this.
- Checksums. A CRC32 footer per page so torn writes are detectable, not silently returned to the caller.
- A free-list page so deleted pages can be reused, otherwise files grow monotonically.
- Concurrent access. Reader-writer latching at the page level so the pager scales to multiple threads.
- Direct I/O /
O_DIRECTto bypass the OS page cache and prevent double-buffering. Needed at high throughput; subtle to get right. - Async I/O.
io_uringon Linux, IOCP on Windows. The synchronouspread/pwritewe use is fine for teaching and for any workload where the database is not the bottleneck.