db-09 — Analysis

We are stitching together db-03/05/06/08 into the smallest engine that deserves the name database. The hard part is not any single component — we already have all of them — but choosing the smallest set of design decisions that yields crash safety and cross-language byte-identity.

Required invariants

  1. Durability of put — once put returns, a crash must not lose the write. Achieved by WAL append + fsync before applying to the memtable.
  2. Atomic publish of an SSTable — a recovering process must see either the complete SST or none of it. Achieved by write(.tmp) → fsync → rename. (POSIX rename is atomic with respect to crash.)
  3. Atomic publish of a flush — a recovering process must not see an SST that MANIFEST does not list, and must not see MANIFEST listing an SST that does not exist. Achieved by ordering: write SST → rename SST → rewrite MANIFEST → rename MANIFEST → truncate WAL. A crash between SST-publish and MANIFEST-rewrite leaks an unlisted SST file (harmless and reusable on the next flush via a higher id; we keep it simple and never reuse). A crash between MANIFEST-rewrite and WAL-truncate replays records that are already in the SST — MemTable::put is idempotent for the same key, so this is safe (the duplicate disappears on next flush).
  4. Read precedence — for any key k, the answer must come from the most recent writer. Order: memtable first, then SSTables in newest-to-oldest order. Tombstones count as a hit.
  5. Cross-language determinism — given the same input script, all three languages must produce byte-identical DUMP. Achieved by sharing exactly the formats defined in db-03/05/06/08 plus the WriteBatch wire format defined in this lab.

Design decisions

Why MANIFEST is plain text

LevelDB's MANIFEST is a binary record log of edits ("add file X to level Y", "delete file X", "set next file number to N", ...). That makes log replay fast but is not byte-identity-friendly across languages because each edit record carries varint-encoded fields and an internal version-edit format.

For this lab the live set is small (one process, no concurrent writers, no compaction) so we use the simpler representation: a text file rewritten on every flush, atomic by tmp+rename. The cost is one extra O(n) write per flush where n = number of live SSTables. For our small in-process loads, this is invisible. db-21 will replace it with the LevelDB-style edit log when compaction needs incremental atomic edits.

Why one SSTable per flush

LevelDB also writes one SST per flush; that's why they're "L0" files (level 0 is the only level where files may overlap). We keep the same property. "Newest L0 wins" then degenerates from a level-aware rule to a simple position-in-MANIFEST rule.

Why no compaction in db-09

Compaction is a separate concern: it's a background process whose only job is to reduce read amplification and reclaim space. Skipping it means:

  • Read cost grows linearly with flush count for keys that miss everything.
  • Disk usage grows monotonically — overwrites and deletes are never reclaimed.

Neither breaks correctness, and both are exactly what db-21 will fix. Splitting them keeps each lab small enough to fully verify.

Why the WriteBatch wire format is reused as the WAL record

Two formats are strictly worse than one: more surface area, more chances for a Rust/Go/C++ encoder to diverge. The batch encoder is the WAL serializer. The WAL framing (record-length + CRC32) is db-03's concern; the contents of each record is a single encoded batch.

Why three languages

The same reason as every lab from db-01 onward: the only honest way to prove that two implementations of a binary protocol agree is to compute sha256 of their output and compare. With three independent implementations, the probability that a bug produces matching sha256s is vanishingly small, so a match line is a near-proof of correctness for the encode + flush + recover + read pipeline.