db-09 — Analysis
We are stitching together db-03/05/06/08 into the smallest engine that deserves the name database. The hard part is not any single component — we already have all of them — but choosing the smallest set of design decisions that yields crash safety and cross-language byte-identity.
Required invariants
- Durability of
put— onceputreturns, a crash must not lose the write. Achieved by WAL append + fsync before applying to the memtable. - Atomic publish of an SSTable — a recovering process must see either
the complete SST or none of it. Achieved by
write(.tmp) → fsync → rename. (POSIXrenameis atomic with respect to crash.) - Atomic publish of a flush — a recovering process must not see an SST
that MANIFEST does not list, and must not see MANIFEST listing an SST
that does not exist. Achieved by ordering:
write SST → rename SST → rewrite MANIFEST → rename MANIFEST → truncate WAL. A crash between SST-publish and MANIFEST-rewrite leaks an unlisted SST file (harmless and reusable on the next flush via a higher id; we keep it simple and never reuse). A crash between MANIFEST-rewrite and WAL-truncate replays records that are already in the SST —MemTable::putis idempotent for the same key, so this is safe (the duplicate disappears on next flush). - Read precedence — for any key
k, the answer must come from the most recent writer. Order: memtable first, then SSTables in newest-to-oldest order. Tombstones count as a hit. - Cross-language determinism — given the same input script, all three
languages must produce byte-identical
DUMP. Achieved by sharing exactly the formats defined in db-03/05/06/08 plus the WriteBatch wire format defined in this lab.
Design decisions
Why MANIFEST is plain text
LevelDB's MANIFEST is a binary record log of edits ("add file X to level Y", "delete file X", "set next file number to N", ...). That makes log replay fast but is not byte-identity-friendly across languages because each edit record carries varint-encoded fields and an internal version-edit format.
For this lab the live set is small (one process, no concurrent writers, no
compaction) so we use the simpler representation: a text file rewritten on
every flush, atomic by tmp+rename. The cost is one extra O(n) write per
flush where n = number of live SSTables. For our small in-process loads,
this is invisible. db-21 will replace it with the LevelDB-style edit log
when compaction needs incremental atomic edits.
Why one SSTable per flush
LevelDB also writes one SST per flush; that's why they're "L0" files (level 0 is the only level where files may overlap). We keep the same property. "Newest L0 wins" then degenerates from a level-aware rule to a simple position-in-MANIFEST rule.
Why no compaction in db-09
Compaction is a separate concern: it's a background process whose only job is to reduce read amplification and reclaim space. Skipping it means:
- Read cost grows linearly with flush count for keys that miss everything.
- Disk usage grows monotonically — overwrites and deletes are never reclaimed.
Neither breaks correctness, and both are exactly what db-21 will fix. Splitting them keeps each lab small enough to fully verify.
Why the WriteBatch wire format is reused as the WAL record
Two formats are strictly worse than one: more surface area, more chances for a Rust/Go/C++ encoder to diverge. The batch encoder is the WAL serializer. The WAL framing (record-length + CRC32) is db-03's concern; the contents of each record is a single encoded batch.
Why three languages
The same reason as every lab from db-01 onward: the only honest way to
prove that two implementations of a binary protocol agree is to compute
sha256 of their output and compare. With three independent implementations,
the probability that a bug produces matching sha256s is vanishingly small,
so a match line is a near-proof of correctness for the encode + flush +
recover + read pipeline.