db-09 — LevelDB Complete

What is it?

A small but end-to-end LSM-tree key-value store assembled from the labs we have built so far. It is the smallest interesting "real database" we can ship: opens a directory, durably accepts put/delete/batch writes, answers get/scan queries, and survives crashes — using the WAL (db-03), MemTable (db-05), SSTable format (db-06), and merging iterator (db-08) as its parts.

The engine deliberately stops short of automatic background compaction. That arrives in db-21; here we keep the focus on correctness of the read path across multiple immutable L0 SSTables and a live memtable.

Why does it matter?

This is the first lab where the labels start to look like the things you actually run in production:

Db::open(dir) + MANIFEST — every LSM-shaped store (LevelDB, RocksDB, Pebble, Cassandra's SSTable subsystem, HBase HFile) has exactly this contract: a directory is the database, a manifest enumerates which files are live, and recovery rebuilds in-memory state by reading the manifest and replaying the WAL.
The write path's three steps — encode batch → append+fsync WAL → apply to memtable — is the universal LSM commit. Almost every storage engine on Earth does these three things in this order. Once you internalize why (durability before visibility), you can read any LSM source tree.
The read path — memtable then newest SSTable then older SSTables, with the first hit (Value or Tombstone) winning — is the core LSM invariant. Compaction in db-21 is just amortizing this work; it doesn't change the rule.

If you understand this lab, you understand the shape of LevelDB.

How does it work?

                 ┌────── Db (one directory) ─────────────────────┐
                 │                                                │
   write path    │  WriteBatch ─► encode ─► WAL.append + fsync   │
   ───────────►  │                            │                  │
                 │                            ▼                  │
                 │                      MemTable (in RAM)        │
                 │                            │                  │
                 │                  size/explicit Flush          │
                 │                            ▼                  │
                 │   sst-NNNNNN.sst.tmp ─► fsync ─► rename       │
                 │                            │                  │
                 │            prepend (id, SstReader) to L0      │
                 │            rewrite MANIFEST atomically        │
                 │            close+delete+reopen WAL            │
                 │                                                │
   read path     │  Get(k):  MemTable → L0[0] → L0[1] → …        │
   ───────────►  │           first hit wins; Tombstone ⇒ None    │
                 │                                                │
                 │  Scan:    MergingIterator over                 │
                 │             [MemTable, L0[0], L0[1], …]       │
                 │             drop_tombstones=true               │
                 └────────────────────────────────────────────────┘

On-disk layout

<dir>/
  MANIFEST            text; one "L0 <id>" line per live SSTable, newest first
  wal.log             db-03 WAL of WriteBatch records (binary)
  sst-000001.sst      db-06 SSTable, one per flush, zero-padded 6-digit id
  sst-000002.sst
  ...

Recovery (`Db::open`)

mkdir -p the directory.
Read MANIFEST line by line; each line is L0 <id> newest-first.
Open each SSTable in that order; track max_id.
Open the WAL with WalIter and replay every record (WriteBatch::decode then apply to a fresh memtable). Any torn tail is dropped by the WAL iterator (db-03 invariant).
Open the WAL for writes; set next_id = max_id + 1.

Write path

put(k, v) ≡ Write(WriteBatch{Put{k,v}})
del(k)    ≡ Write(WriteBatch{Delete{k}})

Write(batch):
    bytes  = batch.encode()
    wal.append(bytes); wal.sync()    # durability first
    apply(batch, memtable)           # then visibility

The batch wire format is identical in-memory and on-WAL:

u32 LE count
  for each op:
    u8 type          # 0 = Put, 1 = Delete
    u32 LE klen
    key bytes
    if Put:
      u32 LE vlen
      value bytes

This is the same encoder/decoder Rust, Go, and C++ all use, which is what makes the cross-language byte-identity test possible.

Flush

Flush():
    if memtable empty: return
    id = next_id++
    build SstWriter from memtable.sorted()
    write sst-id.sst.tmp; fsync; rename → sst-id.sst   # crash-safe publish
    prepend (id, SstReader) to ssts                   # newest first
    rewrite MANIFEST atomically (tmp + rename)
    wal.close(); remove(wal.log); wal = Wal::open(wal.log)
    memtable = MemTable::new()

The order matters: the SST must be durably renamed before we rewrite MANIFEST, and MANIFEST must be durably renamed before we truncate the WAL. If we crash between any two steps, recovery is safe — either the WAL still has the records, or the SST is on disk and listed in MANIFEST.

Read path

Get(k) walks the in-RAM memtable first, then SSTables newest-first. The first hit wins:

Source hit returns	Result
`Value(v)`	`Some(v)`
`Tombstone`	`None`
miss	continue

If nothing matches, return None.

ScanAll() and SerializeView() reuse db-08's MergingIterator. The memtable is materialized into a KeyEntry vector (already sorted by BTreeMap or std::map), then the iterator merges it with each SSTable's entries, preferring the newer source on ties (memtable beats L0[0] beats L0[1] ...).

What's intentionally out of scope

Compaction — db-21. Without it, repeated overwrites of the same key accumulate as more L0 SSTables. Reads stay correct (newest wins) but per-Get work grows linearly in flush count.
Snapshots / MVCC — db-13.
Sharding, replication, sequence numbers — db-16+.
Bloom filters per SSTable — built in db-04; wiring is a db-21 optimization (skip SSTables whose Bloom rejects the key).

Cross-language invariant

All three implementations expose a dbctl --dir DIR CLI that reads a script from stdin (PUT k v, DEL k, FLUSH, DUMP, DUMP_WITH_TOMBS). scripts/cross_test.sh drives the same script through each, performs a crash/recover cycle by closing and reopening the database, then compares sha256(DUMP) and sha256(DUMP_WITH_TOMBS) across Rust, Go, and C++.

A byte-identical DUMP after recovery proves that all three implementations agree on: the WAL record format, the SSTable format, the MANIFEST format, the merge ordering, the tombstone semantics, and the recovery procedure.

Distributed Systems Engineer — Build Databases & Consensus From Scratch