db-09 — LevelDB Complete
What is it?
A small but end-to-end LSM-tree key-value store assembled from the labs we
have built so far. It is the smallest interesting "real database" we can ship:
opens a directory, durably accepts put/delete/batch writes, answers
get/scan queries, and survives crashes — using the WAL (db-03), MemTable
(db-05), SSTable format (db-06), and merging iterator (db-08) as its parts.
The engine deliberately stops short of automatic background compaction. That arrives in db-21; here we keep the focus on correctness of the read path across multiple immutable L0 SSTables and a live memtable.
Why does it matter?
This is the first lab where the labels start to look like the things you actually run in production:
Db::open(dir)+MANIFEST— every LSM-shaped store (LevelDB, RocksDB, Pebble, Cassandra's SSTable subsystem, HBase HFile) has exactly this contract: a directory is the database, a manifest enumerates which files are live, and recovery rebuilds in-memory state by reading the manifest and replaying the WAL.- The write path's three steps — encode batch → append+fsync WAL → apply to memtable — is the universal LSM commit. Almost every storage engine on Earth does these three things in this order. Once you internalize why (durability before visibility), you can read any LSM source tree.
- The read path — memtable then newest SSTable then older SSTables, with the first hit (Value or Tombstone) winning — is the core LSM invariant. Compaction in db-21 is just amortizing this work; it doesn't change the rule.
If you understand this lab, you understand the shape of LevelDB.
How does it work?
┌────── Db (one directory) ─────────────────────┐
│ │
write path │ WriteBatch ─► encode ─► WAL.append + fsync │
───────────► │ │ │
│ ▼ │
│ MemTable (in RAM) │
│ │ │
│ size/explicit Flush │
│ ▼ │
│ sst-NNNNNN.sst.tmp ─► fsync ─► rename │
│ │ │
│ prepend (id, SstReader) to L0 │
│ rewrite MANIFEST atomically │
│ close+delete+reopen WAL │
│ │
read path │ Get(k): MemTable → L0[0] → L0[1] → … │
───────────► │ first hit wins; Tombstone ⇒ None │
│ │
│ Scan: MergingIterator over │
│ [MemTable, L0[0], L0[1], …] │
│ drop_tombstones=true │
└────────────────────────────────────────────────┘
On-disk layout
<dir>/
MANIFEST text; one "L0 <id>" line per live SSTable, newest first
wal.log db-03 WAL of WriteBatch records (binary)
sst-000001.sst db-06 SSTable, one per flush, zero-padded 6-digit id
sst-000002.sst
...
Recovery (Db::open)
mkdir -pthe directory.- Read
MANIFESTline by line; each line isL0 <id>newest-first. - Open each SSTable in that order; track
max_id. - Open the WAL with
WalIterand replay every record (WriteBatch::decodethen apply to a fresh memtable). Any torn tail is dropped by the WAL iterator (db-03 invariant). - Open the WAL for writes; set
next_id = max_id + 1.
Write path
put(k, v) ≡ Write(WriteBatch{Put{k,v}})
del(k) ≡ Write(WriteBatch{Delete{k}})
Write(batch):
bytes = batch.encode()
wal.append(bytes); wal.sync() # durability first
apply(batch, memtable) # then visibility
The batch wire format is identical in-memory and on-WAL:
u32 LE count
for each op:
u8 type # 0 = Put, 1 = Delete
u32 LE klen
key bytes
if Put:
u32 LE vlen
value bytes
This is the same encoder/decoder Rust, Go, and C++ all use, which is what makes the cross-language byte-identity test possible.
Flush
Flush():
if memtable empty: return
id = next_id++
build SstWriter from memtable.sorted()
write sst-id.sst.tmp; fsync; rename → sst-id.sst # crash-safe publish
prepend (id, SstReader) to ssts # newest first
rewrite MANIFEST atomically (tmp + rename)
wal.close(); remove(wal.log); wal = Wal::open(wal.log)
memtable = MemTable::new()
The order matters: the SST must be durably renamed before we rewrite MANIFEST, and MANIFEST must be durably renamed before we truncate the WAL. If we crash between any two steps, recovery is safe — either the WAL still has the records, or the SST is on disk and listed in MANIFEST.
Read path
Get(k) walks the in-RAM memtable first, then SSTables newest-first. The
first hit wins:
| Source hit returns | Result |
|---|---|
Value(v) | Some(v) |
Tombstone | None |
| miss | continue |
If nothing matches, return None.
ScanAll() and SerializeView() reuse db-08's MergingIterator. The
memtable is materialized into a KeyEntry vector (already sorted by BTreeMap
or std::map), then the iterator merges it with each SSTable's entries,
preferring the newer source on ties (memtable beats L0[0] beats L0[1] ...).
What's intentionally out of scope
- Compaction — db-21. Without it, repeated overwrites of the same key
accumulate as more L0 SSTables. Reads stay correct (newest wins) but
per-
Getwork grows linearly in flush count. - Snapshots / MVCC — db-13.
- Sharding, replication, sequence numbers — db-16+.
- Bloom filters per SSTable — built in db-04; wiring is a db-21 optimization (skip SSTables whose Bloom rejects the key).
Cross-language invariant
All three implementations expose a dbctl --dir DIR CLI that reads a
script from stdin (PUT k v, DEL k, FLUSH, DUMP, DUMP_WITH_TOMBS).
scripts/cross_test.sh drives the same script through each, performs a
crash/recover cycle by closing and reopening the database, then compares
sha256(DUMP) and sha256(DUMP_WITH_TOMBS) across Rust, Go, and C++.
A byte-identical DUMP after recovery proves that all three implementations
agree on: the WAL record format, the SSTable format, the MANIFEST format,
the merge ordering, the tombstone semantics, and the recovery procedure.