db-07: LSM Compaction
0. Why compaction at all?
The LSM write path (db-05 MemTable + db-06 SSTable) is intentionally append-only. When a key is updated, the new version is written to a fresh MemTable and later flushed to a fresh SSTable; the old version is still sitting in some older SSTable. When a key is deleted, a tombstone is written, not a removal.
Without compaction, three pathologies grow without bound:
| Pathology | Symptom | Bound |
|---|---|---|
| Read amplification | A single get() must check every live SSTable. | O(#SSTables) |
| Space amplification | Obsolete versions and tombstones keep occupying disk. | Total writes / live bytes |
| Index/metadata bloat | Reader has to load every SSTable's index. | O(#SSTables) |
Compaction merges N input SSTables into M output SSTables, applying newest wins semantics and (eventually) purging tombstones. It trades extra write I/O (write amplification) for bounded read and space amplification.
1. The two strategies (one sentence each)
- Leveled (LevelDB, RocksDB default): level L holds at most ~10× the bytes of level L-1. When a level is full, you pick one file and compact it against the overlapping files in L+1. Read amp ≈ #levels; space amp ≈ 1.1×.
- Tiered (Cassandra default, Pebble's "level 0"): when level L has K files, merge all of them into a single L+1 file. Read amp ≈ #levels × K; space amp can be 2–3×; write amp is much lower.
This lab implements neither policy. It implements the mechanism they both need: a correct K-way merge that respects recency ordering and tombstones. Picking the policy is a separate problem (and a configurable one).
2. The mechanism: K-way merge
Inputs: an ordered list of SSTables [A, B, C, ...], where A is the newest
(most recently written) and the rest follow in age order.
Output: a single SSTable whose entries are the sorted union of all input keys,
where for any key k the entry is taken from the first input that contains
it. Tombstones are entries — they win against older values just like a put.
The merge is a textbook K-way merge:
- Open all inputs and produce per-input cursors that iterate in key order.
- Push each cursor's current key into a min-heap keyed by
(key, source_index).source_indexis the recency tiebreaker — smaller index = newer. - Pop the smallest. This is the next unique key and its winning entry.
- Emit it (subject to the tombstone-drop rule below).
- Advance the popped cursor. Also advance every other cursor whose current key equals the just-emitted key (they are stale duplicates).
- Repeat until the heap is empty.
In a min-heap with K cursors and N total entries the merge is O(N log K).
3. Newest-wins semantics
The contract:
- Inputs are ordered by recency. Index 0 is newest.
- For each distinct key, the first input that contains it wins.
- The winning entry's type (Value or Tombstone) is preserved.
- All other versions of that key are discarded.
This matches what a layered reader would do on a get() query if it walked the levels top-down and short-circuited on the first hit.
4. Tombstone purging
A tombstone exists to hide an older version of a key. It is safe to drop a tombstone if and only if there is no older version anywhere that the tombstone could be hiding.
Two cases:
- Compacting the bottom level. There is nothing older. Every tombstone
whose only remaining copy is in the output is safe to drop. Callers signal
this with
drop_tombstones=true. - Compacting a non-bottom level. Even if no input has an older version of
the key, a deeper level still might. Tombstones must be kept. Callers leave
drop_tombstones=false.
This lab exposes the flag and trusts the caller. A real engine wires it from the level metadata.
5. What this lab does NOT do (and why)
- No splitting: the output is a single SSTable. Production engines cap output file size to keep per-file work bounded. The merge logic is the same; splitting is an output-side concern handled by switching SstWriter targets.
- No level metadata: there is no notion of "this output belongs to level N". That belongs to a manifest / version-edit log, which is db-09 territory.
- No deletion of obsolete inputs: the caller is responsible for unlinking the input files once the output is durable. We just return bytes.
- No checksums or atomic rename: writing-then-renaming and checksumming blocks belong in db-08+.
6. Cross-language contract
The output is a db-06 SSTable. Two implementations that compact the same
inputs in the same order with the same drop_tombstones flag must produce
byte-identical outputs. We verify this with sha256 across rust/go/cpp.
7. Failure modes worth recognizing
| Bug | Symptom |
|---|---|
| Wrong recency tiebreaker (older wins on ties) | After compaction, a recently-overwritten key reverts. |
| Forgetting to advance non-winning duplicates | Same key appears multiple times in output → SstWriter errors. |
| Comparing keys as strings (UTF-8) not bytes | Non-ASCII keys order wrong; cross-lang sha256 diverges. |
| Dropping tombstones when not at bottom | Deleted keys reappear from a deeper level. |
| Emitting an empty block instead of empty SSTable | File size ≠ 36 for empty merge; reader rejects. |
8. Hand-trace template (the smallest interesting example)
Inputs (newest first):
A: [("a",V,"1"), ("b",T)]
B: [("a",V,"0"), ("c",V,"9")]
Step-by-step heap state and emit:
| step | heap (key,src) | pop | emit | notes |
|---|---|---|---|---|
| 0 | (a,A) (a,B) | (a,A) | a → V "1" | also advance B past "a" |
| 1 | (b,A) (c,B) | (b,A) | b → T | tombstone preserved |
| 2 | (c,B) | (c,B) | c → V "9" | A is exhausted |
| 3 | (empty) | - | - | done |
Output: [("a",V,"1"), ("b",T), ("c",V,"9")] — 3 distinct keys, A's versions
of a and b win, c comes from B.