db-07: LSM Compaction

0. Why compaction at all?

The LSM write path (db-05 MemTable + db-06 SSTable) is intentionally append-only. When a key is updated, the new version is written to a fresh MemTable and later flushed to a fresh SSTable; the old version is still sitting in some older SSTable. When a key is deleted, a tombstone is written, not a removal.

Without compaction, three pathologies grow without bound:

PathologySymptomBound
Read amplificationA single get() must check every live SSTable.O(#SSTables)
Space amplificationObsolete versions and tombstones keep occupying disk.Total writes / live bytes
Index/metadata bloatReader has to load every SSTable's index.O(#SSTables)

Compaction merges N input SSTables into M output SSTables, applying newest wins semantics and (eventually) purging tombstones. It trades extra write I/O (write amplification) for bounded read and space amplification.

1. The two strategies (one sentence each)

  • Leveled (LevelDB, RocksDB default): level L holds at most ~10× the bytes of level L-1. When a level is full, you pick one file and compact it against the overlapping files in L+1. Read amp ≈ #levels; space amp ≈ 1.1×.
  • Tiered (Cassandra default, Pebble's "level 0"): when level L has K files, merge all of them into a single L+1 file. Read amp ≈ #levels × K; space amp can be 2–3×; write amp is much lower.

This lab implements neither policy. It implements the mechanism they both need: a correct K-way merge that respects recency ordering and tombstones. Picking the policy is a separate problem (and a configurable one).

2. The mechanism: K-way merge

Inputs: an ordered list of SSTables [A, B, C, ...], where A is the newest (most recently written) and the rest follow in age order.

Output: a single SSTable whose entries are the sorted union of all input keys, where for any key k the entry is taken from the first input that contains it. Tombstones are entries — they win against older values just like a put.

The merge is a textbook K-way merge:

  1. Open all inputs and produce per-input cursors that iterate in key order.
  2. Push each cursor's current key into a min-heap keyed by (key, source_index). source_index is the recency tiebreaker — smaller index = newer.
  3. Pop the smallest. This is the next unique key and its winning entry.
  4. Emit it (subject to the tombstone-drop rule below).
  5. Advance the popped cursor. Also advance every other cursor whose current key equals the just-emitted key (they are stale duplicates).
  6. Repeat until the heap is empty.

In a min-heap with K cursors and N total entries the merge is O(N log K).

3. Newest-wins semantics

The contract:

  • Inputs are ordered by recency. Index 0 is newest.
  • For each distinct key, the first input that contains it wins.
  • The winning entry's type (Value or Tombstone) is preserved.
  • All other versions of that key are discarded.

This matches what a layered reader would do on a get() query if it walked the levels top-down and short-circuited on the first hit.

4. Tombstone purging

A tombstone exists to hide an older version of a key. It is safe to drop a tombstone if and only if there is no older version anywhere that the tombstone could be hiding.

Two cases:

  • Compacting the bottom level. There is nothing older. Every tombstone whose only remaining copy is in the output is safe to drop. Callers signal this with drop_tombstones=true.
  • Compacting a non-bottom level. Even if no input has an older version of the key, a deeper level still might. Tombstones must be kept. Callers leave drop_tombstones=false.

This lab exposes the flag and trusts the caller. A real engine wires it from the level metadata.

5. What this lab does NOT do (and why)

  • No splitting: the output is a single SSTable. Production engines cap output file size to keep per-file work bounded. The merge logic is the same; splitting is an output-side concern handled by switching SstWriter targets.
  • No level metadata: there is no notion of "this output belongs to level N". That belongs to a manifest / version-edit log, which is db-09 territory.
  • No deletion of obsolete inputs: the caller is responsible for unlinking the input files once the output is durable. We just return bytes.
  • No checksums or atomic rename: writing-then-renaming and checksumming blocks belong in db-08+.

6. Cross-language contract

The output is a db-06 SSTable. Two implementations that compact the same inputs in the same order with the same drop_tombstones flag must produce byte-identical outputs. We verify this with sha256 across rust/go/cpp.

7. Failure modes worth recognizing

BugSymptom
Wrong recency tiebreaker (older wins on ties)After compaction, a recently-overwritten key reverts.
Forgetting to advance non-winning duplicatesSame key appears multiple times in output → SstWriter errors.
Comparing keys as strings (UTF-8) not bytesNon-ASCII keys order wrong; cross-lang sha256 diverges.
Dropping tombstones when not at bottomDeleted keys reappear from a deeper level.
Emitting an empty block instead of empty SSTableFile size ≠ 36 for empty merge; reader rejects.

8. Hand-trace template (the smallest interesting example)

Inputs (newest first):

A: [("a",V,"1"), ("b",T)]
B: [("a",V,"0"), ("c",V,"9")]

Step-by-step heap state and emit:

stepheap (key,src)popemitnotes
0(a,A) (a,B)(a,A)a → V "1"also advance B past "a"
1(b,A) (c,B)(b,A)b → Ttombstone preserved
2(c,B)(c,B)c → V "9"A is exhausted
3(empty)--done

Output: [("a",V,"1"), ("b",T), ("c",V,"9")] — 3 distinct keys, A's versions of a and b win, c comes from B.