db-07: LSM Compaction

0. Why compaction at all?

The LSM write path (db-05 MemTable + db-06 SSTable) is intentionally append-only. When a key is updated, the new version is written to a fresh MemTable and later flushed to a fresh SSTable; the old version is still sitting in some older SSTable. When a key is deleted, a tombstone is written, not a removal.

Without compaction, three pathologies grow without bound:

Pathology	Symptom	Bound
Read amplification	A single get() must check every live SSTable.	O(#SSTables)
Space amplification	Obsolete versions and tombstones keep occupying disk.	Total writes / live bytes
Index/metadata bloat	Reader has to load every SSTable's index.	O(#SSTables)

Compaction merges N input SSTables into M output SSTables, applying newest wins semantics and (eventually) purging tombstones. It trades extra write I/O (write amplification) for bounded read and space amplification.

1. The two strategies (one sentence each)

Leveled (LevelDB, RocksDB default): level L holds at most ~10× the bytes of level L-1. When a level is full, you pick one file and compact it against the overlapping files in L+1. Read amp ≈ #levels; space amp ≈ 1.1×.
Tiered (Cassandra default, Pebble's "level 0"): when level L has K files, merge all of them into a single L+1 file. Read amp ≈ #levels × K; space amp can be 2–3×; write amp is much lower.

This lab implements neither policy. It implements the mechanism they both need: a correct K-way merge that respects recency ordering and tombstones. Picking the policy is a separate problem (and a configurable one).

2. The mechanism: K-way merge

Inputs: an ordered list of SSTables [A, B, C, ...], where A is the newest (most recently written) and the rest follow in age order.

Output: a single SSTable whose entries are the sorted union of all input keys, where for any key k the entry is taken from the first input that contains it. Tombstones are entries — they win against older values just like a put.

The merge is a textbook K-way merge:

Open all inputs and produce per-input cursors that iterate in key order.
Push each cursor's current key into a min-heap keyed by (key, source_index). source_index is the recency tiebreaker — smaller index = newer.
Pop the smallest. This is the next unique key and its winning entry.
Emit it (subject to the tombstone-drop rule below).
Advance the popped cursor. Also advance every other cursor whose current key equals the just-emitted key (they are stale duplicates).
Repeat until the heap is empty.

In a min-heap with K cursors and N total entries the merge is O(N log K).

3. Newest-wins semantics

The contract:

Inputs are ordered by recency. Index 0 is newest.
For each distinct key, the first input that contains it wins.
The winning entry's type (Value or Tombstone) is preserved.
All other versions of that key are discarded.

This matches what a layered reader would do on a get() query if it walked the levels top-down and short-circuited on the first hit.

4. Tombstone purging

A tombstone exists to hide an older version of a key. It is safe to drop a tombstone if and only if there is no older version anywhere that the tombstone could be hiding.

Two cases:

Compacting the bottom level. There is nothing older. Every tombstone whose only remaining copy is in the output is safe to drop. Callers signal this with drop_tombstones=true.
Compacting a non-bottom level. Even if no input has an older version of the key, a deeper level still might. Tombstones must be kept. Callers leave drop_tombstones=false.

This lab exposes the flag and trusts the caller. A real engine wires it from the level metadata.

5. What this lab does NOT do (and why)

No splitting: the output is a single SSTable. Production engines cap output file size to keep per-file work bounded. The merge logic is the same; splitting is an output-side concern handled by switching SstWriter targets.
No level metadata: there is no notion of "this output belongs to level N". That belongs to a manifest / version-edit log, which is db-09 territory.
No deletion of obsolete inputs: the caller is responsible for unlinking the input files once the output is durable. We just return bytes.
No checksums or atomic rename: writing-then-renaming and checksumming blocks belong in db-08+.

6. Cross-language contract

The output is a db-06 SSTable. Two implementations that compact the same inputs in the same order with the same drop_tombstones flag must produce byte-identical outputs. We verify this with sha256 across rust/go/cpp.

7. Failure modes worth recognizing

Bug	Symptom
Wrong recency tiebreaker (older wins on ties)	After compaction, a recently-overwritten key reverts.
Forgetting to advance non-winning duplicates	Same key appears multiple times in output → SstWriter errors.
Comparing keys as strings (UTF-8) not bytes	Non-ASCII keys order wrong; cross-lang sha256 diverges.
Dropping tombstones when not at bottom	Deleted keys reappear from a deeper level.
Emitting an empty block instead of empty SSTable	File size ≠ 36 for empty merge; reader rejects.

8. Hand-trace template (the smallest interesting example)

Inputs (newest first):

A: [("a",V,"1"), ("b",T)]
B: [("a",V,"0"), ("c",V,"9")]

Step-by-step heap state and emit:

step	heap (key,src)	pop	emit	notes
0	(a,A) (a,B)	(a,A)	a → V "1"	also advance B past "a"
1	(b,A) (c,B)	(b,A)	b → T	tombstone preserved
2	(c,B)	(c,B)	c → V "9"	A is exhausted
3	(empty)	-	-	done

Output: [("a",V,"1"), ("b",T), ("c",V,"9")] — 3 distinct keys, A's versions of a and b win, c comes from B.

Distributed Systems Engineer — Build Databases & Consensus From Scratch