Advanced Storage Engine
Lab status: complete. All unit tests pass;
scripts/cross_test.shproves three independent implementations (Rust, Go, C++) produce byte-identical canonical wire dumps for three fixed workloads.
1. What Is It
A standalone study of two pieces that turn a textbook LSM tree into something closer to RocksDB / LevelDB strength:
- Range tombstones — a single record that logically deletes every key in
a half-open interval
[start, end), instead of writing one Delete per key. - Compaction policies — size-tiered (Cassandra/Scylla heritage) and Universal (RocksDB's flagship), expressed as deterministic, side-effect- free functions over the sequence of SSTs.
Everything else (block cache, bloom layout, manifest, WAL, MVCC) is held at its simplest possible form so the two ideas above are studied in isolation.
2. Why Care
- Range tombstones make
DELETE FROM t WHERE id BETWEEN ?, ?and TTL-style "drop everything older than X" affordable. Without them, a one-million-key range delete writes one million Delete entries — and worse, blocks all subsequent reads until those tombstones reach the bottom level. - Compaction policies are the single biggest knob in any LSM. Size-tiered minimises write amplification at the cost of read amplification; Universal bounds the SST count while preserving recency. Choosing one is choosing the workload shape you'll be good at.
3. Core Data Structures
| Type | Purpose |
|---|---|
Entry { Put(k,v) | Delete(k) } | The point-write unit. |
RangeTomb { start, end } | Half-open interval delete; start ≤ k < end. |
SimpleBloom (u64) | 64-bit single-word bloom; two FNV-1a positions. |
Sst { smallest, largest, entries, range_tombstones, bloom } | An immutable run. |
LsmTreeAdvanced { ssts (newest first), ratio } | The whole tree. |
Sst::size() = entries.len() + 2 * range_tombstones.len(). The ×2 weight on
tombstones is deliberate: it makes compaction more eager when tombstones pile
up, which matches real-world tuning advice.
4. The Six Algorithms in One Page
-
Build SST. Walk pending entries right-to-left, mark each key's last occurrence as
keep. Then walk left-to-right emitting kept entries (preserves insertion order of survivors). Compute smallest/largest and the bloom in the same left-to-right pass. Range tombstones are copied verbatim. -
Get(key). Walk SSTs newest → oldest, accumulating active tombstones into a vector as we go. For each SST:
- append its tombs to
active, - if any tomb in
activecoverskey, returnNone(early exit), - if bloom misses,
continue(bounds and entries skipped, but older SSTs may still contain covering tombs — so the walk continues), - if
key < smallestorkey > largest,continuefor the same reason, - linear scan entries; first hit returns
Some(v)for Put,Nonefor Delete.
- append its tombs to
-
Size-tiered compaction. Pick the longest prefix
L ∈ [2, n-1]such thatΣ size(ssts[0..L]) ≤ ratio · size(ssts[L]). Merge that prefix into one SST and insert it at the newest position (index 0). If no suchLexists, returnfalse. -
Universal compaction. Pick the longest contiguous run
[i, i+L)withL ≥ 3such thatΣ size(run) ≤ ratio · size(ssts[i+L])(i.e. the run is followed by something at least twice as big). Ties broken by smalleri. Merge the run, replace it in place. -
Merge. Walk the run newest → oldest. For each SST:
- copy its range tombs into
out_tombs(preserved verbatim), - for each entry: skip if
seen[key](newer-wins), skip if covered by any previously active tomb, otherwise keep it; markseen, - append the SST's range tombs to
active.
Finally sort
out_entriesby key for determinism and recompute the bloom + bounds. - copy its range tombs into
-
Dump (canonical wire format). A length-prefixed binary blob, little- endian throughout. Magic
"DSEADV21"‖f64 ratio‖u32 sst_count‖ per SST: lenpref(smallest) ‖ lenpref(largest) ‖u32 nE‖ entries (u8 kind‖ lenpref(key) ‖ if Put: lenpref(val)) ‖u32 nT‖ tombs (lenpref(start) ‖ lenpref(end)) ‖u64 bloom.
5. What's Deliberately Not Here
- No WAL — recovery is out of scope; the tree is in-memory.
- No block cache, no separate index/filter blocks — the bloom is one
u64. - No level structure — SSTs are a flat list, newest first.
- No snapshots / MVCC —
Getis point-in-time only. - No concurrency — everything is single-threaded; SSTs are immutable so reads-with-merges would be trivially safe.
These omissions are why the lab fits in three files per language while still exercising the two ideas (range tombstones, compaction policy) at the depth where their subtleties bite.
6. Pointers to Cross-Language Equivalence
The whole point of the lab is that three independent implementations agree
byte-for-byte, not just at API level. The shared deterministic workload
(SplitMix64 PRNG, three draws per op, fixed flush/compact cadence) and the
canonical wire format (Section 4.6) are the two halves of that contract.
scripts/cross_test.sh enforces it with three hard-coded sha256 fixtures.
See docs/execution.md for the format spec, docs/verification.md for the
expected output of the verification scripts, and docs/analysis.md for the
design forces behind both range tombstones and the two compaction policies.