Advanced Storage Engine

Lab status: complete. All unit tests pass; scripts/cross_test.sh proves three independent implementations (Rust, Go, C++) produce byte-identical canonical wire dumps for three fixed workloads.

1. What Is It

A standalone study of two pieces that turn a textbook LSM tree into something closer to RocksDB / LevelDB strength:

  1. Range tombstones — a single record that logically deletes every key in a half-open interval [start, end), instead of writing one Delete per key.
  2. Compaction policiessize-tiered (Cassandra/Scylla heritage) and Universal (RocksDB's flagship), expressed as deterministic, side-effect- free functions over the sequence of SSTs.

Everything else (block cache, bloom layout, manifest, WAL, MVCC) is held at its simplest possible form so the two ideas above are studied in isolation.

2. Why Care

  • Range tombstones make DELETE FROM t WHERE id BETWEEN ?, ? and TTL-style "drop everything older than X" affordable. Without them, a one-million-key range delete writes one million Delete entries — and worse, blocks all subsequent reads until those tombstones reach the bottom level.
  • Compaction policies are the single biggest knob in any LSM. Size-tiered minimises write amplification at the cost of read amplification; Universal bounds the SST count while preserving recency. Choosing one is choosing the workload shape you'll be good at.

3. Core Data Structures

TypePurpose
Entry { Put(k,v) | Delete(k) }The point-write unit.
RangeTomb { start, end }Half-open interval delete; start ≤ k < end.
SimpleBloom (u64)64-bit single-word bloom; two FNV-1a positions.
Sst { smallest, largest, entries, range_tombstones, bloom }An immutable run.
LsmTreeAdvanced { ssts (newest first), ratio }The whole tree.

Sst::size() = entries.len() + 2 * range_tombstones.len(). The ×2 weight on tombstones is deliberate: it makes compaction more eager when tombstones pile up, which matches real-world tuning advice.

4. The Six Algorithms in One Page

  1. Build SST. Walk pending entries right-to-left, mark each key's last occurrence as keep. Then walk left-to-right emitting kept entries (preserves insertion order of survivors). Compute smallest/largest and the bloom in the same left-to-right pass. Range tombstones are copied verbatim.

  2. Get(key). Walk SSTs newest → oldest, accumulating active tombstones into a vector as we go. For each SST:

    • append its tombs to active,
    • if any tomb in active covers key, return None (early exit),
    • if bloom misses, continue (bounds and entries skipped, but older SSTs may still contain covering tombs — so the walk continues),
    • if key < smallest or key > largest, continue for the same reason,
    • linear scan entries; first hit returns Some(v) for Put, None for Delete.
  3. Size-tiered compaction. Pick the longest prefix L ∈ [2, n-1] such that Σ size(ssts[0..L]) ≤ ratio · size(ssts[L]). Merge that prefix into one SST and insert it at the newest position (index 0). If no such L exists, return false.

  4. Universal compaction. Pick the longest contiguous run [i, i+L) with L ≥ 3 such that Σ size(run) ≤ ratio · size(ssts[i+L]) (i.e. the run is followed by something at least twice as big). Ties broken by smaller i. Merge the run, replace it in place.

  5. Merge. Walk the run newest → oldest. For each SST:

    • copy its range tombs into out_tombs (preserved verbatim),
    • for each entry: skip if seen[key] (newer-wins), skip if covered by any previously active tomb, otherwise keep it; mark seen,
    • append the SST's range tombs to active.

    Finally sort out_entries by key for determinism and recompute the bloom + bounds.

  6. Dump (canonical wire format). A length-prefixed binary blob, little- endian throughout. Magic "DSEADV21"f64 ratiou32 sst_count ‖ per SST: lenpref(smallest) ‖ lenpref(largest) ‖ u32 nE ‖ entries (u8 kind ‖ lenpref(key) ‖ if Put: lenpref(val)) ‖ u32 nT ‖ tombs (lenpref(start) ‖ lenpref(end)) ‖ u64 bloom.

5. What's Deliberately Not Here

  • No WAL — recovery is out of scope; the tree is in-memory.
  • No block cache, no separate index/filter blocks — the bloom is one u64.
  • No level structure — SSTs are a flat list, newest first.
  • No snapshots / MVCC — Get is point-in-time only.
  • No concurrency — everything is single-threaded; SSTs are immutable so reads-with-merges would be trivially safe.

These omissions are why the lab fits in three files per language while still exercising the two ideas (range tombstones, compaction policy) at the depth where their subtleties bite.

6. Pointers to Cross-Language Equivalence

The whole point of the lab is that three independent implementations agree byte-for-byte, not just at API level. The shared deterministic workload (SplitMix64 PRNG, three draws per op, fixed flush/compact cadence) and the canonical wire format (Section 4.6) are the two halves of that contract. scripts/cross_test.sh enforces it with three hard-coded sha256 fixtures.

See docs/execution.md for the format spec, docs/verification.md for the expected output of the verification scripts, and docs/analysis.md for the design forces behind both range tombstones and the two compaction policies.