Analysis — db-21 Advanced Storage Engine

1. Problem Statement

Two engineering questions, studied in isolation:

  1. Range deletes. How does an LSM delete a key range [a, b) cheaply, without writing one Delete per key, and without losing correctness if a newer Put falls inside the same range?
  2. Compaction policy. How do size-tiered and Universal compaction actually differ — not as marketing words, but as deterministic functions over the current SST sequence?

The lab refuses to answer these with prose alone. It demands an executable specification that three language ports must agree on byte-for-byte.

2. Why Three Languages

Cross-language byte agreement is the cheapest sanity check that survives refactoring. If Rust drifts from Go on fixture A, the failure tells you exactly which side broke: the diff between the two dump_state() blobs is a structured binary, decodable by eye.

It also forces the design through three different idiom sets:

  • Rust keeps Option<Vec<u8>> for Get, enum Entry { Put, Delete } for the entry kind, and uses Vec<u8> everywhere for keys/values. Slices for bounds; no copies in merge_run's hot path.
  • Go uses []byte plus bytes.Compare. A map[string]struct{} stands in for the dedupe set. math.Float64bits for the ratio encoding.
  • C++ uses std::string as a byte container (avoids the char_traits trap), std::optional<std::string> for Get, and an inline 64-line SHA-256 in lsmctl.cc to keep the dependency surface at zero.

If you can read the same algorithm in all three and they line up at the byte level, the algorithm is unambiguous. That's the deliverable.

3. Range Tombstones — The Subtlety

The single non-obvious rule is:

A range tombstone hides keys older than it, but is itself hidden by Puts newer than it.

Both halves matter. Test range_tomb_respects_newer_put exists because a naive implementation that consults all tombstones before walking entries will silently drop the fresh value.

The implementation enforces this by walking SSTs newest → oldest and accumulating active tombstones as the walk descends. A Put in a newer SST is checked against the (then still empty) active set, so it survives. A Put in an older SST is checked against the (by then populated) active set, so it is hidden.

This also explains why a bloom miss must continue instead of return None: the SST we just skipped might have zero matching keys, but it could still contribute a range tombstone that shadows something below it. The active set must be allowed to grow.

4. Size-Tiered vs Universal — The Real Difference

Both are "merge several SSTs into one". The difference is which several.

  • Size-tiered asks: "is there a prefix [0..L) of new, small SSTs that together fit within ratio · size(ssts[L])?" It picks the longest qualifying prefix, merges them, and inserts the result at the newest position. This is greedy from the top of the tree.

  • Universal asks the same shape of question, but over a contiguous run anywhere in the list, with a minimum run length of 3. It picks the longest run; ties go to the leftmost. The merged run replaces itself in place.

The minimum lengths (≥ 2 prefix for tiered, ≥ 3 run for universal) are deliberate, both to keep work amortised and to make the two policies distinguishable on small inputs. Without them, both would degenerate to "merge whenever you can" and the fixtures wouldn't separate them.

5. Why the Wire Format Looks Like That

Five choices, each with a reason:

  1. Magic "DSEADV21" — eight bytes, no length prefix. Mismatches surface as the first 16 hex chars of the sha256 changing, which is easy to spot.
  2. f64 ratio — encoded via the IEEE 754 bit pattern, not as a string. This is why all three languages route through f64::to_bits, math.Float64bits, and memcpy(&u64, &double, 8). Strings would force a formatter choice ("0.5" vs "0.5000000000000000").
  3. Length-prefixed keys/valuesu32 LE lengths, raw bytes. No terminator, no escaping. Decoding is a one-pass scan.
  4. Entries newest-SST-first — matches the in-memory layout; reversing it in the dump would obscure the actual data structure.
  5. Bloom as raw u64 LE — not a list of positions. The bitmap is the bloom; nothing else needs to be portable.

6. Trade-offs Not Taken

  • We did not implement snapshot reads. Every Get is "as of right now".
  • We did not deduplicate range tombstones across SSTs at merge time. A range that fully contains an older range still leaves both in the merged output. Real engines coalesce; we don't, because the canonical-bytes test would then depend on a chosen normalisation policy.
  • We did not gate compaction on actual work performed; size-tiered may pick a length-2 prefix even when the merge produces zero entries (after tombstones erase everything). That's a feature for the study lab — it exercises the merge code; in production you'd skip empty merges.