Analysis — db-21 Advanced Storage Engine
1. Problem Statement
Two engineering questions, studied in isolation:
- Range deletes. How does an LSM delete a key range
[a, b)cheaply, without writing one Delete per key, and without losing correctness if a newer Put falls inside the same range? - Compaction policy. How do size-tiered and Universal compaction actually differ — not as marketing words, but as deterministic functions over the current SST sequence?
The lab refuses to answer these with prose alone. It demands an executable specification that three language ports must agree on byte-for-byte.
2. Why Three Languages
Cross-language byte agreement is the cheapest sanity check that survives
refactoring. If Rust drifts from Go on fixture A, the failure tells you
exactly which side broke: the diff between the two dump_state() blobs is
a structured binary, decodable by eye.
It also forces the design through three different idiom sets:
- Rust keeps
Option<Vec<u8>>forGet,enum Entry { Put, Delete }for the entry kind, and usesVec<u8>everywhere for keys/values. Slices for bounds; no copies inmerge_run's hot path. - Go uses
[]byteplusbytes.Compare. Amap[string]struct{}stands in for the dedupe set.math.Float64bitsfor the ratio encoding. - C++ uses
std::stringas a byte container (avoids thechar_traitstrap),std::optional<std::string>forGet, and an inline 64-line SHA-256 inlsmctl.ccto keep the dependency surface at zero.
If you can read the same algorithm in all three and they line up at the byte level, the algorithm is unambiguous. That's the deliverable.
3. Range Tombstones — The Subtlety
The single non-obvious rule is:
A range tombstone hides keys older than it, but is itself hidden by Puts newer than it.
Both halves matter. Test range_tomb_respects_newer_put exists because a
naive implementation that consults all tombstones before walking entries
will silently drop the fresh value.
The implementation enforces this by walking SSTs newest → oldest and accumulating active tombstones as the walk descends. A Put in a newer SST is checked against the (then still empty) active set, so it survives. A Put in an older SST is checked against the (by then populated) active set, so it is hidden.
This also explains why a bloom miss must continue instead of return None:
the SST we just skipped might have zero matching keys, but it could still
contribute a range tombstone that shadows something below it. The active
set must be allowed to grow.
4. Size-Tiered vs Universal — The Real Difference
Both are "merge several SSTs into one". The difference is which several.
-
Size-tiered asks: "is there a prefix
[0..L)of new, small SSTs that together fit withinratio · size(ssts[L])?" It picks the longest qualifying prefix, merges them, and inserts the result at the newest position. This is greedy from the top of the tree. -
Universal asks the same shape of question, but over a contiguous run anywhere in the list, with a minimum run length of 3. It picks the longest run; ties go to the leftmost. The merged run replaces itself in place.
The minimum lengths (≥ 2 prefix for tiered, ≥ 3 run for universal) are deliberate, both to keep work amortised and to make the two policies distinguishable on small inputs. Without them, both would degenerate to "merge whenever you can" and the fixtures wouldn't separate them.
5. Why the Wire Format Looks Like That
Five choices, each with a reason:
- Magic
"DSEADV21"— eight bytes, no length prefix. Mismatches surface as the first 16 hex chars of the sha256 changing, which is easy to spot. f64ratio — encoded via the IEEE 754 bit pattern, not as a string. This is why all three languages route throughf64::to_bits,math.Float64bits, andmemcpy(&u64, &double, 8). Strings would force a formatter choice ("0.5"vs"0.5000000000000000").- Length-prefixed keys/values —
u32 LElengths, raw bytes. No terminator, no escaping. Decoding is a one-pass scan. - Entries newest-SST-first — matches the in-memory layout; reversing it in the dump would obscure the actual data structure.
- Bloom as raw
u64 LE— not a list of positions. The bitmap is the bloom; nothing else needs to be portable.
6. Trade-offs Not Taken
- We did not implement snapshot reads. Every Get is "as of right now".
- We did not deduplicate range tombstones across SSTs at merge time. A range that fully contains an older range still leaves both in the merged output. Real engines coalesce; we don't, because the canonical-bytes test would then depend on a chosen normalisation policy.
- We did not gate compaction on actual work performed; size-tiered may pick a length-2 prefix even when the merge produces zero entries (after tombstones erase everything). That's a feature for the study lab — it exercises the merge code; in production you'd skip empty merges.