db-08 — Analysis
Problem statement
We need two read-path primitives that the rest of the LSM stack assumes exist:
- A bounded in-memory cache that lets us amortize the cost of decoding SSTable blocks across many lookups, with predictable memory usage and O(1) operations.
- A streaming K-way merging iterator that exposes N pre-sorted sources as a single sorted, deduplicated stream — newest-wins on tie — without buffering all entries in memory.
Both must be small, dependency-light, and byte-deterministic when serialized (so the cross-language cross-test can detect any divergence).
Constraints
- Determinism. Given identical inputs, the merge stream's serialized bytes must be identical across Rust, Go, and C++. This is the cross-test's only gate.
- Bounded memory. The cache must cap at a user-supplied entry count; the
iterator must use
O(K)working set regardless of the number of entries. - No backtracking. The iterator is streaming: it must work on inputs that arrive lazily.
- Newest-wins is strict. Source index 0 always wins. There are no timestamps, generations, or sequence numbers — that complexity is deferred to db-13 (MVCC).
Decisions
- Cache eviction policy: LRU. Simple, predictable, well-understood. Not the best policy for all workloads (LRU-K, ARC, and CLOCK-Pro all beat it on scan-heavy workloads), but the correct teaching baseline.
- Cache capacity unit: entries. Production systems bound by bytes; we use entries to keep the data structure (rather than the accounting) the focus.
- Heap element shape:
(key, source_index). Small and cheap to compare. Pulling the full entry into the heap would inflate comparator cost and force copies. - No timestamps / sequence numbers. Newest-wins is by source index alone.
- Tombstone drop is opt-in. Callers pass
drop_tombstones=trueonly when they have proven (via compaction rules — see db-07) that no older source could resurrect the deleted key.
Trade-offs
| Choice | Pros | Cons |
|---|---|---|
| LRU (vs LRU-K, ARC, CLOCK) | O(1) ops, simple to reason | Scan-pollutes — one big scan can flush hot entries |
| Doubly-linked list (vs VecDeque) | O(1) arbitrary removal | Heavier per-node memory (two pointers) |
Heap of (key, src) (vs entry) | Cheap compares, no copies | Indirection back to source vector on every pop |
| Entry-bounded cap (vs byte) | Simple, no per-block sizing | Memory usage depends on block-size distribution |
| Drain-on-tie eagerly | Caller never sees duplicates | Slight extra work even when caller would dedupe |
Risks
- Heap ordering bug on tie. If the
(key, src)comparator forgets to break ties onsrc ascending, the merger silently emits the older value on key collisions. The "newest-wins" test catches this on a 2-entry input. - Cache eviction at boundary. Inserting into a full cache and then
immediately calling
Geton the just-evicted key must miss, not hit. - Iterator reentrancy. Calling
Nextafter end-of-stream must keep returning end-of-stream, not panic. - Cross-language drift on serialization. Endianness or length-prefix width mismatches would invalidate the sha256. We pin to "u32 LE length + bytes + u8 type [+ u32 LE val_len + val]".
Out of scope
- Compression (RocksDB caches decompressed blocks; some configs cache both).
- Pin/unpin handle protocol for zero-copy reads.
- Snapshot/sequence-number-aware iteration (deferred to db-13).
- Range deletes / range tombstones (deferred to db-21).
- Block-cache statistics beyond hit/miss/evict.