db-08 — Analysis

Problem statement

We need two read-path primitives that the rest of the LSM stack assumes exist:

A bounded in-memory cache that lets us amortize the cost of decoding SSTable blocks across many lookups, with predictable memory usage and O(1) operations.
A streaming K-way merging iterator that exposes N pre-sorted sources as a single sorted, deduplicated stream — newest-wins on tie — without buffering all entries in memory.

Both must be small, dependency-light, and byte-deterministic when serialized (so the cross-language cross-test can detect any divergence).

Determinism. Given identical inputs, the merge stream's serialized bytes must be identical across Rust, Go, and C++. This is the cross-test's only gate.
Bounded memory. The cache must cap at a user-supplied entry count; the iterator must use O(K) working set regardless of the number of entries.
No backtracking. The iterator is streaming: it must work on inputs that arrive lazily.
Newest-wins is strict. Source index 0 always wins. There are no timestamps, generations, or sequence numbers — that complexity is deferred to db-13 (MVCC).

Cache eviction policy: LRU. Simple, predictable, well-understood. Not the best policy for all workloads (LRU-K, ARC, and CLOCK-Pro all beat it on scan-heavy workloads), but the correct teaching baseline.
Cache capacity unit: entries. Production systems bound by bytes; we use entries to keep the data structure (rather than the accounting) the focus.
Heap element shape: (key, source_index). Small and cheap to compare. Pulling the full entry into the heap would inflate comparator cost and force copies.
No timestamps / sequence numbers. Newest-wins is by source index alone.
Tombstone drop is opt-in. Callers pass drop_tombstones=true only when they have proven (via compaction rules — see db-07) that no older source could resurrect the deleted key.

Choice	Pros	Cons
LRU (vs LRU-K, ARC, CLOCK)	O(1) ops, simple to reason	Scan-pollutes — one big scan can flush hot entries
Doubly-linked list (vs VecDeque)	O(1) arbitrary removal	Heavier per-node memory (two pointers)
Heap of `(key, src)` (vs entry)	Cheap compares, no copies	Indirection back to source vector on every pop
Entry-bounded cap (vs byte)	Simple, no per-block sizing	Memory usage depends on block-size distribution
Drain-on-tie eagerly	Caller never sees duplicates	Slight extra work even when caller would dedupe

Heap ordering bug on tie. If the (key, src) comparator forgets to break ties on src ascending, the merger silently emits the older value on key collisions. The "newest-wins" test catches this on a 2-entry input.
Cache eviction at boundary. Inserting into a full cache and then immediately calling Get on the just-evicted key must miss, not hit.
Iterator reentrancy. Calling Next after end-of-stream must keep returning end-of-stream, not panic.
Cross-language drift on serialization. Endianness or length-prefix width mismatches would invalidate the sha256. We pin to "u32 LE length + bytes + u8 type [+ u32 LE val_len + val]".