Step 02 — Merging Iterator

Goal

Implement a streaming K-way merging iterator over N pre-sorted (key, Entry) sources, where source index 0 is newest and ties are won by smaller index. Support an optional drop_tombstones flag.

Spec

API (Rust signature):

#![allow(unused)]
fn main() {
pub struct MergingIterator { /* … */ }

impl MergingIterator {
    pub fn new(sources: Vec<Vec<(Vec<u8>, Entry)>>, drop_tombstones: bool) -> Self;
}

impl Iterator for MergingIterator {
    type Item = (Vec<u8>, Entry);
    fn next(&mut self) -> Option<Self::Item>;
}
}

Behavior contracts:

  • Each source is sorted strictly ascending by key with no within-source duplicates (caller's responsibility).
  • The merged stream is sorted strictly ascending; each key appears at most once.
  • On tie, source with smaller index wins; older sources are drained — i.e. their copies of that key are advanced past, not yielded.
  • If drop_tombstones=true, winning entries whose type is Tombstone are not yielded; the iterator continues to the next key.
  • The working set is O(K) regardless of N.

Canonical serialization (cross-test contract)

For each yielded (key, entry):

u32_le(len(key)) || key                                          // 4+|key| bytes
u8(type)                                                          // 1 byte; 0=Value, 1=Tombstone
if type == Value:
    u32_le(len(value)) || value                                  // 4+|value| bytes

This is what SerializeStream emits and what the cross-test sha256s.

Acceptance

cd src/rust && cargo test
cd src/go   && go test
cd src/cpp  && ctest --test-dir build

Six (Rust/Go) or seven (C++) merger tests must pass:

  • empty inputs → empty output
  • single source passthrough
  • two-source interleave with no duplicates
  • newest-wins on tie
  • tombstone kept when drop=false
  • tombstone dropped when drop=true
  • (SerializeStream deterministic & expected size)

Discussion prompts

  • Why not nested two-way merges? (Total work would be O(N · K) instead of O(N log K); for K=10 that's 3.3× worse and gets worse with K.)
  • Why is "drain duplicates" eager rather than lazy? (Lazy would force the caller to dedupe, breaking the invariant that the merger's output is the single source of truth for "what's live at this key".)
  • Where in real systems do you find tie-break-by-source-index? (LSM read path, time-series chunk merging, Kafka log compaction, anywhere "newer wins" without explicit timestamps.)