db-07 Step 1 — K-way merge core

The whole lab is one algorithm. We build it in three languages, then expose it through a tiny CLI.

Cursor

A cursor is an iterator over (key, entry) pairs from one input SSTable, in key order. The simplest representation: materialize via SstReader::entries() and index into the resulting vector.

#![allow(unused)]
fn main() {
struct Cursor {
    items: Vec<(Vec<u8>, Entry)>,
    pos: usize,
}
impl Cursor {
    fn peek(&self) -> Option<&[u8]> { self.items.get(self.pos).map(|(k,_)| k.as_slice()) }
    fn take(&mut self) -> (Vec<u8>, Entry) { let i = self.pos; self.pos += 1; std::mem::take_or_clone(&self.items[i]) }
}
}
type cursor struct {
    items []entry // entry = {Key []byte; E sstable.Entry}
    pos   int
}
func (c *cursor) peek() []byte { if c.pos >= len(c.items) { return nil }; return c.items[c.pos].Key }
func (c *cursor) take() entry  { x := c.items[c.pos]; c.pos++; return x }
struct Cursor {
    std::vector<std::pair<std::vector<uint8_t>, sstable::Entry>> items;
    std::size_t pos = 0;
    const std::vector<uint8_t>* peek() const {
        return pos < items.size() ? &items[pos].first : nullptr;
    }
};

Heap entry

(key bytes, source_index)

Min-heap ordered by key ascending, ties broken by source_index ascending (smaller index = newer = wins). All three implementations must use this exact ordering for byte-identity.

Emit loop

Pseudocode is in docs/execution.md. The crucial bit is the inner drain loop:

# After emitting (k, entry):
while heap.peek().key == k:
    (_, j) = heap.pop()
    cursors[j].take()  # discard the older duplicate
    if cursors[j].peek() is not None:
        heap.push((cursors[j].peek(), j))

That loop is the only difference between "K-way merge of disjoint inputs" and "K-way merge with newest-wins dedupe".

Why the tiebreaker direction matters

Inputs are passed newest first (index 0 newest). On a tie, the smaller index must come out of the heap first. So when you build a (key, src) tuple, the smaller src is the smaller tuple, and a min-heap pops it first. No need to invert; the natural lexicographic order on the tuple does the right thing.

If you ever flip the input convention (oldest first), invert the tiebreaker. Do not do both — pick one and document it. We picked: index 0 = newest.

Try this before reading step 2

Without looking at the implementation, on paper, trace this:

A: [("a",V,"1"), ("c",V,"3")]
B: [("a",V,"old"), ("b",V,"2")]

Write out the heap after each pop. You should get four pops and three emits.

The expected output: [("a",V,"1"), ("b",V,"2"), ("c",V,"3")].