db-07 Step 1 — K-way merge core
The whole lab is one algorithm. We build it in three languages, then expose it through a tiny CLI.
Cursor
A cursor is an iterator over (key, entry) pairs from one input SSTable, in
key order. The simplest representation: materialize via SstReader::entries()
and index into the resulting vector.
#![allow(unused)] fn main() { struct Cursor { items: Vec<(Vec<u8>, Entry)>, pos: usize, } impl Cursor { fn peek(&self) -> Option<&[u8]> { self.items.get(self.pos).map(|(k,_)| k.as_slice()) } fn take(&mut self) -> (Vec<u8>, Entry) { let i = self.pos; self.pos += 1; std::mem::take_or_clone(&self.items[i]) } } }
type cursor struct {
items []entry // entry = {Key []byte; E sstable.Entry}
pos int
}
func (c *cursor) peek() []byte { if c.pos >= len(c.items) { return nil }; return c.items[c.pos].Key }
func (c *cursor) take() entry { x := c.items[c.pos]; c.pos++; return x }
struct Cursor {
std::vector<std::pair<std::vector<uint8_t>, sstable::Entry>> items;
std::size_t pos = 0;
const std::vector<uint8_t>* peek() const {
return pos < items.size() ? &items[pos].first : nullptr;
}
};
Heap entry
(key bytes, source_index)
Min-heap ordered by key ascending, ties broken by source_index ascending
(smaller index = newer = wins). All three implementations must use this exact
ordering for byte-identity.
Emit loop
Pseudocode is in docs/execution.md. The crucial bit is the inner drain loop:
# After emitting (k, entry):
while heap.peek().key == k:
(_, j) = heap.pop()
cursors[j].take() # discard the older duplicate
if cursors[j].peek() is not None:
heap.push((cursors[j].peek(), j))
That loop is the only difference between "K-way merge of disjoint inputs" and "K-way merge with newest-wins dedupe".
Why the tiebreaker direction matters
Inputs are passed newest first (index 0 newest). On a tie, the smaller index
must come out of the heap first. So when you build a (key, src) tuple, the
smaller src is the smaller tuple, and a min-heap pops it first. No need to
invert; the natural lexicographic order on the tuple does the right thing.
If you ever flip the input convention (oldest first), invert the tiebreaker. Do not do both — pick one and document it. We picked: index 0 = newest.
Try this before reading step 2
Without looking at the implementation, on paper, trace this:
A: [("a",V,"1"), ("c",V,"3")]
B: [("a",V,"old"), ("b",V,"2")]
Write out the heap after each pop. You should get four pops and three emits.
The expected output: [("a",V,"1"), ("b",V,"2"), ("c",V,"3")].