db-08 — Execution

Build order

  1. Rust first: drives the canonical data-shape decisions (the Entry enum from db-06, the byte format of SerializeStream).
  2. Go second: ports the same algorithm with native data structures (container/list, container/heap).
  3. C++ third: same algorithm with std::list + std::priority_queue.

After all three pass their own unit tests, we run scripts/cross_test.sh which builds canonical input SSTables via db-05 + db-06 and asserts that all three merge_iter binaries produce the same sha256.

Per-language layout

Rust (src/rust)

  • Cargo.toml pulls in db-06's sstable crate by path = "../../../db-06-sstable-format/src/rust".
  • src/lib.rs defines BlockCache, MergingIterator, SerializeStream, and re-exports sstable::Entry. The cache uses a HashMap of slot indices plus a Vec<Node> arena with embedded prev/next indices and a free-list — an arena-based intrusive list, which beats LinkedList<T> on allocator pressure.
  • src/bin/merge_iter.rs is the cross-test CLI.

Go (src/go)

  • go.mod is module github.com/10xdev/dse/db08 with replace directives pointing to db-05 and db-06 on disk.
  • lru.go uses container/list.List and map[BlockKey]*list.Element.
  • iter.go uses container/heap with a []heapItem backing slice.
  • cmd/merge_iter/main.go is the CLI.

C++ (src/cpp)

  • CMakeLists.txt directly compiles db-06's sstable.cc into our sstable_lib rather than add_subdirectorying db-06 — that would leak db-06's add_test registrations into our ctest.
  • lru.{h,cc} uses std::list<Node> + std::unordered_map; Get uses list_.splice(begin, list_, it->second) for O(1) MRU promotion.
  • iter.{h,cc} uses std::priority_queue<HeapEntry, std::vector, Greater>.
  • src/merge_iter_bin.cc is the CLI.

Verification

  • scripts/verify.sh builds + tests all three languages.
  • scripts/cross_test.sh builds db-05 + db-06 input pipelines, generates the same newer.sst / older.sst used by db-07, runs merge_iter in all three languages, and asserts sha256 byte-identity for both drop_tombstones=false and drop_tombstones=true. It also spot-checks that "NEW-10", "OLD-50", "val99" appear in the stream and that the key5 tombstone framing (040000006b65793501) appears exactly when expected.

Reproducible cross-test sha256 (this lab's truth)

drop=false  → f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
drop=true   → ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)

The 9-byte difference is exactly the framing of one tombstone entry: u32_le(4) + "key5" + u8(1) = 4 + 4 + 1 = 9 bytes.

What you should be able to do after this lab

  • Sketch an LRU on a whiteboard in under three minutes and explain why both the hash map and the list are necessary.
  • Explain why a K-way merge uses a heap and not nested merges, and quote the O(N log K) comparison bound.
  • Identify, in any storage codebase, where the "newest-wins on tie" rule is enforced and where the "drain duplicates" step happens.
  • Argue when it is safe to drop a tombstone during iteration vs when it is not.