Step 03 — CLI and Cross-Language Test

Goal

Wrap the MergingIterator in a CLI binary (merge_iter) that reads N SSTables (built by db-06), runs a merge, and writes the canonical serialized stream to stdout. Then prove the three language implementations are byte-identical with scripts/cross_test.sh.

CLI spec

merge_iter [--drop-tombstones] IN1.sst IN2.sst ...
  • IN1 is the newest source; INk is the oldest.
  • Reads each input via db-06's SstReader, converts to Vec<(Vec<u8>, Entry)>, feeds all into a MergingIterator, calls SerializeStream, and writes the bytes verbatim to stdout.
  • Exit code: 0 success, 1 input error, 2 usage error.

Implementations:

Acceptance

Run the cross-test:

bash scripts/cross_test.sh

It must:

  1. Print match: lines with sha256s that are the same for all three languages (in both drop=false and drop=true modes).
  2. Confirm via hex spot-check that NEW-10, OLD-50, and val99 are present in the stream.
  3. Confirm the key5 tombstone framing (040000006b65793501) appears with drop=false and is absent with drop=true.
  4. End with CROSS-TEST OK.

Captured truth (current run):

drop=false  → f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
drop=true   → ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)

Discussion prompts

  • Why pipe the binary stream into sha256sum rather than diff the entry list? (A bytewise hash catches all serialization differences with a single number; it is the strongest possible equivalence test.)
  • The drop=true output is exactly 9 bytes shorter than drop=false. Where do those 9 bytes go? (u32_le(4) + "key5" + u8(1) = 4+4+1 = 9 — one tombstone frame.)
  • If you wanted to add a new entry kind (say, a "merge-add" delta), what would you change in the serialization? (Pick a new type byte (e.g. 2), decide its payload framing, document it, and update all three languages' SerializeStream and CLI in lockstep.)