db-07 Step 3 — CLI and cross-language byte-identity

CLI shape (all three languages emit and accept the same)

compact [--drop-tombstones] OUT.sst IN1.sst IN2.sst ...

Arguments:

  • --drop-tombstones: optional first flag. If present, tombstones are dropped (use when this is the bottom-level compaction).
  • OUT.sst: output file path.
  • IN1.sst ...: one or more input SSTable paths. IN1 is the newest.

Exit codes:

  • 0: success.
  • 1: any error (open failure, malformed SSTable, write failure).
  • 2: usage error.

The CLI is intentionally minimal. There is no JSON, no stats, no progress. Stats live in db-22 (performance + benchmarking).

The cross-test scenario

The script in scripts/cross_test.sh:

  1. Builds feed_newer.mt (memtable scenario from observation.md, 50 keys with key10 replaced and key5 deleted).
  2. Builds feed_older.mt (100 keys with key50 = "OLD-50").
  3. Promotes both to SSTables using the db-06 Rust binary (sstable build feed_newer.mt newer.sst).
  4. For each language, runs compact OUT.sst newer.sst older.sst.
  5. Asserts sha256(rust.OUT) == sha256(go.OUT) == sha256(cpp.OUT).
  6. Runs the 3×3 read matrix using db-06's sstable iter over each OUT.
  7. Spot-checks sstable get OUT.sst <key> for key5, key10, key50, key99, nope.

The spot-checks use db-06's sstable CLI (not db-07's compact), which is why steps 5–7 don't need a separate db-07 reader: the output is a db-06 SSTable.

Why this proves the merge

Two SSTables with overlapping keys, where some overlaps prefer the newer version and one (key50) is unique to the older. If your merge logic gets the recency tiebreaker wrong, you read val10 instead of NEW-10. If you forget to drain duplicates, you write the same key twice and SstWriter::add throws. If you drop tombstones by mistake, key5 disappears.

If all three languages get the same sha256, the algorithm and its translation to three runtimes are pinned down.