Step 03 — Cross-Language Byte Equivalence

Goal

Prove that the Rust, Go, and C++ implementations produce byte-identical canonical wire dumps for three fixed workloads.

Why This Is The Whole Point

API-level test parity is cheap and weak. "Same input → same hash of a canonical binary dump" is strong: any per-language drift (endian, integer width, map-iteration order, float formatting) surfaces as a hash mismatch on the next run.

The Format (one canonical source)

See docs/execution.md Section 4. Two-line summary:

  • Magic "DSEADV21"f64 LE ratiou32 LE sst_count.
  • Per SST (newest first): bounds (lenpref) ‖ entries (u8 kind + lenpref key + maybe lenpref value) ‖ range tombs ‖ u64 LE bloom bitmap.

The Workload (one canonical source)

See docs/execution.md Sections 1-3. Two-line summary:

  • SplitMix64 PRNG, 3 draws per op, (r1 >> 62) & 3 chooses Put / Put / Delete / RangeTomb. Flush every 8 ops, compact every 16. No residue flush at end.

The Three Fixtures

Fixtureseedopskeysscenario
A4220032tieredcompact
B750064universalcompact
C9930016withrange

Hashes are pinned in scripts/cross_test.sh and reproduced in docs/execution.md Section 5 and docs/verification.md Section 3.

Done When

./scripts/cross_test.sh
# ... ends with ...
=== ALL OK ===

If it doesn't, the diff between two implementations' dumps is the debugging artefact. Decode the first ~16 bytes to confirm magic + ratio, then walk SSTs one at a time — each SST is self-delimiting.

What To Do When A Hash Drifts

  1. Recapture from Rust. If you intentionally changed semantics, the Rust reference dictates the new canonical hashes; update both scripts/cross_test.sh and docs/execution.md Section 5.
  2. Hunt the drift. If you didn't intend to change anything, diff the raw dump_state bytes between the failing pair. The first differing byte tells you where in the format the bug lives. Common culprits: forgot LE, used usize instead of u32, iterated a hash map.