Step 03 — Cross-Language Byte Equivalence

Goal

Prove that the Rust, Go, and C++ implementations produce byte-identical canonical wire dumps for three fixed workloads.

API-level test parity is cheap and weak. "Same input → same hash of a canonical binary dump" is strong: any per-language drift (endian, integer width, map-iteration order, float formatting) surfaces as a hash mismatch on the next run.

The Format (one canonical source)

See docs/execution.md Section 4. Two-line summary:

Magic "DSEADV21" ‖ f64 LE ratio ‖ u32 LE sst_count.
Per SST (newest first): bounds (lenpref) ‖ entries (u8 kind + lenpref key + maybe lenpref value) ‖ range tombs ‖ u64 LE bloom bitmap.

The Workload (one canonical source)

See docs/execution.md Sections 1-3. Two-line summary:

SplitMix64 PRNG, 3 draws per op, (r1 >> 62) & 3 chooses Put / Put / Delete / RangeTomb. Flush every 8 ops, compact every 16. No residue flush at end.

The Three Fixtures

Fixture	seed	ops	keys	scenario
A	42	200	32	`tieredcompact`
B	7	500	64	`universalcompact`
C	99	300	16	`withrange`

Hashes are pinned in scripts/cross_test.sh and reproduced in docs/execution.md Section 5 and docs/verification.md Section 3.

Done When

./scripts/cross_test.sh
# ... ends with ...
=== ALL OK ===

If it doesn't, the diff between two implementations' dumps is the debugging artefact. Decode the first ~16 bytes to confirm magic + ratio, then walk SSTs one at a time — each SST is self-delimiting.

What To Do When A Hash Drifts

Recapture from Rust. If you intentionally changed semantics, the Rust reference dictates the new canonical hashes; update both scripts/cross_test.sh and docs/execution.md Section 5.
Hunt the drift. If you didn't intend to change anything, diff the raw dump_state bytes between the failing pair. The first differing byte tells you where in the format the bug lives. Common culprits: forgot LE, used usize instead of u32, iterated a hash map.

Distributed Systems Engineer — Build Databases & Consensus From Scratch