db-10 step 03 — Serialize + CLI + cross-language byte-identity
Goal
Define a canonical wire format for the tree, build a btreectl CLI
that runs a deterministic workload and writes the serialized tree to
stdout, then prove via sha256 that all three implementations produce
identical bytes for two distinct scenarios.
Wire format
Preorder traversal. Per node, in this exact order:
u8 is_leaf (1 = leaf, 0 = internal)
u32 LE nkeys
nkeys * { u32 LE klen, key bytes, u32 LE vlen, val bytes }
if !is_leaf:
(nkeys + 1) * recurse(child_j)
All length prefixes are little-endian (matches every other lab in
the workspace). The empty tree serializes as 01 00 00 00 00 (one
empty leaf).
Deterministic workload
KEY_SPACE = 200
run_workload(scenario, seed, ops):
rng = SplitMix64(seed)
tree = BTree::new()
for _ in 0..ops:
r1 = rng.next_u64()
r2 = rng.next_u64() # ALWAYS draw both
key = (r1 % KEY_SPACE).to_be_bytes() # u64 BE = 8 bytes
val = (r2 as u32).to_be_bytes() # u32 BE = 4 bytes
match scenario:
"inserts" : tree.insert(&key, &val)
"deletes" : if i < ops/2: tree.insert(&key, &val) else: tree.delete(&key)
"mixed" : op = (r1 >> 62) & 0x3
0 | 1 -> insert ; 2 -> delete ; 3 -> skip
return tree
Two PRNG draws per iteration is non-negotiable; if any implementation short-circuits the second draw on a skip branch, the seed → state mapping desyncs.
CLI contract
btreectl --seed N --ops M --scenario {inserts | deletes | mixed}
Writes the canonical wire-format bytes (no trailing newline) to stdout.
Tasks
- Add
serialize_tree(&self) -> Vec<u8>toBTree. Pure function; does not mutate the tree. - Implement the SplitMix64 PRNG with the standard constants
(
0x9E3779B97F4A7C15,0xBF58476D1CE4E7B5,0x94D049BB133111EB). - Implement
run_workloadper the spec above. - Implement
btreectlin Rust, Go, and C++. - Write
scripts/verify.shthat runs unit tests in all three langs. - Write
scripts/cross_test.shthat:- Builds all three binaries.
- Scenario A:
btreectl --seed 42 --ops 500 --scenario inserts→ sha256 all three → assert match. Hash:4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088. - Scenario B:
btreectl --seed 7 --ops 500 --scenario mixed→ sha256 all three → assert match. Hash:9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718. - Spot-check that the stream contains an expected byte sequence (defensive against silent-empty regressions).
- Print
=== ALL OK ===.
Acceptance
$ scripts/verify.sh
=== rust === ... ok
=== go === ... ok
=== cpp === ... ok
=== OK ===
$ scripts/cross_test.sh
...
match(A): 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088
match(B): 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718
=== ALL OK ===
A byte-identical hash across three independent implementations for both scenarios is a near-proof that the PRNG, key/value encoding, insert path, delete path, and serialization format are all spec- compliant.
Discussion prompts
- Why must we draw two PRNG outputs per iteration even when the scenario chooses to skip?
- Why is the wire format preorder rather than level-order or in-order? What property does preorder preserve that the others lose?
- If the Scenario-A hash matches but Scenario-B doesn't, what code paths are the prime suspects, and why?
- The sha256s are baked into
cross_test.shas constants. What is the benefit, and what is the maintenance cost when the wire format legitimately evolves?