db-10 step 03 — Serialize + CLI + cross-language byte-identity

Goal

Define a canonical wire format for the tree, build a btreectl CLI that runs a deterministic workload and writes the serialized tree to stdout, then prove via sha256 that all three implementations produce identical bytes for two distinct scenarios.

Wire format

Preorder traversal. Per node, in this exact order:

u8        is_leaf                  (1 = leaf, 0 = internal)
u32 LE    nkeys
nkeys *   { u32 LE klen, key bytes, u32 LE vlen, val bytes }
if !is_leaf:
    (nkeys + 1) * recurse(child_j)

All length prefixes are little-endian (matches every other lab in the workspace). The empty tree serializes as 01 00 00 00 00 (one empty leaf).

Deterministic workload

KEY_SPACE = 200

run_workload(scenario, seed, ops):
    rng = SplitMix64(seed)
    tree = BTree::new()
    for _ in 0..ops:
        r1 = rng.next_u64()
        r2 = rng.next_u64()                 # ALWAYS draw both
        key = (r1 % KEY_SPACE).to_be_bytes()  # u64 BE = 8 bytes
        val = (r2 as u32).to_be_bytes()       # u32 BE = 4 bytes
        match scenario:
            "inserts" : tree.insert(&key, &val)
            "deletes" : if i < ops/2: tree.insert(&key, &val) else: tree.delete(&key)
            "mixed"   : op = (r1 >> 62) & 0x3
                        0 | 1 -> insert ; 2 -> delete ; 3 -> skip
    return tree

Two PRNG draws per iteration is non-negotiable; if any implementation short-circuits the second draw on a skip branch, the seed → state mapping desyncs.

CLI contract

btreectl --seed N --ops M --scenario {inserts | deletes | mixed}

Writes the canonical wire-format bytes (no trailing newline) to stdout.

Tasks

  1. Add serialize_tree(&self) -> Vec<u8> to BTree. Pure function; does not mutate the tree.
  2. Implement the SplitMix64 PRNG with the standard constants (0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB).
  3. Implement run_workload per the spec above.
  4. Implement btreectl in Rust, Go, and C++.
  5. Write scripts/verify.sh that runs unit tests in all three langs.
  6. Write scripts/cross_test.sh that:
    1. Builds all three binaries.
    2. Scenario A: btreectl --seed 42 --ops 500 --scenario inserts → sha256 all three → assert match. Hash: 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088.
    3. Scenario B: btreectl --seed 7 --ops 500 --scenario mixed → sha256 all three → assert match. Hash: 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718.
    4. Spot-check that the stream contains an expected byte sequence (defensive against silent-empty regressions).
    5. Print === ALL OK ===.

Acceptance

$ scripts/verify.sh
=== rust === ... ok
=== go   === ... ok
=== cpp  === ... ok
=== OK ===

$ scripts/cross_test.sh
...
  match(A): 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088
  match(B): 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718
=== ALL OK ===

A byte-identical hash across three independent implementations for both scenarios is a near-proof that the PRNG, key/value encoding, insert path, delete path, and serialization format are all spec- compliant.

Discussion prompts

  • Why must we draw two PRNG outputs per iteration even when the scenario chooses to skip?
  • Why is the wire format preorder rather than level-order or in-order? What property does preorder preserve that the others lose?
  • If the Scenario-A hash matches but Scenario-B doesn't, what code paths are the prime suspects, and why?
  • The sha256s are baked into cross_test.sh as constants. What is the benefit, and what is the maintenance cost when the wire format legitimately evolves?