db-18 — Observation

Expected canonical hashes

Six configurations are pinned in scripts/cross_test.sh. The lab is green iff all three binaries (Rust release, Go release, C++ Release) emit exactly these strings on stdout (no trailing newline):

Name	Flags	SHA-256 of canonical dump
A	`--seed 42 --nodes 3 --rounds 1000 --proposals 5`	`0a35fdad1dd97c76a40a61b020c6181a56c4a40d4f723cb68fe70c2112aa9b63`
B	`--seed 7 --nodes 5 --rounds 2000 --proposals 20`	`3cc6cae6cb7f9d2b7cb88088a0f22581ac4c41bd86bab1b3676dd0ba33fd7ead`
C	`--seed 99 --nodes 3 --rounds 500 --proposals 0`	`f28d025af748a790beded6167115c7094a7f939b45d439728e4d6b7e144c3be0`
D	`--seed 1 --nodes 1 --rounds 200 --proposals 5`	`e5e0248c7c4fa20991b90afdac828eab91a7414497461dadc2e1553040693139`
E	`--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0`	`674e62d809248ac99401054c195d29b0e2eed6ccc78ec45e96da8aaf69c36096`
F	`--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1`	`7d80176abad54e533b2f4174e84f58432a000255fbb2ecbbb1dd915cb6bb6ab5`

These are the contract. Edit any production code such that one of these strings changes and you have changed semantics; reverify end-to-end before you ship.

Walking the wire: scenario D byte-by-byte

Scenario D is the shortest possible dump (one node, five proposals, all decided locally). Use it as a Rosetta Stone before debugging the multi-node hashes. The layout is magic || u32 node_count || node[], and the node payload starts at offset 12.

00..07  4453 4550 4158 3031     "DSEPAX01"            magic
08..0b  01 00 00 00             node_count = 1
0c..0f  00 00 00 00             node.id = 0
10..13  rr rr 00 00             node.promised_ballot.round       (round it won at)
14..17  00 00 00 00             node.promised_ballot.proposer_id (= self.id = 0)
18      02                      node.role = Leader (2)
19..1c  rr rr 00 00             node.my_ballot.round
1d..20  00 00 00 00             node.my_ballot.proposer_id
21..24  05 00 00 00             accept_count = 5
... 5 × {u64 slot, u32 ab.round, u32 ab.proposer_id, u32 value_len, value bytes}
... then u32 learned_count = 5 and 5 × {u64 slot, u32 value_len, value bytes}

Run:

src/rust/target/release/paxosctl --seed 1 --nodes 1 --rounds 200 --proposals 5
# e5e0248c7c4fa20991b90afdac828eab91a7414497461dadc2e1553040693139

To dump the raw bytes (skip the sha256 step) hack the binary to print canonical_dump instead of sha256_hex(&canonical_dump); do it locally only — the canonical CLI output is the sha256.

Walking the wire: scenario C (no proposals)

Scenario C runs three nodes for 500 ticks with --proposals 0. Exactly one of them will be elected leader; nobody decides anything. The dump therefore has accept_count == 0 and learned_count == 0 for every node. The bytes that do change between languages if you have an iteration-order bug are the per-node promised_ballot.round values (the elected leader's round depends on whether some other proposer almost-elected first). If C is the failing scenario, you have an election-timer determinism bug, not a Phase-2 bug.

Divergence runbook

If cross_test.sh prints MISMATCH scenario X, follow this script:

# 1. Capture the raw bytes from each binary. Patch paxosctl locally
#    to print `canonical_dump` raw instead of sha256 hex, run once,
#    then revert the patch. Save to rust.bin, go.bin, cpp.bin.

cmp -l rust.bin go.bin | head
cmp -l rust.bin cpp.bin | head

cmp -l prints byte_offset rust_value go_value in octal. Map the first offset to the field it belongs in:

Offset	Field	Likely culprit
0..7	magic `"DSEPAX01"`	wrong magic literal
8..11	`node_count`	wrong `u32_le` writer, wrong endianness
12 + k*node_size + 0..3	`node.id`	iterating nodes in wrong order (not ascending id)
12 + k*node_size + 4..11	`promised_ballot`	election-timer drift or wrong PRNG seed mix
12 + k*node_size + 12	`role` (1 byte)	enum reordered (must be Follower=0, Candidate=1, Leader=2)
12 + k*node_size + 13..20	`my_ballot`	step-down logic differs (e.g., resetting `my_ballot` to zero or not)
12 + k*node_size + 21..24	`accept_count`	one acceptor accepted a slot the others did not — Phase-2 message ordering bug
inside an accept tuple	`slot`	accepts iterated in receive order, not sorted by slot
inside an accept tuple	`accepted_ballot`	Phase-1 recovery used a wrong rule (e.g., last-write-wins instead of highest-ballot)
inside an accept tuple	`value_len` / `value`	wrong proposal scheduled at this slot — proposal-injection rule or leader-pick rule differs
inside the learned section	`slot` / `value`	the difference is downstream of an accept-section difference; fix that first

Tick-level diff

If cmp -l flags a divergence inside the accepts of node 1, add eprintln!/fmt.Fprintln(os.Stderr, ...)/std::cerr lines in each implementation at the boundaries of the suspect ticks:

#![allow(unused)]
fn main() {
// after handle() and after on_tick():
eprintln!("t={} id={} promised={:?} role={:?} my={:?} accepts={:?} learned={:?}",
          t, id, n.promised_ballot, n.role, n.my_ballot, n.accepts, n.learned);
}

Run all three, diff -u rust.log go.log. The first differing tick is the bug.

Most common culprits in practice

Forgetting to sort the Promise payload by slot. Go's map iteration order is randomized; you must sort.Slice before appending to the wire.
Reading next_slot before recovering from prepare_accepted. If recovery doesn't update next_slot = max + 1, the leader will double-allocate a slot that already has a recovered accept, silently overwriting it.
Letting step_down clear promised_ballot. Promises are forever; only my_ballot is candidate-state.
Counting yourself twice in accept_count. Both become_leader and try_decide insert self; the second one is a no-op only if accept_count is a set, not a multiset.
Iterating peers as for p in nodes.iter() on a HashMap. Use BTreeMap in Rust, std::map in C++, and explicit for p := uint32(0); p < n; p++ in Go.

Distributed Systems Engineer — Build Databases & Consensus From Scratch