Step 03 — Partitions and Catchup

Goal: implement heal() and truncate_and_replay so a partitioned follower can rejoin and converge. Then ship the cross-language exam.

What to read first

  • CONCEPTS.md § "Partition + heal".
  • docs/execution.md Stages 4–7.
  • src/rust/src/lib.rs — the truncate_and_replay and Cluster::heal bodies.

Concrete tasks

  1. Implement Replica::truncate_and_replay(leader_log, leader_commit): replace own log, wipe state machine, replay committed prefix.
  2. Implement Cluster::heal():
    • clone leader log + commit index up front (avoid use-after-mutate),
    • clear partitions,
    • for each previously-partitioned follower in ascending id order, call truncate_and_replay.
  3. In propose, when try_append returns false (gap), do a snapshot push immediately and count the ack.
  4. Implement dump_state per the wire format in CONCEPTS.md. Pin every byte offset in a test (test_snapshot_byte_format).
  5. Port the workload driver (run_workload) to all three languages. The byte-decoding rules — kind = (r1 >> 62) & 0x3, k = "k" + …, v = u64_le(r3 % 10000) — must be identical across all three.
  6. Build Rust binary, run scenarios A and B, capture the hashes, bake them into Go test, C++ test, and scripts/cross_test.sh.
  7. Bring Go and C++ green: run scripts/cross_test.sh. It must end with === ALL OK ===.

Definition of done

bash scripts/verify.sh        # → "=== OK ==="
bash scripts/cross_test.sh    # → "=== ALL OK ==="

Both scenarios produce:

  • A: 1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b
  • B: 272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc

Common bugs at this stage

  • heal() reads leader.log after mutating a follower — use a snapshot variable.
  • dump_state in Go iterates the map directly → randomised hash. Fix: sort the keys.
  • dump_state in C++ uses strcpy(magic_buf, "DSEDKV20") and copies 9 bytes including the NUL. Fix: std::memcpy(buf, MAGIC.data(), 8).
  • C++ test passes in Debug, fails in Release because assert(c.propose(...)) got stripped. Fix: assign to a bool ok = ... first, then assert(ok).
  • CLI prints a trailing newline. The exam compares full lines; a trailing \n breaks the hash comparison.