Step 03 — Fault injection and catch-up

Goal

Add the failure-injection schedule, the catch_up operation, and the top-level run_cluster workload driver — completing the lab.

Tasks

  1. Implement Cluster::set_follower_up(fid, up) (assert fid is 1 or 2, never 0).
  2. Implement Cluster::catch_up(fid):
    • Snapshot the leader's log and commit_index.
    • While the follower's log.len() is less than the leader's, append leader_log[fol.log.len()] to the follower.
    • If the follower's commit_index is below the leader's, set it to the leader's and apply_committed.
  3. Implement step_op(rng, keys):
    • Draw r1, r2, r3 = rng.next() (always three).
    • kind = (r1 >> 62) & 0x3; 0,1,2 → Put, 3 → Del.
    • k = i64(r2 % keys), v = i64(r3 % 1000).
  4. Implement run_cluster(seed, ops, keys, scenario):
    • down_start = ops/2, down_end = (ops*3)/4, with_fault = (scenario == "fault").
    • For i in 0..ops:
      • If with_fault && i == down_start: set follower 1 down.
      • If with_fault && i == down_end: set follower 1 up, then catch_up(1).
      • submit(step_op(rng, keys)).
    • After the loop: if with_fault && !up[1], set follower 1 up and catch_up(1). (Handles ops % 4 != 0.)
  5. Write a dbctl hash workload --seed N --ops N --keys N --scenario <normal|fault> CLI that prints the SHA-256 hex of run_cluster(...).encode_snapshot() with no trailing newline.
  6. Freeze the two scenario hashes as named constants and assert them in two tests per language. Cross-check with scripts/cross_test.sh.

Acceptance

  • verify.sh ends with === OK ===.
  • cross_test.sh ends with === ALL OK ===.
  • The two frozen hashes
    • 5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296 (normal, seed=42 ops=200 keys=16)
    • d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792 (fault, seed=7 ops=2000 keys=128) match across Rust, Go, and C++.

Pitfalls

  • Drawing fewer RNG words on the Del branch will silently desync hashes — always draw three.
  • The post-loop catch-up matters: if the run ends inside the down window, follower 1 still needs to converge.
  • catch_up must clone the leader's log first; mutating both at once in Rust requires careful borrow handling.
  • The "ack on up[fid] only" rule is essential: a down follower contributes zero acks regardless of its log length.