db-18 step 02 — Multi-Paxos and the replicated log

Goal

Generalise single-decree Paxos into a log. A stable leader runs Phase 1 once, then drives a sequence of slots through Phase 2 only — that is the entire point of Multi-Paxos. Newly elected leaders must recover any partially-accepted slots before issuing new proposals, so the log stays contiguous and every committed prefix is identical on every replica.

Tasks

  1. Leader election trigger. A Follower or Candidate whose election_deadline elapses bumps my_ballot.round = max(my_ballot.round, promised_ballot.round) + 1, sets my_ballot.proposer_id = self.id, transitions to Candidate, and broadcasts Prepare { ballot: my_ballot }. Election deadline is reset with the same splitmix64 jitter formula as Raft: t + 150 + splitmix64(seed ^ id ^ t) % 150.

  2. become_leader. On collecting quorum promises for my_ballot, transition to Leader, then:

    • Compute next_slot = max(slot in any promise.accepted) + 1, defaulting to max(learned.keys()) + 1 if no accepts were reported.
    • For every slot in [0, next_slot) that appears in any promise: adopt the value of the highest-ballot accept and broadcast Accept { my_ballot, slot, value } (this is the recovery sweep — it re-proposes potentially-chosen values under the new ballot).
    • Call drain_pending to attach pending client values to the next free slots, broadcasting Accept for each.
  3. Heartbeat. A Leader whose heartbeat_due elapses broadcasts Heartbeat { ballot: my_ballot }. Followers reset their election timers on any inbound RPC from the current leader. This is what makes Multi-Paxos amortise Phase 1: as long as heartbeats arrive, no one starts a new ballot.

  4. Decided broadcast. When a leader's try_decide(slot) sees a quorum of accept_ok=true for the slot's ballot, it marks learned[slot] = value and broadcasts Decided { slot, value } to every node. Learners record the value in learned; the leader does not need to re-decide on receipt.

  5. Lowest-id leader rule. When tests inspect "the" leader of a cluster, they pick the Leader with the lowest id. This is a deterministic tie-break for the (rare) case where two nodes briefly both believe themselves leader during a flap; the safety invariants do not depend on at-most-one Leader, only on at-most-one chosen value per slot per ballot.

Acceptance

Inline unit tests in each language:

  • single_node_decides_every_proposal — a 1-node cluster (quorum 1) with proposals = 3 ends with learned = [(0, "val-0"), (1, "val-1"), (2, "val-2")]. Degenerate case but verifies the leader path.
  • three_node_elects_single_leaderCluster::new(42, 3) after 500 ticks with zero proposals has exactly one node in role Leader.
  • three_node_replicates_proposalsCluster::new(42, 3) after 1000 ticks with proposals = 5 has every node's learned of length 5 and byte-identical to node 0's.
  • multi_slot_log_is_contiguous — 10 proposals on a 3-node cluster yield slot keys 0..10 on every node, no gaps.
  • partition_heals_progress_resumes — drop all traffic between node 0 and the other two; the surviving pair {1, 2} still elects a leader and decides 4 proposals. Demonstrates that Multi-Paxos liveness depends on some quorum being connected, not on the original leader being reachable.

All five green in Rust, Go, and C++.

Discussion prompts

  • Amortisation. Why is the Phase 1 cost paid only at leader change in Multi-Paxos but on every decision in single-decree Paxos? What is the steady-state message count per decision on a 5-node cluster?
  • Leader leases. Real systems (Spanner, Chubby) layer a lease on top of Multi-Paxos so the leader can serve linearizable reads without quorum. What changes in the safety argument if you serve reads off the leader without a lease?
  • Recovery cost. A new leader must walk every acceptor's full accepts map for the recovery sweep. What is the message size in bytes for a log with 1M slots and 256-byte values? What optimisations (truncation, snapshots, min_slot exchange) would you add for production?