db-18 step 02 — Multi-Paxos and the replicated log
Goal
Generalise single-decree Paxos into a log. A stable leader runs Phase 1 once, then drives a sequence of slots through Phase 2 only — that is the entire point of Multi-Paxos. Newly elected leaders must recover any partially-accepted slots before issuing new proposals, so the log stays contiguous and every committed prefix is identical on every replica.
Tasks
-
Leader election trigger. A Follower or Candidate whose
election_deadlineelapses bumpsmy_ballot.round = max(my_ballot.round, promised_ballot.round) + 1, setsmy_ballot.proposer_id = self.id, transitions to Candidate, and broadcastsPrepare { ballot: my_ballot }. Election deadline is reset with the same splitmix64 jitter formula as Raft:t + 150 + splitmix64(seed ^ id ^ t) % 150. -
become_leader. On collectingquorumpromises formy_ballot, transition to Leader, then:- Compute
next_slot = max(slot in any promise.accepted) + 1, defaulting tomax(learned.keys()) + 1if no accepts were reported. - For every slot in
[0, next_slot)that appears in any promise: adopt the value of the highest-ballot accept and broadcastAccept { my_ballot, slot, value }(this is the recovery sweep — it re-proposes potentially-chosen values under the new ballot). - Call
drain_pendingto attach pending client values to the next free slots, broadcastingAcceptfor each.
- Compute
-
Heartbeat. A Leader whose
heartbeat_dueelapses broadcastsHeartbeat { ballot: my_ballot }. Followers reset their election timers on any inbound RPC from the current leader. This is what makes Multi-Paxos amortise Phase 1: as long as heartbeats arrive, no one starts a new ballot. -
Decidedbroadcast. When a leader'stry_decide(slot)sees a quorum ofaccept_ok=truefor the slot's ballot, it markslearned[slot] = valueand broadcastsDecided { slot, value }to every node. Learners record the value inlearned; the leader does not need to re-decide on receipt. -
Lowest-id leader rule. When tests inspect "the" leader of a cluster, they pick the Leader with the lowest
id. This is a deterministic tie-break for the (rare) case where two nodes briefly both believe themselves leader during a flap; the safety invariants do not depend on at-most-one Leader, only on at-most-one chosen value per slot per ballot.
Acceptance
Inline unit tests in each language:
single_node_decides_every_proposal— a 1-node cluster (quorum 1) withproposals = 3ends withlearned = [(0, "val-0"), (1, "val-1"), (2, "val-2")]. Degenerate case but verifies the leader path.three_node_elects_single_leader—Cluster::new(42, 3)after 500 ticks with zero proposals has exactly one node in role Leader.three_node_replicates_proposals—Cluster::new(42, 3)after 1000 ticks withproposals = 5has every node'slearnedof length 5 and byte-identical to node 0's.multi_slot_log_is_contiguous— 10 proposals on a 3-node cluster yield slot keys0..10on every node, no gaps.partition_heals_progress_resumes— drop all traffic between node 0 and the other two; the surviving pair{1, 2}still elects a leader and decides 4 proposals. Demonstrates that Multi-Paxos liveness depends on some quorum being connected, not on the original leader being reachable.
All five green in Rust, Go, and C++.
Discussion prompts
- Amortisation. Why is the Phase 1 cost paid only at leader change in Multi-Paxos but on every decision in single-decree Paxos? What is the steady-state message count per decision on a 5-node cluster?
- Leader leases. Real systems (Spanner, Chubby) layer a lease on top of Multi-Paxos so the leader can serve linearizable reads without quorum. What changes in the safety argument if you serve reads off the leader without a lease?
- Recovery cost. A new leader must walk every acceptor's full
acceptsmap for the recovery sweep. What is the message size in bytes for a log with 1M slots and 256-byte values? What optimisations (truncation, snapshots,min_slotexchange) would you add for production?