db-18 — Paxos
This lab implements Multi-Paxos consensus in Rust, Go, and C++,
all three producing a byte-identical sha256 of a canonical cluster
dump for any (seed, nodes, rounds, proposals, partition)
configuration. It is the sibling of db-17 (Raft) and reuses db-16's
deterministic simulator discipline: same splitmix64 seeding, same
(delivery_time, sender, seq) heap tie-break, same "sorted iteration
on the wire" rule, same closed-form proposal schedule.
If db-17 taught you that one consensus algorithm can be expressed identically in three languages, db-18 teaches you that another consensus algorithm — built on different primitives, with no built-in leader concept, and capable of arbitrary concurrent proposers — can be held to the very same bit-level discipline. The two implementations share zero algorithmic code but share all of the determinism machinery, and that is the point.
What is it?
Paxos (Lamport, "The Part-Time Parliament" 1998 / "Paxos Made Simple"
2001) is the original asynchronous consensus algorithm: a family of
acceptors collectively decides on a single value per slot despite
crashes, message loss, and message reordering. Unlike Raft, Paxos has
no first-class leader and no current_term. Its only ordering
primitive is the ballot — a lexicographic pair (round, proposer_id) that acceptors monotonically promise to honor.
Single-decree (one-slot) Paxos has two phases:
-
Phase 1 — Prepare / Promise. A proposer picks a fresh ballot
band broadcastsPrepare(b). An acceptor whose previously promised ballot is≤ bupdatespromised := band replies with every prior accept it holds (each(slot, accepted_ballot, value)triple). On collecting promises from a majority, the proposer enters Phase 2. -
Phase 2 — Accept / Accepted. For each slot, the proposer picks the value to propose: if any promise returned a prior accept for that slot, it must re-propose the value with the highest accepted_ballot (Lamport's P2c invariant); otherwise it is free to propose its own client value. It broadcasts
Accept(b, slot, v). An acceptor whose promised ballot is≤ brecordsaccepted[slot] := (b, v)and repliesAccepted(b, slot). On collecting accepts from a majority, the proposer declares the slot decided and broadcastsDecided(slot, v)to anyone who didn't get the accept.
Multi-Paxos amortizes Phase 1 across many slots. The proposer who
"wins" Phase 1 acts as a distinguished proposer (lab-locally we call
this role Leader) and reuses its promised ballot to drive Phase 2
for every subsequent slot, paying the Phase-1 cost only once per
ballot. Liveness is preserved by election timeouts: an acceptor that
hasn't heard from a leader for ≥ ELECTION_TIMEOUT_MIN + jitter ticks
starts its own Phase 1 with a higher round.
This lab implements Multi-Paxos end-to-end. It is the algorithm behind Google Chubby, Google Spanner's paxos groups, Cassandra lightweight transactions, and (in spirit) Apache ZooKeeper's ZAB.
Why does it matter?
-
Paxos is the historical and theoretical root of asynchronous consensus. Raft, ZAB, Viewstamped Replication, and EPaxos are all reactions to or refinements of Paxos. Reading the paper is easier when you have made the algorithm bit-deterministic with your own hands.
-
No fixed leader means no "single term" to lean on. Raft's safety flows largely from "exactly one leader per term". Paxos has neither. Its safety flows from the much weaker quorum-intersection argument: any two majorities of an
n-node cluster share at least one acceptor, and that acceptor's promised-ballot ordering serializes every accept that could possibly decide a slot. Writing the algorithm in three languages, watching the same sha256 fall out, and then deliberately breaking the quorum (scenario E) is the most visceral way to internalise quorum intersection. -
Concurrent proposers are first-class. Paxos lets every node attempt Phase 1 at any time. Dueling proposers are not an error case; they are the normal case during leadership churn. The deterministic simulator lets you replay the exact tick at which two proposers tied, see which ballot won, and confirm the safety invariants held without any "leader lease" magic.
-
Foundation for the rest of the distributed track. db-19 (ZAB) layers epoch+counter on top of a paxos-ish core; db-20 (distributed KV) feeds Paxos accept-decisions into a key-value state machine; db-23 (capstone) introduces snapshots and reconfiguration on top of whichever consensus engine the student picks (Raft, Paxos, or both).
How does it work?
State (per node)
acceptor : promised_ballot : Ballot # global, not per-slot
accepts : Map<slot, (Ballot, Vec<u8>)>
learner : learned : Map<slot, Vec<u8>>
proposer : role : Follower | Candidate | Leader
my_ballot : Ballot # the ballot this node is driving
prepare_promises: Set<acceptor_id> # accumulated this election
prepare_accepted: Map<slot, (Ballot, Vec<u8>)> # recovered during Phase 1
accept_count : Map<slot, Set<acceptor_id>>
next_slot : u64 # next fresh slot to propose
pending : Deque<Vec<u8>> # queued client values
timers : election_deadline : u64 # sim-time tick
last_heartbeat_sent : u64
promised_ballot is global per node (covers every slot, present and
future) — this is the standard Multi-Paxos optimization. accepts is
per-slot, because each slot is its own single-decree instance.
learned is the per-slot decision; once set it never changes.
Ballot ordering
#![allow(unused)] fn main() { #[derive(Clone, Copy, Eq, PartialEq)] struct Ballot { round: u32, proposer_id: u32 } }
Lex order on (round, proposer_id). Ballot::ZERO = (0, 0) means
"no ballot" and compares less than every other ballot. Promotion of
promised_ballot is monotonic: once an acceptor has promised b, it
will never accept any RPC carrying a strictly lower ballot.
Election timer (liveness)
reset_election_deadline(t):
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150
Identical to db-17's election timer. Heartbeats fire every 50 ticks from the current leader to keep follower timers refreshed.
Phase 1 — Prepare / Promise
start_election(t):
role = Candidate
new_round = max(promised_ballot.round, my_ballot.round) + 1
my_ballot = Ballot { round: new_round, proposer_id: self.id }
prepare_promises = { self.id } # self-promise
prepare_accepted = { slot: (ab, v) | (slot, (ab, v)) in self.accepts }
if my_ballot >= promised_ballot:
promised_ballot = my_ballot # we promise ourselves too
broadcast(Prepare { ballot: my_ballot })
if |prepare_promises| >= quorum(): # n = 1 cluster
become_leader(t)
on Prepare(b) at acceptor:
if b >= promised_ballot:
promised_ballot = b
if role in {Candidate, Leader} and b > my_ballot:
step_down(t) # higher proposer takes over
reset_election_deadline(t)
send Promise(b, accept_ok=true,
accepted = sorted_by_slot(accepts),
acceptor_id = self.id) → b.proposer_id
else:
send Promise(b, accept_ok=false, acceptor_id=self.id) → b.proposer_id
on Promise(b, ok, accepted, from) at candidate:
if role != Candidate or b != my_ballot: drop
if not ok: step_down(t); return # someone outranks us
prepare_promises.insert(from)
for (slot, ab, v) in accepted:
if slot not in prepare_accepted or ab > prepare_accepted[slot].ballot:
prepare_accepted[slot] = (ab, v) # recover highest-ballot value
if |prepare_promises| >= quorum():
become_leader(t)
The recovery rule take if ab > current.ballot is the operational
form of Lamport's P2c: across any majority of acceptors, the value
with the highest accepted ballot for a slot is the only value that
could already be decided in that slot, so the new leader must keep
proposing it (or anything if no acceptor reports a prior accept).
Phase 2 — Accept / Accepted
become_leader(t):
role = Leader
# Re-issue Accepts under our ballot for every recovered slot.
for slot in sorted(prepare_accepted.keys):
if slot in learned: continue
value = prepare_accepted[slot].value
accepts[slot] = (my_ballot, value)
accept_count[slot] = { self.id }
broadcast(Accept { ballot: my_ballot, slot, value })
next_slot = 1 + max(any seen slot in accepts ∪ learned, or -1)
last_heartbeat_sent = t
broadcast(Heartbeat { ballot: my_ballot })
drain_pending(out)
drain_pending():
while pending is non-empty:
value = pending.pop_front()
slot = next_slot; next_slot += 1
accepts[slot] = (my_ballot, value)
accept_count[slot] = { self.id }
broadcast(Accept { ballot: my_ballot, slot, value })
try_decide(slot) # n=1 cluster
on Accept(b, slot, v) at acceptor:
if b >= promised_ballot:
promised_ballot = b
accepts[slot] = (b, v)
if role in {Candidate, Leader} and b > my_ballot:
step_down(t)
reset_election_deadline(t)
send Accepted(b, slot, ok=true, self.id) → b.proposer_id
else:
send Accepted(b, slot, ok=false, self.id) → b.proposer_id
on Accepted(b, slot, ok, from) at leader:
if role != Leader or b != my_ballot: drop
if not ok: step_down(t); return
accept_count[slot].insert(from)
try_decide(slot)
try_decide(slot):
if role != Leader or slot in learned: return
if |accept_count[slot]| >= quorum():
v = accepts[slot].value
learned[slot] = v
broadcast(Decided { slot, value: v })
on Decided(slot, v) at any node:
learned[slot] = v
reset_election_deadline(t)
on Heartbeat(b) at node:
if b >= my_ballot and role in {Candidate, Leader} and b.proposer_id != self.id:
step_down(t)
if b >= promised_ballot or (promised_ballot != ZERO and b == promised_ballot):
reset_election_deadline(t)
Simulator loop (per tick t in 0..rounds)
1. enqueue scheduled proposals — schedule[i] = (i+1) * rounds / (K+1)
2. drain cluster-pending values into the current leader (if any)
3. pop every in-flight msg with delivery_time <= t and dispatch handle()
4. tick all nodes in ascending id; on_tick may fire election or heartbeat
The leader-pick rule for proposal injection is "lowest-id node with
role == Leader". During leadership churn there may be no leader (in
which case the value waits in cluster_pending) or even two stale
leaders (in which case the lowest id wins). The deterministic choice
is what keeps the byte hash stable.
Wire format (Rpc)
Six variants; tagged-union shape in Go, Rust enum and C++ std::variant-
backed types. All fields fixed-width, little-endian:
Prepare { ballot: (round: u32, proposer_id: u32) }
Promise { ballot, accept_ok: bool, acceptor_id: u32,
accepted: [(slot: u64, accepted_ballot, value: Vec<u8>)] }
Accept { ballot, slot: u64, value: Vec<u8> }
Accepted { ballot, slot: u64, accept_ok: bool, acceptor_id: u32 }
Decided { slot: u64, value: Vec<u8> }
Heartbeat { ballot }
The wire format is not serialized to disk by this lab — the simulator passes Rpcs as typed structs in memory. The only thing that is serialized is the canonical dump, and that is what gets hashed.
Canonical dump format
file := magic[8 = "DSEPAX01"] u32_le(node_count) node*
node := u32_le id
u32_le promised_ballot.round
u32_le promised_ballot.proposer_id
u8 role # Follower=0, Candidate=1, Leader=2
u32_le my_ballot.round
u32_le my_ballot.proposer_id
u32_le accept_count
accept * accept_count
u32_le learned_count
learned * learned_count
accept := u64_le slot
u32_le accepted_ballot.round
u32_le accepted_ballot.proposer_id
u32_le value_len
u8 value[value_len]
learned := u64_le slot
u32_le value_len
u8 value[value_len]
Nodes appear in ascending id order; inside each node, both accepts
and learned are emitted in ascending slot order. All multi-byte
integers are little-endian. The dump is hashed with SHA-256 and the
lowercase hex digest is what paxosctl prints to stdout (no trailing
newline).
Cross-language invariants
| Invariant | Why it matters |
|---|---|
splitmix64 constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB | identical PRNG output across languages |
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150 | identical election firing times |
delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3 | identical message scheduling |
heap order (delivery_time, sender, seq); seq global monotonic | identical delivery sequence |
peers iterated in ascending id (BTreeMap / std::map / explicit for p:=0;p<n;p++) | identical broadcast order |
| acceptor's Promise lists prior accepts in ascending slot order | identical Promise payload bytes |
candidate's Phase-1 recovery rule: keep (ab, v) with strictly greater ab | identical recovered value per slot |
next_slot = 1 + max(seen accept slot ∪ seen learned slot) after winning Phase 1 | identical first fresh slot |
try_decide quorum check uses ≥ n/2 + 1 (strict majority, leader counted) | identical decide tick |
leader-pick for proposal injection: lowest-id Leader | identical client routing |
proposal schedule: schedule[i] = (i+1) * rounds / (K+1) integer division | identical pending queue contents |
Role enum order Follower=0, Candidate=1, Leader=2 | identical dump bytes |
| dump emits accepts and learned in ascending slot order; nodes in ascending id order | identical dump bytes |
Drift in any one of these and scripts/cross_test.sh fails. The
companion cmp -l workflow in docs/observation.md walks you from
"the hashes differ" to "this exact byte differs" in three commands.
Multi-Paxos vs. Raft (the comparison the labs exist to make)
| Dimension | Raft (db-17) | Multi-Paxos (db-18) |
|---|---|---|
| ordering primitive | current_term: u64 (single integer, persisted, monotonic) | Ballot { round, proposer_id } lex pair |
| leader concept | first-class; exactly one leader per term | emergent; "leader" = whoever last won Phase 1 |
| concurrent proposers | forbidden by election safety | allowed (and routine during churn) |
| consistency check | prev_log_index / prev_log_term per AppendEntries | per-slot accepted_ballot carried in Promise |
| Phase-1 cost amortization | none needed (single leader) | Multi-Paxos (one Prepare covers all future slots) |
| safety from | log matching + election restriction + commit-only-current-term | quorum intersection + Promise reports prior accepts |
| understandability | designed for clarity (Ongaro 2014) | famously subtle (P2c, dueling proposers) |
The lab implementations make these dimensions concrete: scenario A in db-17 takes ~166 ticks to commit a proposal (election + AE round trip); the equivalent scenario A here takes ~150 ticks for Phase 1 plus ~3 ticks per Accept, then the leader runs at Phase-2-only cost until somebody bumps it.
Files
src/rust/—paxos18crate +paxosctlbinary.src/go/— modulegithub.com/10xdev/dse/db18+cmd/paxosctl.src/cpp/—db18_libstatic library +paxosctlbinary +test_db18.scripts/verify.sh— builds + runs the unit tests for all three.scripts/cross_test.sh— proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.
See docs/ for the long-form write-up and steps/ for the staged
implementation path.