db-18 — Paxos

This lab implements Multi-Paxos consensus in Rust, Go, and C++, all three producing a byte-identical sha256 of a canonical cluster dump for any (seed, nodes, rounds, proposals, partition) configuration. It is the sibling of db-17 (Raft) and reuses db-16's deterministic simulator discipline: same splitmix64 seeding, same (delivery_time, sender, seq) heap tie-break, same "sorted iteration on the wire" rule, same closed-form proposal schedule.

If db-17 taught you that one consensus algorithm can be expressed identically in three languages, db-18 teaches you that another consensus algorithm — built on different primitives, with no built-in leader concept, and capable of arbitrary concurrent proposers — can be held to the very same bit-level discipline. The two implementations share zero algorithmic code but share all of the determinism machinery, and that is the point.


What is it?

Paxos (Lamport, "The Part-Time Parliament" 1998 / "Paxos Made Simple" 2001) is the original asynchronous consensus algorithm: a family of acceptors collectively decides on a single value per slot despite crashes, message loss, and message reordering. Unlike Raft, Paxos has no first-class leader and no current_term. Its only ordering primitive is the ballot — a lexicographic pair (round, proposer_id) that acceptors monotonically promise to honor.

Single-decree (one-slot) Paxos has two phases:

  1. Phase 1 — Prepare / Promise. A proposer picks a fresh ballot b and broadcasts Prepare(b). An acceptor whose previously promised ballot is ≤ b updates promised := b and replies with every prior accept it holds (each (slot, accepted_ballot, value) triple). On collecting promises from a majority, the proposer enters Phase 2.

  2. Phase 2 — Accept / Accepted. For each slot, the proposer picks the value to propose: if any promise returned a prior accept for that slot, it must re-propose the value with the highest accepted_ballot (Lamport's P2c invariant); otherwise it is free to propose its own client value. It broadcasts Accept(b, slot, v). An acceptor whose promised ballot is ≤ b records accepted[slot] := (b, v) and replies Accepted(b, slot). On collecting accepts from a majority, the proposer declares the slot decided and broadcasts Decided(slot, v) to anyone who didn't get the accept.

Multi-Paxos amortizes Phase 1 across many slots. The proposer who "wins" Phase 1 acts as a distinguished proposer (lab-locally we call this role Leader) and reuses its promised ballot to drive Phase 2 for every subsequent slot, paying the Phase-1 cost only once per ballot. Liveness is preserved by election timeouts: an acceptor that hasn't heard from a leader for ≥ ELECTION_TIMEOUT_MIN + jitter ticks starts its own Phase 1 with a higher round.

This lab implements Multi-Paxos end-to-end. It is the algorithm behind Google Chubby, Google Spanner's paxos groups, Cassandra lightweight transactions, and (in spirit) Apache ZooKeeper's ZAB.


Why does it matter?

  • Paxos is the historical and theoretical root of asynchronous consensus. Raft, ZAB, Viewstamped Replication, and EPaxos are all reactions to or refinements of Paxos. Reading the paper is easier when you have made the algorithm bit-deterministic with your own hands.

  • No fixed leader means no "single term" to lean on. Raft's safety flows largely from "exactly one leader per term". Paxos has neither. Its safety flows from the much weaker quorum-intersection argument: any two majorities of an n-node cluster share at least one acceptor, and that acceptor's promised-ballot ordering serializes every accept that could possibly decide a slot. Writing the algorithm in three languages, watching the same sha256 fall out, and then deliberately breaking the quorum (scenario E) is the most visceral way to internalise quorum intersection.

  • Concurrent proposers are first-class. Paxos lets every node attempt Phase 1 at any time. Dueling proposers are not an error case; they are the normal case during leadership churn. The deterministic simulator lets you replay the exact tick at which two proposers tied, see which ballot won, and confirm the safety invariants held without any "leader lease" magic.

  • Foundation for the rest of the distributed track. db-19 (ZAB) layers epoch+counter on top of a paxos-ish core; db-20 (distributed KV) feeds Paxos accept-decisions into a key-value state machine; db-23 (capstone) introduces snapshots and reconfiguration on top of whichever consensus engine the student picks (Raft, Paxos, or both).


How does it work?

State (per node)

acceptor    : promised_ballot : Ballot                # global, not per-slot
              accepts         : Map<slot, (Ballot, Vec<u8>)>
learner     : learned         : Map<slot, Vec<u8>>
proposer    : role            : Follower | Candidate | Leader
              my_ballot       : Ballot                # the ballot this node is driving
              prepare_promises: Set<acceptor_id>      # accumulated this election
              prepare_accepted: Map<slot, (Ballot, Vec<u8>)>  # recovered during Phase 1
              accept_count    : Map<slot, Set<acceptor_id>>
              next_slot       : u64                   # next fresh slot to propose
              pending         : Deque<Vec<u8>>        # queued client values
timers      : election_deadline   : u64               # sim-time tick
              last_heartbeat_sent : u64

promised_ballot is global per node (covers every slot, present and future) — this is the standard Multi-Paxos optimization. accepts is per-slot, because each slot is its own single-decree instance. learned is the per-slot decision; once set it never changes.

Ballot ordering

#![allow(unused)]
fn main() {
#[derive(Clone, Copy, Eq, PartialEq)]
struct Ballot { round: u32, proposer_id: u32 }
}

Lex order on (round, proposer_id). Ballot::ZERO = (0, 0) means "no ballot" and compares less than every other ballot. Promotion of promised_ballot is monotonic: once an acceptor has promised b, it will never accept any RPC carrying a strictly lower ballot.

Election timer (liveness)

reset_election_deadline(t):
    election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150

Identical to db-17's election timer. Heartbeats fire every 50 ticks from the current leader to keep follower timers refreshed.

Phase 1 — Prepare / Promise

start_election(t):
    role = Candidate
    new_round = max(promised_ballot.round, my_ballot.round) + 1
    my_ballot = Ballot { round: new_round, proposer_id: self.id }
    prepare_promises = { self.id }                  # self-promise
    prepare_accepted = { slot: (ab, v) | (slot, (ab, v)) in self.accepts }
    if my_ballot >= promised_ballot:
        promised_ballot = my_ballot                 # we promise ourselves too
    broadcast(Prepare { ballot: my_ballot })
    if |prepare_promises| >= quorum():              # n = 1 cluster
        become_leader(t)

on Prepare(b) at acceptor:
    if b >= promised_ballot:
        promised_ballot = b
        if role in {Candidate, Leader} and b > my_ballot:
            step_down(t)                            # higher proposer takes over
        reset_election_deadline(t)
        send Promise(b, accept_ok=true,
                     accepted = sorted_by_slot(accepts),
                     acceptor_id = self.id) → b.proposer_id
    else:
        send Promise(b, accept_ok=false, acceptor_id=self.id) → b.proposer_id

on Promise(b, ok, accepted, from) at candidate:
    if role != Candidate or b != my_ballot: drop
    if not ok: step_down(t); return                 # someone outranks us
    prepare_promises.insert(from)
    for (slot, ab, v) in accepted:
        if slot not in prepare_accepted or ab > prepare_accepted[slot].ballot:
            prepare_accepted[slot] = (ab, v)        # recover highest-ballot value
    if |prepare_promises| >= quorum():
        become_leader(t)

The recovery rule take if ab > current.ballot is the operational form of Lamport's P2c: across any majority of acceptors, the value with the highest accepted ballot for a slot is the only value that could already be decided in that slot, so the new leader must keep proposing it (or anything if no acceptor reports a prior accept).

Phase 2 — Accept / Accepted

become_leader(t):
    role = Leader
    # Re-issue Accepts under our ballot for every recovered slot.
    for slot in sorted(prepare_accepted.keys):
        if slot in learned: continue
        value = prepare_accepted[slot].value
        accepts[slot] = (my_ballot, value)
        accept_count[slot] = { self.id }
        broadcast(Accept { ballot: my_ballot, slot, value })
    next_slot = 1 + max(any seen slot in accepts ∪ learned, or -1)
    last_heartbeat_sent = t
    broadcast(Heartbeat { ballot: my_ballot })
    drain_pending(out)

drain_pending():
    while pending is non-empty:
        value = pending.pop_front()
        slot = next_slot; next_slot += 1
        accepts[slot] = (my_ballot, value)
        accept_count[slot] = { self.id }
        broadcast(Accept { ballot: my_ballot, slot, value })
        try_decide(slot)                            # n=1 cluster

on Accept(b, slot, v) at acceptor:
    if b >= promised_ballot:
        promised_ballot = b
        accepts[slot] = (b, v)
        if role in {Candidate, Leader} and b > my_ballot:
            step_down(t)
        reset_election_deadline(t)
        send Accepted(b, slot, ok=true, self.id) → b.proposer_id
    else:
        send Accepted(b, slot, ok=false, self.id) → b.proposer_id

on Accepted(b, slot, ok, from) at leader:
    if role != Leader or b != my_ballot: drop
    if not ok: step_down(t); return
    accept_count[slot].insert(from)
    try_decide(slot)

try_decide(slot):
    if role != Leader or slot in learned: return
    if |accept_count[slot]| >= quorum():
        v = accepts[slot].value
        learned[slot] = v
        broadcast(Decided { slot, value: v })

on Decided(slot, v) at any node:
    learned[slot] = v
    reset_election_deadline(t)

on Heartbeat(b) at node:
    if b >= my_ballot and role in {Candidate, Leader} and b.proposer_id != self.id:
        step_down(t)
    if b >= promised_ballot or (promised_ballot != ZERO and b == promised_ballot):
        reset_election_deadline(t)

Simulator loop (per tick t in 0..rounds)

1. enqueue scheduled proposals  — schedule[i] = (i+1) * rounds / (K+1)
2. drain cluster-pending values into the current leader (if any)
3. pop every in-flight msg with delivery_time <= t and dispatch handle()
4. tick all nodes in ascending id; on_tick may fire election or heartbeat

The leader-pick rule for proposal injection is "lowest-id node with role == Leader". During leadership churn there may be no leader (in which case the value waits in cluster_pending) or even two stale leaders (in which case the lowest id wins). The deterministic choice is what keeps the byte hash stable.

Wire format (Rpc)

Six variants; tagged-union shape in Go, Rust enum and C++ std::variant- backed types. All fields fixed-width, little-endian:

Prepare    { ballot: (round: u32, proposer_id: u32) }
Promise    { ballot, accept_ok: bool, acceptor_id: u32,
             accepted: [(slot: u64, accepted_ballot, value: Vec<u8>)] }
Accept     { ballot, slot: u64, value: Vec<u8> }
Accepted   { ballot, slot: u64, accept_ok: bool, acceptor_id: u32 }
Decided    { slot: u64, value: Vec<u8> }
Heartbeat  { ballot }

The wire format is not serialized to disk by this lab — the simulator passes Rpcs as typed structs in memory. The only thing that is serialized is the canonical dump, and that is what gets hashed.

Canonical dump format

file := magic[8 = "DSEPAX01"] u32_le(node_count) node*

node := u32_le id
        u32_le promised_ballot.round
        u32_le promised_ballot.proposer_id
        u8     role                       # Follower=0, Candidate=1, Leader=2
        u32_le my_ballot.round
        u32_le my_ballot.proposer_id
        u32_le accept_count
        accept * accept_count
        u32_le learned_count
        learned * learned_count

accept  := u64_le slot
           u32_le accepted_ballot.round
           u32_le accepted_ballot.proposer_id
           u32_le value_len
           u8 value[value_len]

learned := u64_le slot
           u32_le value_len
           u8 value[value_len]

Nodes appear in ascending id order; inside each node, both accepts and learned are emitted in ascending slot order. All multi-byte integers are little-endian. The dump is hashed with SHA-256 and the lowercase hex digest is what paxosctl prints to stdout (no trailing newline).

Cross-language invariants

InvariantWhy it matters
splitmix64 constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EBidentical PRNG output across languages
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150identical election firing times
delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3identical message scheduling
heap order (delivery_time, sender, seq); seq global monotonicidentical delivery sequence
peers iterated in ascending id (BTreeMap / std::map / explicit for p:=0;p<n;p++)identical broadcast order
acceptor's Promise lists prior accepts in ascending slot orderidentical Promise payload bytes
candidate's Phase-1 recovery rule: keep (ab, v) with strictly greater abidentical recovered value per slot
next_slot = 1 + max(seen accept slot ∪ seen learned slot) after winning Phase 1identical first fresh slot
try_decide quorum check uses ≥ n/2 + 1 (strict majority, leader counted)identical decide tick
leader-pick for proposal injection: lowest-id Leaderidentical client routing
proposal schedule: schedule[i] = (i+1) * rounds / (K+1) integer divisionidentical pending queue contents
Role enum order Follower=0, Candidate=1, Leader=2identical dump bytes
dump emits accepts and learned in ascending slot order; nodes in ascending id orderidentical dump bytes

Drift in any one of these and scripts/cross_test.sh fails. The companion cmp -l workflow in docs/observation.md walks you from "the hashes differ" to "this exact byte differs" in three commands.

Multi-Paxos vs. Raft (the comparison the labs exist to make)

DimensionRaft (db-17)Multi-Paxos (db-18)
ordering primitivecurrent_term: u64 (single integer, persisted, monotonic)Ballot { round, proposer_id } lex pair
leader conceptfirst-class; exactly one leader per termemergent; "leader" = whoever last won Phase 1
concurrent proposersforbidden by election safetyallowed (and routine during churn)
consistency checkprev_log_index / prev_log_term per AppendEntriesper-slot accepted_ballot carried in Promise
Phase-1 cost amortizationnone needed (single leader)Multi-Paxos (one Prepare covers all future slots)
safety fromlog matching + election restriction + commit-only-current-termquorum intersection + Promise reports prior accepts
understandabilitydesigned for clarity (Ongaro 2014)famously subtle (P2c, dueling proposers)

The lab implementations make these dimensions concrete: scenario A in db-17 takes ~166 ticks to commit a proposal (election + AE round trip); the equivalent scenario A here takes ~150 ticks for Phase 1 plus ~3 ticks per Accept, then the leader runs at Phase-2-only cost until somebody bumps it.


Files

  • src/rust/paxos18 crate + paxosctl binary.
  • src/go/ — module github.com/10xdev/dse/db18 + cmd/paxosctl.
  • src/cpp/db18_lib static library + paxosctl binary + test_db18.
  • scripts/verify.sh — builds + runs the unit tests for all three.
  • scripts/cross_test.sh — proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.

See docs/ for the long-form write-up and steps/ for the staged implementation path.