db-18 — Analysis
Required invariants
If any of these is violated, scripts/cross_test.sh will fail, and
in the worst case the algorithm itself is unsafe. They are stated in
the order it is easiest to reason about them.
-
Promise monotonicity (P1). For every acceptor and every sim-time tick
t,promised_ballot[t] >= promised_ballot[t-1]. The simulator enforces this with a single comparison on each ofPrepare,Accept: the message's ballot must be>= promised_ballotbefore any state mutation. The Promise reply'saccept_okbit is the operational form of P1b. -
Accept respects promise (P2a). No acceptor ever stores
accepts[slot] = (ab, v)withab < promised_ballot. The Accept handler short-circuits withaccept_ok=falsewhenb < promised_ballot; the leader interprets that bit and steps down instead of advancingaccept_count. -
Per-slot accept uniqueness under a ballot (P2b). For a fixed slot
sand a fixed ballotb, the valuevthat any acceptor stores under(s, b)is the same value. This holds trivially here because only the leader of ballotbever sendsAccept(b, s, v), and itsaccepts[s]is set once and never overwritten under its own ballot. -
P2c (the safety lemma that needs work). Suppose value
vis chosen at slotsunder ballotb. Then for any ballotb' > bissued by any proposer, the value field of anyAccept(b', s, v')will satisfyv' == v. The mechanism: to issue an Accept at all, the proposer must have collected promises from a quorum at ballotb'. That quorum intersects with the quorum that chosevatb. The intersecting acceptor sawvaccepted underb, so its Promise carries(s, b, v). The proposer's recovery rule (take the value whose accepted_ballot is highest) therefore takesv(or a later value chosen under someb'' > b, but inductively that value is alsov). Sov' == v. QED. The simulator implements this rule instart_election's init ofprepare_acceptedand in the Promise-handler'sif ab > prepare_accepted[s].ballotupdate. -
Decided-once / monotonic learn. Once
learned[s]is set on any node, it never changes value. Locally enforced by reading before writing; globally guaranteed by P2c. -
Byte-determinism of the dump. Two runs with the same
(seed, nodes, rounds, proposals, partition)produce identical canonical dump bytes on every language. This requires every iteration order (peers, slots, accepted-list inside Promise, heap pops on identical(time, sender, seq)) to be fixed. Drift here is whatcross_test.shcatches.
Design decisions worth highlighting
-
promised_ballotis global per node, not per-slot. This is the Multi-Paxos optimization. A per-slot promised-ballot map would be more general (closer to single-decree Paxos per slot) but would cost a Phase-1 per slot. The global ballot lets one Phase 1 cover every present and future slot. -
Phase-1 recovery walks every prior accept, not just the latest per slot. The Promise reply contains all of the acceptor's
accepts(sorted by slot). The candidate folds them intoprepare_acceptedwithtake if strictly greater accepted_ballot. Per the proof of P2c this is the only correct rule; a "latest by receive order" tie-break would lose safety the moment Promises arrived out of order. -
my_ballot.roundis bumped tomax(promised, my_ballot).round + 1when starting an election, not justpromised.round + 1. If this node previously won a higher ballot and stepped down due to a partition heal, it would otherwise re-issue its old ballot and immediately lose to its own historical promise. Themaxmakes forward progress under churn deterministic. -
Leader-pick rule: lowest-id
Leader. When the simulator must inject a client proposal, it picks the lowest-id node currently in roleLeader. There may be zero (queue the proposal incluster_pending) or, briefly, two stale Leaders (the lower id wins; the other'sAcceptwill fail at acceptors that have already promised the new ballot). Determinism > realism here. -
drain_pendingruns on every Accepted, not just every tick. In single-node mode (--nodes 1) the leader becomes its own quorum and decides slots inside the broadcast loop. Doing the drain inbecome_leaderand intry_decidemeans scenario D's hash is independent of how the simulator orders ticks. -
Heap key
(delivery_time, sender, seq). db-16's invariant. Without theseqtiebreak, Promise messages from two acceptors arriving on the same tick from the same sender (impossible by construction, but the type system doesn't know that) would be reorder-able across languages. -
Roleenum order.Follower=0, Candidate=1, Leader=2was chosen to match db-17; any change would propagate into the dump byte at offset12 + 16 = 28per node, which would silently invalidate scenario A's canonical hash.
Tradeoffs worth flagging
-
Concurrent proposers cost throughput, not safety. Two proposers in dueling Phase 1 can ping-pong each other forever in principle. The lab dodges this in two ways: (a) the deterministic simulator can't sustain a livelock because election timeouts are PRNG-jittered per node-id, and (b) once a leader is elected, the election-timer reset on Heartbeat keeps it elected. Production systems add leader leases (Chubby, Spanner) to push the worst case down further.
-
No
commit-only-current-termsubtlety. Raft has Figure 8: a newly-elected leader must commit something in its own term before it can ack older entries, otherwise they can be silently overwritten. Paxos sidesteps the problem because P2c forces a new leader to re-Accept any recoverable value under its own ballot; there is no "shadow commit" to retract. The price is the Phase-1-on-every-startup cost. -
No native log compaction. This lab's
acceptsandlearnedgrow unboundedly. A real Multi-Paxos system snapshots a state machine and discards accepts below the snapshot index (see Spanner, Chubby, db-23). Adding snapshots here would require exposing acommitted_throughwatermark in the dump. -
No membership change.
nis fixed atClusterconstruction time. Vertical Paxos (Lamport/Malkhi/Zhou 2009) is the textbook way to add this. db-23 covers it. -
Three languages is more work than two. Two languages prove the spec is unambiguous. Three rules out the case where you and your collaborator have committed the same misreading. C++'s
std::mapand Rust'sBTreeMapagreeing with Go's explicitsort.Slicewas the only thing that caught a misordered Promise payload in scenario B during development.
Why three languages
Same answer as db-17: the constraint forces the spec to be a spec and not a habit. Sorted-iteration discipline, fixed enum order, little-endian fixed-width integers, no map iteration on the wire — these are easy to get away with in any single language, and the only way to surface them is to ask "would another implementation make the same choice without being told?". For Paxos the question matters even more: the algorithm is sensitive to whether the highest-ballot prior accept is chosen during recovery, and a sort-order bug would make safety stochastic, which is the worst possible failure mode.