db-18 — Broader Ideas
The lab implements textbook Multi-Paxos with a deterministic simulator and three-language cross-validation. It deliberately stops where production engineering begins. This document collects the threads worth pulling on next.
Variants and refinements
Fast Paxos (Lamport 2006)
Skips Phase 2's "leader replays" step on the happy path by letting
any proposer broadcast Accept directly. The cost: the fast-path
quorum must be ⌈3n/4⌉ instead of ⌊n/2⌋ + 1, so 4-of-5 instead
of 3-of-5. When two proposers collide on the fast path the system
falls back to classic Paxos. Worth implementing as db-18b once
the classic version is fluent — it reuses the entire wire format
and only changes the proposer-side state machine.
EPaxos (Moraru, Andersen, Kaminsky, SOSP 2013)
Drops the leader entirely. Each command picks its own dependency graph among recently-issued commands and decides in one RTT if no conflict, two RTTs otherwise. The "deterministic simulator + three implementations" discipline you build here is what makes EPaxos's notoriously subtle conflict-detection logic testable at all. Used in production at Facebook (Bunshin) and as the backbone of some geo-distributed configuration stores.
Generalized Paxos (Lamport, MSR-TR-2005-33)
Allows commutative commands to be partially ordered concurrently, not totally ordered serially. The state-machine layer must explicitly declare command commutativity. Precursor to EPaxos. Operationally similar to CRDTs at the storage layer (db-21) but with hard consensus underneath.
Vertical Paxos (Lamport, Malkhi, Zhou, PODC 2009)
Separates the "agree on the value at slot S" problem from the "agree on the membership of the acceptor set at slot S" problem, by delegating reconfiguration to an auxiliary master. Cleaner than joint-consensus (Raft's approach) and Lamport's preferred way to do membership changes. db-23 will revisit.
Flexible Paxos (Howard 2016, dissertation 2019)
Observation: the two quorums in Paxos don't have to be majorities.
They only have to intersect. So Phase-1 quorum + Phase-2 quorum
just have to sum to more than n. Production payoff: you can run
with a smaller Phase-2 quorum (lower latency on the common path)
in exchange for a larger Phase-1 quorum (higher cost during
leadership churn). A great teaching variant to layer on top of
this lab once the canonical hashes are stable.
Production systems to study
Google Chubby
Five-replica Paxos lock service powering Google's lookup infrastructure (DNS, leader election for other services). Chandra et al.'s Paxos Made Live (PODC 2007) is the canonical writeup of what it took to turn the algorithm into a system: leader leases, snapshots every few minutes, master-side group membership, three generations of disk-corruption handling. Read alongside this lab once green.
Google Spanner
Multi-Paxos per shard. Spanner's contribution above Multi-Paxos is TrueTime — a clock API with bounded uncertainty that lets the system serve external-consistency-preserving reads without a Paxos round. The Paxos layer itself is exactly the algorithm you've implemented, plus production hardening.
Apache Cassandra LWT
Lightweight Transactions use Multi-Paxos to give linearizable CAS-style updates on top of Cassandra's eventually-consistent replication. Cassandra picks a fresh ballot per request, so it pays the Phase-1 cost every time and never amortizes — a clean illustration of the Multi-Paxos tradeoff in reverse.
Microsoft Azure Service Fabric
Uses a Paxos variant (Smart Actors) under the hood for ring-leader election and replicated state services. Less publicly documented; the architectural papers are paywalled behind ASE/SOSP, but worth chasing for an industrial counterpoint.
Apache ZooKeeper (ZAB)
Not strict Paxos but in the same family. ZAB layers epoch+counter
on top of a primary-order protocol; the zxid pair is the direct
analogue of this lab's Ballot. db-19 builds it.
Performance experiments worth running
The deterministic simulator is too clean for real performance work, but the simulator's ticks are a fine unit of cost for comparative experiments:
- Phase-1 amortization sweep. For
nodes ∈ {3,5,7,9}, runproposals = 50and count the number of ticks to decide the last slot. The expected curve is linear innodesfor the first decision (Phase 1 costs abroadcastround-trip per acceptor) and constant per slot thereafter (Phase 2 RTT). - Election-timer jitter sensitivity. Vary
ELECTION_TIMEOUT_SPANand measure how often dueling proposers ping-pong before someone wins. The textbook answer is "wider jitter = fewer collisions = fewer ballot bumps", and the simulator lets you confirm it without networking. - Quorum recomputation latency. For Flexible Paxos configurations, plot Phase-2 latency against Phase-1 quorum size. Howard 2016 has the analytical curve; you can ground-truth it.
- Comparison to Raft (db-17). Same flags, same scenarios, same measurement. The lab structure is identical on purpose.
What "production-quality" would require beyond this lab
- Disk durability. A real acceptor fsyncs
promised_ballot,accepts, and (depending on design)learnedbefore replying to a Promise / Accepted. Without that, a crash-restart cycle can silently retract a promise and break safety. - Snapshotting.
acceptsandlearnedgrow forever in this lab. A real system periodically snapshots the state machine and garbage-collects acks below the snapshot index. The snapshot itself must be agreed on by Paxos (or by a separate snapshot coordinator), which is a whole-other lab. - Membership reconfiguration. Adding/removing acceptors safely is non-trivial: you must either run two configurations in parallel during the transition (Raft's joint consensus) or delegate the membership decision to a higher layer (Vertical Paxos). db-23 picks this up.
- Leader leases. Production Paxos systems give the current leader a time-bounded lease to serve reads without consulting acceptors. This requires a synchronized clock model (Spanner's TrueTime, or weaker lease-renewal protocols) — orthogonal to consensus per se but tightly coupled in real deployments.
- Witness / arbiter nodes. Some deployments allow a third node to hold no data but break tie-vote symmetry. Implementing this while keeping safety proofs sound requires care.
- Recovery from disk corruption. Real-world failure modes
include silent bit-rot of
promised_ballot. The defensive posture is to checksum every persisted record and treat a checksum failure as "I've never voted for anything" — a strict safety superset of treating it as "I voted for a high ballot", but at the cost of liveness during recovery. - Observability. Live systems need per-slot decision latency histograms, per-acceptor promise rejection counters, leader flap detection. The canonical dump is the right shape of observability but the real one runs continuously rather than on-demand.