db-18 — Broader Ideas

The lab implements textbook Multi-Paxos with a deterministic simulator and three-language cross-validation. It deliberately stops where production engineering begins. This document collects the threads worth pulling on next.

Fast Paxos (Lamport 2006)

Skips Phase 2's "leader replays" step on the happy path by letting any proposer broadcast Accept directly. The cost: the fast-path quorum must be ⌈3n/4⌉ instead of ⌊n/2⌋ + 1, so 4-of-5 instead of 3-of-5. When two proposers collide on the fast path the system falls back to classic Paxos. Worth implementing as db-18b once the classic version is fluent — it reuses the entire wire format and only changes the proposer-side state machine.

EPaxos (Moraru, Andersen, Kaminsky, SOSP 2013)

Drops the leader entirely. Each command picks its own dependency graph among recently-issued commands and decides in one RTT if no conflict, two RTTs otherwise. The "deterministic simulator + three implementations" discipline you build here is what makes EPaxos's notoriously subtle conflict-detection logic testable at all. Used in production at Facebook (Bunshin) and as the backbone of some geo-distributed configuration stores.

Generalized Paxos (Lamport, MSR-TR-2005-33)

Allows commutative commands to be partially ordered concurrently, not totally ordered serially. The state-machine layer must explicitly declare command commutativity. Precursor to EPaxos. Operationally similar to CRDTs at the storage layer (db-21) but with hard consensus underneath.

Vertical Paxos (Lamport, Malkhi, Zhou, PODC 2009)

Separates the "agree on the value at slot S" problem from the "agree on the membership of the acceptor set at slot S" problem, by delegating reconfiguration to an auxiliary master. Cleaner than joint-consensus (Raft's approach) and Lamport's preferred way to do membership changes. db-23 will revisit.

Flexible Paxos (Howard 2016, dissertation 2019)

Observation: the two quorums in Paxos don't have to be majorities. They only have to intersect. So Phase-1 quorum + Phase-2 quorum just have to sum to more than n. Production payoff: you can run with a smaller Phase-2 quorum (lower latency on the common path) in exchange for a larger Phase-1 quorum (higher cost during leadership churn). A great teaching variant to layer on top of this lab once the canonical hashes are stable.

Production systems to study

Google Chubby

Five-replica Paxos lock service powering Google's lookup infrastructure (DNS, leader election for other services). Chandra et al.'s Paxos Made Live (PODC 2007) is the canonical writeup of what it took to turn the algorithm into a system: leader leases, snapshots every few minutes, master-side group membership, three generations of disk-corruption handling. Read alongside this lab once green.

Google Spanner

Multi-Paxos per shard. Spanner's contribution above Multi-Paxos is TrueTime — a clock API with bounded uncertainty that lets the system serve external-consistency-preserving reads without a Paxos round. The Paxos layer itself is exactly the algorithm you've implemented, plus production hardening.

Apache Cassandra LWT

Lightweight Transactions use Multi-Paxos to give linearizable CAS-style updates on top of Cassandra's eventually-consistent replication. Cassandra picks a fresh ballot per request, so it pays the Phase-1 cost every time and never amortizes — a clean illustration of the Multi-Paxos tradeoff in reverse.

Phase-1 amortization sweep. For nodes ∈ {3,5,7,9}, run proposals = 50 and count the number of ticks to decide the last slot. The expected curve is linear in nodes for the first decision (Phase 1 costs a broadcast round-trip per acceptor) and constant per slot thereafter (Phase 2 RTT).
Election-timer jitter sensitivity. Vary ELECTION_TIMEOUT_SPAN and measure how often dueling proposers ping-pong before someone wins. The textbook answer is "wider jitter = fewer collisions = fewer ballot bumps", and the simulator lets you confirm it without networking.
Quorum recomputation latency. For Flexible Paxos configurations, plot Phase-2 latency against Phase-1 quorum size. Howard 2016 has the analytical curve; you can ground-truth it.
Comparison to Raft (db-17). Same flags, same scenarios, same measurement. The lab structure is identical on purpose.

What "production-quality" would require beyond this lab

Disk durability. A real acceptor fsyncs promised_ballot, accepts, and (depending on design) learned before replying to a Promise / Accepted. Without that, a crash-restart cycle can silently retract a promise and break safety.
Snapshotting. accepts and learned grow forever in this lab. A real system periodically snapshots the state machine and garbage-collects acks below the snapshot index. The snapshot itself must be agreed on by Paxos (or by a separate snapshot coordinator), which is a whole-other lab.
Membership reconfiguration. Adding/removing acceptors safely is non-trivial: you must either run two configurations in parallel during the transition (Raft's joint consensus) or delegate the membership decision to a higher layer (Vertical Paxos). db-23 picks this up.
Leader leases. Production Paxos systems give the current leader a time-bounded lease to serve reads without consulting acceptors. This requires a synchronized clock model (Spanner's TrueTime, or weaker lease-renewal protocols) — orthogonal to consensus per se but tightly coupled in real deployments.
Witness / arbiter nodes. Some deployments allow a third node to hold no data but break tie-vote symmetry. Implementing this while keeping safety proofs sound requires care.
Recovery from disk corruption. Real-world failure modes include silent bit-rot of promised_ballot. The defensive posture is to checksum every persisted record and treat a checksum failure as "I've never voted for anything" — a strict safety superset of treating it as "I voted for a high ballot", but at the cost of liveness during recovery.
Observability. Live systems need per-slot decision latency histograms, per-acceptor promise rejection counters, leader flap detection. The canonical dump is the right shape of observability but the real one runs continuously rather than on-demand.

Distributed Systems Engineer — Build Databases & Consensus From Scratch