db-17 — Broader Ideas
Where Raft and the choices in this lab show up in real systems, and what to build on top of the same simulator harness in the rest of the distributed track.
Immediate next labs
-
db-18 — Paxos. Same heap-and-tick harness, different RPC structure (Prepare / Promise / Accept / Accepted), no fixed leader. Paxos's invariants are notoriously hard to reason about by hand; byte-deterministic replay of a counterexample seed is the difference between "I think it's correct" and "I have evidence". Raft was literally designed as the more understandable alternative — implementing both in this order is the recommended path.
-
db-19 — ZAB. ZooKeeper's atomic broadcast protocol. Similar leader-based skeleton to Raft, but the recovery phase is more involved (NEWLEADER / NEWEPOCH / SYNC / BROADCAST). The Lamport scalar of db-16 generalizes to the
(epoch, counter)pair that ZAB calls a "zxid". -
db-20 — Distributed KV. Wrap a quorum-replicated key-value store around a chosen consensus engine. The state machine is the only thing that changes — the consensus log feeds
Put(k, v)/Delete(k)commands that get applied incommit_indexorder. -
db-23 — Capstone. Adds snapshots, joint-consensus membership changes, and a multi-Raft "shards across regions" deployment on top of this code.
How this lab's pieces map to real systems
-
The Raft skeleton implemented here is exactly what etcd, Consul, TiKV, CockroachDB, MongoDB metadata, OpenStack Nova cells, and the control plane of Vault all run. They each add the extensions deferred from this lab (pre-vote, snapshots, learners, joint consensus), but the core RequestVote / AppendEntries loop is unchanged.
-
The
(delivery_time, sender, seq)heap tie-break is the same trick FoundationDB's simulator uses to drive every commit-proxy /storage-server interaction; TigerBeetle, Antithesis, and Hermitian all reach for it independently. -
The "leader picks max-term, min-id" convention surfaces as the split-brain resolution rule in production systems: when a partition heals and you see two leaders, the one with the higher term wins unconditionally (id break is academic — different terms imply different elections).
-
The
voted_for = Noneencoded as-1is the convention every Raft implementation in production uses on disk. Some encode as optional / nullable types in a richer wire format (protobuf hasoptional), but in any fixed-width binary log the sentinel value is the right answer.
Performance experiments worth running later
-
Crank
--roundsto 100k and watch the binary size grow. The dump is linear in committed entries; if you ever see super-linear growth something is appending entries that don't get committed (a sign of partition oscillation). -
Replace splitmix64 with a per-node
rand::ChaCha20. The simulator will still be deterministic (RNGs are seeded), but cross-language byte equivalence will break unless you also port the ChaCha core identically. Useful exercise in what exactly portability requires. -
Try injecting one heavy proposal vs. many small proposals into a 3-node cluster and measure the cluster dump size vs. the bytes actually committed. The difference is the steady-state replication overhead.
-
Vary the election timeout. The 150 + jitter(0..150) ticks chosen here keeps churn low; halve it and you'll see term numbers climb rapidly under any partition, especially scenario F.
What "production-quality" would require beyond this lab
-
A real disk-backed persistence layer with fsync semantics and crash recovery. The canonical dump pretends
current_term,voted_for, andlogare durable on every change; a real Raft mustfsyncbefore sending any reply that depends on the new state, or risk violating election safety on a power cut. -
Network I/O. The simulator hands typed structs across an in-process heap; production uses gRPC or a custom framing protocol with at- least-once delivery and connection-level back-pressure.
-
Pre-vote and leader leases. Without them, a partitioned candidate bumps its term repeatedly; on heal the legitimate leader steps down unnecessarily. Easy to add as a wrapper on RequestVote; deferred here because it would obscure the core algorithm.
-
Snapshots and log compaction. Without them, the log grows forever and a slow follower can't catch up over the wire. The canonical dump tolerates this only because the lab's
roundsis bounded. -
Membership changes. The fixed
nodescount atCluster::newtime is fine for a lab but useless in production. Joint consensus or the safer learner-then-promote protocol are major additions; covered in db-23. -
Observability. A real Raft cluster exposes per-node
term,commit_index,match_index[*],leader_id, election counts, and message rates as metrics. The canonical dump is a post-mortem view; runtime observability is a separate problem.