db-17 — Raft
This lab implements Raft consensus in Rust, Go, and C++, all three
producing a byte-identical sha256 of a canonical cluster dump for any
(seed, nodes, rounds, proposals, partition) configuration. It builds
directly on the deterministic-simulator discipline from db-16: same
splitmix64 seeding, same (delivery_time, sender, seq) heap tie-break,
same "sorted iteration on the wire" rule.
If db-16 taught you to keep an event log bit-stable across three languages, db-17 teaches you to keep an entire replicated state machine's persistent state bit-stable across three languages and across network partitions. Every later distributed lab (db-18 Paxos, db-19 ZAB, db-20 distributed-kv) is a variation on this skeleton.
What is it?
Raft (Ongaro & Ousterhout, USENIX ATC 2014) is a consensus algorithm that keeps an ordered, append-only replicated log consistent across a cluster of nodes despite crashes, message reorderings, and arbitrary partitions of the network. It is the consensus core inside etcd, Consul, TiKV, CockroachDB, MongoDB's metadata, and many more.
Raft decomposes consensus into three sub-problems:
-
Leader election. Each node is one of
{Follower, Candidate, Leader}. Followers run an election timeout; on timeout a follower becomes a candidate, bumps itscurrent_term, votes for itself, and broadcastsRequestVote. A candidate that receives a majority ofvote_granted=truereplies in the same term becomes leader. -
Log replication. The leader accepts client proposals and appends them to its log. It broadcasts
AppendEntriesRPCs carrying the new entries plus aprev_log_index / prev_log_termconsistency check. On a mismatch the follower rejects; the leader decrementsnext_index[follower]and retries. Once a majority'smatch_indexcovers entryNandlog[N].term == current_term, the leader advancescommit_indextoN. -
Safety. Election restriction (a candidate only earns a vote if its log is at least as up-to-date as the voter's), the commit-only-current-term rule, and the log-matching property (identical
(index, term)⇒ identical entries) together imply state machine safety: once an entry at indexiis applied at one node, no other node will ever apply a different entry ati.
This lab implements the algorithm as it appears in Figure 2 of the
paper, minus snapshots and minus membership changes. The simulator
drives sim time forward in integer ticks; messages are scheduled into a
heap with a deterministic (delivery_time, sender, seq) order; an
optional partition set drops messages in one direction between named
pairs.
Why does it matter?
-
Raft is the production consensus algorithm of the 2010s. Knowing exactly how
prev_log_indexworks, why commit advance is gated onlog[N].term == current_term, and why the election restriction exists is the difference between operating etcd and understanding etcd. -
Three byte-identical implementations forces the spec to be unambiguous. Anywhere Raft "depends on the implementation" — RPC scheduling, election timer jitter, tiebreak for "which leader gets a proposal", iteration order of peer ids — has to be pinned down. The cross-language sha256 makes drift loud.
-
Reproducible partitions. With a deterministic
--partition s,d,...flag and a seeded simulator, you can replay the exact sequence of message drops, leadership changes, and committed entries that triggered a bug, on any machine, in any of the three languages. -
Foundation for the rest of the track. db-18 Paxos and db-19 ZAB will reuse the simulator harness; db-20 distributed-kv will plug a consensus engine into a real key-value store.
How does it work?
State (per node)
persistent : current_term : u64
voted_for : Option<u32> # None == -1 on the wire
log : Vec<LogEntry> # 1-indexed in Figure 2; 0-indexed here
volatile : role : Follower | Candidate | Leader
commit_index : u64 # highest log index known committed
last_applied : u64 # we apply lazily; rarely diverges from commit_index
leader-only : next_index : Map<peer_id, u64> # index of next entry to send to each peer
match_index : Map<peer_id, u64> # highest entry known replicated on each peer
timers : election_deadline : u64 # sim-time tick
heartbeat_due : u64 # next time leader must send AE
Election timer
reset_election_timer(t):
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150
A 150-tick base plus 150 ticks of seeded jitter avoids the classic split-vote loop. Heartbeats fire every 50 ticks.
RequestVote handling
on RequestVote(term, candidate, last_log_index, last_log_term):
if term > current_term: # newer term seen
become_follower(term)
grant = (term == current_term)
&& (voted_for is None or voted_for == candidate)
&& candidate_log_is_at_least_as_up_to_date()
if grant:
voted_for = candidate
reset_election_timer()
reply RequestVoteReply(current_term, grant)
Up-to-date is defined as: last_log_term > my_last_term, or
(last_log_term == my_last_term && last_log_index >= my_last_index).
AppendEntries handling
on AppendEntries(term, leader, prev_idx, prev_term, entries, leader_commit):
if term > current_term: become_follower(term)
if term < current_term: reply (current_term, false); return
reset_election_timer()
if prev_idx > 0 && (log too short OR log[prev_idx-1].term != prev_term):
reply (current_term, false); return # consistency mismatch
# truncate any conflicting suffix, then append
for (i, e) in enumerate(entries):
idx = prev_idx + i
if idx < log.len() && log[idx].term != e.term:
log.truncate(idx)
if idx >= log.len():
log.push(e)
if leader_commit > commit_index:
commit_index = min(leader_commit, log.len())
reply (current_term, true, match_index = prev_idx + len(entries))
Commit advance (leader only)
advance_commit():
for N in (log.len() ..= commit_index + 1).rev():
if log[N-1].term != current_term: continue # Figure 8 safety
replicated = 1 + count(p : match_index[p] >= N)
if 2 * replicated > nodes:
commit_index = N; break
Propose (leader only)
propose(cmd):
log.push(LogEntry{ term: current_term, command: cmd })
match_index[self] = log.len()
broadcast_append_entries()
advance_commit() # NB: required for n == 1, harmless for n > 1
The advance_commit() call inside propose is the one non-obvious
detail. In a single-node cluster the leader has no peers, so no
AppendEntriesReply will ever arrive to trigger a commit — but a
majority is already satisfied (the leader alone is the majority). All
three implementations call advance_commit() at the end of propose
for byte-identical behaviour.
Simulator loop (per tick t in 0..rounds)
1. enqueue scheduled proposals : if t == schedule[i], push payload onto pending
2. inject pending into leader : pick (max term, min id) among Leaders; call propose
3. deliver in-flight : pop heap entries with delivery_time == t
4. tick all nodes : iterate in ascending id; on_tick may fire election or heartbeat
Proposal schedule: schedule[i] = (i+1) * rounds / (K+1) for i in 0..K (integer division). Deterministic, evenly spread, and independent
of cluster behaviour.
Wire format (Rpc)
Four variants; all field widths fixed; little-endian:
RequestVote { term: u64, candidate: u32, last_log_index: u64, last_log_term: u64 }
RequestVoteReply { term: u64, granted: bool (u8) }
AppendEntries { term: u64, leader: u32, prev_idx: u64, prev_term: u64,
entries: [LogEntry], leader_commit: u64 }
AppendEntriesReply{ term: u64, success: bool (u8), match_index: u64 }
The wire format is not serialized to disk by this lab — the simulator passes Rpcs as typed structs in memory. Only the canonical dump is serialized, and that is what gets hashed.
Canonical dump format
file := magic[8 = "DSERAFT1"] u32_le(node_count) node*
node := u32_le id
u64_le current_term
i64_le voted_for # -1 if None (two's complement little-endian)
u8 role # Follower=0, Candidate=1, Leader=2
u64_le commit_index
u32_le log_len
entry * log_len
entry := u64_le term
u32_le cmd_len
u8 cmd[cmd_len]
Nodes appear in ascending id order. All multi-byte numbers are
little-endian. The dump is hashed with SHA-256; the lowercase hex
digest is what raftctl prints (no trailing newline).
Cross-language invariants
| Invariant | Why it matters |
|---|---|
splitmix64 constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB | identical PRNG output |
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150 | identical election firing times |
delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3 | identical message scheduling |
heap order (delivery_time, sender, seq); seq global monotonic | identical delivery sequence |
peers iterated in ascending id (BTreeMap / std::map / explicit for p:=0;p<n;p++) | identical broadcast order |
leader-pick for proposal injection: (max term, min id) | identical client routing |
proposal schedule: (i+1) * rounds / (K+1) integer division | identical pending queue contents |
propose() calls advance_commit() | identical commit_index for n=1 |
voted_for = None encoded as i64 LE -1 | identical dump bytes |
Role enum order Follower=0, Candidate=1, Leader=2 | identical dump bytes |
If any one of these drifts, scripts/cross_test.sh will fail and cmp -l on the two raw dumps will print the byte offset of the first
divergence.
Files
src/rust/—raft17crate +raftctlbinary.src/go/— modulegithub.com/10xdev/dse/db17+cmd/raftctl.src/cpp/—db17_libstatic library +raftctlbinary +test_db17.scripts/verify.sh— runs the unit tests for all three.scripts/cross_test.sh— proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.
See docs/ for the long-form write-up and steps/ for the staged
implementation path.