db-17 — Raft

This lab implements Raft consensus in Rust, Go, and C++, all three producing a byte-identical sha256 of a canonical cluster dump for any (seed, nodes, rounds, proposals, partition) configuration. It builds directly on the deterministic-simulator discipline from db-16: same splitmix64 seeding, same (delivery_time, sender, seq) heap tie-break, same "sorted iteration on the wire" rule.

If db-16 taught you to keep an event log bit-stable across three languages, db-17 teaches you to keep an entire replicated state machine's persistent state bit-stable across three languages and across network partitions. Every later distributed lab (db-18 Paxos, db-19 ZAB, db-20 distributed-kv) is a variation on this skeleton.


What is it?

Raft (Ongaro & Ousterhout, USENIX ATC 2014) is a consensus algorithm that keeps an ordered, append-only replicated log consistent across a cluster of nodes despite crashes, message reorderings, and arbitrary partitions of the network. It is the consensus core inside etcd, Consul, TiKV, CockroachDB, MongoDB's metadata, and many more.

Raft decomposes consensus into three sub-problems:

  1. Leader election. Each node is one of {Follower, Candidate, Leader}. Followers run an election timeout; on timeout a follower becomes a candidate, bumps its current_term, votes for itself, and broadcasts RequestVote. A candidate that receives a majority of vote_granted=true replies in the same term becomes leader.

  2. Log replication. The leader accepts client proposals and appends them to its log. It broadcasts AppendEntries RPCs carrying the new entries plus a prev_log_index / prev_log_term consistency check. On a mismatch the follower rejects; the leader decrements next_index[follower] and retries. Once a majority's match_index covers entry N and log[N].term == current_term, the leader advances commit_index to N.

  3. Safety. Election restriction (a candidate only earns a vote if its log is at least as up-to-date as the voter's), the commit-only-current-term rule, and the log-matching property (identical (index, term) ⇒ identical entries) together imply state machine safety: once an entry at index i is applied at one node, no other node will ever apply a different entry at i.

This lab implements the algorithm as it appears in Figure 2 of the paper, minus snapshots and minus membership changes. The simulator drives sim time forward in integer ticks; messages are scheduled into a heap with a deterministic (delivery_time, sender, seq) order; an optional partition set drops messages in one direction between named pairs.


Why does it matter?

  • Raft is the production consensus algorithm of the 2010s. Knowing exactly how prev_log_index works, why commit advance is gated on log[N].term == current_term, and why the election restriction exists is the difference between operating etcd and understanding etcd.

  • Three byte-identical implementations forces the spec to be unambiguous. Anywhere Raft "depends on the implementation" — RPC scheduling, election timer jitter, tiebreak for "which leader gets a proposal", iteration order of peer ids — has to be pinned down. The cross-language sha256 makes drift loud.

  • Reproducible partitions. With a deterministic --partition s,d,... flag and a seeded simulator, you can replay the exact sequence of message drops, leadership changes, and committed entries that triggered a bug, on any machine, in any of the three languages.

  • Foundation for the rest of the track. db-18 Paxos and db-19 ZAB will reuse the simulator harness; db-20 distributed-kv will plug a consensus engine into a real key-value store.


How does it work?

State (per node)

persistent  : current_term : u64
              voted_for    : Option<u32>          # None == -1 on the wire
              log          : Vec<LogEntry>        # 1-indexed in Figure 2; 0-indexed here
volatile    : role         : Follower | Candidate | Leader
              commit_index : u64                  # highest log index known committed
              last_applied : u64                  # we apply lazily; rarely diverges from commit_index
leader-only : next_index   : Map<peer_id, u64>    # index of next entry to send to each peer
              match_index  : Map<peer_id, u64>    # highest entry known replicated on each peer
timers      : election_deadline : u64             # sim-time tick
              heartbeat_due     : u64             # next time leader must send AE

Election timer

reset_election_timer(t):
    election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150

A 150-tick base plus 150 ticks of seeded jitter avoids the classic split-vote loop. Heartbeats fire every 50 ticks.

RequestVote handling

on RequestVote(term, candidate, last_log_index, last_log_term):
    if term > current_term:                # newer term seen
        become_follower(term)
    grant = (term == current_term)
         && (voted_for is None or voted_for == candidate)
         && candidate_log_is_at_least_as_up_to_date()
    if grant:
        voted_for = candidate
        reset_election_timer()
    reply RequestVoteReply(current_term, grant)

Up-to-date is defined as: last_log_term > my_last_term, or (last_log_term == my_last_term && last_log_index >= my_last_index).

AppendEntries handling

on AppendEntries(term, leader, prev_idx, prev_term, entries, leader_commit):
    if term > current_term: become_follower(term)
    if term < current_term: reply (current_term, false); return
    reset_election_timer()
    if prev_idx > 0 && (log too short OR log[prev_idx-1].term != prev_term):
        reply (current_term, false); return        # consistency mismatch
    # truncate any conflicting suffix, then append
    for (i, e) in enumerate(entries):
        idx = prev_idx + i
        if idx < log.len() && log[idx].term != e.term:
            log.truncate(idx)
        if idx >= log.len():
            log.push(e)
    if leader_commit > commit_index:
        commit_index = min(leader_commit, log.len())
    reply (current_term, true, match_index = prev_idx + len(entries))

Commit advance (leader only)

advance_commit():
    for N in (log.len() ..= commit_index + 1).rev():
        if log[N-1].term != current_term: continue   # Figure 8 safety
        replicated = 1 + count(p : match_index[p] >= N)
        if 2 * replicated > nodes:
            commit_index = N; break

Propose (leader only)

propose(cmd):
    log.push(LogEntry{ term: current_term, command: cmd })
    match_index[self] = log.len()
    broadcast_append_entries()
    advance_commit()           # NB: required for n == 1, harmless for n > 1

The advance_commit() call inside propose is the one non-obvious detail. In a single-node cluster the leader has no peers, so no AppendEntriesReply will ever arrive to trigger a commit — but a majority is already satisfied (the leader alone is the majority). All three implementations call advance_commit() at the end of propose for byte-identical behaviour.

Simulator loop (per tick t in 0..rounds)

1. enqueue scheduled proposals : if t == schedule[i], push payload onto pending
2. inject pending into leader  : pick (max term, min id) among Leaders; call propose
3. deliver in-flight           : pop heap entries with delivery_time == t
4. tick all nodes              : iterate in ascending id; on_tick may fire election or heartbeat

Proposal schedule: schedule[i] = (i+1) * rounds / (K+1) for i in 0..K (integer division). Deterministic, evenly spread, and independent of cluster behaviour.

Wire format (Rpc)

Four variants; all field widths fixed; little-endian:

RequestVote       { term: u64, candidate: u32, last_log_index: u64, last_log_term: u64 }
RequestVoteReply  { term: u64, granted: bool (u8) }
AppendEntries     { term: u64, leader: u32, prev_idx: u64, prev_term: u64,
                    entries: [LogEntry], leader_commit: u64 }
AppendEntriesReply{ term: u64, success: bool (u8), match_index: u64 }

The wire format is not serialized to disk by this lab — the simulator passes Rpcs as typed structs in memory. Only the canonical dump is serialized, and that is what gets hashed.

Canonical dump format

file := magic[8 = "DSERAFT1"] u32_le(node_count) node*

node := u32_le id
        u64_le current_term
        i64_le voted_for                # -1 if None (two's complement little-endian)
        u8     role                     # Follower=0, Candidate=1, Leader=2
        u64_le commit_index
        u32_le log_len
        entry * log_len

entry := u64_le term
         u32_le cmd_len
         u8 cmd[cmd_len]

Nodes appear in ascending id order. All multi-byte numbers are little-endian. The dump is hashed with SHA-256; the lowercase hex digest is what raftctl prints (no trailing newline).

Cross-language invariants

InvariantWhy it matters
splitmix64 constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EBidentical PRNG output
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150identical election firing times
delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3identical message scheduling
heap order (delivery_time, sender, seq); seq global monotonicidentical delivery sequence
peers iterated in ascending id (BTreeMap / std::map / explicit for p:=0;p<n;p++)identical broadcast order
leader-pick for proposal injection: (max term, min id)identical client routing
proposal schedule: (i+1) * rounds / (K+1) integer divisionidentical pending queue contents
propose() calls advance_commit()identical commit_index for n=1
voted_for = None encoded as i64 LE -1identical dump bytes
Role enum order Follower=0, Candidate=1, Leader=2identical dump bytes

If any one of these drifts, scripts/cross_test.sh will fail and cmp -l on the two raw dumps will print the byte offset of the first divergence.


Files

  • src/rust/raft17 crate + raftctl binary.
  • src/go/ — module github.com/10xdev/dse/db17 + cmd/raftctl.
  • src/cpp/db17_lib static library + raftctl binary + test_db17.
  • scripts/verify.sh — runs the unit tests for all three.
  • scripts/cross_test.sh — proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.

See docs/ for the long-form write-up and steps/ for the staged implementation path.