db-17 — Analysis
Required invariants
-
Election safety. At most one leader per term. Enforced by majority voting: a candidate only becomes leader after collecting votes from a strict majority, and each voter only grants one vote per term (the
voted_forfield, persisted in the canonical dump). -
Leader append-only. A leader never overwrites or deletes entries from its own log; it only appends. Followers may truncate on an AppendEntries consistency mismatch, but the leader's local log only grows.
-
Log matching property. If two logs contain an entry with the same
(index, term), then the logs are identical in all entries up through that index. Enforced by theprev_log_index / prev_log_termcheck in AppendEntries and the truncate-on-conflict rule. -
Leader completeness. If an entry is committed in term
T, that entry is present in the log of every leader for all later terms. Enforced by the election restriction (a vote is only granted if the candidate's log is at least as up-to-date as the voter's). -
State machine safety. If a node has applied an entry at index
i, no other node will ever apply a different entry ati. This follows from log matching + leader completeness + the commit-only-current-term rule. -
Byte determinism. For every
(seed, nodes, rounds, proposals, partition)tuple, the three binaries produce identicalcanonical_dumpbytes — hence identical sha256 hex on stdout.scripts/cross_test.shchecks six scenarios.
Design decisions
-
propose()callsadvance_commit()at the end. The non-obvious one. In a single-node cluster the "leader" has no peers, so noAppendEntriesReplywill ever arrive to driveadvance_commit(). But a single-node cluster is its own majority, so the entry should commit the moment it is appended. Without this call, scenario D (--nodes 1) ends withcommit_index = 0despite five proposals in the log. Callingadvance_commit()is harmless forn > 1(the loop's majority check rejects until replies actually arrive). -
Sorted iteration on every wire-affecting loop. Rust uses
BTreeMap<u32, u64>fornext_index/match_index; C++ usesstd::map; Go uses explicitfor p := uint32(0); p < n; p++loops. HashMap would compile and pass single-language tests but failcross_test.shimmediately. db-16's analysis.md called this out; db-17 enforces it across more code surface. -
In-flight heap ordered by
(delivery_time, sender, seq).seqis a global monotonic counter incremented every time a message is enqueued. It exists only to break ties when two messages with the same(delivery_time, sender)would otherwise be ambiguously ordered. Withoutseqyou would see byte diffs on dense traffic at the same delivery tick. -
Leader-pick for proposal injection is
(max term, min id)amongrole == Leadernodes. During leadership churn there may be no leader, or there may be multiple stale leaders that have not yet stepped down. The(max term, min id)rule produces a deterministic routing decision no matter which language's iteration order you start from. -
Proposal schedule is closed-form.
schedule[i] = (i+1) * rounds / (K+1)(integer division). This places K proposals evenly through theroundswindow, independent of cluster behaviour. A schedule derived from cluster state ("propose whenever there's a leader") would couple proposal timing to incidental scheduling choices and produce noisy byte diffs. -
Splitmix64 constants are explicit.
0x9E3779B97F4A7C15(γ / golden-ratio fractional, the seeder constant),0xBF58476D1CE4E7B5and0x94D049BB133111EB(Vigna's two mixer constants). All three implementations copy them as literals; nobody computes them. -
Library + thin CLI. The lab exposes
Cluster::new,run,canonical_dump, andsha256as a library. The CLI is a few dozen lines of arg parsing plus four function calls. Downstream labs (db-18 Paxos, db-20 distributed-kv) will link the library, not shell out.
Tradeoffs worth flagging
-
No snapshots, no log compaction. Logs grow without bound across the run. For
--rounds 2000 --proposals 20you end up with ~20 entries per node; the canonical dump stays small. For production Raft you would add aSnapshotStateRPC and alast_included_index / last_included_term; deferred to db-21 (storage-engine-advanced). -
No pre-vote, no leader lease. A network-partitioned candidate will repeatedly increment its term, then on heal will force the legitimate leader to step down. Mitigated by tight election timeouts in this simulator but a real cluster needs the pre-vote optimization (Ongaro thesis §9.6).
-
No membership changes. The node count is fixed at
Cluster::newtime. Joint consensus (and the safer learner-then-promote alternative) is a major chapter on its own; deferred to db-23 capstone. -
Crash semantics are stylized. Crashes are simulated only via the
partitionflag (drop all messages in one direction). A real Raft must handle persistent storage corruption, fsync ordering, and restart-mid-vote; the canonical dump pretends all state is durable by construction. -
No client-side dedup. A proposal injected into a leader who immediately loses leadership may be replicated, lost, and never re-proposed. The simulator's
pendingqueue is drained unconditionally; we are testing the consensus core, not the client RPC layer.
Why three languages
Same reasoning as db-16, plus one new lesson: Raft has many places
where "iterate over peers" appears. Each one is a chance for a byte
diff. C++'s std::map and Rust's BTreeMap are ordered by default;
Go's map is explicitly randomized at iteration time. The Go
implementation has explicit for p := uint32(0); p < n; p++ loops
everywhere a peer iteration appears. Discovering this discipline by
forcing the cross-language test to pass is more durable than reading
"don't use HashMap" in a style guide.