db-17 — Analysis

Required invariants

  1. Election safety. At most one leader per term. Enforced by majority voting: a candidate only becomes leader after collecting votes from a strict majority, and each voter only grants one vote per term (the voted_for field, persisted in the canonical dump).

  2. Leader append-only. A leader never overwrites or deletes entries from its own log; it only appends. Followers may truncate on an AppendEntries consistency mismatch, but the leader's local log only grows.

  3. Log matching property. If two logs contain an entry with the same (index, term), then the logs are identical in all entries up through that index. Enforced by the prev_log_index / prev_log_term check in AppendEntries and the truncate-on-conflict rule.

  4. Leader completeness. If an entry is committed in term T, that entry is present in the log of every leader for all later terms. Enforced by the election restriction (a vote is only granted if the candidate's log is at least as up-to-date as the voter's).

  5. State machine safety. If a node has applied an entry at index i, no other node will ever apply a different entry at i. This follows from log matching + leader completeness + the commit-only-current-term rule.

  6. Byte determinism. For every (seed, nodes, rounds, proposals, partition) tuple, the three binaries produce identical canonical_dump bytes — hence identical sha256 hex on stdout. scripts/cross_test.sh checks six scenarios.

Design decisions

  • propose() calls advance_commit() at the end. The non-obvious one. In a single-node cluster the "leader" has no peers, so no AppendEntriesReply will ever arrive to drive advance_commit(). But a single-node cluster is its own majority, so the entry should commit the moment it is appended. Without this call, scenario D (--nodes 1) ends with commit_index = 0 despite five proposals in the log. Calling advance_commit() is harmless for n > 1 (the loop's majority check rejects until replies actually arrive).

  • Sorted iteration on every wire-affecting loop. Rust uses BTreeMap<u32, u64> for next_index / match_index; C++ uses std::map; Go uses explicit for p := uint32(0); p < n; p++ loops. HashMap would compile and pass single-language tests but fail cross_test.sh immediately. db-16's analysis.md called this out; db-17 enforces it across more code surface.

  • In-flight heap ordered by (delivery_time, sender, seq). seq is a global monotonic counter incremented every time a message is enqueued. It exists only to break ties when two messages with the same (delivery_time, sender) would otherwise be ambiguously ordered. Without seq you would see byte diffs on dense traffic at the same delivery tick.

  • Leader-pick for proposal injection is (max term, min id) among role == Leader nodes. During leadership churn there may be no leader, or there may be multiple stale leaders that have not yet stepped down. The (max term, min id) rule produces a deterministic routing decision no matter which language's iteration order you start from.

  • Proposal schedule is closed-form. schedule[i] = (i+1) * rounds / (K+1) (integer division). This places K proposals evenly through the rounds window, independent of cluster behaviour. A schedule derived from cluster state ("propose whenever there's a leader") would couple proposal timing to incidental scheduling choices and produce noisy byte diffs.

  • Splitmix64 constants are explicit. 0x9E3779B97F4A7C15 (γ / golden-ratio fractional, the seeder constant), 0xBF58476D1CE4E7B5 and 0x94D049BB133111EB (Vigna's two mixer constants). All three implementations copy them as literals; nobody computes them.

  • Library + thin CLI. The lab exposes Cluster::new, run, canonical_dump, and sha256 as a library. The CLI is a few dozen lines of arg parsing plus four function calls. Downstream labs (db-18 Paxos, db-20 distributed-kv) will link the library, not shell out.

Tradeoffs worth flagging

  • No snapshots, no log compaction. Logs grow without bound across the run. For --rounds 2000 --proposals 20 you end up with ~20 entries per node; the canonical dump stays small. For production Raft you would add a SnapshotState RPC and a last_included_index / last_included_term; deferred to db-21 (storage-engine-advanced).

  • No pre-vote, no leader lease. A network-partitioned candidate will repeatedly increment its term, then on heal will force the legitimate leader to step down. Mitigated by tight election timeouts in this simulator but a real cluster needs the pre-vote optimization (Ongaro thesis §9.6).

  • No membership changes. The node count is fixed at Cluster::new time. Joint consensus (and the safer learner-then-promote alternative) is a major chapter on its own; deferred to db-23 capstone.

  • Crash semantics are stylized. Crashes are simulated only via the partition flag (drop all messages in one direction). A real Raft must handle persistent storage corruption, fsync ordering, and restart-mid-vote; the canonical dump pretends all state is durable by construction.

  • No client-side dedup. A proposal injected into a leader who immediately loses leadership may be replicated, lost, and never re-proposed. The simulator's pending queue is drained unconditionally; we are testing the consensus core, not the client RPC layer.

Why three languages

Same reasoning as db-16, plus one new lesson: Raft has many places where "iterate over peers" appears. Each one is a chance for a byte diff. C++'s std::map and Rust's BTreeMap are ordered by default; Go's map is explicitly randomized at iteration time. The Go implementation has explicit for p := uint32(0); p < n; p++ loops everywhere a peer iteration appears. Discovering this discipline by forcing the cross-language test to pass is more durable than reading "don't use HashMap" in a style guide.