db-19 — Analysis
Required invariants
-
Election agreement. At most one node finishes a successful election cycle with
role == Leading && syncedper epoch. Enforced by majority voting invote_viewplus the strictly increasingpending_new_epoch = max(accepted_epoch, current_epoch) + 1rule: any competing prospective leader sees a higheraccepted_epochand steps down (viaNewEpochrejection) before it can sync. -
Primary order. If a single primary broadcasts proposals
pthenq, every follower that delivers both deliverspbeforeq. Enforced by the leader's monotonically increasingnext_counter(no gaps, no reuse within an epoch) plus the follower'stxn.zxid > last_zxid()gate onPropose(out-of-order proposals are silently dropped rather than re-ordered into the log). -
Integrity. The leader only proposes once it is
synced, and followers only append oncecurrent_epochhas been adopted viaNewLeader. Followers will not append aProposewhosezxid <= last_zxid(), so a stale leader's lateProposefor an already- superseded epoch cannot corrupt a follower's history. -
Agreement on committed prefix. If a follower has `last_committed
= z
, every other follower's history contains every txn withzxid <= z(becauseCommit(z)is only broadcast after a quorum has appended every txn up throughz, and a future leader must include any quorum's committed prefix via the DiscoveryAckEpoch(last_zxid)` reports → it adopts the surviving history). -
Total order. All followers deliver committed transactions in the same order (the leader-assigned zxid order). This follows directly from primary order + agreement on committed prefix.
-
Byte determinism. For every
(seed, nodes, rounds, proposals, partition)tuple, the three binaries produce identicalcanonical_dumpbytes — hence identical sha256 hex on stdout.scripts/cross_test.shchecks six scenarios.
Design decisions
-
propose()callstry_commit()at the end. Same single-node argument as db-17 Raft: a one-node cluster is its own majority, no Ack will ever arrive to drive the commit, so the leader must run the quorum check inline. Harmless forn > 1because the majority check rejects until acks actually arrive. -
Sorted iteration on every wire-affecting loop. Rust uses
BTreeMap/BTreeSet; C++ usesstd::map/std::set; Go uses explicitfor p := uint32(0); p < n; p++loops. The Go code also sorts before iterating wherever amap[uint32]...is read for output (the canonical dump and broadcast loops). HashMap would compile and pass single-language tests but failcross_test.shimmediately. -
LookForLeaderis structurally aVotefor the sender. The Rusthandlearm foldsLookForLeaderdirectly into theVotearm. This avoids a separateLookForLeaderReplyand gives a late-arriving Looking node the ability to tally an immediate self-vote from the source. The Go and C++ implementations do the same fold. -
Non-Looking nodes reply to a
Votewith their own current vote. An isolated Looking node sending aVoteto peers who are already Following gets back aVotepointing at the live leader; combined with the lex-update rule, the isolated node converges on the existing leader in O(1) round-trips after partition heal, rather than starting a new election. -
Votelex comparison is(last_zxid, voter_id), not(last_zxid, leader_id). The voter's id is the tie-breaker when histories are equal — this is what makes the highest-id node win a cold-start election. Usingleader_idinstead would create a fixed-point where any node can vote for any leader and the tally never makes progress. -
pending_new_epoch = max(accepted_epoch, current_epoch) + 1. Themaxcovers the case where this node has previously acknowledged aNewEpochfor a leader that then failed before reaching sync. Without themax, the new leader could pick an epoch that some follower has already rejected, leaving sync stalled forever. -
AckEpochis idempotent. A follower that has already adoptedaccepted_epoch = ereplies again on a re-sentNewEpoch(e). This keeps the discovery handshake robust against the heartbeat-driven re-send loop inon_tickwhile the leader is still gathering acks. -
NewLeaderships the whole history, not a diff. Following the ZAB paper. For a study lab this is fine; production ZooKeeper usesSNAP/DIFF/TRUNCvariants to avoid bulk transfer. Replacing this with a diff would be a substantial change to the RPC layer and is out of scope. -
Heartbeats re-broadcast the last
Commit. Once synced, the leader re-broadcastsCommit(last_committed)every 50 ticks. This doubles as the "leader is alive" signal that resets the follower's election timer. Sending a dedicatedHeartbeatRPC would be one more wire variant for no behavioural gain. -
Proposal schedule is closed-form.
schedule[i] = (i+1) * rounds / (K+1)(integer division). Same rationale as db-17: decoupling proposal timing from cluster scheduling decisions keeps the dump bytes from depending on incidental tick alignment. -
Library + thin CLI. The lab exposes
Cluster::new,run,canonical_dump, andsha256as a library. The CLI is a few dozen lines of arg parsing plus four function calls.
Tradeoffs worth flagging
-
No snapshots, no SNAP/DIFF/TRUNC. The leader sends the full history on every
NewLeader. For the boundedroundsof this lab the cost is trivial; for production ZooKeeper it would be prohibitive on large datasets. Snapshots are deferred to db-21. -
No client sessions, no znodes, no watches. ZAB exists to serve ZooKeeper, but this lab implements ZAB in isolation. The "state machine" is the history vector itself. Anything ZooKeeper-API- shaped (sessions, ephemerals, watches, ACLs) is downstream of the consensus core and lives in a different lab.
-
Crash semantics are stylized. Crashes are simulated only via the
partitionflag (drop all messages in one direction). A real ZooKeeper must handle persistent storage corruption, fsync ordering, and restart-mid-vote; the canonical dump pretends all state is durable by construction. -
No
Observerrole. Production ZooKeeper has non-voting Observer servers that learn from the leader but do not participate in quorum. They are pure read-fanout and add no algorithmic complexity, so they were left out of the simulator. -
No client-side dedup. A proposal injected into a leader who immediately loses leadership may be replicated, lost, and never re-proposed. The simulator's
cluster_pendingqueue is drained unconditionally; we are testing the consensus core, not the client RPC layer. -
Follower truncation is by replacement, not by prefix-match. When a follower receives
NewLeader(e, hist), it adoptshistwholesale, even if its own history shares a prefix. This is correct (the leader's history is authoritative for the new epoch) but heavier than necessary; a real implementation would diff.
Why three languages
Same reasoning as db-16 and db-17, plus one new lesson specific to
ZAB: the algorithm has two quorum-tracking sets that are easy to
get subtly wrong (epoch_acks for discovery, leader_acks for sync,
plus the per-zxid ack_set for broadcast). Each set must be iterated
in a stable order for the dump, and each must include the leader's
own id on initialization. The cross-language test catches both
mistakes immediately — forgetting to add self.id to epoch_acks
costs a tick of discovery time that perturbs every downstream
delivery and changes the dump bytes.