db-19 — Analysis

Required invariants

  1. Election agreement. At most one node finishes a successful election cycle with role == Leading && synced per epoch. Enforced by majority voting in vote_view plus the strictly increasing pending_new_epoch = max(accepted_epoch, current_epoch) + 1 rule: any competing prospective leader sees a higher accepted_epoch and steps down (via NewEpoch rejection) before it can sync.

  2. Primary order. If a single primary broadcasts proposals p then q, every follower that delivers both delivers p before q. Enforced by the leader's monotonically increasing next_counter (no gaps, no reuse within an epoch) plus the follower's txn.zxid > last_zxid() gate on Propose (out-of-order proposals are silently dropped rather than re-ordered into the log).

  3. Integrity. The leader only proposes once it is synced, and followers only append once current_epoch has been adopted via NewLeader. Followers will not append a Propose whose zxid <= last_zxid(), so a stale leader's late Propose for an already- superseded epoch cannot corrupt a follower's history.

  4. Agreement on committed prefix. If a follower has `last_committed

    = z, every other follower's history contains every txn with zxid <= z(becauseCommit(z)is only broadcast after a quorum has appended every txn up throughz, and a future leader must include any quorum's committed prefix via the Discovery AckEpoch(last_zxid)` reports → it adopts the surviving history).

  5. Total order. All followers deliver committed transactions in the same order (the leader-assigned zxid order). This follows directly from primary order + agreement on committed prefix.

  6. Byte determinism. For every (seed, nodes, rounds, proposals, partition) tuple, the three binaries produce identical canonical_dump bytes — hence identical sha256 hex on stdout. scripts/cross_test.sh checks six scenarios.

Design decisions

  • propose() calls try_commit() at the end. Same single-node argument as db-17 Raft: a one-node cluster is its own majority, no Ack will ever arrive to drive the commit, so the leader must run the quorum check inline. Harmless for n > 1 because the majority check rejects until acks actually arrive.

  • Sorted iteration on every wire-affecting loop. Rust uses BTreeMap / BTreeSet; C++ uses std::map / std::set; Go uses explicit for p := uint32(0); p < n; p++ loops. The Go code also sorts before iterating wherever a map[uint32]... is read for output (the canonical dump and broadcast loops). HashMap would compile and pass single-language tests but fail cross_test.sh immediately.

  • LookForLeader is structurally a Vote for the sender. The Rust handle arm folds LookForLeader directly into the Vote arm. This avoids a separate LookForLeaderReply and gives a late-arriving Looking node the ability to tally an immediate self-vote from the source. The Go and C++ implementations do the same fold.

  • Non-Looking nodes reply to a Vote with their own current vote. An isolated Looking node sending a Vote to peers who are already Following gets back a Vote pointing at the live leader; combined with the lex-update rule, the isolated node converges on the existing leader in O(1) round-trips after partition heal, rather than starting a new election.

  • Vote lex comparison is (last_zxid, voter_id), not (last_zxid, leader_id). The voter's id is the tie-breaker when histories are equal — this is what makes the highest-id node win a cold-start election. Using leader_id instead would create a fixed-point where any node can vote for any leader and the tally never makes progress.

  • pending_new_epoch = max(accepted_epoch, current_epoch) + 1. The max covers the case where this node has previously acknowledged a NewEpoch for a leader that then failed before reaching sync. Without the max, the new leader could pick an epoch that some follower has already rejected, leaving sync stalled forever.

  • AckEpoch is idempotent. A follower that has already adopted accepted_epoch = e replies again on a re-sent NewEpoch(e). This keeps the discovery handshake robust against the heartbeat-driven re-send loop in on_tick while the leader is still gathering acks.

  • NewLeader ships the whole history, not a diff. Following the ZAB paper. For a study lab this is fine; production ZooKeeper uses SNAP / DIFF / TRUNC variants to avoid bulk transfer. Replacing this with a diff would be a substantial change to the RPC layer and is out of scope.

  • Heartbeats re-broadcast the last Commit. Once synced, the leader re-broadcasts Commit(last_committed) every 50 ticks. This doubles as the "leader is alive" signal that resets the follower's election timer. Sending a dedicated Heartbeat RPC would be one more wire variant for no behavioural gain.

  • Proposal schedule is closed-form. schedule[i] = (i+1) * rounds / (K+1) (integer division). Same rationale as db-17: decoupling proposal timing from cluster scheduling decisions keeps the dump bytes from depending on incidental tick alignment.

  • Library + thin CLI. The lab exposes Cluster::new, run, canonical_dump, and sha256 as a library. The CLI is a few dozen lines of arg parsing plus four function calls.

Tradeoffs worth flagging

  • No snapshots, no SNAP/DIFF/TRUNC. The leader sends the full history on every NewLeader. For the bounded rounds of this lab the cost is trivial; for production ZooKeeper it would be prohibitive on large datasets. Snapshots are deferred to db-21.

  • No client sessions, no znodes, no watches. ZAB exists to serve ZooKeeper, but this lab implements ZAB in isolation. The "state machine" is the history vector itself. Anything ZooKeeper-API- shaped (sessions, ephemerals, watches, ACLs) is downstream of the consensus core and lives in a different lab.

  • Crash semantics are stylized. Crashes are simulated only via the partition flag (drop all messages in one direction). A real ZooKeeper must handle persistent storage corruption, fsync ordering, and restart-mid-vote; the canonical dump pretends all state is durable by construction.

  • No Observer role. Production ZooKeeper has non-voting Observer servers that learn from the leader but do not participate in quorum. They are pure read-fanout and add no algorithmic complexity, so they were left out of the simulator.

  • No client-side dedup. A proposal injected into a leader who immediately loses leadership may be replicated, lost, and never re-proposed. The simulator's cluster_pending queue is drained unconditionally; we are testing the consensus core, not the client RPC layer.

  • Follower truncation is by replacement, not by prefix-match. When a follower receives NewLeader(e, hist), it adopts hist wholesale, even if its own history shares a prefix. This is correct (the leader's history is authoritative for the new epoch) but heavier than necessary; a real implementation would diff.

Why three languages

Same reasoning as db-16 and db-17, plus one new lesson specific to ZAB: the algorithm has two quorum-tracking sets that are easy to get subtly wrong (epoch_acks for discovery, leader_acks for sync, plus the per-zxid ack_set for broadcast). Each set must be iterated in a stable order for the dump, and each must include the leader's own id on initialization. The cross-language test catches both mistakes immediately — forgetting to add self.id to epoch_acks costs a tick of discovery time that perturbs every downstream delivery and changes the dump bytes.