db-19 — Broader Ideas
The lab implements textbook ZAB (epoch + counter, leader-driven broadcast, discovery + sync on leadership change) with a deterministic simulator and three-language cross-validation. It deliberately stops where production engineering begins. This document collects the threads worth pulling on next.
Variants and refinements
ZAB-with-snapshots
Production ZooKeeper periodically truncates history by snapshotting
the in-memory state machine and dropping txns whose zxid is below
the snapshot's. Followers that fall behind the leader's snapshot are
fast-forwarded with SNAP (whole-state copy) rather than DIFF
(replay tail). Worth implementing as db-19b — it reuses the wire
format and adds a Snap { zxid, state_bytes } RPC alongside the
existing NewLeader payload.
Fast Leader Election (production form)
Real ZooKeeper's FLE has tie-breaking by peer epoch (the highest
epoch this voter has ever seen) before falling back to (last_zxid, id). The lab uses just (last_zxid, id) which is enough for safety
but loses an optimization: a node that just lost leadership often
still has the highest peer-epoch and should regain leadership
quickly. Worth a db-19c.
Observer mode
Observers receive Commit but never vote in elections or quorums.
ZooKeeper added them at scale to push read traffic past the
voter-set throughput ceiling without inflating quorum sizes. The
simulator extension is small: add a Role::Observer, exclude it
from quorum counts, still deliver every Commit.
Read-only mode (RO clients during partition)
When a quorum dies but some nodes remain, ZooKeeper exposes those survivors in a read-only mode that serves last-known committed state. This is a useful failure-mode case for the simulator: drop into RO when no quorum responds within an election cycle.
Cross-epoch zxid ordering
Production ZAB stuffs (epoch, counter) into one u64 (32 bits
each). The lab uses a struct for clarity; switching to the packed
form is a one-line change and would let zxid live in atomic
operations on real hardware. Worth a benchmark in db-22.
Production systems to study
Apache ZooKeeper
The canonical implementation. Read the original ZAB paper (Junqueira,
Reed, Serafini — Zab: High-performance broadcast for primary-backup
systems, DSN 2011) alongside the source in org.apache.zookeeper.server.quorum.
The simulator in this lab maps directly onto Leader.java,
Follower.java, and FastLeaderElection.java.
Kafka KRaft (Raft replacement for ZooKeeper)
Confluent's argument against keeping ZooKeeper as a dependency was operational: two consensus systems (ZAB for metadata, Kafka's own ISR for log replication) doubled the failure-modes and runbooks. KIP-500 replaces ZAB with a Raft-style log inside Kafka itself. A good real-world counterpoint to read alongside db-17 (Raft).
Curator / Recipes
Apache Curator's "recipes" (locks, leader latches, distributed queues) are layered on top of ZooKeeper. They are a clinic in how not to misuse a primary-order primitive: every recipe pins its watch semantics + retry policy explicitly because ZK ephemeral nodes are not ACID transactions.
Etcd v2 vs v3
Etcd v2 used a ZAB-like broadcast; v3 moved to Raft for the same
operational reasons as Kafka. Comparing v2's raft.go (gone, but
in git history) and v3's raft/ is instructive — same problem,
different state machine, near-identical wire bytes.
Chubby (Google)
Chubby is Multi-Paxos-based, not ZAB, but the lease + session model in ZooKeeper traces directly back to Chubby. Burrows's OSDI 2006 paper is the canonical writeup; read it after this lab and before db-23.
Performance experiments worth running
The simulator's ticks are a unit of cost for comparative experiments:
- Quorum-size sweep. For
nodes ∈ {3, 5, 7, 9}, runproposals = 50and count ticks to commit the last proposal. Expected: commit cost rises slowly with quorum size (one extra round-trip per added node), election cost rises sharply (vote table doubles). - Discovery+sync cost on leadership churn. Vary the partition
schedule's
--partitiondensity. The lab's E scenario has 4 churn events in 1000 ticks; the more churn, the higher the ratio ofNewEpoch/NewLeaderbytes toPropose/Commitbytes in the dump. Plot that ratio. - Comparison to Raft (db-17) and Paxos (db-18). Same flag
surface (
--seed --nodes --rounds --proposals --partition) and same scenarios — lab structure is identical on purpose. Compare scenario-A commit latency across the three protocols.
What "production-quality" would require beyond this lab
- Durable storage.
history,current_epoch,accepted_epochmust survivekill -9and power loss. Real ZooKeeper writes a WAL (see db-03) and snapshots every N transactions. - Real network. Sockets, TCP retransmits, framing, TLS, auth.
The simulator's
OutMsgcollapses all of that. - Client sessions. ZooKeeper's session-id ↔ ephemeral-node binding is a major protocol surface in its own right; not modeled here.
- Watches. The pub/sub layer on top of read-paths. Adds a fan-out table and a per-session notification queue.
- Cluster reconfiguration. Adding/removing voters safely is its own protocol (joint quorum on the membership txn). Out of scope.
- Recovery from torn writes. Per-page checksums on the WAL.
- Adversarial inputs. ZAB assumes crash-stop failures only. A Byzantine variant (BFT-ZAB, e.g. BFT-SMaRt) is a separate code base entirely.