db-19 — Broader Ideas

The lab implements textbook ZAB (epoch + counter, leader-driven broadcast, discovery + sync on leadership change) with a deterministic simulator and three-language cross-validation. It deliberately stops where production engineering begins. This document collects the threads worth pulling on next.

Variants and refinements

ZAB-with-snapshots

Production ZooKeeper periodically truncates history by snapshotting the in-memory state machine and dropping txns whose zxid is below the snapshot's. Followers that fall behind the leader's snapshot are fast-forwarded with SNAP (whole-state copy) rather than DIFF (replay tail). Worth implementing as db-19b — it reuses the wire format and adds a Snap { zxid, state_bytes } RPC alongside the existing NewLeader payload.

Fast Leader Election (production form)

Real ZooKeeper's FLE has tie-breaking by peer epoch (the highest epoch this voter has ever seen) before falling back to (last_zxid, id). The lab uses just (last_zxid, id) which is enough for safety but loses an optimization: a node that just lost leadership often still has the highest peer-epoch and should regain leadership quickly. Worth a db-19c.

Observer mode

Observers receive Commit but never vote in elections or quorums. ZooKeeper added them at scale to push read traffic past the voter-set throughput ceiling without inflating quorum sizes. The simulator extension is small: add a Role::Observer, exclude it from quorum counts, still deliver every Commit.

Read-only mode (RO clients during partition)

When a quorum dies but some nodes remain, ZooKeeper exposes those survivors in a read-only mode that serves last-known committed state. This is a useful failure-mode case for the simulator: drop into RO when no quorum responds within an election cycle.

Cross-epoch zxid ordering

Production ZAB stuffs (epoch, counter) into one u64 (32 bits each). The lab uses a struct for clarity; switching to the packed form is a one-line change and would let zxid live in atomic operations on real hardware. Worth a benchmark in db-22.

Production systems to study

Apache ZooKeeper

The canonical implementation. Read the original ZAB paper (Junqueira, Reed, Serafini — Zab: High-performance broadcast for primary-backup systems, DSN 2011) alongside the source in org.apache.zookeeper.server.quorum. The simulator in this lab maps directly onto Leader.java, Follower.java, and FastLeaderElection.java.

Kafka KRaft (Raft replacement for ZooKeeper)

Confluent's argument against keeping ZooKeeper as a dependency was operational: two consensus systems (ZAB for metadata, Kafka's own ISR for log replication) doubled the failure-modes and runbooks. KIP-500 replaces ZAB with a Raft-style log inside Kafka itself. A good real-world counterpoint to read alongside db-17 (Raft).

Curator / Recipes

Apache Curator's "recipes" (locks, leader latches, distributed queues) are layered on top of ZooKeeper. They are a clinic in how not to misuse a primary-order primitive: every recipe pins its watch semantics + retry policy explicitly because ZK ephemeral nodes are not ACID transactions.

Etcd v2 vs v3

Etcd v2 used a ZAB-like broadcast; v3 moved to Raft for the same operational reasons as Kafka. Comparing v2's raft.go (gone, but in git history) and v3's raft/ is instructive — same problem, different state machine, near-identical wire bytes.

Chubby (Google)

Chubby is Multi-Paxos-based, not ZAB, but the lease + session model in ZooKeeper traces directly back to Chubby. Burrows's OSDI 2006 paper is the canonical writeup; read it after this lab and before db-23.

Performance experiments worth running

The simulator's ticks are a unit of cost for comparative experiments:

  • Quorum-size sweep. For nodes ∈ {3, 5, 7, 9}, run proposals = 50 and count ticks to commit the last proposal. Expected: commit cost rises slowly with quorum size (one extra round-trip per added node), election cost rises sharply (vote table doubles).
  • Discovery+sync cost on leadership churn. Vary the partition schedule's --partition density. The lab's E scenario has 4 churn events in 1000 ticks; the more churn, the higher the ratio of NewEpoch/NewLeader bytes to Propose/Commit bytes in the dump. Plot that ratio.
  • Comparison to Raft (db-17) and Paxos (db-18). Same flag surface (--seed --nodes --rounds --proposals --partition) and same scenarios — lab structure is identical on purpose. Compare scenario-A commit latency across the three protocols.

What "production-quality" would require beyond this lab

  • Durable storage. history, current_epoch, accepted_epoch must survive kill -9 and power loss. Real ZooKeeper writes a WAL (see db-03) and snapshots every N transactions.
  • Real network. Sockets, TCP retransmits, framing, TLS, auth. The simulator's OutMsg collapses all of that.
  • Client sessions. ZooKeeper's session-id ↔ ephemeral-node binding is a major protocol surface in its own right; not modeled here.
  • Watches. The pub/sub layer on top of read-paths. Adds a fan-out table and a per-session notification queue.
  • Cluster reconfiguration. Adding/removing voters safely is its own protocol (joint quorum on the membership txn). Out of scope.
  • Recovery from torn writes. Per-page checksums on the WAL.
  • Adversarial inputs. ZAB assumes crash-stop failures only. A Byzantine variant (BFT-ZAB, e.g. BFT-SMaRt) is a separate code base entirely.