References — db-23 capstone

Replication and consensus

  • Diego Ongaro & John Ousterhout. In Search of an Understandable Consensus Algorithm (Extended Version). ATC 2014. The Raft paper — the leader/log/commit-index model used by this lab is a direct simplification of it.
  • Leslie Lamport. Paxos Made Simple. 2001. The original majority-quorum log-replication algorithm.
  • Flavio Junqueira, Benjamin Reed, Marco Serafini. ZAB: High-performance broadcast for primary-backup systems. DSN 2011. Used by ZooKeeper; closest in spirit to the leader-only single-quorum model here.

Theory

  • Fischer, Lynch, Paterson. Impossibility of Distributed Consensus with One Faulty Process. JACM 1985. Why deterministic consensus needs failure detectors / partial synchrony.
  • Eric Brewer. Towards Robust Distributed Systems. PODC 2000 keynote (CAP conjecture). Gilbert & Lynch later proved it.
  • Seth Gilbert & Nancy Lynch. Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. SIGACT 2002.

Practitioner material

  • MIT 6.824 Distributed Systems lectures (esp. lectures 5–8 on Raft).
  • Martin Kleppmann. Designing Data-Intensive Applications. O'Reilly 2017. Chs. 5, 8, 9 on replication, consistency, and consensus.
  • Kyle Kingsbury. Jepsen reports (https://jepsen.io). Practical examples of how real systems violate the guarantees their READMEs claim.

Isolation testing

  • Peter Bailis et al. Hermitage — concrete tests that expose what isolation levels really mean (https://github.com/ept/hermitage).

What this lab does not model

  • Leader election (we hardcode node 0 as leader forever).
  • Log truncation / divergent suffixes (we use synchronous in-process replication, so followers never have entries the leader lacks).
  • Membership changes, log compaction, snapshots, network partitions beyond a single follower being marked down.

Those are the natural follow-on projects after this capstone — see docs/broader-ideas.md.