Broader ideas — what to build next

This capstone is a deliberately minimal replicated KV. Here are the natural follow-on projects, in roughly increasing scope:

1. Leader election

Replace "node 0 is leader forever" with a Raft-style election: randomized timeouts, terms, RequestVote, log-completeness check. Determinism becomes hard the moment timers exist, so frozen-hash testing must be replaced with invariant-style testing (e.g. "every successful read returns a value from the leader's committed log").

2. Real network

Move from synchronous function calls to in-memory channels first, then to TCP RPC, then to UDP with retransmission. At each layer add the corresponding failure injection (drop, reorder, duplicate, delay) and re-verify safety invariants.

3. Log compaction & snapshots

Today catch_up replays the entire leader log. For a long-running cluster this is infeasible; add Raft-style snapshots: leader sends a full kv state plus the index it represents, follower installs that, then resumes from lastIncludedIndex + 1.

4. Membership changes

Add a Reconfigure op that mutates the cluster set. Use the joint-consensus or single-server membership change algorithms.

5. Read consistency levels

Stale read: any follower answers from its local kv.
Read-your-writes: client reads from leader.
Linearizable read: leader confirms it is still leader via a heartbeat to a quorum before answering, or uses Raft's ReadIndex / lease read.

6. Multi-shard / sharded KV

Use a hash of the key to pick a shard; each shard is its own 3-node Raft group. Add a meta-shard that owns the shard map. This is the architecture of TiKV, CockroachDB, Spanner.

7. Transactions across shards

Layer 2PC (with a transaction coordinator log) over the shard groups. Or do Percolator-style snapshot isolation. Or go full Spanner with TrueTime.

8. Jepsen-style testing

Property-based testing with random clients, random faults (partitions, clock skew, node kills), and a linearizability checker (Knossos or Porcupine).

9. Replace the in-process state machine

Plug in the LSM from db-09 or the B-tree from db-15 as the underlying KV store. The replication layer (this lab) shouldn't have to change.

10. Geo-replication

A second tier of replication across regions, with the per-region cluster acting as a single logical replica. Conflict resolution becomes the central question.

Distributed Systems Engineer — Build Databases & Consensus From Scratch