db-20 — Distributed KV Store (Concepts)

This lab is the capstone of the distributed-systems track (db-16..19). It stitches consensus and a state machine into the smallest possible replicated key/value store and exposes the result as a deterministic, byte-identical snapshot across Rust, Go, and C++.

Where it sits

Track	Lab	Provides
db-16	distributed fundamentals	failure models, CAP, FLP
db-17	Raft	a real consensus implementation
db-18	Paxos	a contrast
db-19	ZAB	another contrast
db-20	distributed KV	integration: log + state machine

The scope of db-17 is "implement Raft correctly". The scope of db-20 is "given Raft-shaped semantics, build a replicated state machine you can hash byte-for-byte across three languages." So we deliberately do not re-implement leader election, randomised timers, RPCs, or persistent log files. We model just enough of consensus to study the integration boundary.

Simplifications (vs. real Raft / db-17)

Concept	db-17	db-20
Networking	message-driven	direct in-process broadcast
Elections	randomised timeouts	fixed leader, `current_term == 1`
Followers' acks	RPC reply	function return
Log replication	`next_index` walk on conflict	one-shot snapshot push (`truncate_and_replay`)
Partition	network simulation	`Cluster::partition({ids})` drops messages
Heal / catchup	next-index probes	full log copy + replay
Persistence	log file + fsync	none (in-memory `Vec<LogEntry>`)

The simplifications are honest — they collapse implementation details that do not affect the property we care about: a leader's state-machine snapshot is identical to every healthy follower's snapshot, and identical across languages.

Data model

Op            = NoOp | Put(key, value) | Del(key)
LogEntry      = { term: u64, idx: u64, op: Op }
Replica       = { id, log: Vec<LogEntry>, commit_index, current_term,
                  state_machine: BTreeMap<String, Vec<u8>> }
Cluster       = { replicas[5], leader_idx=0, partitions, next_log_idx }

state_machine is the applied projection of the committed log prefix. We do not store tombstones — Del actually removes the entry.

Propose / commit cycle

The leader allocates the next log index and appends LogEntry{term, idx, op} to its own log.
For each follower, in id order:
- If the follower is partitioned, drop the message.
- Else call try_append(prev_idx, entry). If the follower's last_idx matches prev_idx, the entry is appended (1 ack). If not, snapshot push: truncate_and_replay(leader.log, leader.commit_index) replaces the follower's log wholesale and re-applies the committed prefix (1 ack).
If acks ≥ quorum (3/5), commit on the leader by advancing commit_index and applying to the state machine. Then in id order, advance every reachable follower's commit_index too.
Otherwise the entry stays in the leader's log uncommitted — a future successful proposal or a heal() will retro-commit it.

The total order of commits is the log order: idx 1, 2, 3, ....

Partition + heal

Cluster::partition({3,4}) adds replica ids 3 and 4 to the partitions set. Subsequent proposals do not message them and do not count their acks. If quorum is still reachable (5 − 2 = 3 ≥ 3), the cluster keeps committing. If not, every proposal returns false and the leader's tail grows uncommitted.

Cluster::heal() clears the set and, for each healed follower in ascending id order, performs truncate_and_replay(leader.log, leader.commit_index). This is db-20's stand-in for Raft's next_index-walk conflict resolution: simpler, deterministic, and good enough for the cross-language exam because the final state is identical.

Cross-language byte identity — the exam

Wire format for one replica's snapshot

magic           "DSEDKV20"           (8 bytes)
u64 LE          commit_index
u64 LE          current_term         (== 1 in this lab)
u32 LE          entry_count
  for each (k, v) in ascending k order:
    u32 LE k_len | k_bytes
    u32 LE v_len | v_bytes

Iteration order is ascending key. Rust uses BTreeMap, C++ uses std::map — both naturally ascending. Go's map iteration is randomised, so the Go implementation does an explicit sort.Strings before serialising.
Tombstoned keys are not in the dump.
All integers little-endian.

Workload spec

splitmix64 constants: 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB

setup:
  Cluster::new(5)
  if scenario == "partition":
    at op = ops/4   → partition([3, 4])
    at op = ops*3/4 → heal()

for op in 0..ops:
  r1, r2, r3 = rng.next() ×3
  kind = (r1 >> 62) & 0x3                # 0,1,2 → Put; 3 → Del
  k    = "k" + (r2 % keys).to_string()
  v    = u64_le(r3 % 10000)              # 8 bytes

Frozen golden hashes

Scenario	Args	sha256
A	`--seed 42 --ops 500 --keys 32 --scenario default`	`1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b`
B	`--seed 7 --ops 2000 --keys 128 --scenario partition`	`272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc`

These are baked into scripts/cross_test.sh, src/go/dkv20_test.go, and src/cpp/tests/test_dkv20.cc. Any change to PRNG / wire format / op decoding / partition timing / snapshot push will break them — which is exactly the point.

Where to look next

src/rust/src/lib.rs — the reference implementation. Read it first.
src/go/dkv20.go — port. Note the explicit sort.Strings before writing wire bytes.
src/cpp/src/dkv20.cc — port. Note the manual little-endian writers and pure-stdlib sha256.
docs/ — the long-form study notes (analysis, execution, observation, verification, broader ideas).
steps/ — the three-step study plan if you are walking the lab fresh.

Distributed Systems Engineer — Build Databases & Consensus From Scratch