db-20 — Distributed KV Store (Concepts)
This lab is the capstone of the distributed-systems track (db-16..19). It stitches consensus and a state machine into the smallest possible replicated key/value store and exposes the result as a deterministic, byte-identical snapshot across Rust, Go, and C++.
Where it sits
| Track | Lab | Provides |
|---|---|---|
| db-16 | distributed fundamentals | failure models, CAP, FLP |
| db-17 | Raft | a real consensus implementation |
| db-18 | Paxos | a contrast |
| db-19 | ZAB | another contrast |
| db-20 | distributed KV | integration: log + state machine |
The scope of db-17 is "implement Raft correctly". The scope of db-20 is "given Raft-shaped semantics, build a replicated state machine you can hash byte-for-byte across three languages." So we deliberately do not re-implement leader election, randomised timers, RPCs, or persistent log files. We model just enough of consensus to study the integration boundary.
Simplifications (vs. real Raft / db-17)
| Concept | db-17 | db-20 |
|---|---|---|
| Networking | message-driven | direct in-process broadcast |
| Elections | randomised timeouts | fixed leader, current_term == 1 |
| Followers' acks | RPC reply | function return |
| Log replication | next_index walk on conflict | one-shot snapshot push (truncate_and_replay) |
| Partition | network simulation | Cluster::partition({ids}) drops messages |
| Heal / catchup | next-index probes | full log copy + replay |
| Persistence | log file + fsync | none (in-memory Vec<LogEntry>) |
The simplifications are honest — they collapse implementation details that do not affect the property we care about: a leader's state-machine snapshot is identical to every healthy follower's snapshot, and identical across languages.
Data model
Op = NoOp | Put(key, value) | Del(key)
LogEntry = { term: u64, idx: u64, op: Op }
Replica = { id, log: Vec<LogEntry>, commit_index, current_term,
state_machine: BTreeMap<String, Vec<u8>> }
Cluster = { replicas[5], leader_idx=0, partitions, next_log_idx }
state_machine is the applied projection of the committed log
prefix. We do not store tombstones — Del actually removes the entry.
Propose / commit cycle
- The leader allocates the next log index and appends
LogEntry{term, idx, op}to its own log. - For each follower, in id order:
- If the follower is partitioned, drop the message.
- Else call
try_append(prev_idx, entry). If the follower's last_idx matchesprev_idx, the entry is appended (1 ack). If not, snapshot push:truncate_and_replay(leader.log, leader.commit_index)replaces the follower's log wholesale and re-applies the committed prefix (1 ack).
- If acks ≥ quorum (3/5), commit on the leader by advancing
commit_indexand applying to the state machine. Then in id order, advance every reachable follower's commit_index too. - Otherwise the entry stays in the leader's log uncommitted — a future
successful proposal or a
heal()will retro-commit it.
The total order of commits is the log order: idx 1, 2, 3, ....
Partition + heal
Cluster::partition({3,4}) adds replica ids 3 and 4 to the
partitions set. Subsequent proposals do not message them and do not
count their acks. If quorum is still reachable (5 − 2 = 3 ≥ 3), the
cluster keeps committing. If not, every proposal returns false and
the leader's tail grows uncommitted.
Cluster::heal() clears the set and, for each healed follower in
ascending id order, performs truncate_and_replay(leader.log, leader.commit_index). This is db-20's stand-in for Raft's
next_index-walk conflict resolution: simpler, deterministic, and good
enough for the cross-language exam because the final state is
identical.
Cross-language byte identity — the exam
Wire format for one replica's snapshot
magic "DSEDKV20" (8 bytes)
u64 LE commit_index
u64 LE current_term (== 1 in this lab)
u32 LE entry_count
for each (k, v) in ascending k order:
u32 LE k_len | k_bytes
u32 LE v_len | v_bytes
- Iteration order is ascending key. Rust uses
BTreeMap, C++ usesstd::map— both naturally ascending. Go'smapiteration is randomised, so the Go implementation does an explicitsort.Stringsbefore serialising. - Tombstoned keys are not in the dump.
- All integers little-endian.
Workload spec
splitmix64 constants: 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB
setup:
Cluster::new(5)
if scenario == "partition":
at op = ops/4 → partition([3, 4])
at op = ops*3/4 → heal()
for op in 0..ops:
r1, r2, r3 = rng.next() ×3
kind = (r1 >> 62) & 0x3 # 0,1,2 → Put; 3 → Del
k = "k" + (r2 % keys).to_string()
v = u64_le(r3 % 10000) # 8 bytes
Frozen golden hashes
| Scenario | Args | sha256 |
|---|---|---|
| A | --seed 42 --ops 500 --keys 32 --scenario default | 1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b |
| B | --seed 7 --ops 2000 --keys 128 --scenario partition | 272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc |
These are baked into scripts/cross_test.sh, src/go/dkv20_test.go,
and src/cpp/tests/test_dkv20.cc. Any change to PRNG / wire format /
op decoding / partition timing / snapshot push will break them — which
is exactly the point.
Where to look next
src/rust/src/lib.rs— the reference implementation. Read it first.src/go/dkv20.go— port. Note the explicitsort.Stringsbefore writing wire bytes.src/cpp/src/dkv20.cc— port. Note the manual little-endian writers and pure-stdlib sha256.docs/— the long-form study notes (analysis, execution, observation, verification, broader ideas).steps/— the three-step study plan if you are walking the lab fresh.