db-20 — Distributed KV Store (Concepts)

This lab is the capstone of the distributed-systems track (db-16..19). It stitches consensus and a state machine into the smallest possible replicated key/value store and exposes the result as a deterministic, byte-identical snapshot across Rust, Go, and C++.

Where it sits

TrackLabProvides
db-16distributed fundamentalsfailure models, CAP, FLP
db-17Rafta real consensus implementation
db-18Paxosa contrast
db-19ZABanother contrast
db-20distributed KVintegration: log + state machine

The scope of db-17 is "implement Raft correctly". The scope of db-20 is "given Raft-shaped semantics, build a replicated state machine you can hash byte-for-byte across three languages." So we deliberately do not re-implement leader election, randomised timers, RPCs, or persistent log files. We model just enough of consensus to study the integration boundary.

Simplifications (vs. real Raft / db-17)

Conceptdb-17db-20
Networkingmessage-drivendirect in-process broadcast
Electionsrandomised timeoutsfixed leader, current_term == 1
Followers' acksRPC replyfunction return
Log replicationnext_index walk on conflictone-shot snapshot push (truncate_and_replay)
Partitionnetwork simulationCluster::partition({ids}) drops messages
Heal / catchupnext-index probesfull log copy + replay
Persistencelog file + fsyncnone (in-memory Vec<LogEntry>)

The simplifications are honest — they collapse implementation details that do not affect the property we care about: a leader's state-machine snapshot is identical to every healthy follower's snapshot, and identical across languages.

Data model

Op            = NoOp | Put(key, value) | Del(key)
LogEntry      = { term: u64, idx: u64, op: Op }
Replica       = { id, log: Vec<LogEntry>, commit_index, current_term,
                  state_machine: BTreeMap<String, Vec<u8>> }
Cluster       = { replicas[5], leader_idx=0, partitions, next_log_idx }

state_machine is the applied projection of the committed log prefix. We do not store tombstones — Del actually removes the entry.

Propose / commit cycle

  1. The leader allocates the next log index and appends LogEntry{term, idx, op} to its own log.
  2. For each follower, in id order:
    • If the follower is partitioned, drop the message.
    • Else call try_append(prev_idx, entry). If the follower's last_idx matches prev_idx, the entry is appended (1 ack). If not, snapshot push: truncate_and_replay(leader.log, leader.commit_index) replaces the follower's log wholesale and re-applies the committed prefix (1 ack).
  3. If acks ≥ quorum (3/5), commit on the leader by advancing commit_index and applying to the state machine. Then in id order, advance every reachable follower's commit_index too.
  4. Otherwise the entry stays in the leader's log uncommitted — a future successful proposal or a heal() will retro-commit it.

The total order of commits is the log order: idx 1, 2, 3, ....

Partition + heal

Cluster::partition({3,4}) adds replica ids 3 and 4 to the partitions set. Subsequent proposals do not message them and do not count their acks. If quorum is still reachable (5 − 2 = 3 ≥ 3), the cluster keeps committing. If not, every proposal returns false and the leader's tail grows uncommitted.

Cluster::heal() clears the set and, for each healed follower in ascending id order, performs truncate_and_replay(leader.log, leader.commit_index). This is db-20's stand-in for Raft's next_index-walk conflict resolution: simpler, deterministic, and good enough for the cross-language exam because the final state is identical.

Cross-language byte identity — the exam

Wire format for one replica's snapshot

magic           "DSEDKV20"           (8 bytes)
u64 LE          commit_index
u64 LE          current_term         (== 1 in this lab)
u32 LE          entry_count
  for each (k, v) in ascending k order:
    u32 LE k_len | k_bytes
    u32 LE v_len | v_bytes
  • Iteration order is ascending key. Rust uses BTreeMap, C++ uses std::map — both naturally ascending. Go's map iteration is randomised, so the Go implementation does an explicit sort.Strings before serialising.
  • Tombstoned keys are not in the dump.
  • All integers little-endian.

Workload spec

splitmix64 constants: 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB

setup:
  Cluster::new(5)
  if scenario == "partition":
    at op = ops/4   → partition([3, 4])
    at op = ops*3/4 → heal()

for op in 0..ops:
  r1, r2, r3 = rng.next() ×3
  kind = (r1 >> 62) & 0x3                # 0,1,2 → Put; 3 → Del
  k    = "k" + (r2 % keys).to_string()
  v    = u64_le(r3 % 10000)              # 8 bytes

Frozen golden hashes

ScenarioArgssha256
A--seed 42 --ops 500 --keys 32 --scenario default1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b
B--seed 7 --ops 2000 --keys 128 --scenario partition272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc

These are baked into scripts/cross_test.sh, src/go/dkv20_test.go, and src/cpp/tests/test_dkv20.cc. Any change to PRNG / wire format / op decoding / partition timing / snapshot push will break them — which is exactly the point.

Where to look next

  • src/rust/src/lib.rs — the reference implementation. Read it first.
  • src/go/dkv20.go — port. Note the explicit sort.Strings before writing wire bytes.
  • src/cpp/src/dkv20.cc — port. Note the manual little-endian writers and pure-stdlib sha256.
  • docs/ — the long-form study notes (analysis, execution, observation, verification, broader ideas).
  • steps/ — the three-step study plan if you are walking the lab fresh.