db-13 — Transactions and MVCC

What is it?

A multi-version concurrency control key-value store with snapshot isolation semantics, in pure memory, ported byte-identically across Rust, Go, and C++. There is no disk, no log, no recovery — only the core MVCC machinery: per-key version chains, a single timestamp oracle, optimistic write-set conflict detection at commit time, and a garbage collector that respects active snapshots.

Every key holds a Vec<Version> sorted ascending by commit_ts, where a Version is { commit_ts: u64, payload: Option<Bytes> } and an empty payload means committed tombstone. A transaction at start_ts reads the newest version with commit_ts <= start_ts, ignoring everything written after it began. On commit, the transaction's write-set is checked against the chain — if any key has a committed version with commit_ts > start_ts, the commit aborts with a write-write conflict; otherwise the transaction's writes are appended under a freshly issued commit_ts.

The lab's load-bearing artifact is a canonical byte serializer for the entire store and a deterministic multi-worker workload. The serialized bytes hash to the same SHA-256 in all three languages.

Why does it matter?

This is the lab where transactions become real. Every storage engine so far in the project has been single-writer or last-write-wins. The moment two transactions can race to update the same key, you need to decide what the database does about it, and that decision shapes everything from the API up to the failure model.

Snapshot isolation is the dominant choice in modern engines:

Postgres runs SI by default for READ COMMITTED and a stricter serializable variant (SSI) for SERIALIZABLE.
TiKV / CockroachDB / FoundationDB are all built on Percolator-style MVCC with snapshot reads and optimistic commit.
Microsoft Hekaton is a pure in-memory MVCC engine almost identical in shape to this lab.
RocksDB's Transaction layer implements optimistic and pessimistic MVCC on top of LSM versions.

What MVCC buys you is the property that readers never block writers and writers never block readers. The cost is space (multiple versions per key) and a garbage-collection problem (when can the old versions be dropped without breaking some live snapshot?). This lab confronts both.

It also forces the engineer to internalize a precise statement of what SI does not give you — the write-skew anomaly — which is the single most-asked question in database interviews because nine out of ten engineers conflate snapshot isolation with serializability.

How does it work?

                ┌──────────── timestamp oracle (atomic u64) ────────────┐
                │   begin() → start_ts;   commit() → commit_ts          │
                └───────────────────────────────────────────────────────┘
                          │                              │
              ┌───────────▼──────────┐         ┌─────────▼─────────┐
              │ Txn { start_ts,      │         │ Store {           │
              │       writes: BTree, │  put    │   chains: BTree<  │
              │       closed: bool } │ ────────►│     key → Vec<   │
              │                      │  del    │       Version>>, │
              │ get(k):              │         │   active_starts, │
              │   1. local writes    │  get    │   oracle          │
              │   2. chain[k] newest │ ◄────── │ }                 │
              │      with commit_ts  │         │                   │
              │      ≤ start_ts      │         │                   │
              │                      │ commit  │   conflict-check, │
              │ commit():            │ ───────►│   then append at  │
              │   conflict-check     │         │   commit_ts       │
              │   then publish       │         │                   │
              └──────────────────────┘         └───────────────────┘
                                                        │
                                                        │   gc(below_ts)
                                                        ▼
                                          drop v[i] iff exists v[i+1]
                                          with commit_ts ≤
                                          min(below_ts, oldest_active)

The five operations

begin() — atomically increments the oracle, calls the resulting number start_ts, registers it in the active starts multiset.
get(k) — first checks the txn's local write-set (read-your-own-writes), then walks the chain for k from newest to oldest looking for the first version with commit_ts <= start_ts. Returns None if that version is a tombstone or no such version exists.
put(k, v) / delete(k) — buffer into a per-txn BTreeMap. No store I/O.
commit() — under the store mutex:
1. for each key in the write-set, fail with Conflict { key, conflicting_ts } if the chain's newest version has commit_ts > start_ts;
2. otherwise allocate commit_ts from the oracle;
3. append each local write to the chain under commit_ts;
4. remove start_ts from the active set. A read-only commit (writes.is_empty()) skips steps 1–3 and just retires from the active set.
abort() — discards the write-set and retires from the active set. Idempotent. Drop/destructor calls abort() if neither commit() nor abort() ran.

The active set and GC

The store keeps a refcount-multiset of currently-active start_ts values. gc(below_ts) computes

cutoff = min(below_ts, oldest_active_start_ts)

and for every chain, drops every prefix version v[i] such that v[i+1] exists with v[i+1].commit_ts <= cutoff. The newest version is always retained — future readers may still need it (or its tombstone).

The reasoning: any reader at start_ts >= cutoff will pick the newest version with commit_ts <= start_ts, never v[i] from the dropped prefix because v[i+1] is also visible to them and is newer. Readers with start_ts < cutoff cannot exist — the active multiset is non-empty only at timestamps >= oldest_active = cutoff.

This is the same reasoning Postgres VACUUM uses with xmin/xmax and OldestXmin, the same reasoning TiKV uses with its "safe point", and the same reasoning Hekaton's GC uses with its "oldest active transaction".

Snapshot isolation, not serializable

The commit-time check looks at the write-set only. It does not look at the read-set. This means:

Two txns reading the same key and updating the same key → exactly one wins. (Lost-update is prevented.)
Two txns reading the same key and updating different keys based on their reads → both can succeed. (Write skew is allowed.)

The classic write-skew anomaly:

T1: r(x); r(y); if x+y >= 0: w(x, -100)
T2: r(x); r(y); if x+y >= 0: w(y, -100)

Started with x=0, y=0, both txns observe x+y=0, both write, both commit (different keys → no write-set overlap). The post-state is x=-100, y=-100, which no serial schedule of T1 then T2 (or T2 then T1) can produce. Snapshot isolation will allow this. Serializable SI (Postgres SSI) catches it via dangerous-structure detection on read dependencies. We deliberately do not implement that — db-13 is the smallest faithful SI engine.

Cross-language invariant

mvccctl workload --seed S --ops N --keys K --writers W --readers R --scenario {writeheavy|mixed|conflicting} is the cross-language contract:

Identical SplitMix64 PRNG seeded with S.
Each op draws three samples: worker_idx = r1 % (W+R), key_idx = r2 % K, payload = (u32)r3 big-endian.
Workers 0..W are writers (they put then commit every 4 ops); workers W..W+R are readers (they get then commit every 4 ops).
Open transactions are drained at the end.

The store is then serialized via the canonical dump and SHA-256-hex'd to stdout (no trailing newline).

Wire format

"DSEMVCC1"          (8 ASCII bytes)
u64 LE next_ts                  ← oracle + 1
u32 LE key_count
per key (sorted ascending by raw key bytes):
  u32 LE klen
  key bytes
  u32 LE version_count
  per version (ascending by commit_ts):
    u64 LE commit_ts
    u8  has_value                 0 = tombstone, 1 = value
    if has_value:
      u32 LE vlen
      vbytes

All integers are unsigned little-endian. Keys and values are length- prefixed; no null terminators, no escapes. next_ts is oracle + 1 to match the next value begin() would issue — this makes the dump round-trippable: a future MvccStore::load can resume the oracle exactly.

Why these particular determinism guarantees

Key iteration order — std::map<Bytes,...> (C++), sorted slice (Go), BTreeMap (Rust). Never raw map iteration in any port.
Within-key version order — natural append order (we always append at the newest commit_ts), reinforced by the chain being a Vec, not a set.
Per-txn write-set order at commit — BTreeMap / sorted keys. This is not visible in the dump itself (writes from a single commit share commit_ts), but it determines which key a multi-key conflict reports, which matters for the error tests.
Workload PRNG — single-threaded SplitMix64 stream with the exact constants Sebastiano Vigna published. No rand crate, no math/rand, no <random> — those are NOT cross-implementation stable.

Frozen reference hashes

Scenario	`--seed --ops --keys --writers --readers --scenario`	sha256
A	`--seed 42 --ops 500 --keys 16 --writers 4 --readers 4 --scenario mixed`	`67d65acae63d8612114131a679c02912b7f8f63df10bce30a2b0def810b7c547`
B	`--seed 7 --ops 2000 --keys 4 --writers 8 --readers 2 --scenario conflicting`	`11433ba130a81a092743c08791f9790c4f148607eef1e23c163a20e354c03824`

Any change to the MVCC semantics, the workload generator, the wire format, or any defaulting in the CLI must update those numbers in scripts/cross_test.sh, the Go test (mvcc_test.go), the C++ test (tests/test_mvcc13.cc), and this table — all in the same commit.

What's intentionally out of scope

Durability. No WAL, no fsync, no crash recovery. The whole store vanishes on process exit. Adding a WAL on top is db-21 work.
Serializability. Snapshot isolation only; we deliberately allow write skew. SSI is a follow-on lab.
Read-set tracking. A txn does not remember which keys it read. Without that, SSI cannot detect anti-dependency cycles.
Locks. The store uses a single coarse mutex for clarity. A real in-memory MVCC engine (Hekaton, MemSQL) uses lock-free version installation with CAS on the chain head; we leave that to db-21.
Distributed timestamps. The oracle is a single atomic counter, not an HLC or TrueTime. Spanner / CRDB / TiKV-style distribution is db-16+ territory.
Range scans, secondary indexes, predicates. Single-key get / put / delete only. db-14 layers indexes on top.

Distributed Systems Engineer — Build Databases & Consensus From Scratch