db-13 — Transactions and MVCC
What is it?
A multi-version concurrency control key-value store with snapshot isolation semantics, in pure memory, ported byte-identically across Rust, Go, and C++. There is no disk, no log, no recovery — only the core MVCC machinery: per-key version chains, a single timestamp oracle, optimistic write-set conflict detection at commit time, and a garbage collector that respects active snapshots.
Every key holds a Vec<Version> sorted ascending by commit_ts, where
a Version is { commit_ts: u64, payload: Option<Bytes> } and an empty
payload means committed tombstone. A transaction at start_ts reads
the newest version with commit_ts <= start_ts, ignoring everything
written after it began. On commit, the transaction's write-set is
checked against the chain — if any key has a committed version with
commit_ts > start_ts, the commit aborts with a write-write conflict;
otherwise the transaction's writes are appended under a freshly issued
commit_ts.
The lab's load-bearing artifact is a canonical byte serializer for the entire store and a deterministic multi-worker workload. The serialized bytes hash to the same SHA-256 in all three languages.
Why does it matter?
This is the lab where transactions become real. Every storage engine so far in the project has been single-writer or last-write-wins. The moment two transactions can race to update the same key, you need to decide what the database does about it, and that decision shapes everything from the API up to the failure model.
Snapshot isolation is the dominant choice in modern engines:
- Postgres runs SI by default for
READ COMMITTEDand a stricter serializable variant (SSI) forSERIALIZABLE. - TiKV / CockroachDB / FoundationDB are all built on Percolator-style MVCC with snapshot reads and optimistic commit.
- Microsoft Hekaton is a pure in-memory MVCC engine almost identical in shape to this lab.
- RocksDB's
Transactionlayer implements optimistic and pessimistic MVCC on top of LSM versions.
What MVCC buys you is the property that readers never block writers and writers never block readers. The cost is space (multiple versions per key) and a garbage-collection problem (when can the old versions be dropped without breaking some live snapshot?). This lab confronts both.
It also forces the engineer to internalize a precise statement of what SI does not give you — the write-skew anomaly — which is the single most-asked question in database interviews because nine out of ten engineers conflate snapshot isolation with serializability.
How does it work?
┌──────────── timestamp oracle (atomic u64) ────────────┐
│ begin() → start_ts; commit() → commit_ts │
└───────────────────────────────────────────────────────┘
│ │
┌───────────▼──────────┐ ┌─────────▼─────────┐
│ Txn { start_ts, │ │ Store { │
│ writes: BTree, │ put │ chains: BTree< │
│ closed: bool } │ ────────►│ key → Vec< │
│ │ del │ Version>>, │
│ get(k): │ │ active_starts, │
│ 1. local writes │ get │ oracle │
│ 2. chain[k] newest │ ◄────── │ } │
│ with commit_ts │ │ │
│ ≤ start_ts │ │ │
│ │ commit │ conflict-check, │
│ commit(): │ ───────►│ then append at │
│ conflict-check │ │ commit_ts │
│ then publish │ │ │
└──────────────────────┘ └───────────────────┘
│
│ gc(below_ts)
▼
drop v[i] iff exists v[i+1]
with commit_ts ≤
min(below_ts, oldest_active)
The five operations
begin()— atomically increments the oracle, calls the resulting numberstart_ts, registers it in the active starts multiset.get(k)— first checks the txn's local write-set (read-your-own-writes), then walks the chain forkfrom newest to oldest looking for the first version withcommit_ts <= start_ts. ReturnsNoneif that version is a tombstone or no such version exists.put(k, v)/delete(k)— buffer into a per-txnBTreeMap. No store I/O.commit()— under the store mutex:- for each key in the write-set, fail with
Conflict { key, conflicting_ts }if the chain's newest version hascommit_ts > start_ts; - otherwise allocate
commit_tsfrom the oracle; - append each local write to the chain under
commit_ts; - remove
start_tsfrom the active set. A read-only commit (writes.is_empty()) skips steps 1–3 and just retires from the active set.
- for each key in the write-set, fail with
abort()— discards the write-set and retires from the active set. Idempotent.Drop/destructor callsabort()if neithercommit()norabort()ran.
The active set and GC
The store keeps a refcount-multiset of currently-active start_ts
values. gc(below_ts) computes
cutoff = min(below_ts, oldest_active_start_ts)
and for every chain, drops every prefix version v[i] such that
v[i+1] exists with v[i+1].commit_ts <= cutoff. The newest version
is always retained — future readers may still need it (or its
tombstone).
The reasoning: any reader at start_ts >= cutoff will pick the newest
version with commit_ts <= start_ts, never v[i] from the dropped
prefix because v[i+1] is also visible to them and is newer. Readers
with start_ts < cutoff cannot exist — the active multiset is
non-empty only at timestamps >= oldest_active = cutoff.
This is the same reasoning Postgres VACUUM uses with xmin/xmax
and OldestXmin, the same reasoning TiKV uses with its "safe point",
and the same reasoning Hekaton's GC uses with its "oldest active
transaction".
Snapshot isolation, not serializable
The commit-time check looks at the write-set only. It does not look at the read-set. This means:
- Two txns reading the same key and updating the same key → exactly one wins. (Lost-update is prevented.)
- Two txns reading the same key and updating different keys based on their reads → both can succeed. (Write skew is allowed.)
The classic write-skew anomaly:
T1: r(x); r(y); if x+y >= 0: w(x, -100)
T2: r(x); r(y); if x+y >= 0: w(y, -100)
Started with x=0, y=0, both txns observe x+y=0, both write, both
commit (different keys → no write-set overlap). The post-state is
x=-100, y=-100, which no serial schedule of T1 then T2 (or T2 then
T1) can produce. Snapshot isolation will allow this. Serializable SI
(Postgres SSI) catches it via dangerous-structure detection on read
dependencies. We deliberately do not implement that — db-13 is the
smallest faithful SI engine.
Cross-language invariant
mvccctl workload --seed S --ops N --keys K --writers W --readers R --scenario {writeheavy|mixed|conflicting} is the cross-language
contract:
- Identical SplitMix64 PRNG seeded with
S. - Each op draws three samples:
worker_idx = r1 % (W+R),key_idx = r2 % K,payload = (u32)r3big-endian. - Workers
0..Ware writers (theyputthen commit every 4 ops); workersW..W+Rare readers (theygetthen commit every 4 ops). - Open transactions are drained at the end.
The store is then serialized via the canonical dump and SHA-256-hex'd to stdout (no trailing newline).
Wire format
"DSEMVCC1" (8 ASCII bytes)
u64 LE next_ts ← oracle + 1
u32 LE key_count
per key (sorted ascending by raw key bytes):
u32 LE klen
key bytes
u32 LE version_count
per version (ascending by commit_ts):
u64 LE commit_ts
u8 has_value 0 = tombstone, 1 = value
if has_value:
u32 LE vlen
vbytes
All integers are unsigned little-endian. Keys and values are length-
prefixed; no null terminators, no escapes. next_ts is oracle + 1
to match the next value begin() would issue — this makes the dump
round-trippable: a future MvccStore::load can resume the oracle
exactly.
Why these particular determinism guarantees
- Key iteration order —
std::map<Bytes,...>(C++), sorted slice (Go),BTreeMap(Rust). Never raw map iteration in any port. - Within-key version order — natural append order (we always append
at the newest
commit_ts), reinforced by the chain being aVec, not a set. - Per-txn write-set order at commit —
BTreeMap/ sorted keys. This is not visible in the dump itself (writes from a single commit sharecommit_ts), but it determines which key a multi-key conflict reports, which matters for the error tests. - Workload PRNG — single-threaded SplitMix64 stream with the exact
constants Sebastiano Vigna published. No
randcrate, nomath/rand, no<random>— those are NOT cross-implementation stable.
Frozen reference hashes
| Scenario | --seed --ops --keys --writers --readers --scenario | sha256 |
|---|---|---|
| A | --seed 42 --ops 500 --keys 16 --writers 4 --readers 4 --scenario mixed | 67d65acae63d8612114131a679c02912b7f8f63df10bce30a2b0def810b7c547 |
| B | --seed 7 --ops 2000 --keys 4 --writers 8 --readers 2 --scenario conflicting | 11433ba130a81a092743c08791f9790c4f148607eef1e23c163a20e354c03824 |
Any change to the MVCC semantics, the workload generator, the wire
format, or any defaulting in the CLI must update those numbers in
scripts/cross_test.sh, the Go test (mvcc_test.go), the C++ test
(tests/test_mvcc13.cc), and this table — all in the same commit.
What's intentionally out of scope
- Durability. No WAL, no fsync, no crash recovery. The whole store vanishes on process exit. Adding a WAL on top is db-21 work.
- Serializability. Snapshot isolation only; we deliberately allow write skew. SSI is a follow-on lab.
- Read-set tracking. A txn does not remember which keys it read. Without that, SSI cannot detect anti-dependency cycles.
- Locks. The store uses a single coarse mutex for clarity. A real in-memory MVCC engine (Hekaton, MemSQL) uses lock-free version installation with CAS on the chain head; we leave that to db-21.
- Distributed timestamps. The oracle is a single atomic counter, not an HLC or TrueTime. Spanner / CRDB / TiKV-style distribution is db-16+ territory.
- Range scans, secondary indexes, predicates. Single-key get / put / delete only. db-14 layers indexes on top.