Distributed Systems Engineer — Build Databases & Consensus From Scratch
"What I cannot create, I do not understand." — Richard Feynman
A lab-based curriculum for becoming a senior distributed systems engineer by building the systems you'll one day operate, debug, and replace: LevelDB (LSM-tree storage), SQLite (B-tree storage + SQL), and the three canonical consensus algorithms — Raft, Paxos, and ZAB — all implemented from scratch in Rust, Go, and C++.
Why This Repo Exists
Most engineers treat databases and consensus as black boxes. This curriculum makes them transparent. You will:
- Write storage engines that flush, compact, recover, and serve concurrent reads.
- Implement consensus protocols that survive node crashes, network partitions, and message reordering.
- Reason about hardware trade-offs: SSD vs HDD seek latency, write amplification,
fsynccost,io_uringvs blocking I/O, cache-line locality, NUMA effects. - Compare algorithm families: LSM vs B-tree, level-based vs size-tiered compaction, Raft vs Multi-Paxos vs ZAB.
- Build the same thing three times — once in each language — to internalize the design (not the syntax).
Curriculum at a Glance
| Phase | Theme | Labs |
|---|---|---|
| 1 | Storage Primitives & Foundations | db-01 … db-04 |
| 2 | LevelDB / LSM-Tree | db-05 … db-09 |
| 3 | SQLite / B-Tree | db-10 … db-15 |
| 4 | Consensus Algorithms | db-16 … db-20 |
| 5 | Advanced Storage & Capstone | db-21 … db-23 |
See PHASES.md for the full breakdown with learning objectives per lab.
How To Use This Repo
- Read TOOLS.md and install the required toolchains (Rust, Go, C++/CMake).
- Start with
db-01-storage-primitives/. Each lab is self-contained and has the same shape:db-NN-<name>/ ├── CONCEPTS.md # The "why" — read this first ├── references.md # Papers and source-code links to study ├── docs/ │ ├── analysis.md # Design trade-offs (hardware, algorithmic) │ ├── broader-ideas.md # Extensions, alternatives, future work │ ├── execution.md # Toolchain versions, quick-start commands │ ├── observation.md # Debugging, profiling, monitoring │ └── verification.md # Pass/fail checks for your implementation ├── steps/ # Numbered, sequential implementation guides │ ├── 01-*.md │ └── 02-*.md └── src/ ├── rust/ # Cargo workspace ├── go/ # Go module └── cpp/ # CMake project - Work through
steps/in order. The reference code insrc/is a target — try to write your own first, then compare. - Run the checks in
docs/verification.mdbefore moving on.
What You Will Build
By the end of the curriculum you will have implemented (×3 languages):
- A crash-safe write-ahead log with CRC32 checksums and group commit.
- A skip-list MemTable, an SSTable file format with block compression, and level-based compaction.
- A page-oriented B+-tree with a pager, rollback journal, and WAL mode.
- A hand-written SQL tokenizer, parser, AST, and bytecode virtual machine.
- A transaction manager with MVCC snapshot reads and serializable writes.
- A complete Raft implementation with snapshotting and membership changes.
- Single-decree Paxos and Multi-Paxos with a stable leader.
- A simplified ZAB broadcast layer with epoch transitions.
- A 3-node distributed KV store combining Raft with your LevelDB clone.
- A capstone mini distributed SQL database (the storage engine, the SQL frontend, and Raft replication — all your own code).
Prerequisites
- Comfortable with C-family syntax in at least one systems language (you'll pick up the other two as you go).
- Familiarity with binary trees, hash tables, and Big-O analysis.
- Basic Linux command-line and
git. - Not required: prior distributed systems knowledge, SQL internals knowledge, or database engine experience. We build it all from the ground up.
Pedagogical Style
Modeled after cstack/db_tutorial (concept-first, incremental, runnable code at every step) and the ai-engineering/ lab repo (consistent 8-part CONCEPTS.md, docs/, steps/, src/ structure).
Every CONCEPTS.md follows the same 8-part framework:
- What Is It — one-paragraph executive summary
- Why It Matters — concrete benefits
- How It Works — ASCII architecture diagram
- Core Terminology — table of precise definitions
- Mental Models — analogies for intuition
- Common Misconceptions — myths corrected
- Interview Talking Points — what to say in a senior systems interview
- Connections to Other Labs — how this fits the bigger picture
Status
| Phase | Status |
|---|---|
| Phase 1 — Storage Primitives | Lab 01 complete, 02–04 scaffolded |
| Phase 2 — LevelDB | Scaffolded |
| Phase 3 — SQLite | Scaffolded |
| Phase 4 — Consensus | Scaffolded |
| Phase 5 — Advanced & Capstone | Scaffolded |
See PHASES.md for per-lab status.
License
MIT — see source headers in each implementation.
Phases & Labs
This curriculum has 5 phases and 23 labs. Phases build on each other, but within Phase 4 (consensus) you can do Raft → Paxos → ZAB in any order after the foundations in db-16.
Legend: ✅ complete · 🟡 scaffolded · ⬜ planned
Phase 1 — Storage Primitives & Foundations
Before you can build a database, you need to understand the medium it lives on.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| db-01 | Storage Primitives | ✅ | Pages, byte order, mmap vs pread, alignment, HDD/SSD/NVMe latency |
| db-02 | Data Structures for Storage | 🟡 | Skip lists, hash tables, when in-memory vs on-disk structures differ |
| db-03 | Write-Ahead Log | 🟡 | WAL framing, CRC32, fsync semantics, group commit |
| db-04 | Bloom Filters & Hashing | 🟡 | FPR math, xxHash vs Murmur, cuckoo & xor filter alternatives |
Phase 2 — LevelDB / LSM-Tree
Build a production-shape LSM-tree key-value store, the way Google built LevelDB and Meta forked it into RocksDB.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| db-05 | LSM MemTable | 🟡 | Skip-list MemTable, immutable MemTable, flush trigger |
| db-06 | SSTable Format | 🟡 | Data/index/filter blocks, restart points, footer |
| db-07 | LSM Compaction | 🟡 | Level vs size-tiered vs universal, write amplification |
| db-08 | Block Cache & Iterators | 🟡 | LRU, MergingIterator, snapshot via sequence numbers |
| db-09 | LevelDB Complete | 🟡 | Open/close, WriteBatch, recovery, YCSB benchmark |
Phase 3 — SQLite / B-Tree
Build a B+-tree storage engine, a pager, a SQL parser, a bytecode VM, and a transaction manager.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| db-10 | B-Tree Fundamentals | 🟡 | B-Tree vs B+-Tree, page layout, splits & merges |
| db-11 | Pager System | 🟡 | Page cache, rollback journal vs WAL mode, checkpointing |
| db-12 | SQL Frontend | 🟡 | Tokenizer, parser, AST, VDBE bytecode VM |
| db-13 | Transactions & MVCC | 🟡 | ACID, isolation levels, SQLite locks, MVCC vs 2PL |
| db-14 | Indexes & Query Planning | 🟡 | Secondary indexes, cost-based planner, ART, BRIN |
| db-15 | SQLite Complete | 🟡 | JOINs, aggregation, TPC-H subset benchmark |
Phase 4 — Consensus Algorithms
The three canonical consensus families — implemented, tested, and compared.
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| db-16 | Distributed Fundamentals | 🟡 | CAP, FLP, linearizability, vector clocks, HLC |
| db-17 | Raft | 🟡 | Election, AppendEntries, snapshotting, ReadIndex |
| db-18 | Paxos | 🟡 | Single-decree, Multi-Paxos, Flexible Paxos |
| db-19 | ZAB | 🟡 | Epochs, zxids, primary-backup vs leader-based |
| db-20 | Distributed KV Store | 🟡 | Raft + LevelDB backend, linearizable reads, sharding |
Phase 5 — Advanced Storage & Capstone
| Lab | Title | Status | Key Concepts |
|---|---|---|---|
| db-21 | Advanced Storage | 🟡 | io_uring, O_DIRECT, columnar layout, WiscKey |
| db-22 | Performance & Benchmarking | 🟡 | YCSB A–F, flamegraphs, NUMA, perf counters |
| db-23 | Capstone Distributed DB | 🟡 | SQL → planner → LevelDB → Raft; 2PC over Raft groups |
Suggested Pace
- Full-time learner: ~2 labs per week ⇒ ~12 weeks end-to-end.
- Side-project learner: ~1 lab every 1–2 weeks ⇒ ~6 months.
- Reading-only path: skim
CONCEPTS.md+docs/analysis.mdper lab ⇒ ~1 week for the entire curriculum.
Recommended Progression
Phase 1 (must do all 4 in order)
│
├─→ Phase 2 (LevelDB) ──┐
│ │
└─→ Phase 3 (SQLite) ────┤
↓
Phase 4 (Consensus)
↓
Phase 5 (Capstone)
Phase 2 and Phase 3 are independent — pick the storage style that excites you first. Phase 4 only references Phase 1 fundamentals, so you can detour into consensus early if you want. Phase 5's capstone assumes all four prior phases.
Glossary
A unified glossary of terms used across all labs. Terms are grouped by domain.
Storage & I/O
| Term | Definition |
|---|---|
| Page | The unit of I/O between disk and memory. Usually 4 KiB (matches OS page size) but databases often use 4–32 KiB. |
| Block | An SSTable's I/O unit (LevelDB default 4 KiB). Distinct from a B-tree "page" — both are I/O units but for different engines. |
| mmap | Map a file into process address space. Reads happen via page faults; writes via dirty pages flushed by the kernel. |
| pread/pwrite | Positional read/write syscalls. Explicit offset, no shared file pointer. Predictable cost, no page-fault stalls. |
O_DIRECT | Open flag (Linux) that bypasses the page cache. Requires aligned buffers, aligned offsets, aligned sizes. |
fsync | Force file data + metadata to stable storage. Blocks until disk acknowledges. Often the slowest syscall in a database. |
fdatasync | Like fsync but skips non-essential metadata. Faster on most filesystems. |
| Write amplification (WA) | Bytes physically written / bytes logically written. SSDs have hardware WA; LSM-trees have algorithmic WA from compaction. |
| Read amplification (RA) | Bytes physically read / bytes logically read. LSM-trees suffer from RA due to checking multiple levels. |
| Space amplification | Bytes on disk / bytes of live data. LSMs have space amp from stale data awaiting compaction. |
| Endianness | Byte order. Little-endian (x86, ARM default): least-significant byte first. Big-endian: network byte order. |
| Alignment | Memory address being a multiple of N. Required for O_DIRECT (usually 512 B or 4 KiB) and SIMD ops. |
io_uring | Linux async I/O API (≥ 5.1). Two ring buffers (SQ/CQ) shared between kernel and user space. |
| DMA | Direct Memory Access — disk controller writes directly to RAM without CPU involvement. |
Hardware
| Term | Definition |
|---|---|
| HDD seek time | ~5–10 ms for random reads (head movement + rotational latency). ~150 MB/s sequential. |
| SATA SSD | ~100 μs random read latency, ~500 MB/s sequential, ~80K IOPS. |
| NVMe SSD | ~50–100 μs random read latency, ~3–7 GB/s sequential, ~500K–1M IOPS. Multiple hardware queues. |
| Cache line | CPU cache unit, almost always 64 bytes. Data-structure layout for cache locality matters. |
| NUMA | Non-Uniform Memory Access — CPU sockets have local RAM; cross-socket access is slower. |
| Wear leveling | SSD firmware spreads writes across blocks to even out flash wear. Causes hardware write amplification. |
Data Structures
| Term | Definition |
|---|---|
| Skip list | Probabilistic balanced structure with O(log n) ops and lock-free-friendly properties. Used in LevelDB MemTable. |
| B-Tree | Self-balancing m-ary tree. Internal nodes store keys + values + child pointers. Used for indexes. |
| B+-Tree | B-Tree variant where all values live in leaf nodes; internal nodes are pure routing. Used for tables in SQLite. |
| LSM-Tree | Log-Structured Merge-Tree. In-memory MemTable + on-disk sorted runs (SSTables), merged via compaction. |
| Bloom filter | Probabilistic set membership; no false negatives, tunable false positive rate. Used to skip SSTable lookups. |
| ART | Adaptive Radix Tree — modern in-memory index alternative to B-Trees, used by HyPer, DuckDB. |
Consensus
| Term | Definition |
|---|---|
| Quorum | Subset of nodes whose agreement is required. Typically ⌊N/2⌋ + 1 for majority quorum. |
| Term / Epoch | Monotonically increasing identifier for a leadership period (Raft term, ZAB epoch, Paxos ballot). |
| Log index | Position of an entry in the replicated log. Indices are monotonic and dense. |
| Commit index | The largest log index known to be safely replicated to a quorum. |
| Linearizability | Strongest consistency: operations appear to take effect atomically at some point between their invocation and response. |
| Sequential consistency | All processes agree on a single global order, but the order need not match real-time. |
| Eventual consistency | If updates stop, all replicas eventually agree. No real-time guarantees. |
| CAP theorem | Under a network partition, you must choose Consistency or Availability. Partition tolerance is non-negotiable. |
| FLP impossibility | No deterministic asynchronous consensus protocol can guarantee progress with even one crash failure. |
| Lamport timestamp | Scalar logical clock: L(a) < L(b) if a happened-before b. Cannot detect concurrency. |
| Vector clock | Per-node vector. VC(a) < VC(b) iff every component is ≤. Detects concurrent events. |
| HLC | Hybrid Logical Clock: combines physical time with a logical counter; bounded skew from real time. |
Transactions
| Term | Definition |
|---|---|
| ACID | Atomicity, Consistency, Isolation, Durability — properties a transaction must satisfy. |
| Isolation level | READ UNCOMMITTED → READ COMMITTED → REPEATABLE READ → SERIALIZABLE. Each rules out more anomalies. |
| Dirty read | Reading data written by an uncommitted transaction. |
| Non-repeatable read | Reading the same row twice in one tx and getting different values. |
| Phantom read | A range query returns different rows when re-run within one tx. |
| MVCC | Multi-Version Concurrency Control — writes create new versions; readers see a snapshot. |
| 2PL | Two-Phase Locking — acquire locks in a growing phase, release in a shrinking phase. Guarantees serializability. |
| 2PC | Two-Phase Commit — distributed transaction protocol: prepare phase, then commit/abort. Blocking on coordinator failure. |
SQL Engine
| Term | Definition |
|---|---|
| VDBE | Virtual Database Engine — SQLite's bytecode VM that executes compiled SQL. |
| Prepared statement | A parsed and compiled SQL statement, reusable with different parameters. |
| Cardinality estimation | Predicting how many rows a query operator will produce. Core to the query planner. |
| Selectivity | Fraction of rows that satisfy a predicate. Low selectivity ⇒ index scan preferred. |
| Covering index | An index that contains all columns needed by a query, so the table doesn't need to be touched. |
Operational
| Term | Definition |
|---|---|
| Snapshot | A consistent point-in-time view of data. Used for backups, MVCC reads, Raft log compaction. |
| Checkpoint | Operation that flushes in-memory state to disk so recovery has less log to replay. |
| Compaction | Background process that merges sorted files (LSM) or reclaims fragmented space (B-tree). |
| YCSB | Yahoo Cloud Serving Benchmark — standard KV workload suite (A–F). Used in db-22. |
| Jepsen | Test framework for distributed systems correctness; injects partitions/clock skew. Inspires our consensus tests. |
Toolchain Setup
All labs target Linux-first with macOS as a supported secondary platform. Windows is not supported (no io_uring, no O_DIRECT semantics we rely on; use WSL2 instead).
Required Versions
| Tool | Minimum | Recommended | Why |
|---|---|---|---|
| Rust | 1.78 | 1.82+ | std::io::IoSlice, stabilized OnceLock, edition 2021 features used throughout |
| Go | 1.22 | 1.23+ | range-over-func iterators, improved slices/maps stdlib, generics maturity |
| C++ | C++20 | C++20 (Clang 16+ / GCC 13+) | Concepts, <bit> for endian ops, std::span, designated initializers |
| CMake | 3.28 | 3.29+ | CMAKE_CXX_MODULES, modern target_link_libraries semantics |
| clang-format | 17 | 18+ | Consistent C++ formatting across labs |
| Python | 3.11 | 3.12+ | Benchmark plotting & verification scripts (matplotlib, pandas) |
Per-Language Setup
Rust
# rustup is the canonical installer.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup default stable
rustup component add clippy rustfmt
cargo install cargo-nextest # faster, parallel test runner
cargo install cargo-flamegraph # used in db-22
Verify:
rustc --version # rustc 1.78.0 or newer
cargo --version
Go
# macOS
brew install go
# Linux: download from https://go.dev/dl/ — distro packages are usually old.
# Useful tools
go install honnef.co/go/tools/cmd/staticcheck@latest
go install golang.org/x/perf/cmd/benchstat@latest
Verify:
go version # go1.22 or newer
C++
# macOS
xcode-select --install
brew install cmake ninja llvm
# Linux (Debian/Ubuntu)
sudo apt-get install -y build-essential cmake ninja-build clang-17 clang-format-17 \
libsnappy-dev liburing-dev
Verify:
clang++ --version # Clang 16 or newer
cmake --version # 3.28 or newer
Optional but recommended:
liburing-dev— required only fordb-21(io_uringlab) on Linux.libsnappy-dev— used indb-06(SSTable block compression).valgrind/lldb— for memory and crash debugging.
Per-Lab Build Commands
Every lab src/<lang>/ is self-contained and has these commands:
| Language | Build | Test | Run |
|---|---|---|---|
| Rust | cargo build --release | cargo nextest run (or cargo test) | cargo run --release --bin <name> |
| Go | go build ./... | go test ./... | go run ./cmd/<name> |
| C++ | cmake -B build -G Ninja && cmake --build build | ctest --test-dir build | ./build/<name> |
docs/execution.md in each lab repeats the exact commands with the lab-specific binary names.
OS-Specific Notes
Linux
io_uringrequires kernel ≥ 5.1 (≥ 5.6 for most useful features). Check withuname -r.O_DIRECTworks on most filesystems but is rejected by tmpfs — use a real disk path in tests.- For accurate latency benchmarks, disable CPU frequency scaling:
sudo cpupower frequency-set -g performance.
macOS
- No
io_uring—db-21falls back tokqueue+ worker pool. The lab explains the difference. O_DIRECTdoes not exist; useF_NOCACHEviafcntl(the lab provides the wrapper).fsync(2)does not guarantee data hits stable storage on macOS — usefcntl(F_FULLFSYNC). Labs handle this.
Editor / IDE
Any editor works. VS Code with these extensions is what the reference implementations were written in:
rust-lang.rust-analyzergolang.gollvm-vs-code-extensions.vscode-clangdms-vscode.cmake-tools
Sanity Check Script
Run this once after setup to verify everything works:
cd db-01-storage-primitives
( cd src/rust && cargo build --release ) && \
( cd src/go && go build ./... ) && \
( cd src/cpp && cmake -B build -G Ninja && cmake --build build ) && \
echo "All three toolchains OK."
Storage Primitives
The lab that earns you the right to talk about databases.
1. What Is It
This lab teaches the physical layer that every storage engine sits on top of: how data moves between a process's memory and a block device. You will learn the OS page model, the byte-order question (endianness), the three main file I/O styles (read/write, pread/pwrite, mmap, O_DIRECT), buffer alignment, and the durability primitive fsync. You'll also internalize the latency numbers for HDD, SATA SSD, and NVMe SSD that drive every storage design decision in the rest of the curriculum. The deliverable is a tiny page allocator plus a hexdump utility, written three times — once in Rust, once in Go, and once in C++ — exercising pread/pwrite against a real disk file.
2. Why It Matters
- Every later lab depends on these primitives. LSM-trees, B-trees, WALs — they're all built on
pread/pwrite/fsyncand an understanding of the page cache. - Choosing the right I/O style changes throughput by 10–100×. A naïve
readloop is not the same aspreadfrom many threads, which is not the same asio_uring, which is not the same asmmap. - Hardware shapes the algorithm. LSM-trees exist because random writes on HDDs were catastrophic. NVMe IOPS now make some classic assumptions wrong. Knowing the numbers prevents cargo-culting designs from the wrong decade.
fsyncis the single most expensive syscall in any database. Understanding when it must be called — and when you can amortize it — is the difference between 100 commits/sec and 100,000 commits/sec.
3. How It Works
User process
┌───────────────────────────────────────────────────┐
│ Your code: page_allocator, db.put("key", val) │
│ buffer = [u8; PAGE_SIZE] │
└────────────┬───────────────────┬──────────────────┘
│ │
│ pread/pwrite │ mmap
│ (explicit copy) │ (page-fault driven)
▼ ▼
┌─────────────────────────────────────┐
│ Kernel page cache (RAM) │ ← cached pages,
│ 4 KiB pages, indexed by inode+off │ LRU-evicted
└────────────────┬────────────────────┘
│ block layer
│ (scheduler, mq-deadline / none for NVMe)
▼
┌─────────────────────┐
│ Device driver │ fsync() blocks here
│ (NVMe / SATA AHCI) │ until disk acks
└─────────┬───────────┘
▼
┌─────────────────────┐
│ Storage hardware │ HDD: ~5 ms random
│ HDD / SSD / NVMe │ SSD: ~100 µs random
│ │ NVMe: ~50 µs random
└─────────────────────┘
Three things to internalize from this picture:
- Without
O_DIRECT, you always go through the kernel page cache. Yourpreadmay hit a warm cache (memcpy speed) or cold cache (full disk I/O). Latency variance is enormous. fsyncis the only way to tell the device to flush its own write cache. Without it, "the write returned" means "the kernel accepted it", not "it survives a power loss".mmapandpreadare fundamentally different mental models.mmapmakes I/O implicit (page faults),preadmakes it explicit (syscalls). LMDB chosemmap. SQLite, LevelDB, and PostgreSQL chosepread/pwrite. We will usepread/pwritefor predictability, and discussmmapin the analysis.
4. Core Terminology
| Term | Definition |
|---|---|
| Page | Fixed-size unit of I/O between user and storage. The kernel uses 4 KiB; databases pick 4–32 KiB. We use 4 KiB. |
| Page cache | Kernel-managed RAM that mirrors recently accessed file pages. Transparent to read/write and pread/pwrite. |
pread(fd, buf, n, off) | Read n bytes from fd starting at byte offset off. Does not affect the file pointer. Thread-safe. |
pwrite(fd, buf, n, off) | Write n bytes to fd at byte offset off. Thread-safe. |
mmap | Map a file region into the process's address space. Accesses become loads/stores; faults trigger page-ins. |
fsync(fd) | Block until all dirty data and metadata for fd are on stable storage. The durability primitive. |
fdatasync(fd) | Like fsync but may skip metadata updates that aren't required to retrieve the data. |
O_DIRECT | Open flag (Linux) that bypasses the page cache. Requires 512-byte or 4-KiB alignment on buffers, offsets, sizes. |
F_FULLFSYNC | macOS-only fcntl that actually flushes the drive's cache. fsync on macOS is not enough for true durability. |
| Endianness | Byte order of multi-byte integers in memory. Little-endian = LSB first (x86, ARM default); big-endian = MSB first (network byte order). |
| Alignment | An address being a multiple of N. Matters for SIMD, DMA, O_DIRECT, and many hardware operations. |
| Sector | The atomic write unit of the device. HDDs: 512 B (legacy) or 4 KiB (Advanced Format). NVMe: usually 4 KiB. |
| IOPS | I/O operations per second. The right unit for random workloads (HDD ~150, SATA SSD ~80K, NVMe ~500K–1M). |
| Latency | Time for one operation to complete. Often what users actually feel; throughput hides tail behavior. |
5. Mental Models
The page cache is a transparent cache, not a database. Think of pread like Map::get: if the key is in the cache, it's a memcpy; if it's not, the kernel goes to disk for you. You can't observe a cache miss with timing alone in production — that's the whole point of caches and the whole reason benchmarks lie.
fsync is a phone call to the disk. All other writes are "I told the postman" — fast, no guarantee. fsync is "I waited on the line while the disk confirmed the package arrived." Phone calls are slow. Group commits = bundling 100 packages into one call.
mmap is "make the file look like an array". pread is "I will ask for bytes one request at a time". The first is convenient. The second is predictable. Convenience and predictability are usually at war.
Sequential vs random I/O on an HDD is 100× different. On NVMe it's 2–3×. This is why LSM-trees won the 2000s and why "just append" got rediscovered in the 2010s and why NVMe makes some of those assumptions less critical in the 2020s. Hardware shapes design.
6. Common Misconceptions
- "
writereturning means my data is safe." False. The kernel buffered it. Onlyfsync(orfdatasyncfor data-only, orF_FULLFSYNCon macOS) guarantees durability. - "
mmapis faster thanpreadbecause there's no syscall." Often false.mmapaccess generates page faults, which are also context switches into the kernel, plus they're synchronous (you can't overlap them with computation as easily). LMDB-style designs win when the working set fits in RAM; they suffer on writes due to fsyncing the mapping. - "SSDs make random vs sequential irrelevant." Partially true. Random reads are fast, but random writes still incur garbage collection and write amplification at the firmware level. Sequential writes still reduce hardware WA significantly.
- "4 KiB is always the right page size." No. It matches OS page size, which is friendly for
mmapand for the page cache. But LevelDB uses 4 KiB blocks (read amp) and 64 MiB SSTables (sequential writes). PostgreSQL uses 8 KiB pages. The "right" page size depends on workload. - "
fsyncflushes only my file." On many filesystems and many older kernels,fsynccould flush more (or less) than expected. Modern ext4/xfs are sane, but historical PostgreSQLfsyncbugs (2018) showed that the contract is more subtle than it looks.
7. Interview Talking Points
- "For a write-heavy OLTP workload on local NVMe, I'd start with direct
pwrite+fdatasyncrather thanmmap.mmapmakes durable writes ambiguous —msync(MS_SYNC)is a heavier hammer thanfdatasyncbecause it covers the whole mapping, and you give up control over write ordering." - "My rule of thumb: HDD random read ≈ 5 ms, SATA SSD ≈ 100 µs, NVMe ≈ 50 µs, RAM ≈ 100 ns, L1 ≈ 1 ns. Every five orders of magnitude is where a different design becomes interesting. LSM-trees collapse the gap between random and sequential by converting random writes to sequential ones."
- "
fsyncis what amortizes the difference between 100 commits/sec and 100,000 commits/sec. Group commit batches N concurrent transactions into onefsync, trading latency (one transaction may wait ~5 ms for a batch) for throughput (100× more committed transactions perfsync). Postgres, MySQL InnoDB, and SQLite all do this." - "
O_DIRECTisn't a free win. You skip the page cache, so you have to implement your own cache and your buffers must be aligned. PostgreSQL deliberately uses the page cache and lets the OS do that work for it. Oracle and Sybase useO_DIRECT. The choice depends on whether you trust your buffer manager more than the kernel's."
8. Connections to Other Labs
db-02— uses the page-aligned allocator from here for skip-list and hash-table node storage.db-03— the WAL is literallypwrite+fdatasyncin a loop; this lab gives you the muscle memory.db-06— SSTable blocks are read viapreadat known offsets; this lab is the read side.db-11— the SQLite pager is apread/pwrite-based page cache; you'll reimplement what the kernel does for you here.db-21— revisits I/O withio_uring(Linux) andO_DIRECTfor the advanced cases; this lab establishes the baseline.
References — Storage Primitives
Canonical Papers & Specifications
- POSIX
pread/pwrite/fsync— https://pubs.opengroup.org/onlinepubs/9699919799/functions/pread.html - Linux
open(2)(forO_DIRECT,O_DSYNC) — https://man7.org/linux/man-pages/man2/open.2.html - Linux
fsync(2)— https://man7.org/linux/man-pages/man2/fsync.2.html - Linux
io_uringdesign — https://kernel.dk/io_uring.pdf (Jens Axboe, 2019). Read fordb-21. - macOS
F_FULLFSYNC—man fcntlon macOS; see also Apple Tech Note TN1150.
Hardware Numbers
- "Latency Numbers Every Programmer Should Know" — Jeff Dean, 2012. https://gist.github.com/jboner/2841832
- "What Every Programmer Should Know About Memory" — Ulrich Drepper, 2007. https://people.freebsd.org/~lstewart/articles/cpumemory.pdf (long but seminal)
- NVMe specification — https://nvmexpress.org/specifications/ (skim §3 on queues, §4 on commands)
Battle Stories
- "PostgreSQL's fsync surprise" — https://lwn.net/Articles/752063/. Why
fsyncsemantics on Linux were subtler than database authors assumed. Read this. - "Files are Hard" — Dan Luu. https://danluu.com/file-consistency/. Survey of how filesystems can lose your data.
- "mmap-based databases vs. read/write-based databases" — Andy Pavlo et al., "Are You Sure You Want to Use MMAP in Your Database Management System?", CIDR 2022. https://db.cs.cmu.edu/mmap-cidr2022/. Required reading if you ever consider
mmap.
Implementation References
- SQLite OS interface — https://www.sqlite.org/src/file/src/os_unix.c (search for
unixSyncto see real-worldfsynchandling, including the macOSF_FULLFSYNCworkaround) - LevelDB env_posix.cc — https://github.com/google/leveldb/blob/main/util/env_posix.cc (look at
PosixWritableFile::Sync) - LMDB — http://www.lmdb.tech/doc/ (the canonical
mmapdatabase; read for contrast)
Books
- "Operating Systems: Three Easy Pieces" — Arpaci-Dusseau. Free at https://pages.cs.wisc.edu/~remzi/OSTEP/. Chapters 39–44 (persistence) are exactly this lab.
- "Designing Data-Intensive Applications" — Martin Kleppmann, O'Reilly. Chapter 3 ("Storage and Retrieval") frames the LSM vs B-tree debate that drives Phases 2 and 3.
Analysis — Storage Primitives
This document is for the design decisions and the trade-offs. The CONCEPTS.md told you what exists; this tells you why we picked one over the other and what you'd reach for in different conditions.
Decision 1: pread/pwrite over read/write
We use pread/pwrite (explicit offsets) instead of read/write + lseek.
| Aspect | read/write + lseek | pread/pwrite |
|---|---|---|
Thread safety on shared fd | Unsafe — file pointer is shared, lseek+read races | Safe — offset is per-call |
| Syscalls per op | 2 (lseek + read) | 1 |
| Mental model | Stateful cursor | Stateless random access |
| Used by | Single-threaded streaming code | All real databases (SQLite, LevelDB, Postgres) |
Verdict: pread/pwrite is strictly better for database-style access patterns. The only reason to use the cursor variant is when you genuinely have a single sequential reader (e.g., tail -f).
Decision 2: pread/pwrite over mmap
This is more nuanced. We use explicit I/O for all labs except where we deliberately study mmap.
| Aspect | mmap | pread/pwrite |
|---|---|---|
| Code complexity | Lower (pointer access) | Higher (explicit calls) |
| Latency predictability | Bad — page faults are synchronous, can stall on cold pages | Good — every cost is visible in the syscall |
| Write durability | Tricky — msync(MS_SYNC) is expensive and synchronizes the whole mapping | Surgical — fdatasync(fd) |
| Memory accounting | Counts as anonymous memory; hard to reason about WSS | Buffers are yours, you bound them |
| Large files (> RAM) | Catastrophic — random page-in storms | Fine — you read what you need |
| Multi-threaded scaling | Page-fault locks scale poorly | Linear scaling with cores |
| TLB pressure | Hugepages help but transparent hugepage transitions can pause processes | None |
| Used by | LMDB, BoltDB | SQLite, LevelDB, RocksDB, Postgres |
The Pavlo et al. CIDR 2022 paper (linked in references.md) is the definitive teardown. TL;DR: mmap is fine when (a) the dataset fits in RAM, (b) the workload is read-heavy, and (c) you don't care about latency tails. For everything else, pread/pwrite wins.
Decision 3: Page Size = 4 KiB
We pick 4 KiB as the default page size in this lab and reconsider in later labs.
| Page size | Pros | Cons |
|---|---|---|
| 512 B | Old HDD sector; small writes are cheap | Tiny metadata-to-data ratio, lots of indirection |
| 4 KiB | Matches OS page, NVMe LBA, page cache. Sweet spot for OLTP. | Small for analytics |
| 8 KiB | Postgres default. Better for slightly larger rows. | Wastes I/O for tiny tuples |
| 16 KiB | MySQL InnoDB default. Good index fanout. | One row update = 16 KiB write |
| 64 KiB / 1 MiB | Analytics, sequential scans, Parquet row groups | Terrible for random updates |
Rule of thumb: page size ≈ the device sector size × small constant. With NVMe at 4 KiB LBA and the OS page also at 4 KiB, going smaller is fighting the hardware and going larger is amortizing a smaller win.
Decision 4: Endianness on Disk
Our on-disk format is little-endian. Justified by:
- x86 and ARM (in normal mode) are little-endian. Big-endian on these platforms means a byte swap on every read.
- Network protocols use big-endian by convention, but our on-disk format is not a network protocol — it's only read by the same machine (or by an explicit migration tool).
- LevelDB and RocksDB use little-endian for fixed-width fields, with varints for variable-width. We follow that convention for compatibility of mental model.
- SQLite uses big-endian for historical reasons (the format dates to 2000, when MIPS/PowerPC/SPARC were still common). It's a legitimate alternative; we just optimize for the modern hardware reality.
Always use explicit conversion functions at the I/O boundary. Never memcpy an int to disk and hope. Our Rust code uses u64::to_le_bytes; Go uses encoding/binary.LittleEndian; C++20 uses std::endian + std::byteswap.
Decision 5: When to call fsync
The cost of fsync on consumer NVMe is roughly:
- Single-threaded latency: ~50 µs–1 ms depending on outstanding writes
- Throughput-limited: roughly 5,000–20,000
fsync/sec before contention dominates
The cost on a HDD is 5–15 ms per fsync. That's why ye olde databases did group commit.
The right policy depends on durability requirements:
| Policy | What survives a crash | Throughput cost |
|---|---|---|
No fsync | Nothing reliably (kernel may flush eventually) | None |
fsync per write | Every acknowledged write | Massive — one syscall per write |
fsync per N writes | Last (N-1) writes possibly lost | 1/N the cost |
| Group commit | Every acknowledged write; latency = time-to-batch + fsync | Excellent — best of both |
fsync periodically (e.g., 100 ms) | Last 100 ms of writes possibly lost (MySQL innodb_flush_log_at_trx_commit=2) | Excellent |
The right design for Phase 2's WAL is group commit: when a writer finishes, it waits on a condition variable; the WAL writer thread pwrites pending records, fdatasyncs once, then wakes every waiter. We'll build this in db-03.
macOS Caveat
On macOS, fsync(fd) does not flush the drive's write cache — it only sends the data to the drive. To get true durability you must call fcntl(fd, F_FULLFSYNC), which can be 10–100× slower than fsync on the same hardware. SQLite, LevelDB, and Postgres all handle this. Our wrapper in src/*/fsync_full.* does the platform dispatch.
Decision 6: O_DIRECT — Not in This Lab
We don't use O_DIRECT in Lab 01 because:
- It requires aligned buffers (typically 4 KiB), aligned offsets, and aligned I/O sizes.
- It bypasses the page cache, so you must implement your own — which is a buffer manager (Phase 3,
db-11). - It's not available on macOS — you'd use
fcntl(fd, F_NOCACHE, 1)as the closest analogue, but it has weaker semantics.
We revisit O_DIRECT in db-21-storage-engine-advanced once we have a buffer manager worth talking about.
Hardware Numbers Cheat-Sheet
Memorize these. They drive every storage design decision:
L1 cache hit 1 ns
Branch mispredict 3 ns
L2 cache hit ~4 ns
L3 cache hit ~15 ns
DRAM access ~100 ns — 100× L1
Context switch ~1–5 µs
NVMe random read ~50 µs — 500× DRAM
NVMe sequential read ~5 µs/4KB
SATA SSD random read ~100 µs
SATA SSD seq read ~10 µs/4KB
HDD random read ~5 ms — 100,000× DRAM
HDD sequential read ~30 µs/4KB (~150 MB/s)
fsync on NVMe ~50 µs–1 ms
fsync on HDD ~10 ms
F_FULLFSYNC (macOS) ~10–50 ms — actually flushes drive cache
Network RTT same DC ~500 µs
Network RTT same region ~1 ms
Network RTT cross-region ~50–150 ms — drives Raft heartbeat tuning in db-17
Five-order-of-magnitude gaps are where the design changes. Between L1 and DRAM (100×), you can ignore it. Between DRAM and disk (1000×), you can't. Between disk and network cross-region (1000× again), distributed systems get hard.
What Breaks at Scale
- Filesystem journal contention:
fsyncon ext4 serializes through the FS journal. Many concurrentfsyncs on the same FS don't scale linearly. Mitigation: one WAL file per database, dedicated FS for WAL. - Page cache thrashing: when working set > RAM, every
preadis a miss. The kernel's LRU is generic; an app-aware cache (Phase 2's block cache, Phase 3's pager) does better. fsyncfailure handling: on Linux, a failedfsynccan mark dirty pages as clean — silently losing your data. This is the "fsyncgate" referenced in the references. Mitigation: panic onfsyncerror and crash-recover from the WAL (modern Postgres does this).- NVMe queue depth: NVMe shines with QD=32–128 in flight. A single-threaded
preadloop runs at QD=1 and leaves most of the drive idle.io_uring(Phase 5) fixes this.
Execution — Storage Primitives
Prerequisites
You've completed the toolchain setup in ../../TOOLS.md. To confirm:
rustc --version # ≥ 1.78
go version # ≥ 1.22
clang++ --version # ≥ 16 (or g++ ≥ 13)
cmake --version # ≥ 3.28
Quick Start — All Three Languages
From the lab root:
# Rust
( cd src/rust && cargo build --release )
./src/rust/target/release/pagealloc write /tmp/lab01.bin 0 "hello, disk"
./src/rust/target/release/pagealloc read /tmp/lab01.bin 0
./src/rust/target/release/pagealloc hexdump /tmp/lab01.bin
# Go
( cd src/go && go build -o /tmp/pagealloc-go ./cmd/pagealloc )
/tmp/pagealloc-go write /tmp/lab01.bin 0 "hello, disk"
/tmp/pagealloc-go read /tmp/lab01.bin 0
/tmp/pagealloc-go hexdump /tmp/lab01.bin
# C++
( cd src/cpp && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && cmake --build build )
./src/cpp/build/pagealloc write /tmp/lab01.bin 0 "hello, disk"
./src/cpp/build/pagealloc read /tmp/lab01.bin 0
./src/cpp/build/pagealloc hexdump /tmp/lab01.bin
All three binaries are byte-compatible — write with one, read with another, get the same bytes.
CLI Reference (all three implementations)
| Command | Effect |
|---|---|
pagealloc write <file> <page_no> <ascii_string> | Write the ASCII bytes (zero-padded to 4 KiB) into page page_no. Calls fsync before returning. |
pagealloc read <file> <page_no> | pread page page_no (4 KiB), print bytes up to first null. |
pagealloc hexdump <file> | Walk the whole file 4 KiB at a time and print a canonical hex dump (16 bytes/line). |
pagealloc bench <file> <pages> <iters> | Random pread benchmark: file is preallocated to pages pages, then iters random reads are timed. |
Tests
# Rust
( cd src/rust && cargo test )
# Go
( cd src/go && go test ./... )
# C++
( cd src/cpp && cmake --build build && ctest --test-dir build --output-on-failure )
Each test suite covers:
- Round-trip:
writethenreadreturns the same bytes. - Cross-implementation: a file written by Rust must read correctly with the Go and C++ binaries (run by
scripts/cross_test.sh). fsyncis called on write (verified bystrace -e fsyncin the cross_test script on Linux).- Endianness sanity: the page header uses little-endian and is identical across implementations.
Environment Variables
| Variable | Default | Effect |
|---|---|---|
DSE_PAGE_SIZE | 4096 | Override page size (must be a power of two). Only consume in the pagealloc bench subcommand. |
DSE_FSYNC | 1 | If 0, skip fsync on write. Only for benchmarking — never in production. |
Observation — Storage Primitives
How to look inside the page cache, watch syscalls, measure latency, and prove to yourself that your code is doing what you think.
Looking at the Page Cache
Linux
# What's in the page cache for our file? (Requires `pcstat` or vmtouch.)
go install github.com/tobert/pcstat/pcstat@latest
pcstat /tmp/lab01.bin
# Drop the page cache (requires root) — to test "cold" reads.
sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
macOS
# `purge` drops the unified buffer cache (requires admin password).
sudo purge
# `fs_usage` is the macOS strace-for-files.
sudo fs_usage -w -f filesys ./src/rust/target/release/pagealloc
Watching Syscalls
# Linux
strace -e trace=openat,pread64,pwrite64,fsync,fdatasync \
./src/go/pagealloc-go write /tmp/lab01.bin 0 "hello"
# macOS (sudo required for dtrace)
sudo dtruss -f -t pread,pwrite,fsync ./src/cpp/build/pagealloc write /tmp/lab01.bin 0 "hello" 2>&1
You should see — in order:
openat(AT_FDCWD, "/tmp/lab01.bin", O_RDWR|O_CREAT, 0644) = 3
pwrite64(3, "hello\0\0...", 4096, 0) = 4096
fdatasync(3) = 0
close(3) = 0
If you see read(3, ...) without an offset, you're using buffered I/O — that's wrong for this lab.
If you see no fsync/fdatasync, your durability is fake.
Measuring Latency
The bench subcommand measures cold-cache and warm-cache pread latency:
# Preallocate a 100 MB file, then do 10000 random 4 KiB reads.
./src/rust/target/release/pagealloc bench /tmp/lab01.bin 25600 10000
Expected output:
preallocated: 25600 pages = 102400 KiB
warm-cache reads: p50=3.1 µs p99=8.4 µs throughput=315 MB/s
dropped page cache
cold-cache reads: p50=78 µs p99=210 µs throughput=51 MB/s
The exact numbers depend on your hardware. The shape matters:
- Warm p50 ≈ 1–5 µs: that's a memcpy from the page cache. No actual disk I/O.
- Cold p50 ≈ 50–200 µs on NVMe, 5–15 ms on a spinning disk.
- p99 > 10× p50: latency tails are real; this motivates
io_uringand dedicated I/O threads.
Profiling Tools
Rust
cargo install cargo-flamegraph
cd src/rust
cargo flamegraph --release --bin pagealloc -- bench /tmp/lab01.bin 25600 100000
# open flamegraph.svg in your browser
Go
cd src/go
go test -bench=BenchmarkPread -cpuprofile=cpu.prof ./...
go tool pprof -http=:8080 cpu.prof
C++
# Linux
perf record -F 999 -g ./src/cpp/build/pagealloc bench /tmp/lab01.bin 25600 100000
perf report
# macOS (use Instruments.app or sample)
sample pagealloc 5 -file /tmp/sample.txt
Watching Disk Throughput
# Linux (iostat from sysstat package)
iostat -dx 1 nvme0n1
# macOS
sudo fs_usage -w -f diskio
While running pagealloc bench, watch r/s (reads per second), rkB/s, and await (avg I/O latency in ms). For NVMe, expect r/s to plateau in the thousands for QD=1; you'd need io_uring (Lab 21) to push it into the hundreds of thousands.
Verifying Endianness
# Write the integer 0x01020304 into a fresh file (we'll write it as bytes via hexdump).
./src/rust/target/release/pagealloc write /tmp/endian.bin 0 ""
# In a separate REPL session, use whichever language you prefer to write a binary u32 to the file.
# Then xxd the file:
xxd /tmp/endian.bin | head -1
A little-endian system writes 04 03 02 01 for the value 0x01020304. If you see 01 02 03 04, either your machine is big-endian (unlikely on x86/ARM) or your code is using to_be_bytes somewhere.
Verification — Storage Primitives
The pass/fail checks for this lab. If all eight pass for all three implementations, you are done.
Per-Implementation Checks
For each of src/rust, src/go, src/cpp:
V1 — Builds
# Rust
( cd src/rust && cargo build --release ) && echo "RUST OK"
# Go
( cd src/go && go build ./... ) && echo "GO OK"
# C++
( cd src/cpp && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && cmake --build build ) && echo "CPP OK"
V2 — Unit Tests Pass
( cd src/rust && cargo test --release ) && echo "RUST TESTS OK"
( cd src/go && go test ./... ) && echo "GO TESTS OK"
( cd src/cpp/build && ctest --output-on-failure ) && echo "CPP TESTS OK"
V3 — Round-Trip
# Per binary:
$BIN write /tmp/v3.bin 5 "hello, lab"
$BIN read /tmp/v3.bin 5 | grep -q "^hello, lab$" && echo "V3 OK"
V4 — fsync Is Called (Linux only)
strace -e fsync,fdatasync -o /tmp/syscalls.log $BIN write /tmp/v4.bin 0 "x"
grep -E 'fsync|fdatasync' /tmp/syscalls.log && echo "V4 OK"
(Expected: at least one of fsync(...) or fdatasync(...) in the trace. On macOS substitute sudo dtruss -t fsync.)
Cross-Implementation Checks
V5 — Byte-Compatibility
Files written by one implementation must read identically with the others.
RUST=./src/rust/target/release/pagealloc
GO=./src/go/pagealloc-go
CPP=./src/cpp/build/pagealloc
$RUST write /tmp/v5.bin 3 "cross-lang ok"
$GO read /tmp/v5.bin 3 | grep -q "^cross-lang ok$" && echo "GO read RUST OK"
$CPP read /tmp/v5.bin 3 | grep -q "^cross-lang ok$" && echo "CPP read RUST OK"
$GO write /tmp/v5.bin 7 "go writes"
$CPP read /tmp/v5.bin 7 | grep -q "^go writes$" && echo "CPP read GO OK"
$RUST read /tmp/v5.bin 7 | grep -q "^go writes$" && echo "RUST read GO OK"
V6 — Hexdump Identical
$RUST hexdump /tmp/v5.bin > /tmp/v6.rust.hex
$GO hexdump /tmp/v5.bin > /tmp/v6.go.hex
$CPP hexdump /tmp/v5.bin > /tmp/v6.cpp.hex
diff /tmp/v6.rust.hex /tmp/v6.go.hex && diff /tmp/v6.rust.hex /tmp/v6.cpp.hex && echo "V6 OK"
V7 — Endianness Sanity
The first 8 bytes of each non-empty page should be a little-endian magic constant 0x44534531_50414745 (DSE1PAGE reversed):
xxd -l 8 /tmp/v5.bin | head -1
# Expected: 00000000: 4547 4150 3145 5344
If you see 4453 4531 5041 4745, your implementation is writing big-endian — fix that.
V8 — Benchmark Smoke
$RUST bench /tmp/v8.bin 1024 1000
# Expected: prints both warm-cache and cold-cache p50/p99 lines without crashing.
Master Script
A single command to run everything (provided as scripts/verify.sh):
bash scripts/verify.sh
Expected output ends with:
====================================================
ALL 8 CHECKS PASSED for RUST, GO, CPP
====================================================
If any check fails, the script exits non-zero and prints which check + which implementation failed.
Broader Ideas — Storage Primitives
Where to go after this lab if you want to push deeper. Each idea is a self-contained extension or alternative.
1. Replace pread with io_uring (Linux)
The single biggest jump from this lab's design to a modern engine is moving from synchronous syscalls to async submission queues. With pread at QD=1, NVMe runs at ~5% of its IOPS. With io_uring at QD=32+, it hits the spec sheet.
- Lab pointer:
db-21-storage-engine-advanceddoes this end-to-end. - Self-study: implement a
pread_asyncAPI now that internally still usespreadbut queues requests through acrossbeamchannel (Rust) / goroutine pool (Go) /std::jthreadworker pool (C++). When you then swap the backend forio_uring, no API consumer changes. - Reference: Jens Axboe's "Efficient IO with io_uring" (https://kernel.dk/io_uring.pdf), §3.
2. Page Layout — Slotted Pages vs Fixed-Size Records
Our pages are zero-padded ASCII. Real engines use slotted pages:
┌────────┬────────────────────────────┬──────┐
│ header │ slot[0] slot[1] ... │ free │
│ │ → offsets into page │ │
├────────┴────────────────────────────┴──────┤
│ ← record N ← record 1 ← record 0 │ (grows from end)
└────────────────────────────────────────────┘
This lets variable-length records share a page without external fragmentation. PostgreSQL, MySQL InnoDB, and SQLite all use slotted pages. Try this: extend pagealloc so each page holds a slot directory and stores up to 16 variable-length keys per page. This is the warm-up for db-10.
3. Copy-on-Write Pages (LMDB-style)
Instead of overwriting a page in place, allocate a fresh page and update the parent to point to it. This is how LMDB achieves single-writer MVCC without a WAL. Pros: simpler crash recovery (just point at the last committed root). Cons: requires a GC for unreferenced pages, doubles write amplification.
- Reference: Howard Chu's LMDB tech docs, http://www.lmdb.tech/doc/
- Self-study: extend the allocator to track free pages and never overwrite; introduce a "commit" op that just writes a new root pointer atomically.
4. Write Coalescing & Group Commit
Right now every write calls fsync immediately. Even a single concurrent writer benefits from group commit:
#![allow(unused)] fn main() { // Pseudocode let mut pending = vec![]; loop { pending.push(receive_write_request()); if elapsed_since_last_fsync > 100us || pending.len() > 64 { pwrite_all(pending); fdatasync(); for req in pending.drain(..) { req.notify_done(); } } } }
- Lab pointer:
db-03-write-ahead-logbuilds this for the WAL. Try it here as warm-up. - Trade-off: latency increases by
100us, throughput rises by ~50× under contention.
5. Direct I/O + Aligned Buffers
O_DIRECT (Linux) or F_NOCACHE (macOS) bypasses the page cache. To use it you need 4-KiB-aligned buffers (in Rust: Layout::from_size_align(4096, 4096)?; in C++: posix_memalign(&buf, 4096, 4096); in Go: trickier — use golang.org/x/sys/unix.Mmap with MAP_ANON).
- When this matters: when your app has a better cache than the kernel (e.g., Phase 2's block cache). Oracle, MySQL with
O_DIRECT, and most flash-tuned engines pick this. - Self-study: add a
pagealloc write-directsubcommand that opens withO_DIRECTand demonstrates the alignment requirement (the program must fail predictably if the buffer is unaligned).
6. Sparse Files & Hole Punching
Files don't have to be contiguous. fallocate(FALLOC_FL_PUNCH_HOLE) releases blocks back to the filesystem without changing the file size. Useful for LSM-tree SSTable compaction (free space after removing dead keys) and for journal log truncation.
- Reference:
man 2 fallocate - Self-study: add
pagealloc punch <file> <page_no>and verify withdu -h <file>that the file's apparent size is unchanged but on-disk size shrinks.
7. Crash Testing with dm-flakey (Linux)
The hard part of storage code is testing the failure cases. dm-flakey is a Linux device-mapper target that simulates random write failures.
# 5-second window of normal operation, then 1 second of dropping writes, repeat.
sudo dmsetup create flakey-dev --table "0 $size flakey /dev/loop0 0 5 1 1 drop_writes"
Mount your test filesystem on /dev/mapper/flakey-dev and run your pagealloc write loop across the drop window. Without fsync, you should lose data. With fsync, the writes that completed should survive. This is how the real engines test durability.
8. Comparing mmap Yourself
We argued for pread/pwrite. Don't take our word for it — implement pagealloc-mmap as a fourth implementation. Compare:
| Workload | pread | mmap |
|---|---|---|
| Sequential read of 1 GB | ? | ? |
| Random read of 4 KiB × 10⁶ from a 1 GB file (warm) | ? | ? |
| Random read from a 100 GB file (cold) | ? | ? |
| 10⁵ random writes with durability | ? | ? |
Plot the results, write down what surprised you. Bring those numbers to the mmap Pavlo paper (in references.md) and check whether they match.
9. Persistent Memory (PMEM, Optane)
Intel Optane is dead, but Persistent Memory programming patterns survive in CXL.mem and in research kernels. PMEM is byte-addressable like RAM, persistent like SSD, with clwb + sfence as the durability primitive (no fsync). The persistent memory programming library (PMDK) is what to read.
- Reference: https://pmem.io/pmdk/
- Why it matters: if/when CXL persistent memory becomes commodity, every storage engine in this curriculum will need a rewrite. Already, WiscKey, SplitFS, and uTree are research designs assuming PMEM.
10. Beyond Disk: Object Storage as a Backing Store
Modern cloud-native databases (Snowflake, Databricks, BigQuery) don't pwrite to local disks — they PUT 4 MiB objects to S3. The trade-offs are wildly different (high latency, infinite throughput, eventual consistency until 2020). The closest "primitives lab" for that world would replace pread/pwrite with HTTP range requests. Worth thinking about, especially before db-23's capstone.
- Reference: "Lakehouse: A New Generation of Open Platforms" (Armbrust et al., CIDR 2021)
Step 1 — Open a File and Write Bytes
Goal
Build the smallest possible thing that touches the disk: open a file, write some bytes at a known offset, close the file. You'll do this three times — once in Rust, Go, and C++ — so you can feel how each language exposes the same pread/pwrite/fsync primitives.
Prerequisites
- Toolchain installed per
../../TOOLS.md. - An empty editor and a terminal in this lab's directory.
What You're Building
A function with this signature (conceptually):
write_page(path: string, page_no: u64, bytes: [u8]) -> Result
- Opens (or creates)
pathfor read+write. - Computes
offset = page_no * PAGE_SIZE(withPAGE_SIZE = 4096). - Zero-pads
bytesto exactlyPAGE_SIZE. pwrites the padded buffer atoffset.- Calls
fdatasync(orfsynciffdatasyncis unavailable). - Closes the file.
Why pwrite, not write
The classic POSIX write syscall uses the file's seek pointer (lseek). That makes it stateful — two threads writeing to the same fd will race. pwrite takes an explicit offset and is thread-safe. Every database in this curriculum uses pwrite. No lseek in our code, ever.
Why PAGE_SIZE = 4096
It matches the OS page size on x86_64 and ARM64, which means the kernel page cache, the device LBA, and your write are all the same unit. Mismatched sizes cause read-modify-write at the kernel layer: writing 100 bytes requires the kernel to first read the 4 KiB page containing those bytes, modify, and write back. By always writing a full page, you avoid that hidden cost.
Why fdatasync Over fsync
fsync flushes data and metadata (file size, modification time). For a write that doesn't change the file size — the common case in a steady-state database — fdatasync skips the metadata flush, saving a few hundred microseconds per call on average. Use fdatasync when you can.
Rust Implementation
In ../src/rust/src/lib.rs we use the std::os::unix::fs::FileExt::write_at extension, which compiles to pwrite64 on Linux and macOS. Look at the function write_page.
Key idiom:
#![allow(unused)] fn main() { use std::os::unix::fs::FileExt; file.write_all_at(&buf, offset)?; file.sync_data()?; // == fdatasync }
sync_data is Rust's portable name for fdatasync on Linux and fcntl(F_FULLFSYNC) on macOS (Rust 1.78+ uses F_BARRIERFSYNC on macOS, which is a faster middle ground).
Go Implementation
In ../src/go/pagealloc.go, the WriteAt method is pwrite, and f.Sync() is fsync. There is no first-class fdatasync in os, so we call unix.Fdatasync(fd) from golang.org/x/sys/unix.
if _, err := f.WriteAt(buf, offset); err != nil { return err }
return unix.Fdatasync(int(f.Fd()))
On macOS, unix.Fdatasync is not exported (the kernel doesn't have it). We fall back to unix.FcntlInt(fd, unix.F_FULLFSYNC, 0). The wrapper in fsync_full.go handles the platform branch.
C++ Implementation
In ../src/cpp/src/pagealloc.cc:
ssize_t n = ::pwrite(fd, buf.data(), buf.size(), offset);
if (n != static_cast<ssize_t>(buf.size())) return std::errc::io_error;
::fdatasync(fd);
On macOS we use ::fcntl(fd, F_FULLFSYNC). The dispatch is in fsync_full.cc.
Try It
cd src/rust && cargo build --release
./target/release/pagealloc write /tmp/step1.bin 0 "first page"
xxd -l 32 /tmp/step1.bin
Expected output:
00000000: 4547 4150 3145 5344 0000 0000 0000 0000 EGAP1ESD........
00000010: 6669 7273 7420 7061 6765 0000 0000 0000 first page......
The first 8 bytes are our little-endian page magic 0x44534531_50414745 (read as bytes left-to-right: 45 47 41 50 31 45 53 44).
Bytes 16+ contain your ASCII payload "first page" followed by zero-padding to 4 KiB.
What Just Happened
- You opened a file (
open(2)withO_RDWR | O_CREAT). - You wrote exactly one page at exactly one offset (
pwrite(2)). - You forced the data to stable storage (
fdatasync(2)orF_FULLFSYNCon macOS). - You closed the fd, which does not flush —
close(2)returns immediately.
On a power loss between step 3 and step 4, your write survives. Without step 3, it might not.
Next
In Step 2 you'll add the read path and a hexdump utility, and verify that all three implementations produce byte-identical files.
Step 2 — pread and Hexdump
Goal
Implement the read side (pread) and a hexdump utility, then prove cross-implementation byte-compatibility: a file written by Rust must read identically from Go and C++.
The Read Side
Symmetric to Step 1:
read_page(path: string, page_no: u64) -> [u8; PAGE_SIZE]
- Open the file read-only.
pread(fd, buf, PAGE_SIZE, page_no * PAGE_SIZE).- Return the buffer (caller will strip trailing zeros or use the magic header to validate).
Rust
#![allow(unused)] fn main() { file.read_exact_at(&mut buf, offset)?; }
Go
n, err := f.ReadAt(buf, offset)
if err != nil && err != io.EOF { return nil, err }
ReadAt returns io.EOF if n < len(buf) — this is normal for the last page of a file that hasn't been preallocated. Tests handle this case.
C++
ssize_t n = ::pread(fd, buf.data(), buf.size(), offset);
if (n < 0) return std::errc::io_error;
buf.resize(n); // shrink if short read
Page Header Format
Every non-empty page in our format begins with a 16-byte header:
offset size field
------ ---- -----
0 8 magic = 0x44534531_50414745 (LE: 45 47 41 50 31 45 53 44 ; ASCII reversed: "EGAP1ESD")
8 2 version = 1
10 2 flags = 0
12 4 payload_len (LE u32, number of bytes used after the header)
16 n payload bytes
n+16 — zero-pad to PAGE_SIZE
This is a deliberately simple format — we'll grow it in later labs. For now it gives us:
- A magic number to detect "is this a valid page?"
- A version field to evolve the format later.
- An explicit payload length so we don't have to scan for zeros (zeros are valid bytes in real data).
The Hexdump Utility
A canonical 16-byte-per-line hex dump:
00000000: 4547 4150 3145 5344 0100 0000 0a00 0000 EGAP1ESD........
00000010: 6669 7273 7420 7061 6765 0000 0000 0000 first page......
00000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
...
Format spec:
- 8-digit hex offset, then
:. - 16 bytes per line, grouped 2 bytes per word, separated by single space.
- 2-space gap before the ASCII rendering.
- ASCII rendering: printable ASCII as itself, otherwise
..
This format matches xxd output for easy diff-based cross-language verification.
Cross-Implementation Test
This is the most important check in this lab. Run scripts/cross_test.sh:
bash scripts/cross_test.sh
What it does (excerpt):
$RUST write /tmp/xt.bin 0 "from rust"
$GO write /tmp/xt.bin 1 "from go"
$CPP write /tmp/xt.bin 2 "from cpp"
$RUST hexdump /tmp/xt.bin > /tmp/h.rust
$GO hexdump /tmp/xt.bin > /tmp/h.go
$CPP hexdump /tmp/xt.bin > /tmp/h.cpp
diff /tmp/h.rust /tmp/h.go || { echo "RUST/GO mismatch"; exit 1; }
diff /tmp/h.rust /tmp/h.cpp || { echo "RUST/CPP mismatch"; exit 1; }
echo "cross-language byte-compat OK"
If this fails, the most common bugs are:
- Wrong endianness on the magic or
payload_len. - Forgetting to zero-pad the page (one impl leaves junk past the payload).
- Off-by-one on the offset calculation (
page_no * PAGE_SIZEvs(page_no + 1) * PAGE_SIZE).
What Just Happened
You now have a portable, file-format-compatible storage primitive across three languages. This is the foundation for every later lab — the WAL in db-03 is exactly this with append-only semantics, and the SSTable in db-06 is this with a richer block format.
Next
Step 3: measure latency, demonstrate the page cache, and understand why your second read of a page is 1000× faster than the first.
Step 3 — Benchmark and the Page Cache
Goal
See the page cache with your own eyes by measuring warm-cache and cold-cache pread latency. This is the experiment that should make you suspicious of every microbenchmark you ever read.
The Benchmark
pagealloc bench <file> <pages> <iters>:
- Preallocate
filetopages * 4 KiBusing a sequential write loop. fsyncto make sure it's on disk.- Time
itersrandompreads of one page each. - Drop the page cache.
- Time
itersrandompreads again. - Print p50/p99/p999 for each phase plus throughput in MB/s.
Implementation lives in:
../src/rust/src/bin/pagealloc.rs(benchsubcommand)../src/go/cmd/pagealloc/main.go(benchsubcommand)../src/cpp/src/main.cc(benchsubcommand)
Dropping the Page Cache
On Linux:
sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
Our benchmark binary calls this automatically if it can. If it can't (no sudo), it prints a warning and skips the cold phase.
On macOS:
sudo purge
Same logic — the binary attempts it, warns if it can't.
Expected Numbers
On a modern laptop with NVMe:
$ ./pagealloc bench /tmp/bench.bin 65536 50000 # 256 MB file, 50k iters
preallocated 65536 pages = 262144 KiB
WARM cache:
iterations : 50000
p50 : 1.2 µs
p99 : 5.8 µs
p99.9 : 18 µs
throughput : 1840 MB/s
dropped page cache
COLD cache:
iterations : 50000
p50 : 64 µs
p99 : 180 µs
p99.9 : 340 µs
throughput : 56 MB/s
Two observations:
- Warm cache is ~50× faster than cold. The page cache makes microbenchmarks lie. If you benchmarked a database after running the benchmark for warmup, you measured
memcpy, not disk. - p99 is 4–5× p50 even on cold cache. Latency tails come from queue depth, kernel scheduling, NVMe garbage collection. This motivates
io_uring(Lab 21) and request hedging in distributed systems (Lab 20).
On a Spinning Disk (if you have one)
COLD cache:
p50 : 6.4 ms ← 100× worse than NVMe
p99 : 18 ms
throughput : 0.6 MB/s ← versus 56 MB/s on NVMe
This 100× gap is why LSM-trees exist. Random reads on HDD are unworkable for OLTP; engines either:
- Avoid them (sequential append-only logs).
- Hide them behind cache (large block caches + bloom filters).
- Punt to SSD (HDD as cold tier only).
Throughput vs Latency
Watch what happens with iters = 1000 vs iters = 100000:
$ ./pagealloc bench /tmp/bench.bin 65536 1000
WARM throughput : 4200 MB/s
$ ./pagealloc bench /tmp/bench.bin 65536 100000
WARM throughput : 1800 MB/s
Higher iteration counts include more cache eviction (as the random distribution gradually evicts pages we already cached), exposing memory bandwidth and TLB misses. Real workloads sit between these. A single benchmark number is almost always wrong.
Try This
Add a flag to control the access pattern: sequential vs random. Sequential preads benefit from the kernel's read-ahead heuristic. On the same NVMe device you should see:
RANDOM cold : 56 MB/s
SEQUENTIAL cold : 2400 MB/s ← 40× faster, all due to read-ahead
This is why scans are cheap and point lookups are expensive — even on SSD.
What Just Happened
You measured the page cache, the access pattern's effect on throughput, and the gap between p50 and p99. These three insights drive every storage design in this curriculum:
- Page cache exists → your in-process block cache (Lab 8) must be smarter than LRU on raw bytes, otherwise you're duplicating the kernel's work.
- Sequential >> random → LSM-tree compaction (Lab 7) sorts data on disk to convert all future reads to sequential ranges.
- p99 >> p50 → consensus heartbeats (Lab 17) must tolerate occasional 100× slow
fsyncs without triggering elections.
Next
You've finished Lab 01. Run docs/verification.md to confirm all 8 checks pass. Then move on to db-02-data-structures-for-storage.
Data Structures for Storage
Status: Complete. Companion to db-01 and prerequisite for db-05 (MemTable) and db-10 (B-Tree).
1. What Is It
This lab is about the in-memory data structures that databases use, and why those choices change completely when the data is on disk. We build two structures from scratch:
- A skip list — an ordered, probabilistic, pointer-based map. This is what LevelDB and RocksDB use for their MemTable, and what Redis uses for sorted sets.
- A hash table with open-addressing + Robin Hood probing — an unordered, array-backed map. This is what you use when you need O(1) point lookups and don't need ordering.
We then benchmark them against each other on three workloads (insert, point lookup, range scan) and explain why the numbers come out the way they do.
This lab does not implement a B-Tree or a B+-Tree — those are on-disk structures and arrive in db-10. The lesson here is why a B-Tree dominates on disk even though a skip list or hash table is faster in RAM.
2. Why It Matters
Every database has a critical-path data structure for "find this key":
| System | Structure | Why |
|---|---|---|
| LevelDB / RocksDB MemTable | Skip list | Lock-free reads, ordered iteration for flush to SSTable |
| Redis sorted-set (ZSET) | Skip list + hash table | Skip list for ranked access, hash for O(1) by-key |
Memcached, Java HashMap | Open-addressing hash table | Unordered, point lookup only |
| SQLite, PostgreSQL, InnoDB | B+-Tree | On-disk: minimize page reads |
| Cassandra, ScyllaDB MemTable | Skip list | Same reasoning as LevelDB |
| Lucene postings | Skip list + delta-encoded arrays | Range scans over sorted doc IDs |
If you pick the wrong structure you don't get "a little slower" — you get "100× slower" or "we run out of memory at 100M keys." The cost model for an in-memory structure (cycles, cache misses) is not the cost model for an on-disk one (page reads, sync latency), and a structure tuned for one will lose badly in the other domain.
3. How It Works
Skip list
A skip list is a stack of linked lists. The bottom list contains every key in sorted order. Each higher list contains a random subset of the keys below it, sampled with probability p (we use p = 0.5). To find a key, you walk right on the highest level until you'd overshoot, then drop down a level, repeat.
level 3: HEAD ────────────────────────────────────────────────► NIL
│ │
level 2: HEAD ────────► [13] ───────────────► [42] ────────────► NIL
│ │ │ │
level 1: HEAD ──► [7] ─► [13] ─► [21] ───────► [42] ─► [55] ──► NIL
│ │ │ │ │ │ │
level 0: HEAD ► [3]►[7]►[13]►[21]►[34]►[39]►[42]►[51]►[55] ──► NIL
Searching for 39 from the top:
- L3: HEAD → NIL (overshoot from HEAD), drop to L2
- L2: HEAD → 13 → 42? 42 > 39, drop to L1
- L1: 13 → 21 → 42? 42 > 39, drop to L0
- L0: 21 → 34 → 39. Found.
Expected time: O(log n) with constant factor 1/p · log_{1/p}(n). With p = 0.5 that's 2 · log₂ n comparisons.
Hash table (open addressing, Robin Hood)
An array of 2^k slots. Hash the key, mod by table size, that's the home slot. If occupied by a different key, probe linearly (slot+1, slot+2, ...). Robin Hood twist: when you probe past slot i for the d-th time, and the resident at slot i has been probed only d' < d times, swap them — the "rich" entry gets displaced by the "poor" one. This bounds the worst-case probe distance to roughly the mean.
home(K1)=2 home(K2)=2 home(K3)=4
hash before insert:
┌──┬──┬────┬────┬────┬──┬──┬──┐
│ │ │ K1 │ K2 │ K3 │ │ │ │
└──┴──┴────┴────┴────┴──┴──┴──┘
0 1 2 3 4 5 6 7
(K1 home=2 dist=0, K2 home=2 dist=1, K3 home=4 dist=0)
insert K4 with home=2:
probe slot 2 (K1, dist 0); K4 dist=0; equal — keep going
probe slot 3 (K2, dist 1); K4 dist=1; equal — keep going
probe slot 4 (K3, dist 0); K4 dist=2 > 0 — STEAL, K3 displaced
probe slot 5 (empty); place K3 with dist=1
result:
┌──┬──┬────┬────┬────┬────┬──┬──┐
│ │ │ K1 │ K2 │ K4 │ K3 │ │ │
└──┴──┴────┴────┴────┴────┴──┴──┘
Loads up to ~0.9 work well with Robin Hood; we resize at 0.85 (× 2 capacity, rehash all).
When in-memory and on-disk diverge
A skip-list node holding a 16-byte key + 16-byte value + 4 forward pointers is ~64 bytes in memory. The same record packed into a B+-Tree leaf page (4 KiB page, no per-record pointers) is ~36 bytes — no level header, no forward-pointer array. And the B+-Tree co-locates ~100 records in one page, so a range scan of 100 keys is 1 page read instead of 100 random reads.
On disk:
- A pointer is an 8-byte offset that triggers a page read (~100 µs cold).
- A cache miss is ~100× a cache hit.
- A page is 4 KiB whether you read 1 byte or 4096.
Therefore on-disk structures want high fan-out, low height, contiguous siblings, no random pointer chasing. Skip lists violate all four; B+-Trees satisfy all four.
4. Core Terminology
| Term | Definition |
|---|---|
| Skip list | Probabilistic ordered map: stack of linked lists with geometric level distribution |
| Level | Index of a forward-pointer array in a skip-list node (0 = bottom, dense; higher = sparser) |
p | Per-level promotion probability (we use 0.5) |
| Sentinel head | Dummy node with the maximum possible level; all searches start here |
| Open addressing | Collision resolution: probe other slots in the same array (vs chaining into a list) |
| Linear probing | Probe sequence is h, h+1, h+2, … (best cache behavior) |
| Robin Hood | On insert, displace any resident whose probe distance is smaller than the newcomer's |
| Probe distance / PSL | Slots between a key's home slot and its actual slot (probe sequence length) |
| Load factor | len / capacity |
| Tombstone | Sentinel marking a deleted slot so probes don't short-circuit (we use backward-shift deletion instead) |
| Backward-shift deletion | After deleting a slot, shift the following non-home entries left by one; avoids tombstones |
| Cache line | 64 bytes on x86_64 / Apple Silicon; the unit the CPU fetches from RAM |
| Pointer chasing | A traversal whose next address depends on the byte just loaded; CPU cannot prefetch |
| Fan-out | Number of children per internal node in a tree; B+-Trees aim for hundreds |
5. Mental Models
"A skip list is a binary search you can mutate cheaply"
A balanced BST and a skip list have the same asymptotic complexity. The skip list wins because it has no rebalancing: no rotations, no recoloring. Each insert is one geometric coin flip + N forward-pointer writes (N = node height, ≈ log₂ n expected). This makes it much easier to make concurrent — LevelDB's MemTable allows lock-free reads while a writer inserts, because a partially-published node is invisible until the bottom-level pointer is CAS'd in.
"A hash table is a sparse array you pretend is dense"
If keys were integers in [0, N) you'd use an array. A hash function fakes that: it maps any key into [0, capacity). Collisions are the cost of the lie. Robin Hood specifically equalizes the cost of the lie across keys, so the worst-case lookup is close to the average.
"Cost models differ by 5 orders of magnitude"
L1 cache hit ~1 ns
Main memory ~100 ns
NVMe SSD random read (cold) ~100 µs ← 1000× RAM
HDD seek ~10 ms ← 100× SSD
Cross-DC round trip ~50 ms
A "fast" in-memory structure becomes irrelevant if it issues 10 page reads where a B+-Tree issues 1. The B+-Tree's "slow" O(log_B n) with B = 256 beats the skip list's "fast" O(log₂ n) on disk by a factor of log₂(256) = 8 — every level you avoid saves 100 µs.
6. Common Misconceptions
- "Skip lists are slow because they're probabilistic." No — the expected and with-high-probability bounds are tight. The variance is small for any list above ~1000 keys. Failure modes are bad RNG seed (we use a deterministic xorshift here) and adversarial key insertion patterns (irrelevant for hashed keys; mitigated by per-instance seed).
- "Hash tables have O(1) worst case." Average, not worst. A pathological hash function or adversarial keys produce O(n) chains. Robin Hood mitigates variance but does not change the worst case.
- "You should always use a hash table when you don't need ordering." Two cases where skip lists or trees win even unordered: (a) you need iteration in any deterministic order across runs; (b) memory is tight and you can't afford the 15–40% slack of a hash table at safe load factors.
- "Open addressing wastes memory because of empty slots." Chaining wastes more in practice: every chain node is a heap allocation with a header (
mallocarenas + pointer + next ptr ≈ 32 bytes overhead per entry). Linear-probing hash tables with 70% load factor still use less memory thanstd::unordered_map. - "A B-Tree is just a balanced BST with more children." No: B+-Trees keep all data in leaves and chain leaves with sibling pointers, making range scans
O(1)per page after the initial descent. - "
std::unordered_map/ Gomap/ Pythondictare the gold standard." They're general-purpose. Specialized hash tables (Abseil'sflat_hash_map, Rust'shashbrown, F14) beat them by 2–5× on most workloads. Database authors often roll their own.
7. Interview Talking Points
- "Why does LevelDB use a skip list for the MemTable instead of a red-black tree?" → Lockless reads via single-pointer CAS publish; no rotations; easier to implement correctly.
- "Why isn't a hash table good enough for a MemTable?" → MemTable flushes to an SSTable, which is a sorted file. A hash table would require sorting at flush time (O(n log n)); a skip list is already sorted, so flush is O(n) sequential.
- "When would you use chaining vs open addressing?" → Open addressing for small fixed-size values (better cache); chaining when values are large and you want pointer stability across resizes.
- "What's the cost model on disk that breaks skip lists?" → Each level traversal is a potential page read. With
log₂ nlevels you paylog₂ n × 100 µs. A B+-Tree with fan-out 256 haslog_256 nlevels, so 3 page reads for 16 M keys vs 24. - "Why is Robin Hood probing useful?" → Bounds variance: the maximum probe distance grows as
O(log n)w.h.p., and lookups for missing keys become almost as fast as hits because you can stop as soon as you see a slot with smaller PSL than yours. - "What's the alternative to tombstones?" → Backward-shift deletion: walk forward from the deleted slot, shift each non-home entry left until you hit an empty slot or a home-slotted entry. O(probe-length) per delete, no tombstone bookkeeping.
8. Connections to Other Labs
- db-01 Storage Primitives — the page abstraction the disk structures use.
- db-03 WAL — the WAL is appended before the MemTable insert; failure recovery rebuilds the MemTable by replaying the WAL.
- db-04 Bloom Filters — sits in front of the SSTable; same family of probabilistic in-memory structures.
- db-05 LSM MemTable — uses the skip list from this lab, adds a snapshot / immutable flip.
- db-10 B-Tree Fundamentals — contrast with this lab; same problem, different cost model.
- db-21 Storage Engine Advanced — concurrent skip list (CAS publish), concurrent hash table (extendible hashing, lock striping).
References — db-02 Data Structures for Storage
Skip lists
- Pugh, W. (1990). "Skip Lists: A Probabilistic Alternative to Balanced Trees." CACM 33(6). The original paper; six pages, very readable.
https://www.cs.umd.edu/~pugh/galileo/papers/CACM_Skiplist_1990.pdf - LevelDB MemTable source — skip list with a single allocator arena.
https://github.com/google/leveldb/blob/main/db/skiplist.h - RocksDB InlineSkipList — production skip list with per-node tail allocation.
https://github.com/facebook/rocksdb/blob/main/memtable/inlineskiplist.h - Redis
t_zset.c— skip list with per-nodespanfield for O(log n) rank queries.
https://github.com/redis/redis/blob/unstable/src/t_zset.c
Hash tables
- Celis, P., Larson, P.-Å., Munro, J. I. (1985). "Robin Hood Hashing." FOCS '85. Original paper.
- Pedro Celis's thesis on Robin Hood hashing (1986). The probe-distance analysis is here.
https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf - Emmanuel Goossaert's deep dive — accessible and runnable.
https://codecapsule.com/2013/11/11/robin-hood-hashing/ - Google Abseil
flat_hash_map— SIMD probing on top of open addressing.
https://abseil.io/about/design/swisstables - Rust
hashbrown— port of Abseil's SwissTable; the implementation behindstd::collections::HashMap.
https://github.com/rust-lang/hashbrown
Cost models & cache
- Drepper, U. (2007). "What Every Programmer Should Know About Memory."
https://akkadia.org/drepper/cpumemory.pdf - Jeff Dean's "Numbers Every Programmer Should Know" — the canonical latency hierarchy.
https://gist.github.com/jboner/2841832 - Pavlo, A. "Database Storage I" (CMU 15-445). The disk-vs-RAM cost-model lecture.
https://15445.courses.cs.cmu.edu/fall2023/slides/03-storage1.pdf
Tree structures (preview for db-10)
- Bayer, R., McCreight, E. (1972). "Organization and Maintenance of Large Ordered Indices." Acta Informatica. The original B-Tree paper.
- Comer, D. (1979). "The Ubiquitous B-Tree." ACM Computing Surveys. The canonical survey.
Background
- Sedgewick & Wayne, Algorithms 4th ed.
- Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms (CLRS) 3rd ed. — Ch. 11 (hash tables), Ch. 17 (amortized analysis).
Analysis — Design Decisions
Every choice here is reversible in code but irreversible in performance: change one, all the others bend with it.
D1. Skip list over balanced BST (red-black, AVL)
| Skip list | Red-black tree | |
|---|---|---|
| Insert / lookup | O(log n) expected | O(log n) worst case |
| Implementation LOC | ~120 (Rust) | ~400+ |
| Concurrent reads | Lock-free with seqlock or single-CAS publish | Requires hand-over-hand locking |
| Cache locality | Poor (pointer chasing) | Poor (pointer chasing) |
| Worst-case bound | w.h.p., not absolute | Absolute |
Choice: skip list. The simplicity (no rotations, no rebalancing, no parent pointers) is the value proposition. Worst-case absolute bound is irrelevant when we control the hash that feeds keys in.
When you'd flip: real-time systems where a probabilistic O(n) blowup is unacceptable. Database MemTables don't qualify.
D2. Hash table: open addressing + Robin Hood, not chaining
| Open addressing (linear) | Chaining | |
|---|---|---|
| Memory per entry | (1/loadfactor) × sizeof(slot) | sizeof(entry) + sizeof(ptr) + malloc overhead |
| Cache misses per lookup | 0–1 typically | 1–3 typically |
| Pointer stability across resize | No | Yes |
| Deletion | Backward-shift or tombstone | Free a list node |
| Tail latency at high load | Spikes near 1.0 | Degrades gracefully |
Choice: open addressing, linear probing, Robin Hood. The cache-miss saving is decisive at small/medium values, which is the database use case.
When you'd flip: large values (≥1 KiB) where you want pointer stability so external references survive resize.
D3. p = 0.5 for skip list, not p = 0.25
The expected number of comparisons per search is (1/p) · log_{1/p}(n). Minimizing this gives p = 1/e ≈ 0.37. In practice:
p | Expected comparisons (n=1M) | Levels | Memory per node |
|---|---|---|---|
| 0.25 | ~16 | ~10 | 1.33 forward ptrs avg |
| 0.5 | ~20 | ~20 | 2 forward ptrs avg |
| 1/e | ~14 | ~13 | 1.58 forward ptrs avg |
We pick p = 0.5 because (a) bit-shift sampling (count trailing zeros of a random u64) is one instruction, and (b) the code stays trivial. The 30% theoretical improvement from p=1/e is not worth the table-lookup or log math.
D4. Max level = 32
A skip list with p=0.5 and n entries has expected max level log₂ n. At max level 32 we support n = 2^32 ≈ 4 G entries before quality degrades. Going higher costs a forward-pointer slot per node forever (8 bytes per extra level). Going lower caps the structure size.
D5. Hash function: FNV-1a 64-bit
We need a hash that is:
- High quality enough for non-adversarial keys (passes basic distribution tests)
- Fast for small keys (~10 cycles per 8 bytes)
- Identical in all three languages so cross-language tests can compare counts
FNV-1a is 6 lines of code, deterministic, and produces nearly the same probe-distance distribution as xxHash3 for keys ≤ 32 bytes. We use it because we control the input keys in this lab; in production you'd switch to xxHash3 or SipHash.
When you'd flip: keys controlled by adversaries → SipHash with a per-instance random key.
D6. Load factor 0.85, grow ×2
Robin Hood handles load factor up to ~0.9 well; beyond that the variance of probe distance blows up. We resize at 0.85 to leave headroom and double the capacity (and rehash). Cost of resize is amortized O(1) per insert.
D7. Backward-shift deletion, no tombstones
Tombstones simplify code but bloat the table over time — a "delete-then-insert" workload fills the table with markers and forces a resize. Backward-shift deletion costs O(PSL) per delete (typically <5 slots) and keeps the table compact. The implementation walks forward from the deleted slot, moves each entry one slot left until reaching an empty slot or an entry whose PSL is 0.
D8. Seed: deterministic per-instance, random across instances
The skip list RNG must not be deterministically the same across runs. But within one process run we want reproducible behavior for debugging. The CLI accepts an optional seed; tests pass a fixed value.
Cost-model cheat sheet
| Operation | Latency |
|---|---|
| L1 cache hit | ~1 ns |
| L2 cache hit | ~4 ns |
| L3 cache hit | ~15 ns |
| Main memory | ~100 ns |
| 4 KiB page from SSD (cold) | ~100 µs |
| 4 KiB page from HDD | ~10 ms |
Every random pointer dereference is a potential L3 miss → 100 ns. A skip list with 20 levels traverses 20 such pointers in the worst case = 2 µs. A linear-probing hash table touches 1–2 cache lines = ~10 ns. Hash table is ~100× faster for point lookups in RAM, and the gap grows as the working set leaves L3.
What breaks at scale
| Symptom | Cause | Mitigation |
|---|---|---|
| Skip-list lookup tail latency 10× worse than median | Bad RNG sequence; node height variance | Use a higher-quality PRNG; bound max height |
| Hash table tail latency spikes near 90% load | Robin Hood variance explodes | Resize earlier (load factor 0.75) |
| Skip-list memory 2× of equivalent BST | Forward-pointer array overhead | Use per-level arena allocators; pack pointer arrays |
| Hash table grows but never shrinks | Resize is unidirectional in our impl | Shrink when load < 0.25 (we don't — extension point) |
| Iterator skips entries | Mutation during iteration | Snapshot at iterator-creation time (db-05 covers this) |
Execution — How to Build and Run
Quick start (per language)
# Rust
cd src/rust
cargo build --release
cargo test --release
./target/release/dsbench --help
# Go
cd src/go
go test ./...
go build -o bin/dsbench ./cmd/dsbench
./bin/dsbench --help
# C++
cd src/cpp
cmake -S . -B build && cmake --build build
ctest --test-dir build
./build/dsbench --help
CLI: dsbench
A single binary per language that exercises both data structures.
| Subcommand | Args | What it does |
|---|---|---|
skiplist insert N [seed] | N (count) | Inserts N keys, prints final size + max level + histogram |
skiplist roundtrip N | N | Inserts N keys, verifies every key reads back, then removes them |
skiplist iter N | N | Inserts N random keys, prints all keys in iterator order |
hashtable insert N | N | Inserts N keys, reports load factor + max probe distance + histogram |
hashtable roundtrip N | N | Insert + verify + delete + verify gone |
bench point N | N | Inserts N keys into both, benchmarks point lookups for each |
bench mem N | N | Reports bytes-per-entry for both structures |
Library API
Same shape in all three languages.
SkipList::new(seed) -> SkipList
SkipList::insert(key, value) -> bool (true if newly inserted, false if replaced)
SkipList::get(key) -> Option<value>
SkipList::remove(key) -> bool
SkipList::len() -> usize
SkipList::iter() -> sorted iterator over (key, value)
HashTable::new(capacity) -> HashTable
HashTable::insert(key, value) -> bool
HashTable::get(key) -> Option<value>
HashTable::remove(key) -> bool
HashTable::len() -> usize
HashTable::load_factor() -> f64
HashTable::max_probe() -> usize
Keys and values are byte strings.
Verifying
./scripts/verify.sh # invariants per structure
./scripts/cross_test.sh # cross-language behavioral checks
Observation — Looking Inside the Structures
What you should see
This lab is at its best when you stop trusting the numbers and start looking at the memory layout.
1. Histogram of skip-list node heights
The skip list with p=0.5 should produce a geometric distribution: ~half the nodes at level 0 only, ~quarter reaching level 1, etc.
./target/release/dsbench skiplist insert 100000
Sample output:
level count %
0 50032 50.0 ████████████████████████████████████████████████
1 25021 25.0 █████████████████████████
2 12508 12.5 █████████████
3 6234 6.2 ██████
4 3098 3.1 ███
5 1581 1.6 ██
6 778 0.8 █
...
max level used = 16
If your distribution is skewed (e.g., level 0 is 25% instead of 50%) your RNG or sampling code is wrong. The most common bug is rng() & 1 evaluated once and reused.
2. Hash-table probe-distance histogram
./target/release/dsbench hashtable insert 1000000
Sample output (Robin Hood, load 0.477):
probe distance count %
0 633412 63.3
1 235108 23.5
2 87412 8.7
3 32104 3.2
4 10001 1.0
5 1652 0.2
6 298 0.0
7 12 0.0
8 1 0.0
mean = 0.55 max = 8 capacity = 2097152 load = 0.477
With pure linear probing (no Robin Hood) the tail extends much further.
3. Memory accounting
./target/release/dsbench bench mem 1000000
Sample:
skip list : 1,000,000 entries, ~80 B/entry
hash table : 1,000,000 entries, ~25 B/entry (cap=2097152, load=0.477)
What "working" looks like
- Skip list: max level grows like
log₂ n + 5(slight overshoot from variance). - Hash table: mean probe < 1, max probe < 10 at load ≤ 0.85.
- Bench: hash table 5–20× faster than skip list on point ops.
- Memory: hash table is 2–4× smaller per entry.
What "broken" looks like
- Mean probe distance climbs above 2 → poor hash function or table not actually power-of-two-sized.
- Max skip-list level stuck at 1–2 with 1M entries → RNG broken; bit-test always falls through.
- Same level distribution from one run to the next → seed not random.
- Hash table size doesn't grow after 85% load → resize trigger not firing.
max_probe3–4× above the theoretical bound → almost always the hash function. We hit this with raw FNV-1a 64-bit (max_probe ≈ 200 at N=100k vs expected ≤ 66). Adding a SplitMix64 finalizer fixed it. The pure-FNV variant typoed its prime constant too — see step 02 for the canonical value0x00000100000001b3and the three pinned vectors ("","a","foobar").
What scripts/cross_test.sh proves
Runs dsbench skiplist iter N seed in Go, Rust, and C++ and diffs all
three outputs. If they aren't byte-identical, one of the three has drifted
on hash function, RNG, or ordering — usually the easiest single signal
for catching a port regression.
Verification — Invariants
scripts/verify.sh runs the language-default binary (Rust by default) through these checks. scripts/cross_test.sh then re-runs the same scenarios in Go and C++ and asserts the behaviorally observable outputs match. The internal layouts are not required to match — only the API behavior.
| # | Invariant | How verified |
|---|---|---|
| V1 | Skip list round-trip: insert(k, v) then get(k) == v for all k | dsbench skiplist roundtrip 10000 |
| V2 | Skip list iteration is in sorted order | dsbench skiplist iter 1000 piped to sort -c |
| V3 | Skip list level distribution is geometric (p=0.5 ± tolerance) | histogram chi-square check in unit test |
| V4 | Skip list max level stays ≤ MAX_LEVEL even with 100k inserts | bench reports max_level_used |
| V5 | Hash table round-trip: insert(k, v) then get(k) == v for all k | dsbench hashtable roundtrip 10000 |
| V6 | Hash table max probe distance ≤ 4 × log₂(cap × load) at load ≤ 0.85 | unit test asserts |
| V7 | Hash table resizes at load 0.85, capacity doubles, max-probe drops | unit test |
| V8 | Backward-shift deletion never leaves a lookup hole | unit test: insert 100, delete random 50, assert remaining 50 still found |
| V9 | Insert with same key twice replaces value, len() unchanged | unit test |
| V10 | Cross-language: insert sequence [5, 1, 3, 8, 2] into skip list, iter output is sorted in all 3 langs | cross_test.sh |
| V11 | Cross-language: hash table after inserting same 1000 keys reports same len() | cross_test.sh |
| V12 | Cross-language: roundtrip of the same N keys works in all 3 langs | cross_test.sh |
Running
./scripts/verify.sh # ~5s, runs Rust binary
DSE_BIN=./src/go/bin/dsbench ./scripts/verify.sh
DSE_BIN=./src/cpp/build/dsbench ./scripts/verify.sh
./scripts/cross_test.sh # builds all 3, runs cross-language checks
What to do when a check fails
| Failure | Most likely cause |
|---|---|
| V1, V5 | Off-by-one in insert path; key not normalized |
| V2 | Skip list level 0 chain has out-of-order pointer write |
| V3 | RNG broken: same bit pattern reused per call |
| V4 | Level cap not enforced |
| V6 | Hash table not actually power-of-two; home_slot = hash % cap not masking |
| V7 | Load-factor check uses > instead of >=, or len not decremented on delete |
| V8 | Tombstones left in array; backward-shift loop terminates too early |
| V11 | Hash function differs across langs — must be FNV-1a in all three |
Broader Ideas — Where to Go Next
Extensions you can build on top of this lab. Each is a 0.5–2 day exercise.
1. Concurrent skip list with lock-free reads
LevelDB's MemTable allows concurrent readers while a single writer inserts. The trick: nodes are made visible by a single atomic store of the bottom-level forward pointer — once that store lands, the node exists; before, it doesn't. Higher-level pointers can race because they're only used to speed up a search; if they point to a not-yet-visible node, the next compare won't match and the search retries.
Implement: an AtomicPtr per forward pointer, a single writer (enforced by external mutex or compare_exchange), no per-node lock. Test: spawn 8 readers + 1 writer, run for 10s, assert no reader observes a partial node.
2. Concurrent hash table: lock striping + extendible hashing
Lock striping: 64 stripe mutexes; key's stripe = hash & 63. A write locks its stripe; reads either lock-read or use seqlock counters.
Extendible hashing: instead of full-table resize, split one bucket at a time when it grows past a threshold.
3. ART (Adaptive Radix Tree)
A radix tree variant that uses 4 different internal node layouts (4, 16, 48, 256 children) and adapts based on density. Wins for variable-length keys with shared prefixes (URLs, paths). [Leis et al., ICDE 2013].
4. Cuckoo hashing
Two hash functions, two candidate slots per key. Lookups are guaranteed 2 reads. Used in Memcached extensions.
5. Hopscotch hashing
Each entry must live within H slots of its home (typically H=32). Bounded probe distance like Robin Hood with stronger guarantee.
6. B+-Tree (preview db-10)
Write an in-memory B+-Tree with fan-out 64, leaves chained, and compare to skip list on range scans. This sets up the "why on-disk B-Trees beat skip lists" intuition before you've touched disk.
7. Skip list with rank queries (Redis ZRANK)
Add a span field per forward pointer = "number of bottom-level nodes this pointer skips over." Now rank(key) is O(log n) instead of O(n). ~50 LOC of additions.
8. Bloom filter (preview db-04)
In front of the hash table: a 1-bit-per-position array sized for ~1% false-positive rate at expected N. Measure: at 50%-miss workloads the Bloom filter saves you cache misses; at 95%-hit workloads it's pure overhead.
9. xor / cuckoo / ribbon filters
Modern variants (xor [Graf & Lemire 2020], ribbon [Dillinger 2021]) get the same false-positive rate as Bloom with 25–35% less memory.
10. Cache-conscious skip list
Replace the per-node forward-pointer array with a contiguous tail allocation (RocksDB's InlineSkipList). Compare cache miss rates: same algorithm, half the misses.
11. Persistent / immutable variants
Build an immutable skip list where insert returns a new root, sharing 99% of nodes with the old one. Useful for snapshots, MVCC.
When you've explored a couple, you're ready for db-03 Write-Ahead Log, where the durability story begins.
Step 1 — Implement the Skip List
Goal
Build a sorted map with O(log n) expected insert/lookup/remove and O(n) ordered iteration.
API
SkipList::new(seed: u64) -> SkipList
SkipList::insert(key, value) -> bool // false if replaced existing
SkipList::get(key) -> Option<value>
SkipList::remove(key) -> bool
SkipList::len() -> usize
SkipList::iter() -> iterator<(key, value)>
SkipList::max_level_used() -> usize
SkipList::level_histogram() -> [usize; MAX_LEVEL]
Constants
MAX_LEVEL = 32P = 0.5(sample viacount_trailing_zeros(rng()) % MAX_LEVEL)
Data layout
SkipList {
head: Node* // dummy sentinel at MAX_LEVEL
level: usize // current max level used (1..=MAX_LEVEL)
len: usize
rng: u64 // xorshift64 state
}
Node {
key: Vec<u8>
value: Vec<u8>
forward: Vec<Option<Box<Node>>> // length = height of this node
}
In Rust we use Box<Node> for ownership and raw pointers for siblings (or Option<NonNull<Node>> for safer raw pointers). In Go, *Node is the natural choice. In C++, std::unique_ptr<Node> for the sole owner of level-0 next, raw pointers for higher levels.
For simplicity we use a single ownership style: level-0 owns nodes; higher levels hold raw pointers. Drop walks the level-0 chain once.
Insert pseudocode
update[0..MAX_LEVEL] = HEAD
x = head
for i in (level-1)..=0:
while x.forward[i] != null && x.forward[i].key < key:
x = x.forward[i]
update[i] = x
if x.forward[0] != null && x.forward[0].key == key:
x.forward[0].value = value
return false // replaced
new_level = random_level()
if new_level > level:
for i in level..new_level: update[i] = HEAD
level = new_level
node = new Node(key, value, new_level)
for i in 0..new_level:
node.forward[i] = update[i].forward[i]
update[i].forward[i] = node
len += 1
return true
Random level
fn random_level(rng: &mut u64) -> usize {
*rng ^= *rng << 13;
*rng ^= *rng >> 7;
*rng ^= *rng << 17;
let lvl = (rng.trailing_zeros() as usize) + 1;
min(lvl, MAX_LEVEL)
}
Tests
| # | Test | Pass if |
|---|---|---|
| T1 | insert 1000 random keys, all get succeed | every value matches |
| T2 | insert sorted keys 0..999, iter yields 0..999 | strictly increasing |
| T3 | insert + remove all keys, len = 0 | empty |
| T4 | insert with same key twice, len unchanged | replacement worked |
| T5 | level distribution at N=100k is geometric | sum of L≥k slots ≈ N · 2^-k |
Step 2 — Implement the Hash Table
Goal
Open-addressing hash table with linear probing + Robin Hood + backward-shift deletion.
API
HashTable::new(initial_capacity_pow2: usize) -> HashTable
HashTable::insert(key, value) -> bool // false if replaced
HashTable::get(key) -> Option<value>
HashTable::remove(key) -> bool
HashTable::len() -> usize
HashTable::capacity() -> usize
HashTable::load_factor() -> f64
HashTable::max_probe() -> usize
HashTable::probe_histogram() -> Vec<usize>
Hash function
FNV-1a 64-bit followed by a SplitMix64 finalizer (identical in all three languages):
offset = 0xcbf29ce484222325
prime = 0x00000100000001b3 # = 1_099_511_628_211
h = offset
for byte in key:
h ^= byte
h = h * prime (wrapping)
return splitmix64_finalize(h)
fn splitmix64_finalize(mut h: u64) -> u64 {
h ^= h >> 30; h = h.wrapping_mul(0xbf58476d1ce4e5b9);
h ^= h >> 27; h = h.wrapping_mul(0x94d049bb133111eb);
h ^= h >> 31;
h
}
Plain FNV-1a has notoriously poor avalanche on short, sequential keys —
running it raw against Robin Hood probing blows up max_probe (we measured
206 vs the expected ≲66 bound at N=100k). The SplitMix64 finalizer is
bijective (adds no collisions) and re-mixes the high bits down, which
restores the geometric PSL distribution.
Known-answer vectors (pin these in tests across all three languages):
| key | hash |
|---|---|
"" | 0xf52a15e9a9b5e89b |
"a" | 0x02c0bdbf481420f8 |
"foobar" | 0x404da9e3b74078c2 |
Slot layout
Slot {
occupied: bool // or use psl = MAX as sentinel
psl: u32 // probe sequence length
hash: u64
key: Vec<u8>
value: Vec<u8>
}
We store the full 64-bit hash inside each slot so resizes don't re-hash keys, and so that get can compare hashes (cheap) before keys (expensive).
Insert (Robin Hood)
if (len + 1) / capacity > 0.85: resize(capacity * 2)
h = hash(key)
i = h & (capacity - 1)
psl = 0
loop:
if !slots[i].occupied:
slots[i] = (key, value, h, psl); occupied; len += 1
return true
if slots[i].hash == h && slots[i].key == key:
slots[i].value = value
return false // replaced
if slots[i].psl < psl:
swap(slots[i], (key, value, h, psl)) // steal
i = (i + 1) & (capacity - 1)
psl += 1
Get
h = hash(key)
i = h & (capacity - 1)
psl = 0
loop:
if !slots[i].occupied: return None
if slots[i].psl < psl: return None // would have stolen
if slots[i].hash == h && slots[i].key == key:
return Some(slots[i].value)
i = (i + 1) & (capacity - 1)
psl += 1
Remove (backward-shift)
i = find_slot(key)
if not found: return false
loop:
j = (i + 1) & (capacity - 1)
if !slots[j].occupied || slots[j].psl == 0:
slots[i].occupied = false
len -= 1
return true
slots[i] = slots[j]
slots[i].psl -= 1
i = j
Resize
new_slots = [empty; capacity * 2]
old = swap(slots, new_slots); capacity *= 2; len = 0
for slot in old where occupied:
insert(slot.key, slot.value) // re-uses hash if you cache it
Tests
| # | Test | Pass if |
|---|---|---|
| T1 | insert + get of 10k random keys | all hits |
| T2 | insert 10k, remove 5k random, get remaining 5k | all still found, removed not found |
| T3 | insert past 85% load triggers resize | capacity doubled |
| T4 | duplicate insert replaces value, len unchanged | replacement worked |
| T5 | max PSL ≤ 4·log₂(cap·load) at 1M inserts | bounded variance |
Step 3 — Benchmark and Compare
Goal
Quantify the cost difference between the two structures on three workloads.
Workloads
| Name | Op | Measured |
|---|---|---|
point | get(key) where key was previously inserted | ns/op + cache misses (if perf) |
point-miss | get(key) where key is absent | ns/op |
range | iterate from lo until hi | ns/key (skip list only) |
Procedure
- Seed both structures with N (10k, 100k, 1M, 10M) random 8-byte keys + 8-byte values.
- For each workload, take
itersrandom accesses, time the loop, divide by iters. - Report p50, p99, mean, total throughput.
Expected outcomes (M2 Pro, release, N=1M)
| Op | Skip list | Hash table | Ratio |
|---|---|---|---|
| Insert | ~700 ns | ~95 ns | 7× |
| Point hit | ~450 ns | ~25 ns | 18× |
| Point miss | ~500 ns | ~30 ns | 17× |
| Range scan 1000 from key | ~25 µs | N/A | — |
| Bytes/entry | ~80 | ~25 | 3.2× |
What the numbers prove
- For unordered point access, never use a skip list. The factor-of-20 gap is from cache-miss count: hash table touches ~1 line, skip list touches ~20 (one per level).
- For ordered access, the skip list is the only option of the two. Range scans on a hash table require collecting all entries and sorting —
O(n log n)setup vsO(k)for the skip list. - The memory gap is real and gets worse for tiny values. Skip-list forward-pointer arrays dominate when value size is < ~64 B.
How to run
./target/release/dsbench bench point 1000000
./target/release/dsbench bench mem 1000000
Optional: cache-miss measurement
# Linux
perf stat -e cache-misses,cache-references ./target/release/dsbench bench point 1000000
# macOS (Instruments → Counters template, capture by PID)
xcrun xctrace record --template 'Counters' --launch ./target/release/dsbench bench point 1000000
Write-Ahead Log (WAL)
Status: complete — runnable in Rust, Go, C++.
1. What Is It
A WAL is an append-only file that records intent-to-modify before the actual data pages are updated. On crash, the recovery routine replays the WAL from the last checkpoint, restoring the database to a consistent state.
client write ──► append record ──► fsync(WAL) ──► ack client
│
▼
later: apply to data file
later: checkpoint & truncate WAL
The critical invariant is the write order: WAL hits stable storage before the client is told the write committed. If the process dies between the fsync and the data-page update, recovery re-applies the logged operation. If it dies before the fsync, the client never got an ack, so losing the record is acceptable.
2. Why It Matters
| Without WAL | With WAL |
|---|---|
| Random in-place writes to the data file | Sequential append (10–100× faster on HDD, still better on SSD) |
| Each commit = random-page fsync | Each commit = single sequential append + fsync |
| Crash mid-update ⇒ torn page, corrupt file | Crash ⇒ replay log, idempotent recovery |
| Group commit impossible | Multiple commits batched into one fsync ("group commit") |
Every serious database has one: PostgreSQL's WAL, MySQL's redo log, SQLite's WAL mode, RocksDB's WAL, LevelDB's LOG file, Kafka's segments.
3. How It Works
Record framing used in this lab (mirrors LevelDB's simplified record format minus the block-grouping):
┌─────────┬─────────┬──────────────────────┐
│ len(u32)│ crc(u32)│ payload (len bytes) │
└─────────┴─────────┴──────────────────────┘
4 4 N
lenandcrcare little-endian.crcis CRC-32 (IEEE 802.3 polynomial, reflected) of the payload only.- Records are written back-to-back with no padding.
- Recovery iterates from offset 0 and stops at the first record whose header is short, whose payload is short, or whose CRC fails. The valid prefix is replayed; the bad tail is silently truncated on next open.
file:
┌───┬───┬─────┬───┬───┬─────┬───┬───┬──┐ ← crashed mid-write
│L=8│CRC│ A… │L=4│CRC│ B… │L=9│CRC│??│
└───┴───┴─────┴───┴───┴─────┴───┴───┴──┘
▲ ▲
│ └─ short payload → stop, truncate from here
└─ last fully-flushed record
4. Core Terminology
| Term | Definition |
|---|---|
| Record | Self-describing unit: header (len + crc) followed by payload bytes. |
fsync | Syscall asking the kernel to flush dirty pages and inode metadata to disk. |
fdatasync | Like fsync but skips metadata if only data changed. Slightly faster. |
| Group commit | Coalescing multiple in-flight appends into one shared fsync. |
| Torn write | A write the device split into two physical sectors, only one of which made it. CRC catches this. |
| Tail truncation | Scanning forward at open and discarding any partial trailing record. |
| Checkpoint | Flush dirty pages to data file, record a WAL position beyond which replay is unnecessary. |
5. Mental Models
- The log is the source of truth, the data file is the cache. Recovery reconstructs from the log.
- CRC is for detection, not correction. It tells you where the good prefix ends; it does not heal damage.
fsyncis a barrier, not a universal durability guarantee. Consumer SSDs and FUSE filesystems sometimes lie. Usefio --fdatasync=1to spot-check hardware.- Sequential I/O wins. Even on SSDs, sequential writes have better SLC-cache and GC behavior than random ones.
6. Common Misconceptions
- "
write()already put it on disk." No — kernel page cache.fsyncis required for durability. - "CRC + length is enough." Necessary, not sufficient. A record with both
lenandcrczeroed is indistinguishable fromlen=0,crc=crc32([]). We disallowlen=0(treat as EOF). - "Group commit hurts latency." Tiny median bump for a 10–100× throughput win and lower tail latency under load.
- "fsync == O_DIRECT." Different layers.
O_DIRECTbypasses the page cache;fsyncflushes it.
7. Interview Talking Points
- Distinguish redo log (PostgreSQL/InnoDB), WAL mode (SQLite WAL), journal mode (SQLite default).
- Why CRC the payload, not the header? (Need length first; header CRC catches the wrong failures.)
fsyncvsfdatasyncvssync_file_range.- Group commit mechanics and tradeoff.
- Why WAL alone doesn't beat torn writes on the data file → full-page writes in WAL after each checkpoint.
8. Connections to Other Labs
- db-01 — every append here is a
pwrite+fsyncfrom db-01. - db-05, db-09 — LSM writes always hit the WAL first.
- db-11 — SQLite WAL mode reuses this exact pattern around a B-tree pager.
- db-13 — commit records & 2PC live in the WAL.
- db-17 — Raft's replicated log is, mechanically, a WAL.
References — Write-Ahead Log
Papers
- ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. C. Mohan et al., 1992. The canonical WAL paper; introduces the redo–undo discipline still used everywhere.
- The Log-Structured Merge-Tree (LSM-Tree). P. O'Neil et al., 1996. Section 2 motivates why a separate sequential log is necessary even for in-memory writes.
Code
- LevelDB
db/log_format.h— record types & block structure that inspired this lab's framing. - RocksDB
db/log_writer.cc— production-grade group-commit implementation. - PostgreSQL
src/backend/access/transam/xlog.c— full-page writes, redo machinery.
CRC
- A Painless Guide to CRC Error Detection Algorithms. Ross Williams, 1993. Plain-English walk-through of polynomial division and reflected algorithms.
- Linux kernel
lib/crc32.c— reference table-driven implementation.
Filesystems & fsync
- Can Applications Recover from fsync Failures? Anthony Rebello et al., USENIX ATC 2020. Surveys the depressing landscape of partial fsync failures.
- Files are hard. Dan Luu, 2017. Survey blog post on every way fsync, rename, and friends betray you.
man 2 fsync,man 2 fdatasync,man 2 open(O_DSYNC,O_SYNC).
Analysis — Write-Ahead Log
Problem statement
Make a stream of writes durable and crash-recoverable without paying for a random in-place fsync per write.
Constraints
| Constraint | Why it matters |
|---|---|
| Append-only file | Sequential I/O, ~10–100× faster than random on HDD; better SLC/GC behavior on SSD. |
| Self-describing records | Recovery must work without a side index. Header = length + checksum. |
| Truncation-tolerant tail | Crash mid-write leaves a partial record. We must detect & ignore it on next open. |
| Single writer | We do not address multi-writer log multiplexing here (Kafka does). |
| No structural guarantees from the FS | Don't assume ordering of metadata vs data, or that 4KB writes are atomic. |
Design decisions
- Header =
len(u32 LE) || crc32(u32 LE). Small (8 B), aligned, endian-fixed. We deliberately keep the CRC out of the length so we can stream-checksum the body. - CRC is over the payload only. The header itself is implicitly validated by use — a corrupt
leneither points beyond EOF (short read) or to data whose CRC won't match. len == 0is disallowed, used as the EOF sentinel. Empty payloads are rare in practice and avoiding the ambiguity simplifies the reader (len=0,crc=0happens naturally in a hole of zeros from a sparse file or pre-allocated extent).- Little-endian on disk. Everyone runs LE now (x86, ARM, RISC-V — even POWER prefers it). No
htole32dance saves ~5 LOC per language. - CRC table generated at startup, not hardcoded. 1 KB, computed in microseconds. Easier to audit, and lets us swap polynomials in tests.
- One file, one writer, one fd. No segment rotation in this lab — that lives in db-07 (compaction) and db-09 (LevelDB). Single-file WAL is enough to teach framing.
sync()is a separate method. The caller decides commit boundaries. Production systems may addappend_sync(payload)that batches a group commit; we leave that forbenchmode.
Why this design over alternatives
- vs LevelDB's block-grouped framing: LevelDB pads records to 32KB blocks for alignment and easier corruption isolation. Beautiful, but doubles the code volume. We follow this lab's bias of "minimum to teach the concept, plus one cross-language cross-check".
- vs JSON / protobuf framing: would require schema management. CRC + raw bytes is the smallest possible recoverable framing.
- vs a per-record fsync: we expose a separate
sync()so the user can choose between durability per-record (call after every append) and group commit (call periodically).
Failure modes addressed
| Failure | Detection |
|---|---|
| Partial header at EOF | Header read < 8 bytes ⇒ stop iteration. |
| Header OK, partial payload | Payload read < len ⇒ stop iteration. |
| Full record, CRC mismatch (bit-flip) | CRC32 over payload ≠ stored CRC ⇒ stop iteration. |
| Hole of zeros (sparse FS, preallocated) | len == 0 is the EOF sentinel ⇒ stop iteration cleanly. |
Disk fully lying about fsync | Out of scope; mention fio --fdatasync=1 to detect. |
Failure modes NOT addressed in this lab
- Bit-flip in the header itself that produces a plausible
(len, crc)pair (probability ≈ 2⁻³²). Production systems mitigate with a record-type byte (LevelDB) or magic bytes (Kafka). - Multi-process writers. Use
O_APPEND+ ≤PIPE_BUF append for that; see db-09 / db-21. - Disk full mid-write. Treat as torn write at EOF (the trailing record fails CRC and is dropped on recovery); the caller's
append()returns an I/O error that they must handle.
Execution — How to Build and Run
Quick start (per language)
# Rust
cd src/rust
cargo build --release
cargo test --release
./target/release/walbench --help
# Go
cd src/go
go test ./...
go build -o bin/walbench ./cmd/walbench
./bin/walbench --help
# C++
cd src/cpp
cmake -S . -B build && cmake --build build
ctest --test-dir build
./build/walbench --help
CLI: walbench
A single binary per language exercising the WAL.
| Subcommand | Args | What it does |
|---|---|---|
append PATH N [SIZE] | path, count, payload bytes (default 64) | Appends N records of SIZE bytes; reports bytes/sec |
append-sync PATH N [SIZE] | same | Same as append but sync() after each record |
read PATH | path | Replays the log, prints len(crc=…) ok per record, then OK n=… bytes=… |
corrupt PATH OFFSET BYTE | path, offset, byte value | Overwrites one byte in place — for testing tail tolerance |
bench-group PATH N BATCH | path, total records, batch size | Group-commit benchmark: sync once per BATCH records |
Library API
Same shape in all three languages.
Wal::open(path) -> Wal // creates or opens for append, scans to EOF
Wal::append(payload) -> u64 offset // record start offset in file
Wal::sync() -> () // fdatasync (or fsync) the file
Wal::len() -> u64 // bytes on disk (post-append, post-sync)
Wal::close() -> () // implicit on Drop in Rust / RAII in C++
Wal::iter(path) -> iterator<Vec<u8>> // streams records; stops at first bad/short
open scans forward at startup to (a) find true EOF after a partial-write recovery and (b) optionally truncate the file to that position so the next append doesn't append after a known-bad tail. We do truncate in this implementation — the alternative (leave the bad tail in place) makes file size useless and complicates len().
Verifying
./scripts/verify.sh # invariants per implementation
./scripts/cross_test.sh # write in lang A, read in lang B, all six pairs
Observation — Looking at the Bytes
1. Hexdump a freshly written WAL
./build/walbench append /tmp/wal 3 4
xxd /tmp/wal
00000000: 0400 0000 b3ca 9eb5 4141 4141 0400 0000 ........AAAA....
00000010: b3ca 9eb5 4242 4242 0400 0000 b3ca 9eb5 ....BBBB........
00000020: 4343 4343 CCCC
What you should see:
- Three 12-byte records (4 header + 4 payload * 3 = 36 bytes, but actually 8+4 = 12 each = 36 ✓).
- Identical headers because every payload is "AAAA" / "BBBB" / "CCCC" — same length, different CRC.
04 00 00 00islen = 4in little-endian.- The next 4 bytes are the payload's CRC, also little-endian.
If your file is suspiciously large (e.g., starts with garbage 0x00 or 0xFF runs), open() is opening the file with the wrong flags or your buffer is uninitialized.
2. Group commit vs per-record sync
./build/walbench append-sync /tmp/wal 10000 64 # fsync per record
./build/walbench bench-group /tmp/wal 10000 64 1 # group=1, same thing
./build/walbench bench-group /tmp/wal 10000 64 64 # 64 records per fsync
./build/walbench bench-group /tmp/wal 10000 64 512 # 512 per fsync
Sample (M2 Pro, APFS):
mode throughput
per-record sync 1,800 records/s (~556 µs/sync)
group=64 110,000 records/s
group=512 260,000 records/s
Two takeaways: per-record sync is 3 orders of magnitude slower; group size has diminishing returns past ~256 because the bottleneck shifts to write() itself.
3. Tail truncation in action
./build/walbench append /tmp/wal 5 16
wc -c /tmp/wal # 120 bytes (5 × 24)
printf '\xff\xff\xff\xff' >> /tmp/wal
wc -c /tmp/wal # 124 bytes
./build/walbench read /tmp/wal # reads 5, then "stop: short header" (124-120 = 4 < 8)
./build/walbench append /tmp/wal 1 16
wc -c /tmp/wal # 144 bytes — open() truncated the garbage, then appended
The reopen-truncate behavior is the most easily-missed correctness detail. If it's broken, your second append ends up inside the corrupted region and the file becomes unreadable after recovery.
4. CRC sensitivity
Bit-flipping one byte of a 64-byte payload should kill the CRC of that record but leave everything before it valid:
./build/walbench append /tmp/wal 10 64
./build/walbench corrupt /tmp/wal 100 0x00 # mid-payload of record ~4
./build/walbench read /tmp/wal | tail
# expected: prints ~3 OK records then "stop: bad crc"
What "working" looks like
- Hexdump shows tightly packed 8-byte-header + payload pairs, no padding.
- Group commit is at least 50× faster than per-record sync.
- Tail truncation works on first reopen, regardless of how much garbage you appended.
What "broken" looks like
- A reader that hangs or panics on garbage — fix the bounds checks in the iter loop.
- File size grows but throughput is flat — you're probably calling
fsyncinsideappendaccidentally. - CRC doesn't trip on single-bit flips — wrong polynomial (likely you used the un-reflected version, see
scripts/verify.sh). - Cross-language test fails — endianness or CRC table bug. Print the first 16 bytes of the file from each language and compare.
Verification — What to Test and How
Property tests (per language)
| # | Test | Pass if |
|---|---|---|
| V1 | crc32_known_vectors | "" → 0x00000000; "a" → 0xE8B7BE43; "123456789" → 0xCBF43926 |
| V2 | roundtrip_small | append "hello" "world", iter yields exactly those two payloads |
| V3 | roundtrip_1000_variable | append 1000 records of pseudo-random sizes 1..1024, iter yields identical sequence |
| V4 | truncated_tail | open, append A and B, fsync, write 5 bytes of garbage past EOF, reopen ⇒ iter yields {A,B} only |
| V5 | corrupt_payload | flip one bit in the payload of record 2 of 5, iter yields {1} (stops at first bad) |
| V6 | corrupt_header | overwrite len of record 2 with 0xFFFFFFFF, iter yields {1} |
| V7 | reopen_truncates_garbage | scenario V4 followed by a new append, total iter yields {A,B,C} and file size equals exactly the three records' total bytes |
| V8 | append_returns_offset | offset returned by appendₙ equals sum of (header+payload) for records 0..n-1 |
Cross-language test
scripts/cross_test.sh performs a six-way matrix: for each writer ∈ {go, rust, cpp} and reader ∈ {go, rust, cpp}, write 500 records of varying sizes with a fixed seed in the writer language, read them in the reader language, assert the payload list matches exactly.
This catches:
- Endianness mistakes in
len/crc. - Different CRC polynomials or initial value / final XOR.
- Off-by-one in header parse.
fsyncnot being called before the reader runs (we close the writer between phases).
Manual smoke
./build/walbench append /tmp/wal 100 64
./build/walbench read /tmp/wal | tail
# expected: OK n=100 bytes=7300
./build/walbench corrupt /tmp/wal 50 0xFF
./build/walbench read /tmp/wal | tail
# expected: stops well before record 100, reports bad record
What "passing" means
- All 8 property tests green in all three languages.
cross_test.shexits 0 (9 successful writer×reader runs).- Manual smoke: corruption stops the reader cleanly, no panic / no segfault, no infinite loop.
Broader Ideas — Beyond the Minimum
Things worth knowing that aren't in the lab code.
Block-grouped framing (LevelDB / RocksDB)
LevelDB pads records into 32 KB blocks and uses a 1-byte type field (FULL, FIRST, MIDDLE, LAST) to handle records that straddle blocks. Benefit: corruption in one block can't propagate; recovery can resync to the next block boundary. Cost: more code, slightly more space.
Group commit, properly
Real systems run a "log writer" goroutine/thread:
clients ──► append to buffer ──► wake writer ──► fsync once ──► broadcast cond var
The writer batches all records that arrived during the previous fsync into the next fsync. Latency stays bounded by (max fsync time) + (one batch fill); throughput scales until you saturate the SSD's IOPS.
O_DSYNC vs application-level fsync
O_DSYNC makes every write() durable before returning. Removes the need for explicit fsync, but you lose the chance to batch. Real DBs prefer explicit fsync for that reason.
sync_file_range and friends
Linux-only. sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE) flushes only a byte range. PostgreSQL uses this for "lazy" checkpoints to avoid stalling on huge fsyncs. Doesn't sync metadata, so still need a final fsync.
Pre-allocation & fallocate
For predictable I/O, pre-allocate the next WAL segment with fallocate(FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE). This avoids metadata updates on each grow and gives the FS a contiguous extent. PostgreSQL pre-zeroes 16 MB segments.
Direct I/O & alignment
O_DIRECT bypasses the page cache; useful when the DB has its own buffer pool. Requires 512 B or 4 KB aligned buffers and offsets. Modern recommendation: prefer io_uring + O_DIRECT over POSIX AIO. Returns in db-21.
Mixing data files and WAL on the same disk
Bad idea for HDDs (head contention), neutral for SSDs (no head), bad for low-end SSDs (write amplification competes). Production systems put WAL on a separate device when latency-sensitive.
When the WAL is the database
LSM-trees, Kafka, NATS JetStream, Pulsar, Apache BookKeeper — these treat the log as the primary structure and let secondary indexes / merge trees / consumers catch up. The data file in our toy example was hypothetical; LSMs make it explicit. See db-05 onward.
Encryption / compression
- Compression per record: trivial, but blocks the
Vec<u8>reuse pattern. Better to compress whole segments at checkpoint time. - Encryption per record: AEAD (AES-GCM or ChaCha20-Poly1305) replaces CRC32 — the auth tag is your CRC. PostgreSQL's TDE proposals use this.
Replication
Once you have a sequential log of operations, replicating it is "just" send-and-replay. This is the entire conceptual basis of Raft and ZAB — see db-17 / db-19. The framing tricks here transfer directly.
What goes wrong at scale
- fsync amplification: every fsync touches the FS journal, which serializes against other fsyncs. Solution: large group commit batches.
- Long fsync tails: 99th-percentile fsync on a busy NVMe can be 100ms+. Solution: pipeline; never block a hot-path thread on fsync.
- Filesystems that lie: ext4 with
data=writebackmay completefsyncbefore journaling. APFS, ZFS, btrfs each have their own quirks. Empirical test withfiois the only safe answer.
Step 1 — Record framing & CRC
Goal
Define the on-disk format and a streaming CRC32 implementation that matches between Rust, Go, and C++.
Format recap
┌─────────┬─────────┬──────────────────────┐
│ len(u32)│ crc(u32)│ payload (len bytes) │
└─────────┴─────────┴──────────────────────┘
4 4 N
- Both
u32fields are little-endian. - CRC is over the payload only.
len == 0is the EOF sentinel (an empty payload cannot be appended).
CRC32 — table-driven, reflected
poly = 0xEDB88320 // reflected IEEE 802.3 polynomial
table[256]: built once at startup
for each input byte b:
crc = (crc >> 8) ^ table[(crc & 0xff) ^ b]
return crc ^ 0xFFFFFFFF // final XOR
initial value before processing: 0xFFFFFFFF
Known-answer vectors
| input | CRC32 hex |
|---|---|
"" | 0x00000000 |
"a" | 0xE8B7BE43 |
"123456789" | 0xCBF43926 |
Pin these in every language's unit tests. They are the canonical crc32 IEEE vectors used by zlib, gzip, Ethernet, and the LevelDB log.
Rust outline
#![allow(unused)] fn main() { pub fn crc32_ieee(bytes: &[u8]) -> u32 { let mut c: u32 = 0xFFFF_FFFF; for &b in bytes { c = (c >> 8) ^ TABLE[((c & 0xff) ^ b as u32) as usize]; } c ^ 0xFFFF_FFFF } }
Go outline
func Crc32IEEE(b []byte) uint32 {
c := uint32(0xFFFFFFFF)
for _, x := range b {
c = (c >> 8) ^ table[byte(c)^x]
}
return c ^ 0xFFFFFFFF
}
C++ outline
inline std::uint32_t Crc32Ieee(std::span<const std::uint8_t> b) noexcept {
std::uint32_t c = 0xFFFFFFFFu;
for (auto x : b) c = (c >> 8) ^ kTable[(c & 0xff) ^ x];
return c ^ 0xFFFFFFFFu;
}
Trap: which CRC?
There are at least eight in common use. IEEE (reflected, init 0xFFFFFFFF, final XOR 0xFFFFFFFF) is what we want. 0x04C11DB7 un-reflected is not the same value despite being the same polynomial.
If your test gives 0x4DBDF21C for "a", you're using CRC-32C (Castagnoli). Different polynomial, different table.
Step 2 — Append, sync, iterate
Goal
Implement Wal::open / append / sync / iter consistently in all three languages.
API recap
open(path) -> Wal // O_RDWR | O_CREAT, scan-and-truncate the tail
append(payload) -> u64 // returns the record's start offset
sync() -> () // fdatasync (or fsync on platforms without it)
len() -> u64 // bytes in the live valid region
iter(path) -> Iterator // yields each payload until first short/bad record
open — scan & truncate
The crucial subroutine. After a crash, the file may end in a partial header
or partial payload. open finds the last valid record's end and truncates
the file to that length, so subsequent appends append cleanly.
pos = 0
loop:
if file_size - pos < 8: break // not enough for header
read 8 bytes at pos: (len, crc)
if len == 0: break // EOF sentinel / sparse hole
if pos + 8 + len > file_size: break // payload short
read len bytes at pos+8
if crc32(payload) != crc: break
pos += 8 + len
if pos != file_size:
ftruncate(file, pos)
return Wal { fd, write_offset = pos }
append
hdr[0..4] = len.to_le_bytes()
hdr[4..8] = crc32(payload).to_le_bytes()
pwrite(fd, hdr, write_offset)
pwrite(fd, payload, write_offset + 8)
offset_returned = write_offset
write_offset += 8 + len
return offset_returned
We do not fsync inside append. Callers do that explicitly via sync() to enable group commit.
sync
- Linux:
fdatasync(fd) - macOS:
fcntl(fd, F_FULLFSYNC, 0)for true device-level sync; falls back tofsync(fd)if F_FULLFSYNC fails (e.g., not on APFS). - Windows:
FlushFileBuffers(handle)(out of scope here).
In this lab we use fdatasync (Linux) and fsync (macOS) for simplicity; production should consider F_FULLFSYNC on macOS because plain fsync does not guarantee device-level durability on Apple's filesystems.
iter — read-only replay
Mirrors open's scan loop but yields each payload instead of advancing a write cursor. Stops on the same conditions (len == 0, short header, short payload, bad CRC). Never panics on garbage.
Tests to pin behavior
| # | Test | Expected |
|---|---|---|
| T1 | Append "A", "B", reopen, iter → ["A", "B"] | Both records returned in order |
| T2 | Append, truncate WAL by 1 byte (cut payload), reopen, iter | Last record dropped |
| T3 | Append, flip a payload byte, iter | Reader stops at bad CRC |
| T4 | Append, write \0\0\0\0\0\0\0\0 past EOF, reopen | File length restored to pre-garbage size |
| T5 | append() returned offsets are strictly increasing and equal to file size after that append | Yes |
Gotchas
- macOS
fsyncdoes not flush the disk write cache. UseF_FULLFSYNCfor tests that must outlive a power loss. - Rust
File::write_alldoes not callflushon the kernel level, only the userspaceBufWriter. We use rawpwritevianix/std::os::unix::fs::FileExt::write_all_atto skip the userspace buffer entirely. - Go
os.File.Writeis unbuffered by default, butbufio.Writeris not. Make sure yourWaldoes not wrap the file in abufio.Writer— that defers writes invisibly and confusessync.
Step 3 — Group commit benchmark
Goal
Quantify the cost of fsync and the throughput win from group commit.
Workload
bench-group PATH N BATCH:
for i in 0..N:
append(payload)
if (i+1) % BATCH == 0: sync()
sync() // final
PATH is a brand-new file each run. N = 50_000 is a good starting point on a modern SSD.
Numbers to look for (M2 Pro, APFS, 64-byte payload)
| Batch | Throughput | Avg latency / sync | Bytes flushed / sync |
|---|---|---|---|
| 1 | ~1,800 rec/s | ~560 µs | ~72 B |
| 8 | ~12,000 rec/s | ~670 µs | ~576 B |
| 64 | ~110,000 rec/s | ~580 µs | ~4.6 KB |
| 512 | ~260,000 rec/s | ~1.0 ms | ~37 KB |
| 4096 | ~310,000 rec/s | ~13 ms | ~295 KB |
Two effects worth noting:
- Sync time is roughly constant up to ~4KB: the bottleneck is the per-fsync overhead (syscall + journal commit), not the byte count.
- Returns diminish past batch ~256: bandwidth becomes the next limit. Past ~4096 you start hitting tail-latency cliffs.
What "broken" looks like
- Per-record throughput is the same as group=64: your
sync()isn't doing anything (no-op, wrong fd, orbufio.Writerswallowing the write). - Throughput keeps climbing past group=4096: you may not be calling
sync()at all between batches. - macOS numbers look impossibly fast: plain
fsyncdoes not flush the device cache. Re-run withF_FULLFSYNCto compare.
Comparing to a Linux box
On NVMe + ext4:
| Batch | Throughput |
|---|---|
| 1 | ~3,000 rec/s |
| 64 | ~180,000 rec/s |
| 4096 | ~600,000 rec/s |
The shape is identical; absolute numbers depend on the device's flush latency.
Bloom Filters and Hashing
Status: complete — runnable in Rust, Go, C++.
1. What Is It
A Bloom filter is a probabilistic set: add(x) always succeeds; contains(x) returns either definitely not present or probably present. It uses a fixed-size bit array m and k independent hash functions; add(x) sets bits at positions h_1(x) mod m, …, h_k(x) mod m; contains(x) returns true iff all those bits are set.
add("foo"):
h1=37 h2=812 h3=4 → bits[37]=bits[812]=bits[4]=1
contains("bar"):
h1=99 h2=812 h3=120 → bits[99]=0 ⇒ definitely absent
contains("foo"):
h1=37 h2=812 h3=4 → all 1 ⇒ probably present
False positives are inherent (any other key that hits the same k bits looks present); false negatives are impossible (a stored key set its bits, and we never unset).
2. Why It Matters
| Without a bloom filter | With one |
|---|---|
LevelDB / RocksDB Get(k) on a miss probes every SSTable's index → many disk reads | One in-memory bit-test per SSTable rejects 99% of misses |
| Distributed cache: "do I have this key?" requires a network RTT | Local bit-test on a 1 MB filter answers in nanoseconds |
| Spell-checker holds full dictionary | Few bits per word |
| Webcrawler revisits the same URL | A few bits per URL prevent recrawl |
Filter sizes are tiny: at the textbook optimum (~9.6 bits/key for 1% FPR) a million keys fit in 1.2 MB. Cache-resident.
3. How It Works
For n inserts into m bits with k hashes (assuming independent uniform hashes), the probability a given bit is still zero is (1 - 1/m)^(kn) ≈ e^(-kn/m), so the false-positive rate is
$$\text{FPR} \approx \left(1 - e^{-kn/m}\right)^k$$
Differentiating with respect to k yields the optimal hash count
$$k^* = \frac{m}{n}\ln 2 \approx 0.693 \cdot \frac{m}{n}$$
and the achievable FPR at $k^*$:
$$\text{FPR}^* \approx 0.6185^{,m/n}$$
So 10 bits/key ⇒ ~1% FPR with 7 hashes; 20 bits/key ⇒ ~0.01% with 14 hashes.
Kirsch–Mitzenmacher double hashing
We do not compute k independent hashes. Per Kirsch & Mitzenmacher (2006), it is sufficient — with no measurable FPR penalty — to compute one 64-bit hash, split it into halves h1 and h2, and synthesize:
g_i(x) = h1(x) + i * h2(x) for i = 0..k-1
This is what LevelDB, RocksDB, and most production filters do.
In this lab the underlying 64-bit hash is FNV-1a64 of the key, then mixed once through SplitMix64 to spread the bits. (FNV-1a64 alone is biased in its high bits, and the Kirsch–Mitzenmacher splitting cares about both halves being well-distributed.)
On-disk / on-wire layout
┌─────────┬─────────┬───────────────────────────┐
│ k (u32) │ m (u64) │ bits (⌈m/8⌉ bytes, LE) │
└─────────┴─────────┴───────────────────────────┘
Identical layout across Rust/Go/C++ so all three can read each other's filters byte-for-byte.
4. Core Terminology
| Term | Definition |
|---|---|
m | Bit-array size in bits. |
n | Number of distinct keys inserted. |
k | Number of hash functions per key. |
| FPR | False-positive rate at query time. False negatives are impossible. |
| Bits/key | m / n. The single knob that determines achievable FPR. |
| Saturation | Once a large fraction of bits are 1, FPR climbs sharply; filters should be sized for the maximum expected n. |
| Counting Bloom | Variant that supports remove by storing 4-bit counters per slot. Costs 4× memory. |
| Cuckoo filter | Modern alternative: supports delete, often lower space at FPR ≤ 1%, harder to size. |
| Xor filter | Static (build once, query many) — best space efficiency, but no incremental inserts. |
5. Mental Models
- Bloom is a hash-collision amplifier. One hash collision is rare; needing
kof them simultaneously is rarer. The filter trades memory for that compounding. - A Bloom filter is a negative index. Use it to avoid work; never use it to prove presence.
- Hash quality matters less than independence. Once individual bits are well-distributed, the Kirsch–Mitzenmacher trick gives you arbitrarily many "independent" hashes for free.
- You can compose them. Union ⇒ bitwise OR (with same
m,k). Approximate intersection ⇒ bitwise AND (overestimates).
6. Common Misconceptions
- "FPR depends on the number of bits set." It depends on
m,n, andk. Two filters with the same fill factor but differentkhave different FPR. - "Bigger
kis always better." Past $k^*$, FPR climbs again because each insert sets more bits, accelerating saturation. - "I can resize a Bloom filter." No — bit positions depend on
m. Resize by building a fresh filter from the underlying data (or by maintaining a scalable filter, which is a series of growing Bloom filters). - "Cryptographic hashes are required." Wasted CPU. Anything well-distributed and fast (FNV, xxhash, MurmurHash3, CityHash) works.
- "
removewould be cheap if I just cleared the bits." It would also clear bits set by every other key that shares positions. Counting Bloom exists for this reason.
7. Interview Talking Points
- Derive $k^* = (m/n) \ln 2$ and the resulting FPR formula from first principles.
- Explain Kirsch–Mitzenmacher and why it doesn't increase FPR (citation: Less Hashing, Same Performance, ESA 2006).
- Walk through how RocksDB pairs a Bloom filter with each SSTable — and how the new ribbon filter improves on that.
- Quantify: "for 1% FPR you need ~10 bits/key; for one-in-a-million, ~30."
- Contrast Bloom vs. Cuckoo vs. xor filters and their tradeoffs.
8. Connections to Other Labs
- db-06 — every SSTable carries an embedded Bloom filter.
- db-07 — compaction rebuilds filters because input filters can't be merged exactly.
- db-08 — filter block is cached separately from data blocks.
- db-09 —
LookupKeyflow consults the per-table filter before reading the index block. - db-21 — prefix Bloom filters, partitioned filters, ribbon filters.
References — Bloom Filters and Hashing
Foundational papers
- Burton H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors, CACM 1970. The original 2-page paper. https://dl.acm.org/doi/10.1145/362686.362692
- Adam Kirsch & Michael Mitzenmacher, Less Hashing, Same Performance: Building a Better Bloom Filter, ESA 2006. https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
- Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher, Cuckoo Filter: Practically Better Than Bloom, CoNEXT 2014. https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
- Thomas M. Graf & Daniel Lemire, Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters, JEA 2020. https://arxiv.org/abs/1912.08258
- Peter C. Dillinger & Stefan Walzer, Ribbon Filter: Practically Smaller Than Bloom and Xor, 2021. https://arxiv.org/abs/2103.02515
Production code to read
- LevelDB filter policy: https://github.com/google/leveldb/blob/main/util/bloom.cc
- RocksDB filter blocks: https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter
- RocksDB ribbon filter implementation: https://github.com/facebook/rocksdb/blob/main/util/ribbon_impl.h
Survey & blog posts
- Daniel Lemire, "All about Bloom filters" series: https://lemire.me/blog/tag/bloom-filter/
- Jeff Dean's classic numbers-every-programmer-should-know — useful when sizing filters against disk-seek and RAM costs.
- Hadron, "How RocksDB sizes filters": https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
Hash functions
- Fowler–Noll–Vo (FNV) reference: http://www.isthe.com/chongo/tech/comp/fnv/
- SplitMix64 (Vigna & Steele's high-quality mixer): https://prng.di.unimi.it/splitmix64.c
- Austin Appleby, MurmurHash3: https://github.com/aappleby/smhasher/wiki/MurmurHash3
- xxHash by Yann Collet: https://github.com/Cyan4973/xxHash
Related labs
- db-02 — data structures for storage reused the FNV-1a64 hash now wrapped here.
- db-06 — SSTable format embeds the filter generated here.
Analysis — Bloom Filters and Hashing
Problem statement
Build a fixed-memory probabilistic set that:
- Never reports false negatives (lookups for present keys always return
true). - Reports false positives at a tunable, well-understood rate.
- Is fast enough to consult in the hot path of a key-value store lookup (≈ nanoseconds).
- Has an on-disk representation identical across Rust, Go, and C++ so the same filter built in any language can be read by any other.
Constraints
| Constraint | Why it matters |
|---|---|
| Compact (≤ 2 bytes/key for 5% FPR) | The filter is loaded into RAM beside the table it indexes. |
Constant-time add and contains | Hot path of Get(k). |
| Deterministic across languages | Cross-language tests must pass. |
| Single 64-bit hash | We synthesize k indices via Kirsch–Mitzenmacher — keeps CPU low. |
| No remove | Pure Bloom. Counting variants left to db-21. |
Design decisions
- Base hash = FNV-1a64 then SplitMix64 mixing. FNV-1a64 is trivial to implement identically across languages; SplitMix64 finalizing fixes its weak avalanche so the upper and lower 32 bits are both well-distributed. The two 32-bit halves become
h1andh2for double hashing. - Kirsch–Mitzenmacher:
g_i = h1 + i*h2, all u64 arithmetic, with the finalmod musing a single u128-multiplication trick (Daniel Lemire, Fast Random Integer Generation in an Interval, 2018) so we don't pay for adiv. - Bit array is little-endian byte-packed, bit
ilives inbytes[i/8] >> (i%8) & 1. Same convention LevelDB and RocksDB use. - Header =
k(u32 LE) || m(u64 LE). 12 bytes total. We deliberately putkfirst so a partial-read of just the header reveals the hash count without needingm. - No checksum on the filter itself. Bloom filters can tolerate a flipped bit (it adds at most a few keys' worth of false positives); pages-level checksumming belongs to db-06 (SSTable).
new_with_fpr(n, fpr)constructor. Picksm = ceil(-n * ln(fpr) / (ln 2)^2)andk = round((m/n) * ln 2). Capskat 30 to avoid degenerate sizing for absurdly small FPRs.
Why this design over alternatives
- vs MurmurHash3 / xxhash: faster and arguably higher quality, but each is hundreds of lines to re-implement identically in three languages. FNV+SplitMix is 12 lines per language and indistinguishable in our use case.
- vs
kindependent hashes: 2× CPU for no measurable FPR change (Kirsch & Mitzenmacher 2006). - vs Cuckoo / xor filters: more space-efficient at low FPR but much more code. Worth a separate lab — db-21.
- vs in-language hashers (
std::hash,hash/fnv,std::hash<string>): per-language differences — Go'smaphashis randomized per process; C++std::hash<string>is implementation-defined. None of them survive cross-language interop.
Failure modes addressed
| Failure | How |
|---|---|
| FPR much higher than claimed | Test V1: empirically measure FPR with 100k random queries and assert it's within 2× of the theoretical bound. |
| Bit packing mismatched across languages | Test V2 (cross-lang): each writer dumps its filter bytes; each reader queries it for known-present & known-absent keys. |
| Endian mismatch in header | All header fields encoded little-endian explicitly. |
| Hash function mismatch | Test V3: known FNV-1a64 vectors (""→0xcbf29ce484222325, "foobar"→0x85944171f73967e8) checked at startup. |
Saturation at n ≫ planned | contains still works; FPR climbs gracefully. Filter constructors document the assumed n. |
Failure modes NOT addressed
- Concurrent inserts. Single writer model. Concurrent
addcorrupts overlapping byte writes. Lock externally or use atomic OR per byte — covered in db-08 / db-21. - Adversarial keys. FNV-1a64 is not cryptographic — an attacker can craft collisions to inflate FPR. Use SipHash / xxh3 (with secret seed) if filter inputs are attacker-controlled.
- Deletion. Use a counting Bloom or a cuckoo filter. See db-21.
Execution — How to Build and Run
Quick start (per language)
# Rust
cd src/rust
cargo build --release
cargo test --release
./target/release/bloombench --help
# Go
cd src/go
go test ./...
go build -o bin/bloombench ./cmd/bloombench
./bin/bloombench --help
# C++
cd src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build
./build/bloombench --help
CLI: bloombench
A single binary per language. Subcommands have the same shape across all three so cross-language tests can shell out polyglot.
| Subcommand | Args | Behaviour |
|---|---|---|
hash STR | one string | Print fnv1a64=… splitmix=… h1=… h2=… for the given input |
build PATH N FPR | output path, key count, target FPR | Insert keys key0..key{N-1} and write the filter to PATH |
query PATH KEY | filter path, key | Print present or absent for one key |
query-file PATH KEYS_FILE | filter path, file with one key per line | Print results for each |
fpr-test PATH N M | filter path (built with N keys), M random absent keys | Measure FPR and print observed vs theoretical |
Library API
fnv1a64(bytes) -> u64
splitmix64(u64) -> u64
mix64(bytes) -> u64 // = splitmix64(fnv1a64(bytes))
BloomFilter::new(m_bits, k) -> Bloom
BloomFilter::with_fpr(n, fpr) -> Bloom
- picks m, k optimally; caps k at 30
Bloom::add(bytes)
Bloom::contains(bytes) -> bool
Bloom::k(), Bloom::m_bits(), Bloom::m_bytes()
Bloom::encode() -> Vec<u8> // header || bits
Bloom::decode(bytes) -> Bloom // inverse, validates header
On-disk / on-wire layout
┌─────────┬─────────┬─────────────────────────┐
│ k (u32) │ m (u64) │ bits ⌈m/8⌉ bytes │
│ │ │ bit i = bytes[i/8] >> (i mod 8) & 1
└─────────┴─────────┴─────────────────────────┘
4 8 ⌈m/8⌉
All integers little-endian. No padding. No internal checksum.
Verifying
./scripts/verify.sh # per-language unit + property tests
./scripts/cross_test.sh # writer/reader cross-product over {go,rust,cpp}
Observation — Looking at the Bits
1. Hexdump a freshly built filter
./build/bloombench build /tmp/bf 4 0.05
xxd /tmp/bf
00000000: 0500 0000 1f00 0000 0000 0000 1206 92 ...............
Reading the header: k=5, m=31 bits ⇒ ⌈31/8⌉ = 4 bytes of bits. The trailing 12 06 92 … is the bit vector with 4 keys mixed in. The actual high byte may differ depending on how m is rounded.
For 1000 keys at 1% FPR you should see roughly 9.6 bits/key ⇒ 1200 bytes of bits, and k ≈ 7.
2. Sanity-check the hash chain
./build/bloombench hash foobar
# fnv1a64=85944171f73967e8 splitmix=... h1=... h2=...
Known FNV-1a64 vectors (used in tests):
| Input | fnv1a64 |
|---|---|
"" | 0xcbf29ce484222325 |
"a" | 0xaf63dc4c8601ec8c |
"foobar" | 0x85944171f73967e8 |
All three languages must print the same fnv1a64 and same splitmix64 for any given input. If they don't, cross-language interop is dead on arrival.
3. Empirical FPR matches the formula
./build/bloombench build /tmp/bf 100000 0.01
./build/bloombench fpr-test /tmp/bf 100000 1000000
# expected: observed=0.0098 theoretical≈0.0100 (within ±20% with 1M samples)
If observed FPR is much higher than theoretical:
kis wrong (probably0or1due to integer truncation; checkwith_fprmath).- Hash is biased (FNV without SplitMix mixing — the high bits are clumped).
mod mstep has a bias (usingh % mwith non-primemis OK; usingh & (m-1)only works whenmis a power of two).
If observed FPR is much lower: probably double-counting or your "random absent" key generator overlaps with the present set — verify input.
4. Bit density vs FPR
for fpr in 0.10 0.05 0.01 0.001 0.0001; do
./build/bloombench build /tmp/bf 10000 $fpr
ls -l /tmp/bf
done
Sample row sizes (header + body):
| FPR | Bytes | Bits/key |
|---|---|---|
| 0.10 | ~6 040 | ~4.8 |
| 0.05 | ~7 820 | ~6.2 |
| 0.01 | ~12 010 | ~9.6 |
| 0.001 | ~17 990 | ~14.4 |
| 0.0001 | ~23 970 | ~19.2 |
The 9.6 bits/key heuristic for 1% FPR is the one most often quoted in interviews.
5. Cross-language byte-identical filters
for lang in go rust cpp; do
./src/$lang/.../bloombench build /tmp/bf.$lang 1000 0.01
done
md5sum /tmp/bf.* # all three identical
If any digest differs, suspect (in order): endian, bit ordering inside the byte, integer types of the header, or hash mismatch.
What "working" looks like
- Bytes 0..3 =
k, bytes 4..11 =m, bytes 12..end = bits. No padding. - Empirical FPR is within ±2× of theoretical for any (n, fpr) you try.
- All three languages produce identical filters and read each other's filters.
What "broken" looks like
contains(k)returnsfalsefor a key you just inserted ⇒ false negative ⇒ critical bug. Likely indexing math:setandgetdisagree about bit-within-byte.- FPR is 100% ⇒ all bits are 1 ⇒ either
mwas rounded down to 0 or you're indexing past the bit array. - FPR is 0% with realistic load ⇒
addis a no-op orcontainsalways returns true on the "absent" path. - Cross-language readers disagree ⇒ print the first 16 bytes of each filter and the first three
h1, h2values for a known key; one of them is wrong.
Verification — What to Test and How
Per-language property tests
| # | Test | Pass if |
|---|---|---|
| V1 | fnv1a64_known_vectors | "" → 0xcbf29ce484222325; "a" → 0xaf63dc4c8601ec8c; "foobar" → 0x85944171f73967e8 |
| V2 | splitmix64_known_vectors | splitmix64(0) = 0xe220a8397b1dcdaf; splitmix64(0xdeadbeef) = 0x4adfb90f68c9eb9b |
| V3 | no_false_negatives | Insert N=10 000 random keys (seeded); contains returns true for every one |
| V4 | fpr_within_2x | Build for n=10 000 at fpr=0.01; query 100 000 random absent keys; observed FPR ≤ 2× theoretical |
| V5 | optimal_k_formula | with_fpr(1000, 0.01) returns k=7 and 9 580 ≤ m ≤ 9 620 (allow ±0.5%) |
| V6 | encode_decode_roundtrip | encode → decode → query the same keys: identical results |
| V7 | header_layout | First 4 bytes = k LE; next 8 = m LE; payload length = ⌈m/8⌉ |
| V8 | empty_filter_rejects_all | New filter with m=64, k=3; contains returns false for 1000 random keys |
Cross-language test
scripts/cross_test.sh performs the writer × reader matrix for {go, rust, cpp}²:
- Each writer builds a filter for the same fixed-seed key set (1 000 keys).
- Filters must be byte-identical (
md5sumover filter file). - Each reader opens each writer's filter and runs:
- 1 000 known-present queries → must all return
present - 1 000 known-absent queries (different seed) → results must match across readers
- 1 000 known-present queries → must all return
This catches:
- Endian or bit-order bugs in the header / bit array.
- Hash mismatch (
fnv1a64orsplitmix64differs). mod mreduction differs (Lemire's u128 trick vs%should yield identical indices).
What "passing" means
- All 8 property tests green in all three languages.
cross_test.shexits 0 with 9 byte-identical filter writers and 9 passing reader runs.- Manual smoke: hexdump of a 4-key filter matches the structure described in docs/observation.md.
Broader Ideas — Beyond the Minimum
Block / partitioned filters
RocksDB partitions the filter so that one filter probe touches only a single cache line. Trade: marginally higher FPR for ~3× faster contains on cache-cold filters. See Optimizing Bloom Filter: Challenges, Solutions, and Comparisons (Luo et al., IEEE 2019).
Cuckoo filters
Replace bit array with a hash table of fingerprints. Same FPR as Bloom at lower bits/key (~6 bits/key for 1% FPR), and supports remove. Slower to build, occasionally fails to insert when over-full. Excellent for membership tests with a known max size and a need for deletion.
Xor filters
Build-once, query-many. ~9% smaller than Cuckoo at the same FPR, faster lookup (always exactly 3 memory accesses). Bad fit if you insert incrementally; great for static datasets like compiled SSTable filters.
Ribbon filters (RocksDB 6.15+)
A linear-algebra reformulation of xor filters. ~30% smaller than Bloom at the same FPR, slightly slower lookup, ~10× slower to build. RocksDB now uses these as the default for SSTable filters.
Prefix Bloom filters
When most queries are by prefix (e.g. userid:12345:*), build the filter from prefixes instead of full keys. Saves space and lets prefix-range queries use the filter.
Scaling without resizing
A scalable Bloom filter (Almeida et al., 2007) chains a sequence of progressively larger filters with progressively tighter FPRs. add writes to the youngest; contains ORs across all. Memory grows logarithmically with n.
Compressed Bloom filters
If you transmit a Bloom filter over a network, sparsity makes it gzip well. Mitzenmacher (2002) showed that optimizing for compressed size leads to a different k than optimizing for in-memory FPR.
Cardinality estimation: HyperLogLog vs Bloom
Bloom tells you "in set" with FPR; HLL gives you |set| ± ~2% with constant memory. Often used in the same systems for different questions.
Filters in distributed systems
- Bigtable / HBase: block-level filters per SSTable.
- Cassandra: row-level filter per SSTable, plus a key cache.
- Akamai / CDNs: "did this URL get cached?" Bloom-based pre-checks.
- Gmail: per-user spam fingerprint filters.
- Bitcoin SPV clients (BIP 37): filters published to full nodes to indicate which addresses the SPV wallet cares about. Famously broken from a privacy standpoint — the filter leaks the address set.
Adversarial considerations
Bloom-filter parameters and hashes are usually public. If users can choose keys, they can craft collisions that fill the filter and push FPR to 100%. Defenses:
- Use a keyed hash (SipHash) seeded at filter creation.
- Cap inserts or fall back to an exact structure beyond a threshold.
- Periodically rebuild.
Information-theoretic lower bound
Carter et al. (1978) prove that any approximate set with FPR ε requires at least n * log2(1/ε) bits — that's ~6.64 bits/key for 1% FPR. Bloom uses ~9.6 (44% overhead). Xor filters approach the bound at ~9% overhead. Ribbon filters get within ~3%.
Step 01 — Hash chain (FNV-1a64 → SplitMix64 → double hashing)
Before any filter logic, get the hash chain identical across all three languages. If fnv1a64("foobar") doesn't return 0x85944171f73967e8 everywhere, nothing else will work.
1. FNV-1a64
Algorithm:
hash = 0xcbf29ce484222325
for byte in input:
hash ^= byte
hash = hash * 0x100000001b3 // wrapping 64-bit multiplication
return hash
Known test vectors:
| Input | Result |
|---|---|
"" (empty) | 0xcbf29ce484222325 (the initial value) |
"a" | 0xaf63dc4c8601ec8c |
"foobar" | 0x85944171f73967e8 |
Side-by-side:
#![allow(unused)] fn main() { pub fn fnv1a64(bytes: &[u8]) -> u64 { let mut h: u64 = 0xcbf29ce484222325; for &b in bytes { h ^= b as u64; h = h.wrapping_mul(0x100000001b3); } h } }
func FNV1a64(b []byte) uint64 {
var h uint64 = 0xcbf29ce484222325
for _, x := range b {
h ^= uint64(x)
h *= 0x100000001b3
}
return h
}
std::uint64_t Fnv1a64(const std::uint8_t* p, std::size_t n) {
std::uint64_t h = 0xcbf29ce484222325ULL;
for (std::size_t i = 0; i < n; ++i) {
h ^= p[i];
h *= 0x100000001b3ULL;
}
return h;
}
⚠️ Two-byte traps: don't use FNV-1 (not 1a — different order of XOR vs multiply); don't use the 32-bit prime or basis (different constants).
2. SplitMix64 finalizer
FNV-1a64 has decent low bits but biased high bits. SplitMix64 (Vigna & Steele) is a single-step bit mixer that produces near-perfect avalanche on a 64-bit input. We apply it to FNV's output so that both the upper and lower 32-bit halves are usable as independent hashes.
splitmix64(x):
x = x + 0x9e3779b97f4a7c15
x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9
x = (x ^ (x >> 27)) * 0x94d049bb133111eb
return x ^ (x >> 31)
Known vectors:
| Input | Output |
|---|---|
0 | 0xe220a8397b1dcdaf |
0xdeadbeef | 0x4adfb90f68c9eb9b |
splitmix64(fnv1a64("foobar")) | use the test to lock this in |
The combined hash we actually use:
mix64(bytes) := splitmix64(fnv1a64(bytes))
h1 = mix64(bytes) & 0xffffffff // low 32 bits
h2 = mix64(bytes) >> 32 // high 32 bits
3. Synthesizing k indices (Kirsch–Mitzenmacher)
For a bit array of size m and k hashes:
for i in 0..k:
g = h1 + i * h2 // wrapping 64-bit add
idx = g mod m // reduce to [0, m)
use bits[idx]
The reduction g mod m is hot. Naive % works but a modulo on a 64-bit integer is ~20 cycles. RocksDB and others use Lemire's fast reduction:
fastmod(g, m) = ((g as u128) * (m as u128)) >> 64
(Equivalent to floor(g * m / 2^64), a near-uniform map of [0, 2^64) to [0, m).) Either approach is fine for correctness — but pick one and use it in all three languages, otherwise the bit positions diverge and cross-language tests break.
This lab uses plain % because it's identical across languages with no language-specific u128 syntax to worry about. Performance difference is irrelevant at filter-construction scale.
Test gate
Before moving on, all three bloombench hash foobar invocations must print:
fnv1a64=85944171f73967e8
splitmix=... (same value across languages)
h1=... h2=... (same values across languages)
If those don't match, the rest of the lab cannot succeed.
Step 02 — Bit array, add, contains
The bit array
Backed by a Vec<u8> / []byte / std::vector<uint8_t> of length ⌈m / 8⌉. Indexing is little-endian within each byte:
bit i → byte index = i / 8
bit within = i % 8 // bit 0 is the LSB
test: (bytes[i/8] >> (i%8)) & 1
set: bytes[i/8] |= 1 << (i%8)
Side-by-side:
#![allow(unused)] fn main() { fn set_bit(bits: &mut [u8], i: u64) { let idx = (i / 8) as usize; let off = (i % 8) as u8; bits[idx] |= 1u8 << off; } fn get_bit(bits: &[u8], i: u64) -> bool { let idx = (i / 8) as usize; let off = (i % 8) as u8; (bits[idx] >> off) & 1 == 1 } }
func setBit(bits []byte, i uint64) {
bits[i/8] |= 1 << (i % 8)
}
func getBit(bits []byte, i uint64) bool {
return bits[i/8]&(1<<(i%8)) != 0
}
inline void SetBit(std::uint8_t* b, std::uint64_t i) {
b[i / 8] |= std::uint8_t{1} << (i % 8);
}
inline bool GetBit(const std::uint8_t* b, std::uint64_t i) {
return (b[i / 8] >> (i % 8)) & 1u;
}
⚠️ Pick one bit-order now. LSB-first (above) matches LevelDB and is the natural choice in C. MSB-first matches some networking specs (TCP option encoding). Whichever you pick, all three implementations must agree.
add(key)
add(key):
h = mix64(key)
h1 = h & 0xffffffff
h2 = h >> 32
for i in 0..k:
idx = (h1 + i * h2) mod m
set_bit(idx)
Notes:
- All arithmetic is wrapping 64-bit (
u64/uint64/std::uint64_t). i * h2overflows for largek. That's fine —mod mwill still produce a valid index. Languages with overflow checks (Rust debug mode) needwrapping_mul/wrapping_add.- We compute
honce per key, then derivekindices. That's the entire Kirsch–Mitzenmacher win.
contains(key)
contains(key):
h = mix64(key)
h1 = h & 0xffffffff
h2 = h >> 32
for i in 0..k:
idx = (h1 + i * h2) mod m
if not get_bit(idx):
return false
return true
Early-exit on the first zero bit. With FPR 1% and a random absent key, you typically exit after 1 or 2 probes.
Three-language add skeleton
#![allow(unused)] fn main() { pub fn add(&mut self, key: &[u8]) { let h = mix64(key); let h1 = h as u32 as u64; let h2 = h >> 32; for i in 0..self.k as u64 { let idx = h1.wrapping_add(i.wrapping_mul(h2)) % self.m; set_bit(&mut self.bits, idx); } } }
func (b *Bloom) Add(key []byte) {
h := Mix64(key)
h1 := h & 0xffffffff
h2 := h >> 32
for i := uint64(0); i < uint64(b.k); i++ {
idx := (h1 + i*h2) % b.m
setBit(b.bits, idx)
}
}
void Bloom::Add(const std::uint8_t* k, std::size_t n) {
std::uint64_t h = Mix64(k, n);
std::uint64_t h1 = h & 0xffffffffULL;
std::uint64_t h2 = h >> 32;
for (std::uint64_t i = 0; i < k_; ++i) {
std::uint64_t idx = (h1 + i * h2) % m_;
SetBit(bits_.data(), idx);
}
}
Test gate
- Insert keys
"k0".."k999"(UTF-8).contains("kN")must betruefor every N. Any false negative is a critical bug. - Filter must be byte-identical across the three languages (
md5sum).
What broken looks like
containsis sometimes false for present keys →setandgetdisagree on bit-within-byte. Suspect LSB vs MSB.- Cross-language filters differ →
mod mreduction differs (one uses&instead of%andmisn't a power of two), orh1/h2halves are swapped. containsis always true →mwas constructed as 0; bit array is empty so every(h % 0)panics or all indices land in a never-cleared byte.
Step 03 — Sizing, encode/decode, FPR measurement
Picking m and k from (n, fpr)
Given target false-positive rate p and expected key count n:
$$m = \left\lceil \frac{-n \cdot \ln p}{(\ln 2)^2} \right\rceil$$
$$k = \max\left(1,; \text{round}\left(\frac{m}{n} \cdot \ln 2\right)\right)$$
Reference numbers (compute once, hard-code in tests):
| n | p | m (bits) | bits/key | k |
|---|---|---|---|---|
| 1 000 | 0.10 | ~4 793 | 4.79 | 3 |
| 1 000 | 0.01 | ~9 586 | 9.59 | 7 |
| 1 000 | 0.001 | ~14 378 | 14.38 | 10 |
| 10 000 | 0.01 | ~95 851 | 9.59 | 7 |
Implementation:
with_fpr(n, p):
ln2 = ln(2)
m_real = -(n as f64) * ln(p) / (ln2 * ln2)
m = ceil(m_real)
k_real = (m as f64 / n as f64) * ln2
k = round(k_real) clamped to [1, 30]
return BloomFilter::new(m, k)
The clamp on k prevents pathological cases. with_fpr(1, 1e-100) would request k ≈ 332 and almost certainly saturate the filter.
Encode
[ k: u32 LE ][ m: u64 LE ][ bits: ⌈m/8⌉ bytes ]
#![allow(unused)] fn main() { pub fn encode(&self) -> Vec<u8> { let mut out = Vec::with_capacity(12 + self.bits.len()); out.extend_from_slice(&self.k.to_le_bytes()); out.extend_from_slice(&self.m.to_le_bytes()); out.extend_from_slice(&self.bits); out } }
func (b *Bloom) Encode() []byte {
out := make([]byte, 12+len(b.bits))
binary.LittleEndian.PutUint32(out[0:4], b.k)
binary.LittleEndian.PutUint64(out[4:12], b.m)
copy(out[12:], b.bits)
return out
}
std::vector<std::uint8_t> Bloom::Encode() const {
std::vector<std::uint8_t> out(12 + bits_.size());
EncodeU32LE(out.data() + 0, k_);
EncodeU64LE(out.data() + 4, m_);
std::memcpy(out.data() + 12, bits_.data(), bits_.size());
return out;
}
Decode
decode(bytes):
if len(bytes) < 12: error
k = read u32 LE @ 0
m = read u64 LE @ 4
body = bytes[12:]
if len(body) != ⌈m/8⌉: error
return BloomFilter { k, m, bits: body }
Validate sizes. If k == 0 or m == 0, reject — those are nonsense.
Measuring FPR
fpr-test(filter, n, m_queries):
seed reader rng to disjoint stream
hits = 0
for q in 0..m_queries:
key = generate distinctly-not-inserted key
if filter.contains(key):
hits += 1
observed = hits / m_queries
theoretical = (1 - exp(-k * n / m))^k
print observed, theoretical
Generating known-absent keys: use the same family as the inserted ones, but with indices ≥ n. If insert used "key0", "key1", ..., "key{n-1}", query with "q0", "q1", ... — different prefix guarantees no accidental overlap.
A million absent queries gives ±10% noise on a 1% FPR estimate; that's the sample size used in the test fpr_within_2x.
Test gate
with_fpr(1000, 0.01)returnsk=7andm∈[9 581, 9 591].encodethendecodegives an identical filter.fpr-testwithn=10 000,m_queries=100 000reports observed FPR within 2× of theoretical (well within ±50%).- The encoded filter is byte-identical across Rust / Go / C++.
What broken looks like
k=0fromwith_fpr→ integer truncation; you usedint(k_real)instead ofround.- Decode fails on a perfectly valid file → endian mismatch or header offset wrong.
- Observed FPR is exactly 1.0 → bit array got written but indices land outside its range (modulo bug).
- Observed FPR is exactly 0.0 →
containsalways returns false; bit array isn't being touched onadd(you forgot to mutateself).
LSM MemTable
Lab: db-05 — the in-memory write buffer of an LSM-tree.
1. What Is It
A MemTable is the in-memory, sorted write buffer at the top of every Log-Structured
Merge tree (LSM). All writes — put, delete, range updates — land in the MemTable
first, indexed by key, and only later get flushed to immutable on-disk SSTables
(see db-06). It is paired with a Write-Ahead Log (db-03) for durability: WAL gives
crash recovery; the MemTable gives fast point and range lookups.
This lab implements a deterministic, byte-identical MemTable across Rust, Go, and C++ that can be serialized to disk and read back in any of the three languages.
2. Why It Matters
- Write throughput. Writes touch only RAM (plus a single sequential WAL append).
Random
puts become sequential disk traffic. - Read recency. The MemTable is the freshest copy of any key; a get must consult it first before falling through to L0..Ln SSTables.
- Flush boundary. Once the MemTable hits its size cap (
write_buffer_sizein LevelDB/RocksDB), it freezes, a new MemTable rotates in, and the frozen one is written sequentially to an SSTable on background threads. - Tombstones. Deletes are inserts of tombstone records; the MemTable must preserve them through flush so older SSTables can be shadowed.
3. How It Works
writes reads
│ │
▼ ▼
┌──────── MemTable (active) ─────────┐ point/range
│ sorted map: key → (type, value) │◄──────────┐
│ tombstones live alongside values │ │
└──────────────────┬─────────────────┘ │
size > cap? │ │
▼ │
┌── Immutable MemTable (frozen) ─┐ │
│ flushing in the background │◄───────────┤
└──────────────────┬──────────────┘ │
▼ │
SSTable on disk ─────────────────►┘
(db-06 format)
Internally the MemTable is a sorted associative container with byte-lexicographic key ordering:
- Rust:
BTreeMap<Vec<u8>, Entry>(Vec<u8>'sOrdis lex over bytes). - Go:
map[string]Entry+ key slice sorted on dump/iteration. - C++:
std::map<std::vector<uint8_t>, Entry>(operator< on vectors is lex).
Production LSMs (LevelDB, RocksDB) use a skip list because it offers concurrent lock-free reads and allocations from an arena. For this lab the simpler tree is fine — what matters is the order-determinism and the on-disk byte layout.
4. Core Terminology
| Term | Definition |
|---|---|
| MemTable | Sorted in-memory map of keys to values/tombstones; the LSM write buffer. |
| Immutable MemTable | A frozen MemTable, no longer accepting writes, awaiting flush. |
| Tombstone | A delete marker stored as an entry of type Delete; needed because older SSTables may still hold the key. |
| Skip list | Randomized layered linked-list giving expected O(log n) insert/lookup; LevelDB/RocksDB's choice. |
| Flush | Writing a frozen MemTable out as an SSTable. |
| Sequence number | Monotonically increasing version tag attached to each write so readers can pick the right snapshot. |
| Arena | Bump allocator that backs MemTable nodes; freed in one go when the table is dropped. |
5. Mental Models
- Three-layer journal. WAL = durability log. MemTable = sorted index over the WAL's recent tail. SSTable = compacted, immutable snapshot. The MemTable is the short-term, queryable face of the WAL.
- Latest write wins. For a single point lookup the MemTable always shadows any on-disk data; a tombstone in the MemTable hides every prior value of that key.
- Flush is amplification's first knob. Larger MemTables → fewer, bigger L0 SSTables → less compaction work but more recovery time and RAM. Production tunes this between 16 MiB and 256 MiB.
- Why sorted? Because the flush writes the SSTable in a single sequential pass — no on-disk sort needed if the MemTable is already ordered.
6. Common Misconceptions
- "The MemTable is the WAL." No. The WAL is unsorted, append-only, and may contain redundant updates for the same key. The MemTable is sorted and deduplicated.
- "Tombstones can be GC'd in the MemTable." No — they must be flushed; only after compaction confirms no older SSTable holds the key can a tombstone be dropped.
- "You can skip the WAL if writes are batched." The MemTable lives in RAM. Without the WAL a crash loses every unflushed write.
- "Skip list is the only valid structure." A B-tree, ART, or sorted vector with occasional rebuild are all viable; skip list wins for the specific concurrency pattern of one writer + many readers.
7. Interview Talking Points
- Explain why an LSM uses a MemTable + WAL instead of writing directly to a sorted on-disk file (random I/O kills throughput).
- Walk through the lifecycle:
put→ WAL append → MemTable insert → eventually frozen → flushed → compacted. - Describe how a get traverses MemTable → immutable MemTable → L0 SSTables → Ln, stopping at the first match (value or tombstone).
- Cost of tombstones: read amplification grows because we cannot skip a level just because we found nothing; we might still find a tombstone later.
- Why a MemTable's flush is a sorted sequential write — and why this is the primary trick that makes LSMs faster than B-trees for write-heavy workloads.
8. Connections to Other Labs
- db-03 (WAL): every MemTable write is preceded by a WAL append; the WAL is replayed into a fresh MemTable on startup.
- db-04 (Bloom filters): SSTables produced by MemTable flush carry Bloom filters for negative lookups.
- db-06 (SSTable format): the flush target; this lab's
flush_tois the producer side of db-06'sopen. - db-07 (compaction): consumes SSTables that came from MemTable flushes.
- db-09 (LevelDB complete): stitches all of the above into a working KV store.
References — db-05 LSM MemTable
Primary sources
-
O'Neil, Cheng, Gawlick, O'Neil (1996). The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica 33(4). The original paper. https://www.cs.umb.edu/~poneil/lsmtree.pdf
-
Sanjay Ghemawat & Jeff Dean. LevelDB. The reference open-source LSM. See
db/memtable.{h,cc}anddb/skiplist.h. https://github.com/google/leveldb -
Facebook RocksDB Wiki — MemTable. https://github.com/facebook/rocksdb/wiki/MemTable Covers SkipList, HashSkipList, HashLinkList, and Vector MemTable factories.
Skip lists
-
William Pugh (1990). Skip Lists: A Probabilistic Alternative to Balanced Trees. CACM 33(6): 668–676. https://homepage.cs.uiowa.edu/~ghosh/skip.pdf
-
LevelDB's skiplist implementation — concurrent single-writer/many-reader, lock-free reads via memory-order acquire/release. https://github.com/google/leveldb/blob/main/db/skiplist.h
Alternative data structures
- Bw-tree (Microsoft, 2013): lock-free B+ tree variant used in Hekaton.
- Adaptive Radix Tree (ART, 2013): compact, cache-friendly trie used by HyPer and DuckDB. https://db.in.tum.de/~leis/papers/ART.pdf
- Masstree (Mao, Kohler, Morris, 2012): trie-of-B+trees, very fast for variable length keys.
Tombstones and snapshot reads
-
RocksDB DeleteRange. Tombstones over key ranges, important for prefix deletes. https://github.com/facebook/rocksdb/wiki/DeleteRange
-
LevelDB sequence numbers. Each MemTable entry is internally tagged with a 56-bit sequence and 8-bit type byte; this lab simplifies to just the type byte. See
db/dbformat.hkValueTypeForSeek.
Real-world tunings
- Cassandra: uses
Memtablewith off-heap allocators; flushed to SSTables on size, time, or commit-log pressure. - HBase:
MemStoreper column family; configurable viahbase.hregion.memstore.flush.size. - InfluxDB IOx & TimescaleDB: apply LSM ideas to time-series, with time-bucketed MemTables.
Further reading
- Mark Callaghan's blog (RocksDB engineering): https://smalldatum.blogspot.com/
- Chen Luo & Michael Carey (2020). LSM-based Storage Techniques: A Survey. VLDB J. https://arxiv.org/abs/1812.07527
Analysis — db-05 LSM MemTable
Problem
Implement the in-memory write buffer of an LSM-tree such that
- it supports
put,delete(tombstone insertion),get, and ordered iteration; - it can be serialized to disk in a deterministic byte layout shared by Rust, Go, and C++;
- the same dump can be reloaded in any of the three languages;
- iteration order is byte-lexicographic on keys.
Constraints
- Keys are arbitrary byte sequences up to
2^32 − 1bytes (u32length prefix). - Values are arbitrary byte sequences up to
2^32 − 1bytes; for tombstones the value length is 0. - Cross-language interop: the dump format must be identical byte-for-byte and the cross-test script asserts SHA-256 equality of the three dumps.
- No allocator tricks: simplicity over LevelDB-style arena/skiplist; we use the standard sorted map in each language.
- No concurrency in this lab: single-threaded API. Concurrency arrives in db-09.
Design decisions
Why a sorted associative container, not a skip list?
Production LSMs (LevelDB, RocksDB) use skip lists because they support concurrent lock-free reads and arena allocation. For this teaching lab those benefits are irrelevant — what matters is determinism, byte-identical serialization, and the fact that iteration is in key order so the flush is a sequential write. Any sorted container satisfies that:
| Language | Container | Why |
|---|---|---|
| Rust | BTreeMap<Vec<u8>, Entry> | Vec<u8>: Ord is byte-lex; balanced. |
| Go | map[string]Entry + sort.Strings(keys) | Avoid third-party sorted maps. |
| C++ | std::map<std::vector<uint8_t>, Entry> | RB-tree; vector<uint8_t>::operator< is lex. |
All three give the same iteration order for identical input, which is what cross-test checks.
Tombstones as entries
A delete(k) replaces whatever entry k had with Entry::Tombstone. Crucially the
key is not erased — the tombstone must propagate to the SSTable to shadow older
on-disk versions of the key.
On-disk dump layout
offset size field
------ ---- --------------------------------
0 4 magic ASCII "MMT1"
4 4 count: u32 LE (entry count)
8 ? entries, sorted by key ascending:
[ klen: u32 LE ]
[ vlen: u32 LE ] (0 if tombstone)
[ type: u8 ] (0 = Value, 1 = Tombstone)
[ key bytes ]
[ value bytes ]
All multi-byte integers little-endian; the file is self-delimiting via count and
each entry's two length prefixes. No checksum at this layer — the WAL (db-03) and the
SSTable (db-06) carry their own.
Size accounting
size_bytes() returns the on-disk dump size assuming the current contents flush
immediately: 8 bytes header + per entry (9 + klen + vlen). This is what an LSM
would compare against write_buffer_size.
Error model
The decoder validates:
- magic ==
MMT1, - enough bytes remain for each header field and the declared key/value spans,
typeis 0 or 1,- tombstones have
vlen == 0, - no trailing garbage,
- keys appear in strictly ascending order.
A failure returns an explicit error (Error::* in Rust, error in Go,
std::invalid_argument / std::runtime_error in C++) rather than panicking. The
encoder cannot fail (no I/O at that layer).
Trade-offs
- No sequence numbers. Real LSMs prepend a 64-bit
(seqno << 8) | typeto every internal key so MVCC snapshots can pick the right version. We collapse to "latest write wins" because db-13 reintroduces MVCC. - No range tombstones. Each delete shadows exactly one key. RocksDB-style range deletes are db-09 work.
- No prefix bloom or compressed entries. The MemTable is in RAM; flushing to a proper SSTable (db-06) is where compression and block boundaries appear.
- Allocation policy:
Vec/vector/[]byte-per-entry, not an arena. Allocator pressure becomes interesting only at multi-million-key scales, which we exercise in db-22 benchmarking.
Alternatives considered
- Skip list with arena (LevelDB style). Better concurrency, cache locality, and drop-the-whole-arena freeing — but the data structure complexity (random levels, acquire/release pointer ops) would dwarf the lab's pedagogical point.
- Sorted vector with binary search. Lowest memory overhead, but every
putisO(n)due to mid-vector insertion. Fine for tiny tables (<1 K entries), terrible beyond that. - HashMap with periodic sort. Fast inserts, but iteration is no longer cheap; every flush triggers a sort. Acceptable if flush is rare, painful otherwise.
- B-epsilon tree. Batches writes inside internal nodes, blurring the line with LSM. Out of scope.
Execution — db-05 LSM MemTable
Library API (Rust shape; mirrored in Go and C++)
#![allow(unused)] fn main() { pub enum Entry { Value(Vec<u8>), Tombstone } pub struct MemTable { /* sorted map */ } impl MemTable { pub fn new() -> Self; pub fn len(&self) -> usize; pub fn size_bytes(&self) -> usize; pub fn put(&mut self, key: &[u8], value: &[u8]); pub fn delete(&mut self, key: &[u8]); pub fn get(&self, key: &[u8]) -> Option<&Entry>; pub fn iter(&self) -> impl Iterator<Item = (&[u8], &Entry)>; pub fn encode(&self) -> Vec<u8>; pub fn decode(bytes: &[u8]) -> Result<Self, Error>; } }
Go: func New() *MemTable, func (*MemTable) Put / Delete / Get / Iter / Encode,
func Decode([]byte) (*MemTable, error). Iter yields a slice of (key, entry) pairs
in sorted order.
C++: class MemTable with the same names; Iter() returns a const reference to the
underlying std::map.
CLI: memtable
memtable <subcommand> [args]
Subcommands:
new PATH create an empty MemTable at PATH
put PATH KEY VALUE open PATH, put, save
del PATH KEY open PATH, delete (writes tombstone), save
get PATH KEY print 'value: <hex>' | 'tombstone' | 'absent'
iter PATH print one line per entry: TYPE KEY VALUE (hex)
bulk PATH N open or create PATH, insert key0..key{N-1}
with values val0..val{N-1}, save
size PATH print 'entries=N size_bytes=B'
Iter output format (deterministic, used by cross-test):
V <hex-key> <hex-value>
T <hex-key>
Hex is lowercase, no 0x prefix, no separators.
Build & test
Per language:
# Rust
cd src/rust && cargo test --release && cargo build --release
# Go
cd src/go && go test ./... && go build -o bin/memtable ./cmd/memtable
# C++
cd src/cpp && cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build && ( cd build && ctest --output-on-failure )
Or run all at once:
bash scripts/verify.sh
Cross-language interop test
scripts/cross_test.sh:
- Build all three binaries.
- Drive each one through the same sequence of
bulk 100, a handful ofputs with overwrites, and a handful ofdels. - SHA-256 each dump; assert all three match.
- For each writer/reader pair, run
iterand check the output is byte-identical.
If any pair differs, the script prints the failing combination and exits non-zero.
Manual playground
$ memtable new /tmp/m.bin
$ memtable put /tmp/m.bin alpha "first"
$ memtable put /tmp/m.bin beta "second"
$ memtable put /tmp/m.bin alpha "first-updated" # overwrite
$ memtable del /tmp/m.bin beta # tombstone
$ memtable iter /tmp/m.bin
V 616c706861 66697273742d75706461746564
T 62657461
$ memtable get /tmp/m.bin alpha
value: 66697273742d75706461746564
$ memtable get /tmp/m.bin beta
tombstone
$ memtable get /tmp/m.bin gamma
absent
$ memtable size /tmp/m.bin
entries=2 size_bytes=37
37 = 8 (header) + (9+5+13) + (9+4+0) = 8 + 27 + 13 — two entries with key
"alpha"/value "first-updated" and tombstone for key "beta".
What broken looks like
| Symptom | Likely cause |
|---|---|
| Cross-test SHA mismatch on first byte set | Magic disagreement (must be ASCII MMT1). |
| Cross-test SHA mismatch mid-file | Endianness or type byte placement differs. |
iter order differs across langs | Go's map iteration order; missed the sort.Strings. |
get returns absent after del | Tombstone not stored, only erased. |
| Decoder accepts trailing garbage | Forgot the "consumed all bytes" check. |
Observation — db-05 LSM MemTable
Hex layout of a tiny dump
Three operations: put alpha first, put beta second, del beta.
hexdump -C m.bin
00000000 4d 4d 54 31 02 00 00 00 05 00 00 00 05 00 00 00 |MMT1............|
00000010 00 61 6c 70 68 61 66 69 72 73 74 04 00 00 00 00 |.alphafirst.....|
00000020 00 00 00 01 62 65 74 61 |....beta|
Annotated:
| Offset | Bytes | Field |
|---|---|---|
| 0 | 4d 4d 54 31 | magic ASCII MMT1 |
| 4 | 02 00 00 00 | count = 2 |
| 8 | 05 00 00 00 | klen = 5 (alpha) |
| 12 | 05 00 00 00 | vlen = 5 (first) |
| 16 | 00 | type = Value |
| 17 | 61 6c 70 68 61 | key bytes alpha |
| 22 | 66 69 72 73 74 | value bytes first |
| 27 | 04 00 00 00 | klen = 4 (beta) |
| 31 | 00 00 00 00 | vlen = 0 (tombstone) |
| 35 | 01 | type = Tombstone |
| 36 | 62 65 74 61 | key bytes beta |
Total: 40 bytes; matches size_bytes() = 8 + (9+5+5) + (9+4+0) = 40.
Cross-language byte equality
scripts/cross_test.sh produces three files rust.bin, go.bin, cpp.bin. With
the verify script in this lab:
$ shasum -a 256 *.bin
b67… rust.bin
b67… go.bin
b67… cpp.bin
If any one of the three differs we either have endianness disagreement, key ordering disagreement, or someone wrote the type byte in a different position.
Memory layout intuition
key entry
"abc" ──► Entry::Value("..." 10 bytes)
"abd" ──► Entry::Tombstone
"abz" ──► Entry::Value("" 0 bytes) # empty value is legal and ≠ tombstone
"zz" ──► Entry::Value("..." 4096 bytes)
Notes:
- The key length is not stored alongside the in-memory entry — only at encode time.
- An empty value (
"") is a valid value, distinct fromTombstone. The type byte is what discriminates them.
size_bytes() table
For a MemTable with n entries of average key length k̄ and average value length
v̄, with a fraction f being tombstones:
$$ \text{size_bytes}(n, k̄, v̄, f) = 8 + n \cdot (9 + k̄) + n(1-f) \cdot v̄ $$
For default LSM tunings:
| n | k̄ | v̄ | f | size_bytes |
|---|---|---|---|---|
| 10 000 | 16 | 100 | 0 | 1,250,008 |
| 100 000 | 32 | 256 | 0.01 | 28,634,008 |
| 1 000 000 | 64 | 1024 | 0 | 1,097,000,008 |
(Compare to a real LevelDB write_buffer_size of 4 MiB or RocksDB's 64 MiB default —
the table above shows you'd flush a 10K-entry buffer at about a megabyte.)
What an empty MemTable looks like
hexdump -C empty.bin
00000000 4d 4d 54 31 00 00 00 00 |MMT1....|
8 bytes. size_bytes() returns 8. len() returns 0.
Iteration order corner cases
keys = ["", "\x00", "\x00\x00", "a", "ab", "b"]
Sorted byte-lex order:
"" (empty key — sorts first)
"\x00"
"\x00\x00"
"a"
"ab"
"b"
Empty keys are legal in this design (klen = 0). They are useful when the key is something like a single byte tag followed by an optional suffix.
Verification — db-05 LSM MemTable
Unit tests (per language)
| ID | Test name | What it asserts |
|---|---|---|
| V1 | empty_encode_decode | MemTable::new().encode() → 8 bytes MMT1\x00\x00\x00\x00; decode round-trips to an empty table. |
| V2 | put_then_get | After put("k","v"), get("k") returns Value("v"). |
| V3 | overwrite_replaces | Two puts on the same key keep only the latest value; len() stays at 1. |
| V4 | delete_writes_tombstone | After put("k","v") then del("k"), get("k") returns Tombstone (not None). |
| V5 | iter_byte_lex_order | Insert keys in random order; iteration yields them sorted byte-lex ("" first, \x00 next, etc.). |
| V6 | encode_decode_round_trip | Build a 50-entry table with a mix of values and tombstones; encode → decode → every entry matches and len() is preserved. |
| V7 | size_bytes_matches_encode | For any table, size_bytes() == encode().len(). |
| V8 | decoder_rejects_bad_magic | decode(b"XXX1...") returns Err. |
| V9 | decoder_rejects_truncation | Truncate a valid dump at every byte boundary; decode must fail cleanly (no panic). |
| V10 | decoder_rejects_unsorted_keys | Hand-craft a dump where keys go ["b","a"]; decoder rejects. |
Cross-language interop (scripts/cross_test.sh)
The same scripted scenario runs in each language:
new → bulk 100 → put "key50" "REPLACED"
→ del "key10"
→ put "" "empty-key-value"
→ del "key99"
→ save
This produces dumps rust.bin, go.bin, cpp.bin. The script then:
- SHA-256s all three dumps. All must match — this is the byte-identical gate.
- 3×3 reader matrix. Every reader (
rust/go/cpp) runsiteron every writer's dump. The lines must be identical across all 9 combinations. getspot-check. Each reader querieskey50,key10,key99,"", and an absent keynonexistent; results must bevalue: 5245504c41434544(REPLACED),tombstone,tombstone,value: 656d7074792d6b65792d76616c7565,absentrespectively across all readers.
End-to-end verification (scripts/verify.sh)
bash scripts/verify.sh
Builds and tests all three languages, then runs the cross-test. Final line must be
ALL GREEN.
Manual sanity checks
memtable new /tmp/m && wc -c /tmp/m→ exactly 8 bytes.memtable bulk /tmp/m 1000 && memtable size /tmp/m→ matches the formula8 + 1000 * (9 + len("keyN") + len("valN"))summed over N=0..999.- Hexdump the first 16 bytes of any dump and confirm magic + count.
What broken looks like
| Symptom | Diagnostic |
|---|---|
decode accepts b"\x00\x00\x00\x00" (no magic check) | Add magic test V8. |
Two readers print different iter output for the same dump | Either type-byte misplaced, or one language is comparing by string instead of bytes (UTF-8 vs raw). |
len() differs across langs after the same script | Go's map+sort path lost a duplicate; check overwrite path. |
Dump grows monotonically after del | Tombstone path is creating a new entry under a different key; check key equality. |
Random crash in C++ on decode of truncated input | Missing length check before memcpy; bounds-check every read. |
Broader Ideas — db-05 LSM MemTable
The MemTable in this lab is intentionally minimal. Real systems extend it in many directions; this doc maps the design space.
Concurrency-friendly structures
- Skip list (LevelDB, RocksDB). Single writer + many readers, lock-free reads via memory-order acquire/release. The dominant choice for LSM MemTables.
- Bw-tree (Hekaton). Lock-free B+ tree using delta records and a mapping table; shines on multi-writer workloads.
- ART (Adaptive Radix Tree). Cache-friendly trie; very fast point lookups, used by HyPer, DuckDB, and recent CockroachDB internals for some indexes.
- Masstree. Trie-of-B+trees; outperforms skip list on long variable keys.
Arena allocation
LevelDB's MemTable allocates all skip-list nodes from a bump-arena. Freeing is
O(1) (drop the arena). RocksDB has a configurable Arena and a ConcurrentArena
for parallel writes. Real benefit: less fragmentation and one cache-line probe per
allocation. Our lab uses standard allocators because the lesson is the data layout,
not the allocator.
Sequence numbers & MVCC
Production LSMs prepend a 64-bit sequence number (and an 8-bit type byte) to every internal key. Snapshot reads pick the latest sequence ≤ the snapshot's tag. db-13 revisits this when we add MVCC; here we collapse to last-write-wins.
Range tombstones
A single tombstone shadows one key. RocksDB's DeleteRange tombstones cover a key
range [start, end) and live in a separate auxiliary structure inside the MemTable
(RangeDelAggregator). This avoids exploding the MemTable size when bulk-deleting.
Adding it would require:
- A
RangeTombstonestruct:(start, end, seqno). - A second sorted container inside
MemTable. getconsults both: a key shadowed by an overlapping range tombstone returnsTombstoneeven if it has a Value entry.
Multiple MemTables (active + immutable list)
Production engines keep one active MemTable plus a list of immutable
MemTables awaiting flush. Reads consult [active, ...immutables, L0, L1, ...] in
order. Writers swap atomically (active → immutable + new empty active) when the
size cap is hit. This decouples flush latency from write latency.
Write amplification interplay
The MemTable size cap (write_buffer_size) is the first knob in the LSM write
amplification dial:
- Larger MemTable → fewer, bigger L0 SSTables → less L0 compaction → lower write amp but slower recovery and more RAM.
- Smaller MemTable → more L0 SSTables → more compaction work → higher write amp but fast recovery.
RocksDB and Cassandra default in the range 64–256 MiB; LevelDB defaults to 4 MiB.
Persistent MemTables (PMEM)
Intel Optane / CXL persistent memory blurs the WAL+MemTable boundary: the MemTable itself lives in persistent memory, so the WAL is unnecessary. Papers from VLDB 2018–2020 (NoveLSM, SLM-DB, FloDB) explore this.
Encryption
Cassandra and RocksDB optionally encrypt at-rest data including the MemTable's flushed SSTables. The MemTable itself is in RAM and inherits process-memory protection. Encrypting in-memory pages requires hardware support (SGX, AMD SEV).
Compression of in-memory entries
For very long values, RocksDB can compress values inside the MemTable using LZ4 or
ZSTD via the MemTableRep's EncodeKey hook. Trades CPU for memory; useful when
RAM is the limit.
Skip-list level distribution
Pugh's original skip list uses geometric level distribution with p=0.5 (max levels =
log₂ n). LevelDB sets max levels = 12 and branching = 4; RocksDB defaults max = 16,
branching = 4. Lower branching → taller list → more memory but better adaptivity.
Adversarial concerns
- Memory amplification via tombstones. A flood of deletes can make the MemTable hold many entries with no live data; eventually all tombstones must propagate to SSTables and may take generations of compaction to GC.
- Skew-induced flush storms. A hot key prefix can keep one MemTable bucket pinned while others empty; with hash-partitioned MemTables (HashSkipList) this is pronounced.
Beyond LSM
- B-epsilon trees (TokuDB / Percona) batch writes inside internal B+ tree nodes; no separate MemTable.
- Anti-caching (HyPer, VoltDB) keeps the working set in memory and evicts cold rows to disk; inverts the LSM model.
- WiscKey decouples keys (LSM) from values (separate log) to slash write amplification for large values.
Step 01 — Sorted map + Entry type
Build the in-memory MemTable: a sorted associative container from byte-key to an
Entry that is either Value(bytes) or Tombstone. Implement put, delete,
get, iter, and len / size_bytes in all three languages with the same
iteration order (byte-lex).
Why this first
The MemTable's iteration order is the contract that the on-disk format and the
SSTable flush both depend on. If three languages disagree on order, every later step
falls apart. So this step is a one-language-after-the-other implementation of the
same BTreeMap-equivalent, with a shared unit test that inserts a permutation and
checks the order.
Entry type
#![allow(unused)] fn main() { // Rust #[derive(Clone, Debug, PartialEq, Eq)] pub enum Entry { Value(Vec<u8>), Tombstone, } }
// Go
type EntryType uint8
const (
EntryValue EntryType = 0
EntryTombstone EntryType = 1
)
type Entry struct {
Type EntryType
Value []byte // empty if Tombstone
}
// C++
namespace dse::memtable {
enum class EntryType : std::uint8_t { Value = 0, Tombstone = 1 };
struct Entry {
EntryType type;
std::vector<std::uint8_t> value; // empty if Tombstone
};
}
The type-byte numbering (0 = Value, 1 = Tombstone) is part of the on-disk
contract — don't reorder it.
Container choice
#![allow(unused)] fn main() { // Rust — Vec<u8>'s Ord is byte-lex; BTreeMap iterates in key order use std::collections::BTreeMap; pub struct MemTable { map: BTreeMap<Vec<u8>, Entry>, bytes: usize, } }
// Go — unordered map; sort keys on iteration / encode
type MemTable struct {
m map[string]Entry
bytes int
}
func (t *MemTable) sortedKeys() []string {
keys := make([]string, 0, len(t.m))
for k := range t.m {
keys = append(keys, k)
}
sort.Strings(keys) // byte-lex on string is the same as on []byte
return keys
}
// C++ — std::map's comparator is operator< on vector<uint8_t>, which is lex
class MemTable {
std::map<std::vector<std::uint8_t>, Entry> map_;
std::size_t bytes_ = 0;
};
put / delete
#![allow(unused)] fn main() { pub fn put(&mut self, key: &[u8], value: &[u8]) { self.bytes -= self.entry_bytes(key); self.map.insert(key.to_vec(), Entry::Value(value.to_vec())); self.bytes += self.entry_bytes(key); } pub fn delete(&mut self, key: &[u8]) { self.bytes -= self.entry_bytes(key); self.map.insert(key.to_vec(), Entry::Tombstone); self.bytes += self.entry_bytes(key); } fn entry_bytes(&self, key: &[u8]) -> usize { match self.map.get(key) { None => 0, Some(Entry::Value(v)) => 9 + key.len() + v.len(), Some(Entry::Tombstone) => 9 + key.len(), } } }
Go and C++ use the same accounting trick: subtract the old entry's contribution, update, add the new contribution.
iter
#![allow(unused)] fn main() { pub fn iter(&self) -> impl Iterator<Item = (&[u8], &Entry)> { self.map.iter().map(|(k, e)| (k.as_slice(), e)) } }
func (t *MemTable) Iter() []KeyEntry {
out := make([]KeyEntry, 0, len(t.m))
for _, k := range t.sortedKeys() {
out = append(out, KeyEntry{Key: []byte(k), Entry: t.m[k]})
}
return out
}
const std::map<std::vector<std::uint8_t>, Entry>& Iter() const noexcept {
return map_;
}
Test — order determinism
Insert this permutation in each language and assert iteration yields the keys in the canonical sorted order:
inputs (insert order): ["b", "a", "", "\x00\x00", "ab", "\x00"]
expected iter order: ["", "\x00", "\x00\x00", "a", "ab", "b"]
This catches:
- Go forgetting to sort.
- C++ using
std::map<std::string, ...>(where'\0'ends the string and breaks comparisons on binary keys). - Anyone using a hash map.
What to verify before moving on
putthengetreturns the value just written.deletethengetreturnsTombstone(not absent).- Overwriting a key keeps
len()at 1. - The permutation test above passes.
size_bytes()increases by exactly9 + klen + vlenfor each new key and stays flat when overwriting.
Step 02 — Encode / Decode the dump
Serialize the MemTable to a byte-identical on-disk layout and parse it back.
Layout (recap from analysis.md)
magic "MMT1" (4 bytes)
count u32 LE (4 bytes)
repeat count times, in ascending key order:
klen u32 LE
vlen u32 LE (0 for tombstone)
type u8 (0 = Value, 1 = Tombstone)
key klen bytes
value vlen bytes
Rust
#![allow(unused)] fn main() { pub fn encode(&self) -> Vec<u8> { let mut out = Vec::with_capacity(self.size_bytes()); out.extend_from_slice(b"MMT1"); out.extend_from_slice(&(self.map.len() as u32).to_le_bytes()); for (k, e) in &self.map { let (vlen, t, v) = match e { Entry::Value(v) => (v.len() as u32, 0u8, v.as_slice()), Entry::Tombstone => (0, 1, &[][..]), }; out.extend_from_slice(&(k.len() as u32).to_le_bytes()); out.extend_from_slice(&vlen.to_le_bytes()); out.push(t); out.extend_from_slice(k); out.extend_from_slice(v); } out } pub fn decode(bytes: &[u8]) -> Result<Self, Error> { if bytes.len() < 8 { return Err(Error::Short); } if &bytes[..4] != b"MMT1" { return Err(Error::BadMagic); } let count = u32::from_le_bytes(bytes[4..8].try_into().unwrap()) as usize; let mut p = 8usize; let mut t = MemTable::new(); let mut prev: Option<Vec<u8>> = None; for _ in 0..count { if p + 9 > bytes.len() { return Err(Error::Short); } let klen = u32::from_le_bytes(bytes[p..p+4].try_into().unwrap()) as usize; let vlen = u32::from_le_bytes(bytes[p+4..p+8].try_into().unwrap()) as usize; let ty = bytes[p+8]; p += 9; if p + klen + vlen > bytes.len() { return Err(Error::Short); } let key = bytes[p..p+klen].to_vec(); p += klen; let val = bytes[p..p+vlen].to_vec(); p += vlen; if let Some(ref pk) = prev { if key.as_slice() <= pk.as_slice() { return Err(Error::Unsorted); } } let entry = match ty { 0 => { Entry::Value(val) } 1 => { if vlen != 0 { return Err(Error::BadTombstone); } Entry::Tombstone } _ => return Err(Error::BadType), }; prev = Some(key.clone()); t.insert_raw(key, entry); } if p != bytes.len() { return Err(Error::Trailing); } Ok(t) } }
Go
func (t *MemTable) Encode() []byte {
out := make([]byte, 0, t.SizeBytes())
out = append(out, 'M', 'M', 'T', '1')
out = binary.LittleEndian.AppendUint32(out, uint32(len(t.m)))
for _, k := range t.sortedKeys() {
e := t.m[k]
out = binary.LittleEndian.AppendUint32(out, uint32(len(k)))
out = binary.LittleEndian.AppendUint32(out, uint32(len(e.Value)))
out = append(out, byte(e.Type))
out = append(out, k...)
out = append(out, e.Value...)
}
return out
}
Decode mirrors the Rust shape: read header, then loop reading klen, vlen,
type, key, value, validating ascending key order and rejecting trailing bytes.
C++
std::vector<std::uint8_t> MemTable::Encode() const {
std::vector<std::uint8_t> out;
out.reserve(SizeBytes());
static constexpr std::uint8_t magic[4] = {'M','M','T','1'};
out.insert(out.end(), magic, magic + 4);
PutU32LE(out, static_cast<std::uint32_t>(map_.size()));
for (auto const& [k, e] : map_) {
PutU32LE(out, static_cast<std::uint32_t>(k.size()));
PutU32LE(out, static_cast<std::uint32_t>(e.value.size()));
out.push_back(static_cast<std::uint8_t>(e.type));
out.insert(out.end(), k.begin(), k.end());
out.insert(out.end(), e.value.begin(), e.value.end());
}
return out;
}
What the decoder must reject
| Input | Why it must fail |
|---|---|
| < 8 bytes | header truncated |
magic XXXX | bad format |
| count says 5 but only 3 entries fit | truncated body |
type byte 2 | unknown type |
tombstone with vlen != 0 | malformed |
| keys not strictly ascending | violates order invariant |
| trailing bytes after last entry | corruption |
How to spot-check
After encoding a 2-entry MemTable (alpha=first, beta tombstoned):
xxd /tmp/m.bin | head
00000000: 4d4d 5431 0200 0000 0500 0000 0500 0000 MMT1............
00000010: 0061 6c70 6861 6669 7273 7404 0000 0000 .alphafirst.....
00000020: 0000 0001 6265 7461 ....beta
Three things to verify by eye:
4d4d5431=MMT1.0200 0000= count of 2 (LE).- The tombstone's
vlenis0000 0000and the byte before its key is01.
Test V6 — round-trip 50 entries
Mix puts and deletes:
for i in 0..50:
if i % 5 == 0: t.delete(format!("key{i}").as_bytes())
else: t.put(format!("key{i}").as_bytes(), format!("val{i}").as_bytes())
encoded = t.encode()
roundtrip = MemTable::decode(&encoded).unwrap()
assert_eq!(roundtrip.len(), t.len())
for (k, e) in t.iter() {
assert_eq!(roundtrip.get(k), Some(e))
}
assert_eq!(roundtrip.encode(), encoded)
The last line is the idempotence check — decoding and re-encoding produces the same bytes. If it doesn't, we have non-determinism in iteration order, which will also break cross-language interop.
Step 03 — CLI + cross-language interop
Wrap the library in a uniform CLI and prove that all three implementations produce byte-identical dumps for the same scripted scenario.
The memtable binary
Every language exposes the same subcommands so the cross-test can drive them uniformly:
memtable new PATH
memtable put PATH KEY VALUE
memtable del PATH KEY
memtable get PATH KEY
memtable iter PATH
memtable bulk PATH N
memtable size PATH
KEYandVALUEare passed as raw command-line strings. They may contain printable bytes; for testing we stick to ASCII to avoid shell quoting issues.iterandgetprint hex (lowercase, no separator) so output is shell-safe.
Output formats
# iter
V <hex-key> <hex-value>
T <hex-key>
# get
value: <hex-value>
tombstone
absent
# size
entries=<N> size_bytes=<B>
The scripted scenario
scripts/cross_test.sh drives every language through this sequence:
new # 8 bytes, empty
bulk 100 # 100 entries key0..key99 / val0..val99
put key50 REPLACED # overwrite
del key10 # tombstone
put "" empty-key-value # empty key as a valid key
del key99 # tombstone at the tail
Then it dumps rust.bin, go.bin, cpp.bin and asserts:
shasum -a 256 rust.bin go.bin cpp.bin
# all three hashes must be identical
3×3 reader matrix
For every writer × reader combination, the script runs
$READER iter $WRITER.bin > out.${reader}.${writer}.txt
and diffs pairs of outputs. All nine outputs must agree byte-for-byte.
Why a bulk subcommand
Running 100 separate memtable put PATH key0 val0 … invocations would (a) thrash
the disk and (b) test the CLI's argument parsing more than the data structure.
bulk exists so the cross-test can build a non-trivial table in one process per
language.
Spot-check get results
After the scenario the script also runs
get key50 # expect 'value: 5245504c41434544' (REPLACED in hex)
get key10 # expect 'tombstone'
get key99 # expect 'tombstone'
get "" # expect 'value: 656d7074792d6b65792d76616c7565' (empty-key-value)
get nonexistent # expect 'absent'
across all three readers.
Failure messages worth designing for
$ memtable get /tmp/m bogus
absent
$ memtable get /tmp/no-such-file foo
error: read /tmp/no-such-file: No such file or directory
$ memtable get /tmp/garbage.bin foo
error: bad magic
A consistent error vocabulary across languages makes the cross-test's grep patterns simpler.
Tying it together
scripts/verify.sh runs:
- Rust tests (
cargo test --release). - Go tests (
go test ./...). - C++ tests (
cmake -S . -B build && cmake --build build && ctest). - The cross-language script.
Final stdout must end with ALL GREEN.
SSTable Format
1. What Is It
A Sorted String Table (SSTable) is an immutable on-disk file holding key/value entries in byte-lex key order, organised into fixed-size blocks with an index block that maps each block's first key to its byte range inside the file, and a fixed-size footer that locates the index block.
The format in this lab:
+--------------------+ 0
| data block 0 |
+--------------------+
| data block 1 |
+--------------------+
| ... |
+--------------------+
| data block N-1 |
+--------------------+ index_offset
| index block |
+--------------------+ file_size - 32
| footer (32 bytes) |
+--------------------+ file_size
The footer always lives in the last 32 bytes and ends with the magic
SST1\0\0\0\0, so any reader can validate the file with one pread of the
tail and then pread the index block, and only then the relevant data
block.
2. Why It Matters
- Read-once, write-never. Each SSTable is written sequentially and then treated as read-only. That eliminates most concurrency hazards: lookups, range scans, and compactions can all share a single immutable file.
- Bounded read amplification. A point lookup is footer → index → one data block. With a 4 KB block and a 16-byte average entry, ≤256 keys are scanned per lookup, regardless of file size.
- Predictable I/O. Blocks are aligned write units; the OS page cache can pin hot blocks. Tail latency is dominated by exactly two I/Os per miss (index + data).
- Foundation for LSM. Compaction merges multiple immutable SSTables into new immutable SSTables. The format is what makes "immutable + sorted + indexed" a usable storage primitive.
3. How It Works
3.1 Data block
A data block is a self-describing run of entries.
[count: u32 LE]
repeat count times (keys ascending byte-lex within the block):
[klen: u32 LE][vlen: u32 LE][type: u8][key bytes][value bytes]
The writer flushes a block when its accumulated size would exceed a target (default 4096 bytes). The very first key of each block is the index key for that block.
3.2 Index block
[count: u32 LE]
repeat count times:
[klen: u32 LE][offset: u64 LE][size: u64 LE][first-key bytes]
offset and size locate the data block inside the file. Index entries
are listed in ascending block order, which is the same as ascending
first-key order.
3.3 Footer (exactly 32 bytes)
[index_offset: u64 LE]
[index_size: u64 LE]
[num_blocks: u64 LE]
[magic: "SST1\0\0\0\0" (8 bytes)]
The fixed size makes the tail trivially locatable: pread(fd, buf, 32, file_size - 32).
3.4 Point lookup
- Read footer; verify magic.
- Read index block (
index_offset..index_offset+index_size). - Binary-search the index for the rightmost index entry whose first key ≤ target. That entry's block is the only one that can contain the target.
- Read that block; linear-scan within it.
A miss in step 3 (target < first entry's key) means the key is absent without reading any data block.
4. Core Terminology
| Term | Definition |
|---|---|
| SSTable | Immutable sorted on-disk K/V file. |
| Block | Contiguous run of entries inside the SSTable; the I/O and indexing unit. |
| Data block | Block containing user K/V entries. |
| Index block | Block mapping each data block's first key to (offset, size). |
| Footer | Fixed-size tail (32 B) locating the index block; ends with a magic. |
| Magic | Sentinel byte pattern (SST1 here) used to validate file identity. |
| Index key | The first key of a data block, copied into the index entry. |
| Block boundary | The byte position where one data block ends and the next begins. |
| Restart point | (Not used here; LevelDB-style intra-block delta-encoding marker.) |
| Tombstone | Entry whose type=1 records that a key has been logically deleted. |
5. Mental Models
- Phone book. Data blocks are pages; the index is the alphabetical tabs on the side. The footer is the spine label saying "Volume 3 of 3".
- Skiplist with one level. The index is a single coarse "level" above the sorted data; binary search on the index replaces a multi- level skiplist traversal.
- Two-tier B+tree. Conceptually an SSTable is a 2-level B+tree whose leaves are data blocks and whose root is the index block — but built sequentially and frozen.
6. Common Misconceptions
- "You need to scan the whole file to find a key." No — one index lookup pins a single block.
- "The index must be at the start so you can read it first." No — the footer pointer makes the index location flexible, and writing the index last avoids buffering all keys before any data is flushed.
- "Block size = block contents size." The on-disk block includes its
own
countheader; the writer tracks an estimate of accumulated bytes so blocks land roughly on the target. - "Tombstones can be dropped at write time." Not safely — a tombstone must survive until no older SSTable can shadow it (handled by compaction in db-07).
- "Binary search needs fixed-size index entries." The index entries here are variable-length, but the index block itself is small and fully loaded into RAM, where any search structure is cheap.
7. Interview Talking Points
- Why is the footer at the end? ("So the writer can stream data
blocks then index without two passes; one
preadof the tail finds everything.") - What changes if you target 64 KB blocks instead of 4 KB? (Fewer index entries → smaller index → faster directory lookups; larger read amplification per miss; better compression ratios.)
- How does this format become durable? (
fsyncafter the footer is written, and a parent directoryfsyncso the dirent is recoverable. Without it a crash can leave the magic visible but data missing.) - What is
bsearchlooking for inside the index? (The floor — the largest first-key ≤ target — not equality.) - What stops a corrupt footer from poisoning the reader? (Magic check
- length plausibility checks. Real systems add CRCs per block.)
8. Connections to Other Labs
- db-05 MemTable supplies the sorted, in-memory K/V stream that this writer drains into blocks.
- db-04 Bloom Filters can be attached per SSTable to skip the index lookup on negative queries (added in db-08 / db-09).
- db-07 LSM Compaction consumes many SSTables and produces new ones using exactly this format.
- db-08 Block Cache and Iterators caches the parsed data block rather than re-decoding on each lookup, and turns intra-block scans into iterators.
- db-09 LevelDB Complete stitches MemTable + WAL + SSTable + compaction + bloom into a working engine.
References — SSTable Format
Papers
- O'Neil, P., Cheng, E., Gawlick, D., O'Neil, E. "The Log-Structured Merge-Tree (LSM-Tree)." Acta Informatica 33(4), 1996. — Original description of the immutable run / multi-level merge architecture.
- Chang, F. et al. "Bigtable: A Distributed Storage System for Structured Data." OSDI 2006. — Introduces the SSTable term and the blocks-plus-index layout this lab mirrors.
Open-source implementations
- LevelDB
table/format.h,table/table_builder.cc,table/block_builder.cc,table/block.cc— canonical reference for this lab. The data block format here is the LevelDB block format with restart-point compression removed for clarity. - RocksDB
table/block_based/block_based_table_builder.cc— adds bloom-filter blocks, compression, and partitioned indices on top of the same skeleton. - CockroachDB Pebble
sstable/writer.goandsstable/reader.go— Go implementation in idiomatic style.
Articles
- LevelDB design doc: https://github.com/google/leveldb/blob/main/doc/table_format.md
- RocksDB BlockBasedTable: https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format
- "Building an LSM Storage Engine: SSTables" — Mini-LSM tutorial, Chen et al.: https://skyzh.github.io/mini-lsm/
Analysis — SSTable Format
Problem
Take a sorted stream of K/V (and tombstone) entries — exactly what db-05 produces — and persist it as an immutable, randomly-readable file:
- writing is one sequential pass (no re-reads, no buffering all keys);
- a point lookup costs O(log N) on the index plus one block read;
- the file is self-describing: a reader can validate and navigate it using only the file itself.
Constraints
- 4 KB target data-block size (close to a page; tunable).
- Little-endian integers throughout (matches db-03 / db-05).
- No per-block CRCs in this lab — added in db-21 ("Storage Engine Advanced"). The footer magic is the only integrity gate.
- No compression, no delta-encoded keys: the goal is a format simple enough to compare byte-for-byte across three languages.
- Cross-language interop: Rust, Go, and C++ MUST emit byte-identical SSTables for the same input MemTable.
Design
Stream-once writer
sst_writer.add(key, type, value) -> writes into current data block buffer
sst_writer.finish() -> flushes the current block, writes index, writes footer
The writer accumulates one block in memory at a time. When adding an
entry would push the encoded block size past 4096 bytes, the current
block is flushed and a new one started. The first key of every block
is captured into an IndexEntry { key, offset, size }.
Index sizing
Index entries are ~ (4 + 8 + 8 + k̄) = 20 + k̄ bytes; for k̄ = 16 and ~250 entries per 4 KB block, a 1 GB SSTable carries ~262 144 blocks × 36 B ≈ 9 MB of index — small enough to keep in RAM per open SSTable.
Lookup
fn get(key) -> Option<Entry>:
footer = read_tail(32)
assert footer.magic == "SST1\0\0\0\0"
index = read(footer.index_offset, footer.index_size)
blk = bsearch_floor(index, key)? # None => below smallest
block = read(blk.offset, blk.size)
return linear_scan(block, key)
bsearch_floor is the rightmost index entry whose first key ≤ target.
Returning None (target precedes the smallest first-key) is a fast
miss without reading any data block.
Per-language container choice
| Language | Writer buffer | Index repr |
|---|---|---|
| Rust | Vec<u8> for the current block | Vec<IndexEntry> |
| Go | []byte | []IndexEntry |
| C++ | std::vector<uint8_t> | std::vector<IndexEntry> |
IndexEntry is (Vec<u8> key, u64 offset, u64 size) in all three.
Build-from-memtable bridge
For cross-test friendliness, the writer's input source is a decoded
MemTable dump (the output of db-05 encode). The CLI command
build reads IN.mt, iterates in sorted order, and emits OUT.sst.
What could break
- Block boundary drift. If two implementations disagree on when to
flush a block (e.g.
> 4096vs>= 4096), the data blocks land at different offsets, the index differs, and the footer hashes differ. We pin the rule: *flush when `current_block_size + next_entry_size4096
ANDcurrent_block_size > 0`*. - Index encoding for the very first block. Its first key may be
the empty string
""; the index entry then hasklen=0. The reader must still treat it as the floor for any non-empty target. - Footer alignment. Anything other than exactly 32 bytes after the index block invalidates the magic offset.
Execution — SSTable Format
Library API (uniform across Rust / Go / C++)
struct Entry { type: 0|1; value: bytes } // tombstone => value == empty
struct IndexEntry { key: bytes; offset: u64; size: u64 }
struct Footer { index_offset: u64; index_size: u64; num_blocks: u64; magic: "SST1\0\0\0\0" }
const BLOCK_TARGET: usize = 4096
const FOOTER_LEN: usize = 32
const MAGIC: &[u8; 8] = b"SST1\0\0\0\0"
// ---- writer ----
SstWriter::new(target_block_size = BLOCK_TARGET)
SstWriter::add(&mut self, key: &[u8], entry: Entry) // keys MUST be strictly ascending
SstWriter::finish(&mut self) -> Vec<u8> // returns full SSTable bytes
// ---- reader ----
SstReader::open(bytes: &[u8]) -> Result<Self, Error>
SstReader::len(&self) -> usize // num entries
SstReader::num_blocks(&self) -> usize
SstReader::get(&self, key: &[u8]) -> Option<Entry> // None if absent OR tombstone is not skipped
SstReader::iter(&self) -> impl Iterator<Item=(&[u8], Entry)> // full file scan
Error variants: Short, BadMagic, BadBlock, Unsorted,
BadTombstone, BadType, IndexOutOfRange.
CLI
The binary is named sstable in every language and dispatches on the
first arg:
sstable build IN.mt OUT.sst # read MemTable dump, write SSTable
sstable footer FILE.sst # print: index_offset=... index_size=... num_blocks=... magic_ok=...
sstable get FILE.sst KEY # prints: value: <hex> | tombstone | absent
sstable iter FILE.sst # prints lines: V <hex-key> <hex-value> | T <hex-key>
sstable size FILE.sst # prints: file_bytes=B entries=N num_blocks=K
Output formats match db-05 deliberately so the same cross-test helpers
(hex iter, value:/tombstone/absent get) apply.
Worked example
Given memtable bulk M.mt 100 && memtable put M.mt key50 REPLACED && memtable del M.mt key10, calling sstable build M.mt OUT.sst does:
- Decode
M.mt(MemTable format from db-05). - Iterate in sorted order; for each entry, call
writer.add. - The writer accumulates entries into a 4096-byte data-block buffer.
When the next entry would overflow, it flushes the buffer:
- records
IndexEntry { key = first_key_of_block, offset, size }, - appends the encoded block to the output stream,
- resets the buffer with the just-added entry.
- records
- After the last
add,finishflushes the final block, then writes the index block, then a 32-byte footer ending inSST1\0\0\0\0.
The output file is then self-validating: sstable footer OUT.sst
prints the footer values, sstable iter OUT.sst reproduces every
entry in sorted order, and sstable get OUT.sst key50 returns
value: 5245504c41434544.
Observation — SSTable Format
Smallest possible SSTable
Build from an empty MemTable: zero entries, zero data blocks, an empty index, and just the footer.
file size: 0 + 4 (index count=0) + 32 (footer) = 36 bytes
Hex (annotated):
offset
0000: 00 00 00 00 # index block: count=0
0004: 00 00 00 00 00 00 00 00 index_offset = 0
000c: 04 00 00 00 00 00 00 00 index_size = 4
0014: 00 00 00 00 00 00 00 00 num_blocks = 0
001c: 53 53 54 31 00 00 00 00 magic = "SST1\0\0\0\0"
File-size formula
For a build with N entries spread across K data blocks where the
sum of key sizes is Σk and the sum of value sizes (only for
non-tombstone entries) is Σv:
data_bytes = Σ_blocks ( 4 + Σ_entries_in_block (9 + k + v) )
= 4·K + N·9 + Σk + Σv
index_bytes = 4 + Σ_blocks ( 4 + 8 + 8 + first_key_len )
= 4 + K·20 + Σ_block_first_key_lens
file_bytes = data_bytes + index_bytes + 32
(The 4-byte per-block header is the entry count. The 20-byte
per-index-entry overhead is klen u32 + offset u64 + size u64.)
Hex walkthrough of a 3-entry SSTable
Three small entries forced into one block by the small block target —
e.g. put a 1, put bb 22, del ccc:
00000000 03 00 00 00 count=3
00000004 01 00 00 00 01 00 00 00 00 'a' '1' # entry 1: klen=1 vlen=1 type=0 "a" "1"
00000011 02 00 00 00 02 00 00 00 00 'b' 'b' '2' '2' # entry 2: klen=2 vlen=2 type=0 "bb" "22"
0000001e 03 00 00 00 00 00 00 00 01 'c' 'c' 'c' # entry 3: klen=3 vlen=0 type=1 "ccc"
00000028 01 00 00 00 # index count=1
0000002c 01 00 00 00 00 00 00 00 00 00 00 00 28 00 00 00 00 00 00 00 'a' # klen=1 offset=0 size=0x28 "a"
00000048 00 00 00 00 00 00 00 00 # footer.index_offset = 0x28
00000050 19 00 00 00 00 00 00 00 # footer.index_size = 0x19
00000058 01 00 00 00 00 00 00 00 # footer.num_blocks = 1
00000060 53 53 54 31 00 00 00 00 # magic "SST1\0\0\0\0"
# (file size = 0x68 = 104 bytes)
Note that the first key of the single block is "a", so the index
entry copies that key.
What broken looks like
| Symptom | Likely cause |
|---|---|
BadMagic at open | file truncated, or footer overwritten by an interrupted writer. |
BadBlock reading a block | block size in the index disagrees with the in-file count header — e.g. wrong endianness. |
| Two languages produce different file sizes for identical input | block-flush rule mismatch (> vs >=). |
Unsorted from the writer | caller didn't iterate the MemTable in sorted order before add. |
IndexOutOfRange at read | corrupted offset/size in the index — checked against file_len - 32 to fail loudly. |
Verification — SSTable Format
V1: empty build
build from an empty MemTable produces a 36-byte file ending in
SST1\0\0\0\0 with index_offset=0, index_size=4, num_blocks=0.
V2: single-entry build
add("k", Value("v")) → finish yields a file that:
- contains exactly one data block,
- has one index entry with key
"k", - round-trips:
iterreturns[("k", Value("v"))],get("k")returns the value,get("missing")returnsNone.
V3: tombstones survive
A tombstone added during build is reported as
Some(Entry::Tombstone) by get and as T <hex-key> by iter.
V4: ascending-key precondition
Calling add with a key that is not strictly greater than the
previous added key MUST return Unsorted and leave no partial output.
V5: block-boundary rule
With target_block_size = 64 and inputs whose encoded sizes are
known, the writer flushes the running block as soon as adding the
next entry would push its size strictly greater than 64 bytes. A
test inserts entries crafted so that the second insert is the
boundary-crossing one and asserts the resulting file has exactly two
data blocks and two index entries.
V6: footer location and magic
For any file produced by finish:
- the last 32 bytes parse as a
Footer, magic == "SST1\0\0\0\0",index_offset + index_size + 32 == file_size,num_blocksmatches the count of index entries.
V7: bulk round-trip vs MemTable
Take a MemTable populated by bulk 100 + put key50 REPLACED + del key10 + put "" empty-key-value + del key99, build an SSTable from it, then
verify that iter-over-SSTable returns the exact same (key, entry)
sequence as iter-over-MemTable. Per-key get agrees too.
V8: floor lookup correctness
For a multi-block SSTable, get(target) returns the matching entry
when present and None when the target falls between blocks, even
though the index entry it lands on belongs to the preceding block.
V9: reader rejects bad magic
A file with the last 8 bytes mutated away from SST1\0\0\0\0 MUST
return BadMagic on open.
V10: reader rejects out-of-range index pointer
A file whose footer claims index_offset >= file_size - 32 MUST
return IndexOutOfRange on open (caught before any block read).
Broader Ideas — SSTable Format
- Restart points and prefix compression. LevelDB stores keys as
(shared_prefix_len, unshared_suffix, value)and resets the prefix every N entries (a "restart point"). The block trailer lists the restart offsets so binary search inside a block is still O(log N_restarts). Halves on-disk size for sorted key sets but couples decode to encode order. - Two-level / partitioned index. A 1 TB SSTable would have ~10 GB
of index entries. Partitioning the index into "index blocks of
index blocks" keeps the resident index small at the cost of one
extra
preadper miss. RocksDB uses this above ~2 GB SSTables. - Per-block bloom filters. Attaching a small Bloom (db-04) to each data block lets the reader skip the entire block on a miss without decoding it. Trades index/Bloom RAM for fewer block reads.
- Block CRCs / per-block compression. Real engines write
[data][type byte: compression][crc32c]per block; the writer computes the CRC over compressed bytes. Detects bit-rot at read time but adds CPU cost per block. - Streaming writer to disk. This lab returns
Vec<u8>; production writers stream blocks into anos.Fileand only buffer the index in RAM. With a 4 KB block target and 250 entries/block, peak RAM is ~4 KB + the growing index. - Min/max keys per block in the index. Index entries can carry the last key too, so a query strictly between two blocks short-circuits without reading either. Costs ~2× index size.
- Splitting hot blocks. Some engines (e.g. CockroachDB) measure read frequency per block and adaptively shrink hot blocks to reduce read amplification on small lookups.
- Versioned magic. A future format change (e.g. adding bloom
blocks) bumps the magic to
SST2; readers can keep both code paths and choose at open time. Cheap, common practice.
Step 01 — Data Block Writer
Goal
Implement the smallest unit of an SSTable: a data block builder that accumulates entries and emits the on-disk block bytes.
Block format (recap)
[count: u32 LE]
repeat count times (keys ascending within the block):
[klen: u32 LE][vlen: u32 LE][type: u8][key][value]
The block does not carry its own size — the index entry that points to it does.
Encoded entry size
entry_size(klen, vlen) = 9 + klen + vlen # 4 + 4 + 1 + key + value
A block that holds N entries occupies 4 + Σ entry_size.
Flush rule
Track current_size = 4 (the block header) and the buffer separately.
For each candidate entry (k, v):
sz = entry_size(len(k), len(v))
if buffer_non_empty AND current_size + sz > BLOCK_TARGET:
flush() # emit block, capture index entry, reset
push entry
current_size += sz
This rule allows the block to grow up to and including BLOCK_TARGET
bytes but never beyond. A single oversized entry is emitted alone in
its own block (block size grows past the target only when one entry
already exceeds it).
Side-by-side: Rust / Go / C++
Rust
#![allow(unused)] fn main() { const HEADER: usize = 4; fn entry_size(k: usize, v: usize) -> usize { 9 + k + v } struct BlockBuilder { buf: Vec<u8>, count: u32, first_key: Option<Vec<u8>>, } impl BlockBuilder { fn new() -> Self { let mut buf = Vec::with_capacity(BLOCK_TARGET); buf.extend_from_slice(&0u32.to_le_bytes()); // placeholder for count Self { buf, count: 0, first_key: None } } fn current_size(&self) -> usize { self.buf.len() } fn add(&mut self, key: &[u8], ty: u8, value: &[u8]) { if self.count == 0 { self.first_key = Some(key.to_vec()); } self.buf.extend_from_slice(&(key.len() as u32).to_le_bytes()); self.buf.extend_from_slice(&(value.len() as u32).to_le_bytes()); self.buf.push(ty); self.buf.extend_from_slice(key); self.buf.extend_from_slice(value); self.count += 1; } fn finish(mut self) -> (Vec<u8>, Vec<u8>) { self.buf[0..4].copy_from_slice(&self.count.to_le_bytes()); (self.buf, self.first_key.unwrap_or_default()) } } }
Go
type blockBuilder struct {
buf []byte
count uint32
firstKey []byte
}
func newBlock() *blockBuilder {
b := &blockBuilder{buf: make([]byte, 0, blockTarget)}
b.buf = binary.LittleEndian.AppendUint32(b.buf, 0) // placeholder
return b
}
func (b *blockBuilder) currentSize() int { return len(b.buf) }
func (b *blockBuilder) add(key []byte, ty byte, value []byte) {
if b.count == 0 { b.firstKey = append([]byte(nil), key...) }
b.buf = binary.LittleEndian.AppendUint32(b.buf, uint32(len(key)))
b.buf = binary.LittleEndian.AppendUint32(b.buf, uint32(len(value)))
b.buf = append(b.buf, ty)
b.buf = append(b.buf, key...)
b.buf = append(b.buf, value...)
b.count++
}
func (b *blockBuilder) finish() (block, firstKey []byte) {
binary.LittleEndian.PutUint32(b.buf[0:4], b.count)
return b.buf, b.firstKey
}
C++
struct BlockBuilder {
std::vector<uint8_t> buf;
uint32_t count = 0;
std::vector<uint8_t> first_key;
BlockBuilder() {
buf.reserve(kBlockTarget);
put_u32_le(buf, 0); // placeholder
}
size_t current_size() const { return buf.size(); }
void add(const uint8_t* k, size_t klen,
uint8_t ty,
const uint8_t* v, size_t vlen) {
if (count == 0) first_key.assign(k, k + klen);
put_u32_le(buf, uint32_t(klen));
put_u32_le(buf, uint32_t(vlen));
buf.push_back(ty);
buf.insert(buf.end(), k, k + klen);
buf.insert(buf.end(), v, v + vlen);
++count;
}
std::pair<std::vector<uint8_t>, std::vector<uint8_t>> finish() {
std::memcpy(buf.data(), &count, 4); // LE on the platforms we target
return {std::move(buf), std::move(first_key)};
}
};
(For portability, the C++ version uses put_u32_le to patch the count
header too in the real implementation; the memcpy shortcut works on
little-endian hosts but the lab uses the helper.)
Self-check
- Empty
finishreturns(b"\x00\x00\x00\x00", b""). - After three adds the buffer length equals
4 + Σ entry_size. first_keyis captured on the first add and never overwritten.
Step 02 — Writer, Index, Footer
Goal
Wire the data-block builder into a full SstWriter that emits
[blocks...][index][footer].
Writer state
SstWriter {
out: Vec<u8>, // file bytes accumulated so far
block: BlockBuilder, // current data block
index: Vec<IndexEntry>, // one entry per flushed block
target: usize, // BLOCK_TARGET (default 4096)
last_key: Option<Vec<u8>>,
}
add
fn add(&mut self, key: &[u8], ty: u8, value: &[u8]) -> Result<(), Error> {
if let Some(prev) = &self.last_key {
if key <= prev.as_slice() { return Err(Error::Unsorted); }
}
if ty == 1 && !value.is_empty() { return Err(Error::BadTombstone); }
let sz = entry_size(key.len(), value.len());
if self.block.count > 0 && self.block.current_size() + sz > self.target {
self.flush_block();
}
self.block.add(key, ty, value);
self.last_key = Some(key.to_vec());
Ok(())
}
flush_block
fn flush_block(&mut self) {
let mut blk = std::mem::replace(&mut self.block, BlockBuilder::new());
let (bytes, first_key) = blk.finish();
let offset = self.out.len() as u64;
let size = bytes.len() as u64;
self.out.extend_from_slice(&bytes);
self.index.push(IndexEntry { key: first_key, offset, size });
}
finish
fn finish(mut self) -> Vec<u8> {
if self.block.count > 0 { self.flush_block(); }
let index_offset = self.out.len() as u64;
self.out.extend_from_slice(&(self.index.len() as u32).to_le_bytes());
for e in &self.index {
self.out.extend_from_slice(&(e.key.len() as u32).to_le_bytes());
self.out.extend_from_slice(&e.offset.to_le_bytes());
self.out.extend_from_slice(&e.size.to_le_bytes());
self.out.extend_from_slice(&e.key);
}
let index_size = self.out.len() as u64 - index_offset;
let num_blocks = self.index.len() as u64;
self.out.extend_from_slice(&index_offset.to_le_bytes());
self.out.extend_from_slice(&index_size.to_le_bytes());
self.out.extend_from_slice(&num_blocks.to_le_bytes());
self.out.extend_from_slice(b"SST1\0\0\0\0");
debug_assert_eq!(self.out.len() as u64,
index_offset + index_size + FOOTER_LEN as u64);
self.out
}
Footer parse
fn parse_footer(file: &[u8]) -> Result<Footer, Error> {
if file.len() < FOOTER_LEN { return Err(Error::Short); }
let tail = &file[file.len() - FOOTER_LEN..];
if &tail[24..32] != b"SST1\0\0\0\0" { return Err(Error::BadMagic); }
Ok(Footer {
index_offset: u64::from_le_bytes(tail[ 0.. 8].try_into().unwrap()),
index_size: u64::from_le_bytes(tail[ 8..16].try_into().unwrap()),
num_blocks: u64::from_le_bytes(tail[16..24].try_into().unwrap()),
})
}
The reader then verifies
footer.index_offset + footer.index_size + 32 == file.len() (returns
IndexOutOfRange otherwise) and parses the index block.
Index block parse
Identical structure to write:
let mut p = footer.index_offset as usize;
let count = read_u32_le(&file[p..]); p += 4;
let mut idx = Vec::with_capacity(count as usize);
for _ in 0..count {
let klen = read_u32_le(&file[p..]) as usize; p += 4;
let off = read_u64_le(&file[p..]); p += 8;
let sz = read_u64_le(&file[p..]); p += 8;
let key = file[p..p+klen].to_vec(); p += klen;
if off + sz > footer.index_offset { // beyond data region
return Err(Error::IndexOutOfRange);
}
idx.push(IndexEntry { key, offset: off, size: sz });
}
get with floor binary search
fn get(&self, key: &[u8]) -> Option<Entry> {
// Floor = rightmost index entry whose first_key <= key.
let pos = match self.index.binary_search_by(|e| e.key.as_slice().cmp(key)) {
Ok(i) => i, // exact first-key match
Err(0) => return None, // key precedes the smallest block
Err(i) => i - 1,
};
let blk = &self.index[pos];
let block_bytes = &self.file[blk.offset as usize
.. (blk.offset + blk.size) as usize];
scan_block(block_bytes, key)
}
scan_block decodes entries in order and returns the matching one
when found, None otherwise (a hit on a tombstone returns
Some(Entry::Tombstone) — the engine layer decides what tombstones
mean).
Self-check
- An empty writer:
finish()length is exactly 36 (4-byte empty index- 32-byte footer).
- After one
add, file length =4 + 9 + |k| + |v|(block) +4 + 4 + 8 + 8 + |k|(index) +32(footer). - For a target of 64 and entries crafted with sizes 30, 30, 30: the first add fits (4+30=34 ≤ 64), the second triggers flush (34+30=64? — equals target, no flush; 34+30=64 ≤ 64), then the third (64+30=94 > 64) → flush; result: two data blocks.
Step 03 — CLI and Cross-Language Test
CLI surface
sstable build IN.mt OUT.sst # MemTable dump in → SSTable out
sstable footer FILE.sst # prints footer values + magic_ok
sstable get FILE.sst KEY # value: <hex> | tombstone | absent
sstable iter FILE.sst # V <hex-key> <hex-value> | T <hex-key>
sstable size FILE.sst # file_bytes=B entries=N num_blocks=K
The hex-encoding and value: / tombstone / absent strings match
db-05 so the cross-test reuses the same comparison logic.
Cross-test scenario
Identical input across all three languages:
memtable new M.mt
memtable bulk M.mt 100
memtable put M.mt key50 REPLACED
memtable del M.mt key10
memtable put M.mt "" empty-key-value
memtable del M.mt key99
sstable build M.mt OUT.sst
Cross-test checks:
- Byte identity. sha256 of
OUT.sstmatches across rust / go / c++. (Same input MemTable dump + same writer rules ⇒ same bytes.) - 3×3 iter matrix. Every reader can iterate every writer's output, producing identical line-by-line dumps.
- 3×3 footer parse.
sstable footer OUT.sstfrom every reader on every writer's output reports the sameindex_offset/index_size/num_blocksandmagic_ok=true. - Spot-check get. For each language:
get key50→value: 5245504c41434544,get key10→tombstone,get ""→value: 656d7074792d6b65792d76616c7565,get nope→absent. - Iter equivalence vs MemTable.
sstable iter OUT.sstmatchesmemtable iter M.mtbyte-for-byte (the SSTable preserves the sorted entry stream, including tombstones).
Block-boundary check
With 100 small entries (key0..key99 → val0..val99, encoded ≈ 16
bytes each), a 4096-byte block target produces roughly
100 / (4096 / 16) ≈ 1 data blocks but with the +9 overhead per
entry it lands at 1 or 2 blocks. The cross-test asserts only that
num_blocks ≥ 1 and that every reader agrees on the count.
A separate sub-test forces a small block target (64 bytes) on
identical input across the three languages and asserts the resulting
num_blocks value matches; this is the precise boundary-rule check.
Output formats (exact strings)
| Command | Format |
|---|---|
footer | index_offset=<N> index_size=<N> num_blocks=<N> magic_ok=<true|false> |
get | value: <hex> | tombstone | absent |
iter value | V <hex-key> <hex-value> |
iter tombstone | T <hex-key> |
size | file_bytes=<N> entries=<N> num_blocks=<N> |
The cross-test scripts diff these as plain text.
db-07: LSM Compaction
0. Why compaction at all?
The LSM write path (db-05 MemTable + db-06 SSTable) is intentionally append-only. When a key is updated, the new version is written to a fresh MemTable and later flushed to a fresh SSTable; the old version is still sitting in some older SSTable. When a key is deleted, a tombstone is written, not a removal.
Without compaction, three pathologies grow without bound:
| Pathology | Symptom | Bound |
|---|---|---|
| Read amplification | A single get() must check every live SSTable. | O(#SSTables) |
| Space amplification | Obsolete versions and tombstones keep occupying disk. | Total writes / live bytes |
| Index/metadata bloat | Reader has to load every SSTable's index. | O(#SSTables) |
Compaction merges N input SSTables into M output SSTables, applying newest wins semantics and (eventually) purging tombstones. It trades extra write I/O (write amplification) for bounded read and space amplification.
1. The two strategies (one sentence each)
- Leveled (LevelDB, RocksDB default): level L holds at most ~10× the bytes of level L-1. When a level is full, you pick one file and compact it against the overlapping files in L+1. Read amp ≈ #levels; space amp ≈ 1.1×.
- Tiered (Cassandra default, Pebble's "level 0"): when level L has K files, merge all of them into a single L+1 file. Read amp ≈ #levels × K; space amp can be 2–3×; write amp is much lower.
This lab implements neither policy. It implements the mechanism they both need: a correct K-way merge that respects recency ordering and tombstones. Picking the policy is a separate problem (and a configurable one).
2. The mechanism: K-way merge
Inputs: an ordered list of SSTables [A, B, C, ...], where A is the newest
(most recently written) and the rest follow in age order.
Output: a single SSTable whose entries are the sorted union of all input keys,
where for any key k the entry is taken from the first input that contains
it. Tombstones are entries — they win against older values just like a put.
The merge is a textbook K-way merge:
- Open all inputs and produce per-input cursors that iterate in key order.
- Push each cursor's current key into a min-heap keyed by
(key, source_index).source_indexis the recency tiebreaker — smaller index = newer. - Pop the smallest. This is the next unique key and its winning entry.
- Emit it (subject to the tombstone-drop rule below).
- Advance the popped cursor. Also advance every other cursor whose current key equals the just-emitted key (they are stale duplicates).
- Repeat until the heap is empty.
In a min-heap with K cursors and N total entries the merge is O(N log K).
3. Newest-wins semantics
The contract:
- Inputs are ordered by recency. Index 0 is newest.
- For each distinct key, the first input that contains it wins.
- The winning entry's type (Value or Tombstone) is preserved.
- All other versions of that key are discarded.
This matches what a layered reader would do on a get() query if it walked the levels top-down and short-circuited on the first hit.
4. Tombstone purging
A tombstone exists to hide an older version of a key. It is safe to drop a tombstone if and only if there is no older version anywhere that the tombstone could be hiding.
Two cases:
- Compacting the bottom level. There is nothing older. Every tombstone
whose only remaining copy is in the output is safe to drop. Callers signal
this with
drop_tombstones=true. - Compacting a non-bottom level. Even if no input has an older version of
the key, a deeper level still might. Tombstones must be kept. Callers leave
drop_tombstones=false.
This lab exposes the flag and trusts the caller. A real engine wires it from the level metadata.
5. What this lab does NOT do (and why)
- No splitting: the output is a single SSTable. Production engines cap output file size to keep per-file work bounded. The merge logic is the same; splitting is an output-side concern handled by switching SstWriter targets.
- No level metadata: there is no notion of "this output belongs to level N". That belongs to a manifest / version-edit log, which is db-09 territory.
- No deletion of obsolete inputs: the caller is responsible for unlinking the input files once the output is durable. We just return bytes.
- No checksums or atomic rename: writing-then-renaming and checksumming blocks belong in db-08+.
6. Cross-language contract
The output is a db-06 SSTable. Two implementations that compact the same
inputs in the same order with the same drop_tombstones flag must produce
byte-identical outputs. We verify this with sha256 across rust/go/cpp.
7. Failure modes worth recognizing
| Bug | Symptom |
|---|---|
| Wrong recency tiebreaker (older wins on ties) | After compaction, a recently-overwritten key reverts. |
| Forgetting to advance non-winning duplicates | Same key appears multiple times in output → SstWriter errors. |
| Comparing keys as strings (UTF-8) not bytes | Non-ASCII keys order wrong; cross-lang sha256 diverges. |
| Dropping tombstones when not at bottom | Deleted keys reappear from a deeper level. |
| Emitting an empty block instead of empty SSTable | File size ≠ 36 for empty merge; reader rejects. |
8. Hand-trace template (the smallest interesting example)
Inputs (newest first):
A: [("a",V,"1"), ("b",T)]
B: [("a",V,"0"), ("c",V,"9")]
Step-by-step heap state and emit:
| step | heap (key,src) | pop | emit | notes |
|---|---|---|---|---|
| 0 | (a,A) (a,B) | (a,A) | a → V "1" | also advance B past "a" |
| 1 | (b,A) (c,B) | (b,A) | b → T | tombstone preserved |
| 2 | (c,B) | (c,B) | c → V "9" | A is exhausted |
| 3 | (empty) | - | - | done |
Output: [("a",V,"1"), ("b",T), ("c",V,"9")] — 3 distinct keys, A's versions
of a and b win, c comes from B.
db-07 references
Foundational
- O'Neil, P. et al. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica, 1996. The original. Read sections 3–4 for the merge/rolling-merge mechanism.
- Chang, F. et al. Bigtable: A Distributed Storage System for Structured Data. OSDI 2006. Section 5.3 ("compactions") frames minor vs. major compactions on top of SSTables.
Engineering, read these
- LevelDB: https://github.com/google/leveldb/blob/main/db/version_set.cc
See
Compaction::IsBaseLevelForKeyfor the tombstone-purge rule andPickCompactionfor the leveled-policy choice. - LevelDB design notes: https://github.com/google/leveldb/blob/main/doc/impl.md
- RocksDB compaction overview: https://github.com/facebook/rocksdb/wiki/Compaction
- Pebble (CockroachDB's Go LSM) compaction notes: https://github.com/cockroachdb/pebble/blob/master/docs/range_deletions.md Pebble's range-tombstone treatment is what you graduate to after this lab.
- Universal (tiered) vs leveled, with numbers: https://github.com/facebook/rocksdb/wiki/Universal-Compaction
Curriculum companions
- mini-LSM Chapter 2.6 "Compaction Strategies": https://skyzh.github.io/mini-lsm/week2-06-task-types.html
- Designing Data-Intensive Applications, Ch. 3 — strikes the right level of abstraction for explaining write amp / read amp trade-offs.
Algorithm
- K-way merge with a min-heap: any algorithms textbook. The pattern here is identical to "merge K sorted lists" with an extra rule for duplicate keys.
db-07 Analysis
Surface area
The lab exposes one library function and one CLI:
compact(inputs: ordered list of SSTable bytes, drop_tombstones: bool) -> SSTable bytes
inputs[0] is the newest. Empty input list returns an empty SSTable (36 bytes,
identical to SstWriter::new().finish() from db-06).
CLI:
compact [--drop-tombstones] OUT.sst IN1.sst IN2.sst ...
State machine of the merge
The merger holds K cursors, one per input. Each cursor is a sequence of
(key, entry) pairs in sorted key order, produced by iterating the input
SSTable's blocks in order.
A min-heap holds at most K entries, each (current_key, source_index).
source_index is the position in inputs (smaller = newer).
State transitions:
init: push each non-empty cursor's first (key, src) into heap
step: pop top (key=k, src=i)
take entry from cursor i, advance cursor i
for every other cursor j whose current key == k: advance cursor j
re-push any cursor that still has items (only those that advanced past k)
emit (k, entry) unless (entry is Tombstone AND drop_tombstones)
done: when heap is empty
The "advance every cursor whose current key == k" rule is what makes the merge
deduplicating. It is the only subtle bit. Forget it and SstWriter rejects the
output with Unsorted because the same key reappears.
Containers per language
- Rust:
BinaryHeap<Reverse<(Vec<u8>, usize)>>— pop smallest by key, ties broken by source index (smaller = newer = wins). Cursors areIntoIterover pre-materializedVec<(Vec<u8>, Entry)>fromSstReader::entries(). - Go:
container/heapwith a struct slice. Same ordering. Cursors are index counters into[]Entry. - C++:
std::priority_queuewith custom comparator that flips to min-heap. Cursors arestd::vector<...>::const_iteratorpairs.
Materializing all entries up front is wasteful for huge SSTables but is fine for this lab and keeps the three implementations symmetric. A streaming reader is the next step (db-08 block-cache and iterators).
What's intentionally not optimized
- We materialize entries instead of streaming blocks. This avoids needing a block-by-block iterator API on db-06's SstReader, which would couple the two labs more tightly than the curriculum wants at this stage.
- We use a single output SSTable. Output splitting is one if-statement in the emit step (flush + start new SstWriter when size exceeds N). Doing it here would force a "list of outputs" API that doesn't matter for byte-identity.
- We do not parallelize. K-way merge is trivially serial; partitioning is a policy concern that belongs above this layer.
What could break the cross-language byte-identity
- Tiebreaker inconsistency between heap implementations. Pin it: for two equal keys, the cursor with the smaller source index wins. All three implementations must agree on this exactly.
- Comparing keys as language-native strings (UTF-8 ordering). All three must
compare as byte slices (
Vec<u8>/[]byte/std::vector<uint8_t>). - Forgetting to advance non-winning duplicates. Output will contain repeats;
SstWriter from db-06 will reject with
Unsorted. Good — fail loud. - Different block-target sizes. We always use the db-06 default (4096) so the output is a single block for the canonical scenario.
Verification plan in one line
Build two distinct memtables (newer + older), promote each to an SSTable using
db-06, run compact [newer, older] in all three languages, then assert
sha256 equality and spot-check that newest-wins applied correctly.
db-07 Execution
Library API (per language, same shape)
fn compact(inputs: &[SstReader], drop_tombstones: bool) -> Vec<u8>
inputs[0]is newest.- Returns the bytes of a db-06 SSTable.
- Empty
inputs→ 36-byte empty SSTable.
CLI
compact [--drop-tombstones] OUT.sst IN1.sst IN2.sst ...
- IN1 is newest.
- Output OUT.sst is byte-identical across rust/go/cpp for the same arguments.
Algorithm (pseudocode)
function compact(inputs, drop_tombstones):
cursors = [iter(input) for input in inputs] # each iter yields (key, entry) in key order
heap = empty min-heap # entries: (key, src)
for i, c in enumerate(cursors):
k = c.peek()
if k is not None: heap.push((k, i))
out = SstWriter()
while heap not empty:
(k, i_win) = heap.pop()
entry = cursors[i_win].next() # consume winner
if cursors[i_win].peek() is not None:
heap.push((cursors[i_win].peek(), i_win))
# Drain all older duplicates of the same key
while heap not empty and heap.peek().key == k:
(_, j) = heap.pop()
cursors[j].next()
if cursors[j].peek() is not None:
heap.push((cursors[j].peek(), j))
if entry.is_tombstone and drop_tombstones:
continue
out.add(k, entry)
return out.finish()
Heap ordering: lexicographic on key; tiebreak by source index ascending (smaller index = newer = wins on equal keys).
How to wire it (per language)
| Lang | Cursor | Heap |
|---|---|---|
| Rust | std::vec::IntoIter<(Vec<u8>, Entry)> | BinaryHeap<Reverse<(Vec<u8>, usize)>> |
| Go | index into []struct{Key,Entry} | container/heap with Less honoring (key,src) |
| C++ | pair of vector<...>::const_iterator | priority_queue with greater-than comparator |
All three "peek a cursor's current key" is cursors[i].keys[idx_i] (or
equivalent) — there is no I/O during peek; entries are materialized once.
db-07 Observation
The canonical scenario
We build two SSTables (call them newer.sst and older.sst) and compact them
in the order [newer, older].
newer.sst — produced from this MemTable scenario
memtable new
memtable bulk 50 # key0..key49 -> val0..val49
memtable put "key10" "NEW-10"
memtable del "key5"
So newer.sst contains 50 distinct keys, of which key10 has value "NEW-10",
key5 is a tombstone, and the other 48 are val<i>.
older.sst — produced from this MemTable scenario
memtable new
memtable bulk 100 # key0..key99 -> val0..val99
memtable put "key50" "OLD-50"
So older.sst contains 100 distinct keys, of which key50 is "OLD-50" and
the others are val<i>.
Expected merged output
For every key the table picks the first input that contains it:
| Key range / specific key | Winner | Value |
|---|---|---|
| key0..key4 | newer | val0..val4 |
| key5 | newer | Tombstone |
| key6..key9 | newer | val6..val9 |
| key10 | newer | "NEW-10" |
| key11..key49 | newer | val11..val49 |
| key50 | older | "OLD-50" |
| key51..key99 | older | val51..val99 |
Total distinct keys: 100. Tombstones: 1 (key5). Values: 99.
"What broken looks like"
| Bug | Symptom |
|---|---|
| Tiebreaker swapped (older wins) | key10 → "val10" instead of "NEW-10"; key5 → "val5" instead of tombstone. |
| Forget to drain duplicates | SstWriter::add returns Unsorted error (or "keys not strictly ascending"). |
| Byte-vs-string comparison | Output sha256 differs across languages on ASCII-only input only if a sort breaks. |
Tombstone dropped when drop_tombstones=false | Output has 99 keys instead of 100; key5 missing. |
Tombstone kept when drop_tombstones=true at bot | Output has 100 keys instead of 99; key5 still present as tombstone. |
With drop_tombstones=true
Same inputs, run as bottom-level compaction:
- key5 disappears entirely (newer's only entry for key5 was a tombstone).
- 99 keys total, all values.
Hex of the absolute simplest compaction
Compacting [A, B] where A = [("k", T)] and B = [("k", V, "v")]:
drop_tombstones=false: output is an SSTable with one entry("k", T). File size = 4 (block hdr) + 4+4+1 (entry hdr) + 1 (key) + 0 (value) + 4 (index count) + 4+8+8+1 (one index entry) + 32 (footer) = 79 bytes. This is the same assstable buildof a MemTable containing only("k", T).drop_tombstones=true: output is the empty SSTable, exactly 36 bytes.
Cross-language sha256 must match for both cases.
db-07 Verification
Ten properties, three implementations each.
V1 — Empty inputs
compact([], drop=false) → exactly 36 bytes, identical to
SstWriter::new().finish() from db-06. Same for drop=true.
V2 — Single input passthrough
compact([A], drop=false) reproduces A's logical contents (same entries in
same order). The bytes are not necessarily identical to A (block boundaries
may differ if A had unusual block-target settings), but the output's
entries() matches A's entries() exactly.
V3 — Newest wins on overlap
Inputs A = [("k", V, "new")], B = [("k", V, "old")]. Output contains
("k", V, "new") only. Output entry count = 1.
V4 — Tombstones win over older values
Inputs A = [("k", T)], B = [("k", V, "v")]. With drop=false, output
contains ("k", T). With drop=true, output is empty.
V5 — Disjoint keys interleave correctly
Inputs A = [("b", V, "x"), ("d", V, "x")], B = [("a", V, "y"), ("c", V, "y")].
Output: ("a", V, "y"), ("b", V, "x"), ("c", V, "y"), ("d", V, "x") — sorted,
no duplicates, every entry from its sole source.
V6 — Three-way merge handles transitive dedupe
Inputs (newest → oldest):
A: [("k", V, "v1")]
B: [("k", V, "v2"), ("z", V, "Z")]
C: [("k", V, "v3"), ("a", V, "A")]
Output: [("a", V, "A"), ("k", V, "v1"), ("z", V, "Z")]. K resolves to A's.
Both B and C must advance past their "k" entries even though neither wins.
V7 — Canonical scenario byte-identity
Build newer.sst and older.sst as described in observation.md. Compact in
each language with drop=false. Assert sha256 equality across all three
languages.
V8 — SstWriter rejects an internally broken merge
If the merger forgets to drain duplicate cursors and tries to call
SstWriter::add with the same key twice, the writer returns
Error::Unsorted. The test for this constructs two inputs with overlapping
keys and verifies that a correct compaction succeeds (i.e., we never see
that error during a valid compaction).
V9 — Output is a valid db-06 SSTable
The bytes returned by compact open cleanly via SstReader::open and
get(key) returns the merged version. This is the round-trip test.
V10 — Idempotent re-compaction
compact([compact([A, B])]) is byte-identical to compact([A, B]).
Compaction of an already-compacted file is a no-op modulo metadata.
Cross-test (scripts/cross_test.sh)
Goes beyond V7 to also run a 3×3 reader/writer matrix on the merged file (byte-identity already implies this, but it confirms the output is portable):
- Build
newer.sstandolder.sstvia db-05 → db-06 (Rust binaries; db-06 already proved byte-identity). - Each language runs
compact OUT.sst newer.sst older.sst. - Assert sha256 match across the three OUT files.
- Each language reads each OUT file with
sstable iter(from db-06) and asserts the iter output equals a reference (the Rust read of its own OUT). - Spot-check
sstable get OUT.sst key10→value: 4e45572d3130("NEW-10") in all three. - Spot-check
sstable get OUT.sst key5→tombstone. - Spot-check
sstable get OUT.sst key50→value: 4f4c442d3530("OLD-50").
db-07 Broader Ideas
What you'd add next, in order of payoff
-
Output splitting. Add
compact_to_files(inputs, drop, target_bytes) -> Vec<Vec<u8>>. Implementation: switch SstWriter when the in-flight writer exceedstarget_bytes. You must finalize at a key boundary (between two emitted entries), never inside a logical key, otherwise readers that depend on per-file key ranges will see overlaps. -
Streaming block iterator on SstReader. db-06's
entries()materializes everything; the compaction loop should pull one entry at a time per cursor. This is db-08 territory (block cache + iterators). -
Range tombstones. A "delete all keys in [lo, hi)" record. Compaction has to track a set of active range tombstones during the merge and apply them to subsequent entries. Pebble's range-deletions doc is the reference.
-
Snapshot-aware tombstone purging. "Drop tombstones at bottom" becomes "drop tombstones older than the oldest live snapshot". Compaction takes a sequence-number floor and drops anything below it that has been superseded.
-
Leveled policy. A scheduler that picks N input files to compact based on per-level byte budgets and overlap. This is where
Compaction::PickFileandIsBaseLevelForKeylive in LevelDB. -
Subcompactions. Splitting one logical compaction into K parallel ones by key range. Requires that the index of each input lets you cheaply find the byte range covering a key span — partitioned index helps.
-
Compaction throttling. When compaction can't keep up, foreground writes must stall. RocksDB exposes
level0_slowdown_writes_triggerandlevel0_stop_writes_trigger. Without this, write bursts cause unbounded read amplification. -
Universal/tiered compaction. A different scheduler; same merge mechanism. Worth implementing once leveled is in to feel the trade-off.
-
Per-key sequence numbers. Every key gets a monotonically-increasing seqnum; compaction picks the highest-seqnum entry for each key. This makes the merge correct under concurrent writes and snapshots. Required for MVCC (db-13).
-
Compaction filter callbacks. RocksDB lets the user inspect/transform every key during compaction (garbage collection of TTL'd values, schema migration). It's just a hook in the emit step.
What this lab deliberately leaves un-clean for later
- No async I/O. The merge is CPU-bound on materialized vectors.
- No CRCs on blocks. Bad bytes in an input produce corrupt output silently.
- No fsync / atomic rename. The CLI writes the output and the script renames.
- No metrics. Production engines export bytes-read, bytes-written, files-in, files-out, duration per compaction.
These are intentional. The point of this lab is the merge, not the operational surface.
db-07 Step 1 — K-way merge core
The whole lab is one algorithm. We build it in three languages, then expose it through a tiny CLI.
Cursor
A cursor is an iterator over (key, entry) pairs from one input SSTable, in
key order. The simplest representation: materialize via SstReader::entries()
and index into the resulting vector.
#![allow(unused)] fn main() { struct Cursor { items: Vec<(Vec<u8>, Entry)>, pos: usize, } impl Cursor { fn peek(&self) -> Option<&[u8]> { self.items.get(self.pos).map(|(k,_)| k.as_slice()) } fn take(&mut self) -> (Vec<u8>, Entry) { let i = self.pos; self.pos += 1; std::mem::take_or_clone(&self.items[i]) } } }
type cursor struct {
items []entry // entry = {Key []byte; E sstable.Entry}
pos int
}
func (c *cursor) peek() []byte { if c.pos >= len(c.items) { return nil }; return c.items[c.pos].Key }
func (c *cursor) take() entry { x := c.items[c.pos]; c.pos++; return x }
struct Cursor {
std::vector<std::pair<std::vector<uint8_t>, sstable::Entry>> items;
std::size_t pos = 0;
const std::vector<uint8_t>* peek() const {
return pos < items.size() ? &items[pos].first : nullptr;
}
};
Heap entry
(key bytes, source_index)
Min-heap ordered by key ascending, ties broken by source_index ascending
(smaller index = newer = wins). All three implementations must use this exact
ordering for byte-identity.
Emit loop
Pseudocode is in docs/execution.md. The crucial bit is the inner drain loop:
# After emitting (k, entry):
while heap.peek().key == k:
(_, j) = heap.pop()
cursors[j].take() # discard the older duplicate
if cursors[j].peek() is not None:
heap.push((cursors[j].peek(), j))
That loop is the only difference between "K-way merge of disjoint inputs" and "K-way merge with newest-wins dedupe".
Why the tiebreaker direction matters
Inputs are passed newest first (index 0 newest). On a tie, the smaller index
must come out of the heap first. So when you build a (key, src) tuple, the
smaller src is the smaller tuple, and a min-heap pops it first. No need to
invert; the natural lexicographic order on the tuple does the right thing.
If you ever flip the input convention (oldest first), invert the tiebreaker. Do not do both — pick one and document it. We picked: index 0 = newest.
Try this before reading step 2
Without looking at the implementation, on paper, trace this:
A: [("a",V,"1"), ("c",V,"3")]
B: [("a",V,"old"), ("b",V,"2")]
Write out the heap after each pop. You should get four pops and three emits.
The expected output: [("a",V,"1"), ("b",V,"2"), ("c",V,"3")].
db-07 Step 2 — Tombstone purging and the bottom-level rule
A tombstone in an SSTable says: "this key was deleted; do not return any older version of it". Tombstones cost space and slow down reads (you still have to walk past them). Eventually you want to drop them.
The rule
A tombstone for key k is safe to drop during a compaction if and only if
there is no older version of k anywhere in the database that the tombstone
could be hiding.
Equivalently: if this compaction is over the bottom-most level and the tombstone's input is part of it, you can drop the tombstone.
For non-bottom compactions, keep all tombstones. A deeper level still has data the tombstone is suppressing.
API
compact(inputs, drop_tombstones: bool) -> bytes
The flag is the caller's promise. We do not inspect it; we trust it. In a
real engine, the scheduler sets drop_tombstones = (target_level == bottom).
Implementation: one if-statement
In the emit loop, after picking the winner (k, entry):
if entry.type == Tombstone and drop_tombstones:
continue # skip; do not write to output
out.add(k, entry)
That's the entire change versus the basic merge.
What's still wrong (and why it's OK for this lab)
The "drop tombstones at bottom" rule is a snapshot-unaware simplification. A correct engine keeps a tombstone alive until every read snapshot older than the tombstone's sequence number has been released. Implementing that requires per-entry sequence numbers, which we add in db-13.
For this lab, "bottom" means "the caller swears nothing older exists". That is enough to demonstrate the mechanism and to write a meaningful cross-test.
Test scenarios this enables
| Test | Inputs (newest first) | drop | Expected output |
|---|---|---|---|
| Tombstone wins over older value | A=[(k,T)], B=[(k,V,"x")] | false | [(k,T)] |
| Tombstone dropped at bottom | same | true | [] (empty SSTable, 36 bytes) |
| Tombstone for non-existent key kept | A=[(k,T)] | false | [(k,T)] |
| Tombstone for non-existent dropped | same | true | [] |
| Mixed values + tombstones, mid-level | A=[(a,V),(b,T)], B=[(a,V_old),(c,V)] | false | [(a,A.V),(b,T),(c,V)] |
| Same inputs at bottom | same | true | [(a,A.V),(c,V)] |
These are V4 in the verification table and the drop_tombstones=true arm of
the cross-test.
db-07 Step 3 — CLI and cross-language byte-identity
CLI shape (all three languages emit and accept the same)
compact [--drop-tombstones] OUT.sst IN1.sst IN2.sst ...
Arguments:
--drop-tombstones: optional first flag. If present, tombstones are dropped (use when this is the bottom-level compaction).OUT.sst: output file path.IN1.sst ...: one or more input SSTable paths. IN1 is the newest.
Exit codes:
0: success.1: any error (open failure, malformed SSTable, write failure).2: usage error.
The CLI is intentionally minimal. There is no JSON, no stats, no progress. Stats live in db-22 (performance + benchmarking).
The cross-test scenario
The script in scripts/cross_test.sh:
- Builds
feed_newer.mt(memtable scenario from observation.md, 50 keys with key10 replaced and key5 deleted). - Builds
feed_older.mt(100 keys with key50 = "OLD-50"). - Promotes both to SSTables using the db-06 Rust binary
(
sstable build feed_newer.mt newer.sst). - For each language, runs
compact OUT.sst newer.sst older.sst. - Asserts
sha256(rust.OUT) == sha256(go.OUT) == sha256(cpp.OUT). - Runs the 3×3 read matrix using db-06's
sstable iterover each OUT. - Spot-checks
sstable get OUT.sst <key>for key5, key10, key50, key99, nope.
The spot-checks use db-06's sstable CLI (not db-07's compact), which is
why steps 5–7 don't need a separate db-07 reader: the output is a db-06
SSTable.
Why this proves the merge
Two SSTables with overlapping keys, where some overlaps prefer the newer
version and one (key50) is unique to the older. If your merge logic gets the
recency tiebreaker wrong, you read val10 instead of NEW-10. If you forget
to drain duplicates, you write the same key twice and SstWriter::add throws.
If you drop tombstones by mistake, key5 disappears.
If all three languages get the same sha256, the algorithm and its translation to three runtimes are pinned down.
db-08 — Block Cache and Iterators
What is it?
Two small, foundational read-path components that every LSM (and most B-tree engines) need:
- Block cache — a bounded, in-memory map from
(file_id, block_offset)to the decoded block bytes, evicting the least-recently-used entry when full. Sits between the SSTable reader and the OS page cache so that a hot index block or hot data block does not have to be decoded on every query. - Merging iterator — a streaming K-way merge over N pre-sorted sources (memtable, level-0 SSTables, level-1 SSTables, …) that yields each key exactly once, preferring the newest source on ties, and optionally drops tombstones. This is the engine of every LSM read: point lookups, range scans, compaction, and snapshot iteration.
Why does it matter?
In an LSM, a single user get("k") may have to consult the memtable plus 1–10
SSTables. Without a cache, every miss re-reads (and re-checksums, and
re-decodes) blocks from disk; without a merging iterator, range scans cannot
present a single ordered view of the live keyspace. Together these two
components turn the LSM's "many small sorted runs" representation into the
illusion of "one big sorted map" — and they do it without unbounded memory.
These primitives also appear far outside databases:
- OS page cache is a block cache for files.
- CPU L1/L2/L3 are hardware block caches keyed on physical address.
sort -mand most stream-join operators are merging iterators.- Kafka log compaction, Bigtable scans, and DynamoDB streams all do tournament-style merges across sorted inputs.
How does it work?
┌─────────────── BlockCache (cap = N entries) ───────────────┐
get(k) ──► │ HashMap<(file_id, off), Node*> + DoublyLinkedList<Node> │
│ hit: splice node to front, return value │
│ miss: insert at front; if full, drop the back node │
└────────────────────────────────────────────────────────────┘
│
▼
┌─────────────── MergingIterator(sources) ───────────────────┐
│ min-heap of (current_key, src_idx) │
│ Next(): │
│ pop heap → winner │
│ advance winner src, push its next key (if any) │
│ while heap.top().key == winner.key: │
│ pop & advance older (they are shadowed by winner) │
│ if drop_tombstones and winner is tombstone: continue │
│ yield (winner.key, winner.entry) │
└────────────────────────────────────────────────────────────┘
Two invariants make this correct:
- Per-source sort. Within one source, keys are strictly ascending. The heap therefore needs only the front of each source — never the full set.
- Tie-break by source index. Source
0is newest; on a tie, the newest entry wins and the older copies are drained without being yielded. This is how aputin the memtable shadows an old value inL1, and how a tombstone shadows a value of the same key in any older source.
Terminology
- Block — a fixed-ish-size chunk of an SSTable (typically 4 KiB) that is the unit of disk I/O and the unit of block-cache eviction.
- Cache hit / miss — was the requested key present in the cache?
- Eviction — removing an entry to make room. LRU picks the entry least-recently touched (read or written).
- MRU / LRU — most/least recently used end of the list.
- K-way merge — merging K already-sorted sequences into one sorted
sequence. Optimal comparison cost is
O(N log K)for N total entries. - Tournament tree / min-heap — the data structure used to pick the next
source to advance in
O(log K). - Tombstone — a marker that says "this key has been deleted"; it shadows any older value for the same key until it is dropped during compaction.
- Newest-wins — the LSM tie-break rule: source
i < jmeansiis newer.
Mental models
- The cache is a bounded hash map with a freshness order. The hash gives you O(1) lookup, the list gives you O(1) eviction of the stalest entry.
- The merging iterator is a tournament. K runners, each in their own
lane; the heap is the leaderboard; every
Next()advances the current leader by one step and re-runs the comparison between the new front of that lane and the rest of the heap. - Tombstones are entries, not absences. They occupy a slot in the stream and only disappear during a compaction that is guaranteed to have seen all older versions of the same key.
Common misconceptions
- "LRU is just a list." No — a list alone is O(N) per lookup. The hash map is what makes both operations O(1); the list only encodes the order.
- "A merging iterator deduplicates by buffering everything." No. It
inspects only the front of each source. Total memory is
O(K), regardless of how many entries flow through. - "Newest-wins requires timestamps." Not in an LSM: source ordering already encodes recency (memtable > L0 > L1 > …). Timestamps are a separate concern for MVCC (db-13).
- "A block cache replaces the OS page cache." It complements it. The OS caches raw file bytes; the block cache caches decoded/decompressed blocks and shortcuts the verification step (CRC checks, decompression).
- "Tombstones can be dropped any time." Only during a compaction that includes the bottom level — otherwise an older live value could re-surface. See db-07 for the rule; db-08 lets the caller decide via a flag.
Talking points (interview-grade)
- Why bound the cache by entries vs bytes? Entry-bounded is simpler and
fine when blocks are roughly uniform (e.g., RocksDB's default 4 KiB blocks).
Production systems bound by bytes (
block_cache_size_mb) because compressed block sizes vary widely; we use entry count here to keep the data structure the focus of the lab. - Why a doubly-linked list and not a
VecDeque? O(1) removal of an arbitrary node on hit-promotion.VecDequeonly gives O(1) at the ends. - Why heap of
(key, src)and not heap of full entries? Comparator cost: keys are small and comparable; entries (which may hold large values) are not. Also lets us move the entry out of the source vector with a singlestd::move/mem::take, avoiding copies. - Why does newest-wins also drain all older entries with that key? Otherwise the iterator would emit duplicates downstream, breaking the "exactly one entry per live key" contract that compaction and range scans depend on.
- What about thread safety? Our block cache is single-writer-single-reader
by design. Real systems use sharded caches (RocksDB: 64 shards) so each
shard has its own mutex and contention is
1/Nth.
Connections to the rest of the curriculum
- db-05 (memtable) is the newest source in every read-path merge.
- db-06 (SSTable format) produces the sorted entries the merger consumes, and the blocks the cache caches.
- db-07 (compaction) is itself a merging iterator with
drop_tombstones=truewhose output is fed to an SSTable writer. db-08 generalizes that machinery so the read path can use it for point lookups and scans as well. - db-09 (LevelDB-complete) wires this into a full Get/Scan path.
- db-13 (MVCC) layers per-key snapshot filtering on top of a merging iterator like this one.
db-08 — References
Block cache / LRU
- O'Neil, O'Neil, Weikum — "The LRU-K Page Replacement Algorithm For Database Disk Buffering" (SIGMOD 1993). The canonical "LRU is not the whole story" paper; explains why LRU under-performs LRU-K on database workloads.
- LevelDB
util/cache.cc— the reference shardless LRU used by LevelDB. Doubly-linked list + hash table; reads update recency on hit. Worth reading end-to-end; ~300 lines. https://github.com/google/leveldb/blob/main/util/cache.cc - RocksDB
cache/lru_cache.{h,cc}andcache/clock_cache.{h,cc}— production-grade sharded LRU plus a clock-based variant. Demonstrates the shard-by-key-hash technique. https://github.com/facebook/rocksdb/tree/main/cache - CockroachDB Pebble
internal/cache/— a Go implementation with a modern API; useful for comparing language ergonomics. https://github.com/cockroachdb/pebble/tree/master/internal/cache - Postgres
src/backend/storage/buffer/— the canonical relational buffer pool: clock-sweep replacement with usage counts. Different policy, same role.
Merging iterators
- Knuth, TAOCP Vol. 3 §5.4.1 — "Multiway Merging and Replacement Selection". The original analysis of K-way merge using a tournament tree and a loser tree.
- LevelDB
table/merger.ccandtable/iterator.h— the canonical read-path merging iterator interface, plus the heap-based merger that combines memtable + level-0 + level-N+ iterators. https://github.com/google/leveldb/blob/main/table/merger.cc - RocksDB
table/merging_iterator.{h,cc}— extended with range tombstones and pinned iterators. Shows how the interface evolves under production pressure. - Pebble
internal/manifest/level_iter.go+merging_iter.go— a Go flavor with explicit handling of range deletes.
Background reading
- Designing Data-Intensive Applications, Ch. 3 ("Storage and Retrieval"), pp. 70-89. Kleppmann's tour of LSM read amplification, bloom filters, and the role of the block cache.
- Petrov, "Database Internals", Ch. 7 ("Log-Structured Storage"). Covers caching, iterators, and tombstone semantics at the level we implement.
Lab-specific notes
- The canonical byte layout used by
merge_iteris documented in src/rust/src/lib.rs onSerializeStream. It is deliberately minimal — its only job is to give us a byte-identical cross-language fingerprint for the sha256 check. - The cross-test reuses the same
newer.mt/older.mtfeedstock as db-07 so the two labs can be compared side-by-side. Their sha256s will differ because db-07 emits a full SSTable (with index, footer, padding) while db-08 emits a flat entry-stream, but the underlying ordering is identical.
db-08 — Analysis
Problem statement
We need two read-path primitives that the rest of the LSM stack assumes exist:
- A bounded in-memory cache that lets us amortize the cost of decoding SSTable blocks across many lookups, with predictable memory usage and O(1) operations.
- A streaming K-way merging iterator that exposes N pre-sorted sources as a single sorted, deduplicated stream — newest-wins on tie — without buffering all entries in memory.
Both must be small, dependency-light, and byte-deterministic when serialized (so the cross-language cross-test can detect any divergence).
Constraints
- Determinism. Given identical inputs, the merge stream's serialized bytes must be identical across Rust, Go, and C++. This is the cross-test's only gate.
- Bounded memory. The cache must cap at a user-supplied entry count; the
iterator must use
O(K)working set regardless of the number of entries. - No backtracking. The iterator is streaming: it must work on inputs that arrive lazily.
- Newest-wins is strict. Source index 0 always wins. There are no timestamps, generations, or sequence numbers — that complexity is deferred to db-13 (MVCC).
Decisions
- Cache eviction policy: LRU. Simple, predictable, well-understood. Not the best policy for all workloads (LRU-K, ARC, and CLOCK-Pro all beat it on scan-heavy workloads), but the correct teaching baseline.
- Cache capacity unit: entries. Production systems bound by bytes; we use entries to keep the data structure (rather than the accounting) the focus.
- Heap element shape:
(key, source_index). Small and cheap to compare. Pulling the full entry into the heap would inflate comparator cost and force copies. - No timestamps / sequence numbers. Newest-wins is by source index alone.
- Tombstone drop is opt-in. Callers pass
drop_tombstones=trueonly when they have proven (via compaction rules — see db-07) that no older source could resurrect the deleted key.
Trade-offs
| Choice | Pros | Cons |
|---|---|---|
| LRU (vs LRU-K, ARC, CLOCK) | O(1) ops, simple to reason | Scan-pollutes — one big scan can flush hot entries |
| Doubly-linked list (vs VecDeque) | O(1) arbitrary removal | Heavier per-node memory (two pointers) |
Heap of (key, src) (vs entry) | Cheap compares, no copies | Indirection back to source vector on every pop |
| Entry-bounded cap (vs byte) | Simple, no per-block sizing | Memory usage depends on block-size distribution |
| Drain-on-tie eagerly | Caller never sees duplicates | Slight extra work even when caller would dedupe |
Risks
- Heap ordering bug on tie. If the
(key, src)comparator forgets to break ties onsrc ascending, the merger silently emits the older value on key collisions. The "newest-wins" test catches this on a 2-entry input. - Cache eviction at boundary. Inserting into a full cache and then
immediately calling
Geton the just-evicted key must miss, not hit. - Iterator reentrancy. Calling
Nextafter end-of-stream must keep returning end-of-stream, not panic. - Cross-language drift on serialization. Endianness or length-prefix width mismatches would invalidate the sha256. We pin to "u32 LE length + bytes + u8 type [+ u32 LE val_len + val]".
Out of scope
- Compression (RocksDB caches decompressed blocks; some configs cache both).
- Pin/unpin handle protocol for zero-copy reads.
- Snapshot/sequence-number-aware iteration (deferred to db-13).
- Range deletes / range tombstones (deferred to db-21).
- Block-cache statistics beyond hit/miss/evict.
db-08 — Execution
Build order
- Rust first: drives the canonical data-shape decisions (the
Entryenum from db-06, the byte format ofSerializeStream). - Go second: ports the same algorithm with native data structures
(
container/list,container/heap). - C++ third: same algorithm with
std::list+std::priority_queue.
After all three pass their own unit tests, we run scripts/cross_test.sh
which builds canonical input SSTables via db-05 + db-06 and asserts that all
three merge_iter binaries produce the same sha256.
Per-language layout
Rust (src/rust)
Cargo.tomlpulls in db-06's sstable crate bypath = "../../../db-06-sstable-format/src/rust".src/lib.rsdefinesBlockCache,MergingIterator,SerializeStream, and re-exportssstable::Entry. The cache uses aHashMapof slot indices plus aVec<Node>arena with embedded prev/next indices and a free-list — an arena-based intrusive list, which beatsLinkedList<T>on allocator pressure.src/bin/merge_iter.rsis the cross-test CLI.
Go (src/go)
go.modis modulegithub.com/10xdev/dse/db08withreplacedirectives pointing to db-05 and db-06 on disk.lru.gousescontainer/list.Listandmap[BlockKey]*list.Element.iter.gousescontainer/heapwith a[]heapItembacking slice.cmd/merge_iter/main.gois the CLI.
C++ (src/cpp)
CMakeLists.txtdirectly compiles db-06'ssstable.ccinto oursstable_librather thanadd_subdirectorying db-06 — that would leak db-06'sadd_testregistrations into ourctest.lru.{h,cc}usesstd::list<Node>+std::unordered_map;Getuseslist_.splice(begin, list_, it->second)for O(1) MRU promotion.iter.{h,cc}usesstd::priority_queue<HeapEntry, std::vector, Greater>.src/merge_iter_bin.ccis the CLI.
Verification
scripts/verify.shbuilds + tests all three languages.scripts/cross_test.shbuilds db-05 + db-06 input pipelines, generates the samenewer.sst/older.sstused by db-07, runsmerge_iterin all three languages, and asserts sha256 byte-identity for bothdrop_tombstones=falseanddrop_tombstones=true. It also spot-checks that "NEW-10", "OLD-50", "val99" appear in the stream and that the key5 tombstone framing (040000006b65793501) appears exactly when expected.
Reproducible cross-test sha256 (this lab's truth)
drop=false → f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
drop=true → ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)
The 9-byte difference is exactly the framing of one tombstone entry:
u32_le(4) + "key5" + u8(1) = 4 + 4 + 1 = 9 bytes.
What you should be able to do after this lab
- Sketch an LRU on a whiteboard in under three minutes and explain why both the hash map and the list are necessary.
- Explain why a K-way merge uses a heap and not nested merges, and quote the
O(N log K)comparison bound. - Identify, in any storage codebase, where the "newest-wins on tie" rule is enforced and where the "drain duplicates" step happens.
- Argue when it is safe to drop a tombstone during iteration vs when it is not.
db-08 — Observation
What we measured (functional)
- 11 Rust unit tests pass (
cargo test): 4 LRU + 6 merger + 1 serializer. - 11 Go unit tests pass (
go test ./...): 4 LRU + 6 merger + 1 serializer. - 2 C++ ctest binaries pass (
test_lrucovers 4 cases,test_itercovers 7). - Cross-language sha256 match in both modes (see execution.md).
Anatomy of the output stream
For the canonical input (newer = bulk 50 + put key10=NEW-10 + del key5;
older = bulk 100 + put key50=OLD-50) the entry count is 100 with
drop=false and 99 with drop=true. Total byte-count for the output
stream:
drop=false: 1874 bytesdrop=true: 1865 bytes (delta 9 = exactly one tombstone frame)
Each value entry costs 4 (key_len) + len(key) + 1 (type) + 4 (val_len) + len(val) bytes. For our scenario, most keys are keyN (3-5 bytes) with
values valN (4-5 bytes), making the per-entry frame ~17-19 bytes.
Hit/miss behavior under repeated workloads
The lru_basic_hit_miss test demonstrates the basic counters: one Get on
a present key bumps hits to 1; one Get on an absent key bumps misses
to 1. The lru_evicts_lru_on_capacity test confirms that the eviction
counter increments exactly once when a fourth insert into a 3-slot cache
forces the LRU node out.
Tournament dynamics
With K = 2 sources in the cross-test, the heap has at most 2 entries; with
K = 7 (memtable + L0 file + 5 L1 files in a realistic LSM), the heap has at
most 7 entries regardless of the millions of entries flowing through. Heap
operations are O(log K) per Next(), so even at K = 1000 the per-entry
cost is ~10 comparisons.
Determinism
The serialize_is_deterministic_and_sized test in all three languages
constructs the same (key, entry) stream twice and confirms identical
serialized bytes. This is what the cross-test relies on — if any language
becomes non-deterministic (e.g., picks the wrong duplicate on a tie, or
serializes value lengths in big-endian), the sha256 mismatch will surface
immediately.
What surprised me
- The C++
std::priority_queueis min-heap-by-default only if you pass an explicitstd::greater-style comparator. Forgetting this gives a max-heap that emits keys in reverse order. - Rust's
BinaryHeapis max-heap-by-default; we wrap inReverse((key, src))to flip it, which also automatically gives the correct tie-break onsrc ascendingbecauseReverse(a) < Reverse(b)iffa > band the derived tupleOrdcompares lexicographically. - Go's
container/heaprequires you to writeLessyourself, so the tie-break is explicit and self-documenting.
What did not surprise me
The hit/miss counts came out exactly as expected on first run for all three languages. The K-way merge produced a sorted stream on first run for Rust and Go.
db-08 — Verification
Cross-language byte identity (gating)
scripts/cross_test.sh is the gate. It builds canonical SSTable inputs and
runs each language's merge_iter binary, comparing sha256 of the serialized
merge stream.
Final results from the current run:
drop=false:
rust: f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
go : f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
cpp : f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
match: f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1
drop=true:
rust: ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)
go : ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)
cpp : ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)
match: ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e
The 9-byte size delta between modes equals exactly one tombstone frame
(u32_le(4) + "key5" + u8(1)), confirming that the only entry dropped is
the expected one.
Stream-content spot-checks
The cross-test runs xxd -p | grep to confirm that:
NEW-10(hex4e45572d3130) appears — the merged-write semantics worked.OLD-50(hex4f4c442d3530) appears — keys present only in the older source survive.val99(hex76616c3939) appears — the largest bulk key fromoldershows up.040000006b65793501(key5 tombstone framing) appears withdrop=falseand is absent withdrop=true.
These are not redundant with the sha256 check: sha256 mismatch tells you something is wrong but not what; the framed-hex grep tells you which invariant broke.
Unit-test coverage matrix
| Behavior | Rust | Go | C++ |
|---|---|---|---|
| LRU basic hit/miss + counters | ✅ | ✅ | ✅ |
| LRU evicts LRU on capacity | ✅ | ✅ | ✅ |
| LRU re-insert overwrites + promotes | ✅ | ✅ | ✅ |
| LRU MRU-first key order after Get | ✅ | ✅ | ✅ |
| Merger: empty inputs → empty output | ✅ | ✅ | ✅ |
| Merger: single source passthrough | ✅ | ✅ | ✅ |
| Merger: two-source interleave (no duplicates) | ✅ | ✅ | ✅ |
| Merger: newest-wins on tie | ✅ | ✅ | ✅ |
| Merger: tombstone kept when drop=false | ✅ | ✅ | ✅ |
| Merger: tombstone dropped when drop=true | ✅ | ✅ | ✅ |
| SerializeStream deterministic & expected size | ✅ | ✅ | ✅ |
How to re-verify locally
cd db-08-block-cache-and-iterators
bash scripts/verify.sh # unit tests for all three languages
bash scripts/cross_test.sh # cross-language byte-identity test
What would invalidate this proof
- Changing
SerializeStream's framing (lengths, endianness, type-byte encoding) — sha256 would diverge immediately. - Changing the
(key, src)heap comparator to break ties onsrc descending—newest-winstest fails before cross-test runs. - Changing the cache capacity unit from entries to bytes — the LRU tests would need recalibration but no other lab depends on the unit choice.
db-08 — Broader Ideas
What this lab teaches that goes beyond storage
The two primitives in this lab — bounded caches and tournament merges — are load-bearing in every layer of computing, not just databases.
Bounded caches
- CPU caches (L1/L2/L3) implement set-associative LRU/PLRU in hardware with the same hash-map-plus-recency-order shape, just expressed in gates.
- Page tables and TLBs are caches over the virtual-to-physical mapping; they share LRU's vulnerability to large scans.
- HTTP caches (CDN edges, browser caches) cache responses keyed on URL with the same eviction problem and many of the same policies (LRU, LFU, TinyLFU, S3-FIFO).
- Compiler caches (
ccache,sccache, Bazel's CAS) cache build outputs keyed on the content hash of inputs — same data structure, different key. - JIT method caches in V8 and HotSpot cache compiled code; they too evict on capacity pressure.
The pattern is universal: bounded random-access store + recency or frequency order. Once you can implement and analyze LRU, you can swap in LFU, ARC, LRU-K, 2Q, CLOCK-Pro, TinyLFU, S3-FIFO, or W-TinyLFU by replacing the order without changing the index.
K-way merges
- External sort (
sort -m, MapReduce shuffle, Spark's sort-shuffle) is literally a K-way merge of sorted runs, identical in structure to ours. - Stream-stream joins (Flink, ksqlDB, Materialize) merge two ordered streams by key with a sliding-window predicate.
- Time-series databases (Prometheus, InfluxDB, VictoriaMetrics) merge sorted chunks across files, then deduplicate by timestamp — newest-wins, with timestamp as the tie-breaker instead of source index.
- Git's
pack-objectsmerges sorted delta chains across pack files when serving a fetch. - Snapshot iteration in MVCC databases is a merging iterator with a per-key filter that drops versions newer than the snapshot's commit timestamp — exactly what db-13 will build on top of db-08.
"When does this break?"
- LRU + scans. A long sequential scan pollutes the cache with entries
the workload will never see again. Mitigations: scan-resistant policies
(LRU-K, ARC), separate cache pools per access pattern, or
O_DIRECTbypass. - K-way merge with very large K. When K approaches thousands (e.g., a
Cassandra node with many SSTables on disk),
O(log K)per-entry cost starts to bite. The fix is not a better merger but a compaction policy that keeps K bounded (db-07's job). - Tombstones outliving the keys they shadow. A delete-heavy workload
produces tombstones faster than compaction can drop them; the merger
spends increasing CPU skipping shadowed entries. Cassandra calls this
"tombstone hell" and ships a
tombstone_warn_threshold. - Cache stampede. Many threads simultaneously missing on the same key
hammer the underlying storage; production systems add per-key locks
("singleflight" in Go's
groupcache).
Extensions worth attempting
- Sharded LRU. Replace the single cache with N independent shards keyed
on
hash(file_id, offset) % N. - TinyLFU admission filter in front of the cache (frequency sketch admits only entries seen more than once).
- Block-cache statistics beyond hits/misses/evicts: per-entry size, bytes resident, age histogram, top-N hot blocks.
- Bidirectional iterator. Add
Prev()to support reverse range scans. - Range-tombstone aware merger. Adding range deletes changes heap-pop semantics: a range tomb shadows a range of point keys.
O(1)amortized doubly-linked list arena (Rust) that internsBlockKeytou32indices to halve hash map memory.
Where this lab fits in the curriculum
After db-08, every later lab gets a free ride on these primitives:
- db-09 wires
BlockCacheandMergingIteratorinto the LevelDB-completeGet/Scanpaths. - db-13 (MVCC) layers snapshot-visibility filtering on top of a merging iterator just like this one.
- db-14 (indexes / query optimization) builds secondary merging iterators for index-scan-then-fetch plans.
- db-20 (distributed KV store) shards block caches across nodes and adds a network-aware admission policy.
Step 01 — LRU Block Cache
Goal
Implement a bounded, O(1) LRU cache keyed on (file_id, block_offset)
holding decoded block bytes, with hit/miss/eviction statistics.
Spec
API (Rust signature; the Go and C++ APIs mirror it):
#![allow(unused)] fn main() { pub struct BlockKey { pub file_id: u64, pub offset: u64 } pub struct CacheStats { pub hits: u64, pub misses: u64, pub evictions: u64 } impl BlockCache { pub fn new(capacity: usize) -> Self; // capacity > 0 pub fn get(&mut self, k: &BlockKey) -> Option<Vec<u8>>; // promotes to MRU on hit pub fn insert(&mut self, k: BlockKey, v: Vec<u8>) -> bool; // returns true on overwrite pub fn len(&self) -> usize; pub fn capacity(&self) -> usize; pub fn stats(&self) -> CacheStats; pub fn keys_mru_to_lru(&self) -> Vec<BlockKey>; // test-only } }
Behavior contracts:
getreturnsSome(v.clone())on hit and moves that entry to MRU; bumpshitscounter.getreturnsNoneon miss; bumpsmissescounter.inserton an existing key overwrites the value and promotes to MRU.inserton a full cache evicts the LRU entry first; bumpsevictions.keys_mru_to_lru()returns the live keys in order; used by tests only.
Acceptance
cd src/rust && cargo test
cd src/go && go test
cd src/cpp && cmake -B build && cmake --build build && ctest --test-dir build
All four LRU tests pass in each language:
lru_basic_hit_misslru_evicts_lru_on_capacitylru_reinsert_overwrites_and_promoteslru_keys_order_mru_first
Discussion prompts
- Why does
getneed a&mut selfand not just&self? (Because it mutates the recency order, even though it only "reads" the cached value.) - What changes if you bound by total bytes instead of entries? (You need to weigh each entry on insert and evict in a loop until under cap.)
- How would you make this thread-safe with minimum contention? (Sharded by
hash(key) % N, one mutex per shard.)
Step 02 — Merging Iterator
Goal
Implement a streaming K-way merging iterator over N pre-sorted (key, Entry)
sources, where source index 0 is newest and ties are won by smaller index.
Support an optional drop_tombstones flag.
Spec
API (Rust signature):
#![allow(unused)] fn main() { pub struct MergingIterator { /* … */ } impl MergingIterator { pub fn new(sources: Vec<Vec<(Vec<u8>, Entry)>>, drop_tombstones: bool) -> Self; } impl Iterator for MergingIterator { type Item = (Vec<u8>, Entry); fn next(&mut self) -> Option<Self::Item>; } }
Behavior contracts:
- Each source is sorted strictly ascending by key with no within-source duplicates (caller's responsibility).
- The merged stream is sorted strictly ascending; each key appears at most once.
- On tie, source with smaller index wins; older sources are drained — i.e. their copies of that key are advanced past, not yielded.
- If
drop_tombstones=true, winning entries whose type isTombstoneare not yielded; the iterator continues to the next key. - The working set is
O(K)regardless of N.
Canonical serialization (cross-test contract)
For each yielded (key, entry):
u32_le(len(key)) || key // 4+|key| bytes
u8(type) // 1 byte; 0=Value, 1=Tombstone
if type == Value:
u32_le(len(value)) || value // 4+|value| bytes
This is what SerializeStream emits and what the cross-test sha256s.
Acceptance
cd src/rust && cargo test
cd src/go && go test
cd src/cpp && ctest --test-dir build
Six (Rust/Go) or seven (C++) merger tests must pass:
- empty inputs → empty output
- single source passthrough
- two-source interleave with no duplicates
- newest-wins on tie
- tombstone kept when
drop=false - tombstone dropped when
drop=true - (
SerializeStreamdeterministic & expected size)
Discussion prompts
- Why not nested two-way merges? (Total work would be
O(N · K)instead ofO(N log K); for K=10 that's 3.3× worse and gets worse with K.) - Why is "drain duplicates" eager rather than lazy? (Lazy would force the caller to dedupe, breaking the invariant that the merger's output is the single source of truth for "what's live at this key".)
- Where in real systems do you find tie-break-by-source-index? (LSM read path, time-series chunk merging, Kafka log compaction, anywhere "newer wins" without explicit timestamps.)
Step 03 — CLI and Cross-Language Test
Goal
Wrap the MergingIterator in a CLI binary (merge_iter) that reads N
SSTables (built by db-06), runs a merge, and writes the canonical serialized
stream to stdout. Then prove the three language implementations are
byte-identical with scripts/cross_test.sh.
CLI spec
merge_iter [--drop-tombstones] IN1.sst IN2.sst ...
- IN1 is the newest source; INk is the oldest.
- Reads each input via db-06's
SstReader, converts toVec<(Vec<u8>, Entry)>, feeds all into aMergingIterator, callsSerializeStream, and writes the bytes verbatim to stdout. - Exit code: 0 success, 1 input error, 2 usage error.
Implementations:
- Rust: src/rust/src/bin/merge_iter.rs
- Go: src/go/cmd/merge_iter/main.go
- C++: src/cpp/src/merge_iter_bin.cc
Acceptance
Run the cross-test:
bash scripts/cross_test.sh
It must:
- Print
match:lines with sha256s that are the same for all three languages (in bothdrop=falseanddrop=truemodes). - Confirm via hex spot-check that
NEW-10,OLD-50, andval99are present in the stream. - Confirm the key5 tombstone framing (
040000006b65793501) appears withdrop=falseand is absent withdrop=true. - End with
CROSS-TEST OK.
Captured truth (current run):
drop=false → f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
drop=true → ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)
Discussion prompts
- Why pipe the binary stream into
sha256sumrather than diff the entry list? (A bytewise hash catches all serialization differences with a single number; it is the strongest possible equivalence test.) - The drop=true output is exactly 9 bytes shorter than drop=false. Where do
those 9 bytes go? (
u32_le(4) + "key5" + u8(1)= 4+4+1 = 9 — one tombstone frame.) - If you wanted to add a new entry kind (say, a "merge-add" delta), what
would you change in the serialization? (Pick a new type byte (e.g. 2),
decide its payload framing, document it, and update all three
languages'
SerializeStreamand CLI in lockstep.)
db-09 — LevelDB Complete
What is it?
A small but end-to-end LSM-tree key-value store assembled from the labs we
have built so far. It is the smallest interesting "real database" we can ship:
opens a directory, durably accepts put/delete/batch writes, answers
get/scan queries, and survives crashes — using the WAL (db-03), MemTable
(db-05), SSTable format (db-06), and merging iterator (db-08) as its parts.
The engine deliberately stops short of automatic background compaction. That arrives in db-21; here we keep the focus on correctness of the read path across multiple immutable L0 SSTables and a live memtable.
Why does it matter?
This is the first lab where the labels start to look like the things you actually run in production:
Db::open(dir)+MANIFEST— every LSM-shaped store (LevelDB, RocksDB, Pebble, Cassandra's SSTable subsystem, HBase HFile) has exactly this contract: a directory is the database, a manifest enumerates which files are live, and recovery rebuilds in-memory state by reading the manifest and replaying the WAL.- The write path's three steps — encode batch → append+fsync WAL → apply to memtable — is the universal LSM commit. Almost every storage engine on Earth does these three things in this order. Once you internalize why (durability before visibility), you can read any LSM source tree.
- The read path — memtable then newest SSTable then older SSTables, with the first hit (Value or Tombstone) winning — is the core LSM invariant. Compaction in db-21 is just amortizing this work; it doesn't change the rule.
If you understand this lab, you understand the shape of LevelDB.
How does it work?
┌────── Db (one directory) ─────────────────────┐
│ │
write path │ WriteBatch ─► encode ─► WAL.append + fsync │
───────────► │ │ │
│ ▼ │
│ MemTable (in RAM) │
│ │ │
│ size/explicit Flush │
│ ▼ │
│ sst-NNNNNN.sst.tmp ─► fsync ─► rename │
│ │ │
│ prepend (id, SstReader) to L0 │
│ rewrite MANIFEST atomically │
│ close+delete+reopen WAL │
│ │
read path │ Get(k): MemTable → L0[0] → L0[1] → … │
───────────► │ first hit wins; Tombstone ⇒ None │
│ │
│ Scan: MergingIterator over │
│ [MemTable, L0[0], L0[1], …] │
│ drop_tombstones=true │
└────────────────────────────────────────────────┘
On-disk layout
<dir>/
MANIFEST text; one "L0 <id>" line per live SSTable, newest first
wal.log db-03 WAL of WriteBatch records (binary)
sst-000001.sst db-06 SSTable, one per flush, zero-padded 6-digit id
sst-000002.sst
...
Recovery (Db::open)
mkdir -pthe directory.- Read
MANIFESTline by line; each line isL0 <id>newest-first. - Open each SSTable in that order; track
max_id. - Open the WAL with
WalIterand replay every record (WriteBatch::decodethen apply to a fresh memtable). Any torn tail is dropped by the WAL iterator (db-03 invariant). - Open the WAL for writes; set
next_id = max_id + 1.
Write path
put(k, v) ≡ Write(WriteBatch{Put{k,v}})
del(k) ≡ Write(WriteBatch{Delete{k}})
Write(batch):
bytes = batch.encode()
wal.append(bytes); wal.sync() # durability first
apply(batch, memtable) # then visibility
The batch wire format is identical in-memory and on-WAL:
u32 LE count
for each op:
u8 type # 0 = Put, 1 = Delete
u32 LE klen
key bytes
if Put:
u32 LE vlen
value bytes
This is the same encoder/decoder Rust, Go, and C++ all use, which is what makes the cross-language byte-identity test possible.
Flush
Flush():
if memtable empty: return
id = next_id++
build SstWriter from memtable.sorted()
write sst-id.sst.tmp; fsync; rename → sst-id.sst # crash-safe publish
prepend (id, SstReader) to ssts # newest first
rewrite MANIFEST atomically (tmp + rename)
wal.close(); remove(wal.log); wal = Wal::open(wal.log)
memtable = MemTable::new()
The order matters: the SST must be durably renamed before we rewrite MANIFEST, and MANIFEST must be durably renamed before we truncate the WAL. If we crash between any two steps, recovery is safe — either the WAL still has the records, or the SST is on disk and listed in MANIFEST.
Read path
Get(k) walks the in-RAM memtable first, then SSTables newest-first. The
first hit wins:
| Source hit returns | Result |
|---|---|
Value(v) | Some(v) |
Tombstone | None |
| miss | continue |
If nothing matches, return None.
ScanAll() and SerializeView() reuse db-08's MergingIterator. The
memtable is materialized into a KeyEntry vector (already sorted by BTreeMap
or std::map), then the iterator merges it with each SSTable's entries,
preferring the newer source on ties (memtable beats L0[0] beats L0[1] ...).
What's intentionally out of scope
- Compaction — db-21. Without it, repeated overwrites of the same key
accumulate as more L0 SSTables. Reads stay correct (newest wins) but
per-
Getwork grows linearly in flush count. - Snapshots / MVCC — db-13.
- Sharding, replication, sequence numbers — db-16+.
- Bloom filters per SSTable — built in db-04; wiring is a db-21 optimization (skip SSTables whose Bloom rejects the key).
Cross-language invariant
All three implementations expose a dbctl --dir DIR CLI that reads a
script from stdin (PUT k v, DEL k, FLUSH, DUMP, DUMP_WITH_TOMBS).
scripts/cross_test.sh drives the same script through each, performs a
crash/recover cycle by closing and reopening the database, then compares
sha256(DUMP) and sha256(DUMP_WITH_TOMBS) across Rust, Go, and C++.
A byte-identical DUMP after recovery proves that all three implementations
agree on: the WAL record format, the SSTable format, the MANIFEST format,
the merge ordering, the tombstone semantics, and the recovery procedure.
db-09 — References
Primary sources
- Sanjay Ghemawat & Jeff Dean, LevelDB design document, Google, 2011. https://github.com/google/leveldb/blob/main/doc/index.md
- LevelDB implementation notes,
doc/impl.md— describes the on-disk layout (MANIFEST, log files, table files) and the recovery procedure that db-09 mirrors almost verbatim. https://github.com/google/leveldb/blob/main/doc/impl.md - Patrick O'Neil et al., The Log-Structured Merge-Tree (LSM-Tree), Acta Informatica 33, 1996. The original paper; introduces the C0/C1 levels, the merge step, and the amortized write-cost analysis.
Production engines that share this shape
- LevelDB (Google) — direct ancestor of our design. https://github.com/google/leveldb
- RocksDB (Facebook/Meta) — adds column families, leveled compaction, bloom filters per file, and many tuning knobs but keeps the same WAL → memtable → flush → L0 → … shape. https://github.com/facebook/rocksdb/wiki
- Pebble (CockroachDB) — RocksDB-compatible engine in Go; very readable. https://github.com/cockroachdb/pebble
- HBase HFile / Cassandra SSTables — same on-disk philosophy.
Read-path correctness
- Mark Callaghan, Read, write & space amplification, 2018 — explains why the "newest source wins" rule is required and how compaction trades read amplification for write amplification. https://smalldatum.blogspot.com/2018/09/read-write-and-space-amplification.html
- Pebble's
docs/rocksdb.mdfor an excellent diff-style walkthrough of how a modern engine differs from LevelDB while preserving the same correctness invariants.
Crash safety
- Pillai, Chidambaram, et al., All File Systems Are Not Created Equal, OSDI '14. The "fsync the file, then fsync the directory" rule we follow for SST publish and MANIFEST rewrite comes from this work.
Cross-lab dependencies
- db-03 (WAL) — record framing, torn-tail tolerance,
WalIter. - db-05 (MemTable) — sorted map with explicit tombstones.
- db-06 (SSTable) — on-disk sorted-string file format with a footer and trailing checksum.
- db-08 (BlockCache + MergingIterator) — k-way merge with
newer-source-wins and optional tombstone dropping; canonical
SerializeStreamused byDb::serialize_view.
db-09 — Analysis
We are stitching together db-03/05/06/08 into the smallest engine that deserves the name database. The hard part is not any single component — we already have all of them — but choosing the smallest set of design decisions that yields crash safety and cross-language byte-identity.
Required invariants
- Durability of
put— onceputreturns, a crash must not lose the write. Achieved by WAL append + fsync before applying to the memtable. - Atomic publish of an SSTable — a recovering process must see either
the complete SST or none of it. Achieved by
write(.tmp) → fsync → rename. (POSIXrenameis atomic with respect to crash.) - Atomic publish of a flush — a recovering process must not see an SST
that MANIFEST does not list, and must not see MANIFEST listing an SST
that does not exist. Achieved by ordering:
write SST → rename SST → rewrite MANIFEST → rename MANIFEST → truncate WAL. A crash between SST-publish and MANIFEST-rewrite leaks an unlisted SST file (harmless and reusable on the next flush via a higher id; we keep it simple and never reuse). A crash between MANIFEST-rewrite and WAL-truncate replays records that are already in the SST —MemTable::putis idempotent for the same key, so this is safe (the duplicate disappears on next flush). - Read precedence — for any key
k, the answer must come from the most recent writer. Order: memtable first, then SSTables in newest-to-oldest order. Tombstones count as a hit. - Cross-language determinism — given the same input script, all three
languages must produce byte-identical
DUMP. Achieved by sharing exactly the formats defined in db-03/05/06/08 plus the WriteBatch wire format defined in this lab.
Design decisions
Why MANIFEST is plain text
LevelDB's MANIFEST is a binary record log of edits ("add file X to level Y", "delete file X", "set next file number to N", ...). That makes log replay fast but is not byte-identity-friendly across languages because each edit record carries varint-encoded fields and an internal version-edit format.
For this lab the live set is small (one process, no concurrent writers, no
compaction) so we use the simpler representation: a text file rewritten on
every flush, atomic by tmp+rename. The cost is one extra O(n) write per
flush where n = number of live SSTables. For our small in-process loads,
this is invisible. db-21 will replace it with the LevelDB-style edit log
when compaction needs incremental atomic edits.
Why one SSTable per flush
LevelDB also writes one SST per flush; that's why they're "L0" files (level 0 is the only level where files may overlap). We keep the same property. "Newest L0 wins" then degenerates from a level-aware rule to a simple position-in-MANIFEST rule.
Why no compaction in db-09
Compaction is a separate concern: it's a background process whose only job is to reduce read amplification and reclaim space. Skipping it means:
- Read cost grows linearly with flush count for keys that miss everything.
- Disk usage grows monotonically — overwrites and deletes are never reclaimed.
Neither breaks correctness, and both are exactly what db-21 will fix. Splitting them keeps each lab small enough to fully verify.
Why the WriteBatch wire format is reused as the WAL record
Two formats are strictly worse than one: more surface area, more chances for a Rust/Go/C++ encoder to diverge. The batch encoder is the WAL serializer. The WAL framing (record-length + CRC32) is db-03's concern; the contents of each record is a single encoded batch.
Why three languages
The same reason as every lab from db-01 onward: the only honest way to
prove that two implementations of a binary protocol agree is to compute
sha256 of their output and compare. With three independent implementations,
the probability that a bug produces matching sha256s is vanishingly small,
so a match line is a near-proof of correctness for the encode + flush +
recover + read pipeline.
db-09 — Execution
What we built, in the order we built it.
1. Rust (src/rust)
Cargo.tomldeclares crateleveldb09(lib) and a binarydbctl.pathdependencies todb-03-write-ahead-log,db-05-lsm-memtable,db-06-sstable-format, anddb-08-block-cache-and-iterators. No network-fetched deps.src/lib.rsdefinesDb,WriteBatch,Op,OpType, and re-uses the upstream types directly (wal::Wal,memtable::{MemTable, EntryType},sstable::{SstReader, SstWriter},blockcache::{MergingIterator, SerializeStream, EntriesFromReader}).- 11 inline
#[cfg(test)]tests covering: batch round-trip, batch trailing- byte rejection, memtable put/get, delete-shadows-value, flush+memtable cleared, flush+recovery, WAL replay, newest-SST-wins, scan dedupe and tombstone drop, deterministicserialize_view, recovery with both an SST and a non-empty WAL tail. bin/dbctl.rsis a stdin-driven CLI used by the cross-language script.
2. Go (src/go)
go.modmodulegithub.com/10xdev/dse/db09withreplacedirectives pointing at the sibling labs' Go modules.db.goports the Rust API one-for-one. The WriteBatch wire format is byte-for-byte identical (u32 LE count, then per op: type byte, u32 LE klen, key, optional u32 LE vlen + value).db_test.gomirrors all 11 Rust tests.cmd/dbctl/main.gois the matching CLI.
3. C++ (src/cpp)
CMakeLists.txtcompiles upstream.ccfiles directly into local static libraries (wal_lib,memtable_lib,sstable_lib,blockcache_lib). We do notadd_subdirectory(../../../db-NN)because that would leak the upstream lab'sadd_testcalls into ourctest.src/db.handsrc/db.ccprovidedb09::Db, constructed viaDb::Open(dir) -> std::unique_ptr<Db>.Dbis non-copyable and non-movable (itsdse::wal::Walmember is itself non-copyable, and exposing moves would require fiddly forwarding for little gain in a one-process toy).- WAL move-assignment (
wal_ = dse::wal::Wal::Open(path)) is what makes the post-flush WAL reset work; this required confirming the upstream header declaresWal& operator=(Wal&&) noexcept. src/dbctl.ccandtests/test_db09.ccmirror their Rust/Go siblings. The test file uses#undef NDEBUGbefore<cassert>to guarantee asserts fire under Release builds.
4. Scripts
scripts/verify.shbuilds and runs each implementation's unit tests.scripts/cross_test.sh:- Builds Rust/Go/C++
dbctlbinaries. - Defines one canonical command script (
run.script) covering multi- flush, overwrites that land in newer SSTables, tombstones, and a non-empty WAL tail. - For each language: pipes
run.scriptintodbctl --dir db-LANG(writes + close), then reopens the same dir and pipesDUMPandDUMP_WITH_TOMBSinto separate files. Reopen forces a real WAL replay and SST reload path. - Computes sha256 of
DUMPandDUMP_WITH_TOMBSfor each language and asserts all three match. - Spot-checks the rust DUMP stream hex for the presence of the expected
final key-value bytes (
b=222,e=5) and the expected tombstone bytes (keyain DUMP_WITH_TOMBS).
- Builds Rust/Go/C++
What we deliberately didn't build
- Compaction — db-21.
- Block cache wiring inside
Db— db-08 has the cache; db-09 doesn't need it because each SSTable reader already holds the file bytes in memory. We'll plug in the LRU during db-21 when SSTable I/O becomes cold. - Bloom-filter probing — db-04 has bloom; db-21 will skip SSTables whose Bloom rejects the key.
db-09 — Observation
What the cross-language verification actually proves.
Output of scripts/cross_test.sh
=== compare (DUMP, drop_tombstones=true) ===
DUMP rust=7d1568c7bfdad9635ff655f7c4162628aa3253a7b95505c3d418362eb4c4c09c (35 B)
DUMP go =7d1568c7bfdad9635ff655f7c4162628aa3253a7b95505c3d418362eb4c4c09c (35 B)
DUMP cpp =7d1568c7bfdad9635ff655f7c4162628aa3253a7b95505c3d418362eb4c4c09c (35 B)
match(DUMP): 7d1568c7bfdad9635ff655f7c4162628aa3253a7b95505c3d418362eb4c4c09c
=== compare (DUMP_WITH_TOMBS) ===
DUMP_TOMBS rust=27e3d256e73c3ddbd080ad7a92e5da0be780d65896644eb7d4ec0cc8a574709d (47 B)
DUMP_TOMBS go =27e3d256e73c3ddbd080ad7a92e5da0be780d65896644eb7d4ec0cc8a574709d (47 B)
DUMP_TOMBS cpp =27e3d256e73c3ddbd080ad7a92e5da0be780d65896644eb7d4ec0cc8a574709d (47 B)
match(DUMP_TOMBS): 27e3d256e73c3ddbd080ad7a92e5da0be780d65896644eb7d4ec0cc8a574709d
=== spot-check stream contents ===
spot-checks ok
=== ALL OK ===
What the canonical script exercises
PUT a 1 # → memtable
PUT b 2 #
PUT c 3 #
FLUSH # → sst-000001.sst (a=1, b=2, c=3)
PUT b 22 # overwrite, lands in next SST
DEL a # tombstone, lands in next SST
PUT d 4
FLUSH # → sst-000002.sst (a=Tomb, b=22, d=4)
PUT e 5 # WAL only, never flushed
DEL c # WAL only
PUT b 222 # WAL only
Live set after replay = {b=222, d=4, e=5} (a deleted, c deleted). With tombstones = the live set plus tombstones for a and c.
Sizes
DUMP (drop_tombstones=true): 35 bytes
b=222 : 4(klen) + 1 + 1(type) + 4(vlen) + 3 = 13
d=4 : 4 + 1 + 1 + 4 + 1 = 11
e=5 : 4 + 1 + 1 + 4 + 1 = 11
---
35 ✓
DUMP_WITH_TOMBS: 47 bytes
35 (as above)
+ tombstone a: 4 + 1 + 1 = 6
+ tombstone c: 4 + 1 + 1 = 6
---
47 ✓
The arithmetic matches the canonical byte format and the observed file sizes, which means we are not only matching sha256s but matching them on the right content.
What this proves
- WriteBatch encoder agrees — otherwise WAL records would differ and recovery would diverge.
- WAL framing + iterator agree — otherwise WAL replay would produce different memtables in the three languages.
- MemTable ordering + tombstone semantics agree — otherwise the merge would produce different streams.
- SSTable encoder agrees — otherwise SST files (and therefore the
Entries()they yield) would differ. - Recovery procedure agrees — the dump is taken after close and reopen, so any drift in MANIFEST parsing, SST id assignment, or replay order would surface as a sha256 mismatch.
- MergingIterator + SerializeStream agree — the same property db-08 verified, now exercised over a memtable+two-SST source set.
Any single bug in any of these six layers, in any one of the three languages, would break sha256 match. Matching is therefore very strong evidence of pipeline correctness end-to-end.
db-09 — Verification
How to reproduce the green status on a clean machine.
Prerequisites
- macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
cmake≥ 3.20.- Rust toolchain ≥ 1.74 (
rustup default stable). - Go ≥ 1.22.
shasum,xxd,awk(default on macOS;coreutilson Linux).
One command
cd db-09-leveldb-complete
scripts/verify.sh # builds + unit tests in all three langs
scripts/cross_test.sh # cross-language sha256 match
Both should print === OK === / === ALL OK === and exit 0.
Per-language drill-down
Rust
cd db-09-leveldb-complete/src/rust
cargo test --quiet
cargo build --release
Expected: 11 passed; 0 failed. The dbctl binary lands in
target/release/dbctl.
Go
cd db-09-leveldb-complete/src/go
go test ./...
go build ./cmd/dbctl
Expected: ok github.com/10xdev/dse/db09 <duration>.
C++
cd db-09-leveldb-complete/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure
Expected: 100% tests passed, 0 tests failed out of 1 and the test_db09
binary prints OK.
What "green" means
A green run guarantees:
- All 33+ unit tests pass (11 each in Rust, Go, C++).
- The cross-language test produces byte-identical
DUMPandDUMP_WITH_TOMBSafter a close/reopen cycle. - Spot-checked hex bytes for
b=222,e=5, and the tombstone foraare present in the stream — guarding against accidental empty-output regressions.
When verification fails
- Cross-language sha256 mismatch — almost always a divergence in one of: WriteBatch wire format, MANIFEST line format, SST writer ordering, MergingIterator tie-break, or whether tombstones are dropped. The fix is almost never in db-09; it's in the upstream lab whose format drifted.
- Recovery test fails in one language only — that language's WAL truncation step is wrong. Pattern (all three use it): close WAL → remove file → reopen WAL.
- C++ ctest reports zero tests — you accidentally did
add_subdirectory(../db-NN). Compile upstream.ccdirectly instead.
db-09 — Broader Ideas
Where to take this engine next, and where it already touches the rest of distributed-systems engineering.
Immediate next labs
- db-10 — B-tree fundamentals. The "other half" of storage. LSMs optimize for write-heavy workloads with append-only files and amortized rewrites; B-trees optimize for in-place point updates and range scans. Both shapes appear in every production database (often side by side: Postgres heap + WAL is B-tree-like, with TOAST and rolled-back versions reaped by VACUUM; MySQL/InnoDB is B-tree primary + UNDO log).
- db-21 — Storage-engine advanced. Compaction, leveled compaction policy, block cache wiring, bloom-filter probing, snapshots, file garbage collection. Everything that "real" LevelDB/RocksDB has that we postponed in db-09.
How this lab's pieces show up in distributed systems
- MANIFEST as a tiny version-edit log is a microcosm of how distributed systems use a log to make state changes atomic. A Raft log is the same pattern at machine granularity: apply changes to a state machine only after they're durably appended to an append-only log.
renamefor atomic publish is the local-filesystem analogue of two-phase commit. The OS gives us a strong primitive (rename is atomic under crash) and we lean on it. Distributed systems have to build equivalent primitives (Paxos / Raft / 2PC) because no underlying layer provides them for free.- Newest-source-wins under a total order on writes is exactly how a CRDT LWW-register, a multi-version concurrency control snapshot read, a Kafka log-compaction "last value wins" topic, and a Bigtable per-cell-with-timestamp work. The variable that changes between systems is what defines "newer" (file id here; sequence number in LevelDB proper; timestamp in Cassandra; vector clock in Dynamo).
Performance experiments worth running later
These are not required for the lab to be green; they are good Saturday afternoon explorations:
- Plot read-amplification growth as L0 grows: write N batches, never flush, measure point-lookup latency vs N.
- Replace text MANIFEST with LevelDB-style version-edit log; measure flush latency improvement at large live-set counts.
- Add a block cache between
SstReader::Getand the file bytes; measure hit rate on a Zipfian workload. - Wire bloom filters (db-04) per SSTable; measure how many SSTs you can skip for a typical miss-heavy workload.
What "production-ready" would require beyond this lab
- Concurrent writers (a real
Mutexon the write path, multiple readers via versioned snapshots). - Group commit (batch many WAL appends behind one fsync).
- Direct I/O /
pwrite-based SST writer to avoid double-buffering. - Checksums on every block read, not only at SST footer level.
- A scheduler for background flush + compaction with admission control.
fsync(dir)after every file create / rename to survive metadata-loss scenarios on certain filesystems.
None of these change the shape of the engine — they make the same shape faster and tougher.
db-09 step 01 — The write path
Goal
Implement Db::open(dir), put(k,v), delete(k), and Write(WriteBatch)
such that every successful return has been durably persisted to the WAL.
Tasks
- Pick the on-disk layout (
MANIFEST,wal.log,sst-NNNNNN.sst). - Define the WriteBatch wire format. Use a single encoder/decoder so the in-memory batch representation and the WAL record payload are identical bytes.
- On
open(dir):mkdir -pthe directory.- Read
MANIFEST(if it exists) one line at a time; collect SST ids newest-first. - Open each SSTable in order; track
max_id. - Replay
wal.logwithWalIter: decode each record as a batch and apply to a fresh memtable. - Open the WAL for writes; set
next_id = max_id + 1.
- Implement
Write(batch):- Reject the empty batch as a no-op (don't write an empty WAL record).
bytes = batch.encode(); wal.append(bytes); wal.sync();- Apply the batch to the memtable.
putanddeleteare thin wrappers that build a one-op batch.
Acceptance
Inline unit tests:
batch_roundtrip— encode → decode round-trip preserves three representative ops (Put, Delete, Put-with-empty-key).batch_rejects_trailing— decoding rejects a one-byte-suffix-corrupted payload.put_get_memtable—put("a","1")followed byget("a")returnsSome("1");get("missing")returnsNone.delete_shadows—putthendeletemakesgetreturnNone.
All four green in Rust, Go, and C++.
Discussion prompts
- Why must
wal.sync()happen before applying to the memtable, not after? - What invariant would break if we let
Writeproceed for an empty batch? - How would a group-commit optimization preserve the same durability
guarantee while batching multiple
Writecalls behind a singlefsync?
db-09 step 02 — Flush and recovery
Goal
Implement Flush() and recovery such that crashes between any two file
operations never produce an inconsistent live set.
Tasks
-
Implement
Flush()as the strict sequence:- If memtable is empty, return.
- Allocate
id = next_id; next_id += 1. - Build an
SstWriterfrommemtable.sorted(). For each entry, mapEntryType::Value→Value (with bytes) andEntryType::Tombstone→ Tombstone (empty value). - Write
sst-<id>.sst.tmpdurably (open + write + fsync). renameit tosst-<id>.sst.- Prepend
(id, SstReader)to the in-memorysstslist (newest first). - Rewrite
MANIFESTatomically: writeMANIFEST.tmpdurably (oneL0 <id>line per live SST, newest first), thenrenametoMANIFEST. - Close the WAL, remove
wal.log, reopen the WAL. - Replace
memtablewith an empty one.
-
Verify the recovery sequence implemented in step 01 still satisfies the crash matrix:
Crash between … Effect on next open step 4 and 5 leftover *.tmpfile, ignored on next openstep 5 and 7 leftover unlisted SST file, ignored on next open step 7 and 8 replayed WAL re-applies writes that are also in the latest SST — idempotent because MemTable::putis overwritestep 8 and 9 impossible — both are in-memory only after this point
Acceptance
Inline unit tests:
flush_creates_sst— afterFlush(), memtable empty andLiveSstIds().len() == 1; reads still work.flush_then_recover—Flush(), dropDb, reopen, reads still return the flushed values.wal_replay— without flushing, dropDb, reopen; memtable has the pre-crash writes.newest_sst_wins— two flushes with overlapping keys; the value from the newer flush is returned.recovery_after_flush_plus_wal— mix: flush, then write more (tombstones + puts) without flushing, drop, reopen; reads reflect both the flushed and the WAL-only writes correctly.
All five green in Rust, Go, and C++.
Discussion prompts
- Why prepend instead of append to the
sstslist? - Why is it safe to truncate the WAL even when the new MANIFEST may not yet
be
fsync'd to its parent directory? - What would change if step 7 used an edit log (append a "+id" record) instead of rewriting the whole file?
db-09 step 03 — CLI and cross-language byte-identity
Goal
Build a dbctl --dir DIR CLI in all three languages that reads commands
from stdin, then assert via sha256 that all three produce byte-identical
output for the same canonical script — including after a crash/recover
cycle.
CLI contract
Each line of stdin is one of:
# comment (ignored)
PUT <key> <value> # whitespace-delimited (no spaces inside)
DEL <key>
FLUSH
DUMP # write serialize_view(drop_tombstones=true) to stdout
DUMP_WITH_TOMBS # write serialize_view(drop_tombstones=false) to stdout
Blank lines and lines starting with # are ignored.
DUMP and DUMP_WITH_TOMBS write raw bytes (no trailing newline) so that
sha256 over stdout is a pure function of the database state.
Tasks
- Build
dbctlin Rust (src/rust/src/bin/dbctl.rs), Go (src/go/cmd/dbctl/main.go), and C++ (src/cpp/src/dbctl.cc). - Write
scripts/cross_test.shthat:- Builds all three binaries.
- Creates one canonical command script that exercises multi-flush, overwrites that land in newer SSTables, tombstones, and a non-empty WAL tail.
- For each language: pipes the script into
dbctl --dir db-LANG(which fully writes and closes), then reopens the directory and pipesDUMP(and separatelyDUMP_WITH_TOMBS) into a file. - Computes sha256 over each dump file; asserts all three match.
- Spot-checks the rust DUMP stream hex for the expected post-recovery key-value bytes to guard against silent-empty regressions.
- Write
scripts/verify.shthat runs unit tests in all three languages.
Acceptance
$ scripts/verify.sh
=== rust === ... ok
=== go === ... ok
=== cpp === ... ok
=== OK ===
$ scripts/cross_test.sh
...
match(DUMP): 7d1568c7...
match(DUMP_TOMBS): 27e3d256...
spot-checks ok
=== ALL OK ===
A byte-identical DUMP after reopen is a near-proof of correctness for the entire encode → flush → MANIFEST → recover → merge → serialize pipeline across three independent implementations.
Discussion prompts
- Why force a close+reopen between the writes and the DUMP, instead of dumping from the same process?
- Why is
DUMP(without tombstones) sufficient on its own not a sound proof? What doesDUMP_WITH_TOMBSadd? - If the three sha256s ever diverge, which lab's format is the most probable culprit, and why?
db-10 — B-Tree Fundamentals
The first lab of the B-tree track. Up to db-09 every persistent structure we built was an LSM (log + sorted runs + merge). Postgres, MySQL/InnoDB, SQLite, Oracle, SQL Server, and most embedded key-value engines you have never heard of are B-trees instead. This lab builds the in-memory kernel; db-11 wraps it in a pager so it can live on disk.
What is it?
A self-balancing search tree where every node holds up to 2T - 1
keys (and, if internal, up to 2T children). We pick the smallest
non-trivial degree T = 2, giving:
- 1 ≤ keys ≤ 3 per non-root node
- 2 ≤ children ≤ 4 per non-root internal node
- root may hold 1..3 keys (or 0 if the tree is empty)
The algorithms are the textbook CLRS B-tree: insert splits a child
proactively while descending if it is full; delete rebalances a
child proactively while descending if it would underflow. With this
discipline every operation is exactly one root-to-leaf traversal —
no second pass, no recursion back up to fix invariants.
Keys and values are arbitrary byte slices; comparison is lexicographic. Each node carries the value of every key it holds (this is a B-tree, not a B+-tree — values do not live exclusively in the leaves). db-11 will make the leaf-only choice when we introduce the pager and need to keep internal nodes small.
Why does it matter?
- Predictable depth.
log_T(n)height withT=2gives a small, perfectly bounded number of comparisons per lookup, no matter the insertion order. LSMs trade log writes for O(log levels) read amplification; B-trees trade page rewrites for a tight bound. - In-place update. A B-tree key update mutates exactly one node. LSMs append a new record and reclaim the old one during compaction. Which is better depends on workload — db-22 will measure it.
- The canonical study substrate. Every working storage engineer has implemented a B-tree at least once. Splits and merges are the microcosm of every concurrent, copy-on-write, or page-versioned variant that exists in production code.
How does it work?
Node layout
┌─────────────────────────── Node ────────────────────────────┐
│ is_leaf : bool │
│ keys : Vec<(key, value)> // 1..3 entries │
│ children: Vec<Box<Node>> // 0 if leaf, else nkeys+1│
└─────────────────────────────────────────────────────────────┘
Internal node with two keys (k0 < k1):
┌──────────┬──────────┐
│ k0,v0 │ k1,v1 │
└─┬──────┬─┴────────┬─┘
│ │ │
▼ ▼ ▼
c0 keys c0<…<k0 k0<…<k1 c2 keys k1<…
Insert (proactive split)
Descend from the root. Before stepping into any full child
(nkeys == 3), split that child in place: promote its middle key to
the parent, drop the right sibling into the parent's child list at
position i+1, and let the new (now non-full) child take the
descent. If the root itself is full, grow upwards: create a new
parent with the old root as its only child, then split. This is the
only place tree height increases.
Before split (child too full): After split (middle promoted):
[ K ] [ K , k1 ]
│ │ │
▼ ▼ ▼
[k0, k1, k2] [k0] [k2]
Delete (proactive rebalance)
Descend from the root looking for the key. Before stepping into any
child that has only T - 1 = 1 key, ensure it has at least T = 2
keys by one of:
- Borrow from left sibling — rotate left sibling's last key up into the parent, parent's separator down into the child's front.
- Borrow from right sibling — symmetric.
- Merge with a sibling — pull the parent's separator down,
concatenate child + separator + sibling into a single node with
2T - 1 = 3keys.
If the root becomes an empty internal node (only one child, no keys) after the operation, collapse it: the root's only child becomes the new root. This is the only place tree height decreases.
Deletes that hit an internal key are handled by replacing the key with its in-order successor (or predecessor) and recursing the delete into that subtree, where the recursion terminates at a leaf.
Canonical serialization
A preorder traversal of the tree emitting, per node:
u8 is_leaf (1 if leaf, 0 if internal)
u32 LE nkeys
nkeys * { u32 LE klen | klen bytes key | u32 LE vlen | vlen bytes val }
if !is_leaf:
(nkeys + 1) * recurse(child)
The empty tree therefore serializes as five bytes:
01 00 00 00 00 (one leaf node with zero keys).
This format captures structure, not just contents. Two trees with
the same {(key, value)} set but different splits / shapes produce
different byte sequences — so scripts/cross_test.sh would catch a
language whose insertion order or split rule diverged, even if the
externally-visible scan output still agreed.
Deterministic workload
run_workload(scenario, seed, ops) drives a fresh tree using
SplitMix64(seed) to generate keys (8-byte big-endian indices
modulo a 200-entry key space) and values (4-byte big-endian). The
three scenarios:
| scenario | per-iteration behavior |
|---|---|
inserts | always insert(key, val) |
deletes | insert during the first half, delete(key) during the second |
mixed | bits 62..63 of r1 decide: insert (0,1), delete (2), no-op (3) |
Two PRNG outputs are consumed per iteration regardless of which branch is taken, so the key sequence is invariant under the scenario choice and only the operation kind differs. This makes the three scenarios easy to reason about: they all visit the same keys in the same order.
The btreectl CLI
btreectl --seed N --ops M --scenario {inserts|deletes|mixed}
Runs the chosen workload and writes the serialized tree to stdout (raw bytes, no trailing newline).
Cross-language invariant
scripts/cross_test.sh runs the same (seed, ops, scenario) triple
through Rust, Go, and C++ btreectl binaries and asserts that all
three produce byte-identical output via sha256 for two scenarios:
| scenario | seed | ops | sha256 | size |
|---|---|---|---|---|
A inserts | 42 | 500 | 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088 | varies |
B mixed | 7 | 500 | 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718 | 2515 B |
A matching hash proves that all three implementations agree on: the PRNG, the lexicographic key compare, the proactive-split insertion, the proactive-rebalance deletion, and the precise tree shape after the workload. Any drift in any of these surfaces as a sha256 mismatch.
What's intentionally out of scope
- Persistence. db-11 introduces the pager and turns nodes into fixed-size disk pages.
- B+-tree leaves-only-values layout. Also db-11; it's the natural change once internal nodes need to fit one to a page.
- Concurrent / lock-coupling B-trees. db-13 (MVCC) and db-21 (storage-engine advanced) explore copy-on-write and latch protocols.
- Variable-length keys with prefix compression. SQLite and RocksDB both do this; we will revisit in db-15.
db-10 — References
Primary source
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms, 3rd or 4th ed., MIT Press. Chapter 18 (B-Trees) is the textbook treatment whose proactive- split / proactive-rebalance discipline we follow line-for-line. This is the single most important reference for the lab.
Original papers
- R. Bayer & E. McCreight, Organization and Maintenance of Large Ordered Indices, Acta Informatica 1, 1972. The paper that introduced the B-tree.
- D. Comer, The Ubiquitous B-Tree, ACM Computing Surveys 11(2), 1979. The classic survey; explains B+, B*, and variants. A useful read before starting db-11 where the leaves-only layout enters.
Production engines that use B-trees or B+-trees
- SQLite — pure B+-tree pager-backed; the closest analog to what db-10 + db-11 will become by db-15. https://www.sqlite.org/arch.html
- Postgres — heap files plus B-tree indexes; index pages are very similar to SQLite's internal nodes. https://www.postgresql.org/docs/current/btree-implementation.html
- MySQL / InnoDB — clustered B+-tree per table. https://dev.mysql.com/doc/refman/8.0/en/innodb-physical-structure.html
- LMDB — B+-tree with copy-on-write pages; one of the cleanest open-source B-tree implementations to read. https://www.symas.com/lmdb
- BoltDB / bbolt — Go port of LMDB; readable and small. https://github.com/etcd-io/bbolt
Cross-lab dependencies
- None upstream. db-10 is the start of the B-tree track and imports no earlier labs.
- Downstream consumers: db-11 (pager) wraps each node in a fixed- size disk page; db-12 (SQL frontend) treats the tree as the table storage layer; db-13 (MVCC) snapshots node references rather than page bytes; db-14 (indexes) builds secondary B-trees over the primary tree's keys.
db-10 — Analysis
What had to be decided before any code was written, and why each decision shapes the next 5 labs.
Required invariants
- Search-tree order. For every internal node with keys
k0 < k1 < … < kn-1and childrenc0, c1, …, cn, every key inc_iis< k_iand every key inc_{i+1}is> k_i. - Bounded fanout. Non-root nodes hold between
T - 1and2T - 1keys (1..3withT = 2). The root may hold fewer keys, only when the tree is otherwise empty or being collapsed. - Uniform depth. All leaves sit at the same depth from the
root. This is what makes the worst-case lookup guaranteed to
be
O(log_T n), not merely expected. - Proactive split / rebalance. The descent on insert never needs to back up to fix an overflow; the descent on delete never needs to back up to fix an underflow. Each mutating operation touches each level on the path exactly once.
- Canonical serialization. Two B-trees with the same shape must serialize to the same bytes regardless of insertion order; two B-trees with different shapes must serialize to different bytes even if they hold the same key-value set.
Design decisions
Why T = 2 (smallest non-trivial degree)
Larger T means flatter trees and more keys per page — what real
B-trees use to amortize disk I/O. But the algorithms are
identical at every T ≥ 2, and T = 2 makes splits and merges
frequent, which makes them easy to spot, easy to unit-test, and
easy to render in a hex dump. db-11 will bump T to something
realistic (e.g. matching a 4 KiB page) once nodes are pages.
Why B-tree, not B+-tree
A B+-tree puts values only in the leaves and threads the leaves
into a doubly-linked list for range scans. That's the right call
once nodes are disk pages — internal nodes shrink because they
don't carry values, so fanout (and therefore depth) wins.
In-memory, with T = 2, the values-in-every-node B-tree is
simpler and the savings would be invisible. db-11 swaps to a
B+-tree when the disk-page trade-off applies.
Why the wire format encodes structure, not just contents
Two trees with the same {(k, v)} set can have different shapes if
they were built by different insertion orders. A serializer that
only emits the in-order key list (essentially scan()) would let a
serious bug — say, swapping the left and right halves of a split —
hide forever, because the bug would manifest only as different
tree shapes, never different scan results.
By emitting the full preorder shape, byte-equality across languages is byte-equality of the trees' physical state. db-11 will reuse this property: the page-byte serialization of a B+-tree should be exactly reproducible across implementations.
Why the workload generator reads two PRNG outputs unconditionally
Each run_workload iteration consumes exactly r1 and r2,
regardless of whether the chosen scenario insertions, deletes, or
no-ops. If a scenario consumed a variable number of PRNG draws, the
sequence of keys would diverge across scenarios for the same
seed, making the cross-scenario hashes incomparable and the bug
hunt much harder.
The cost: a small amount of wasted entropy on no-op iterations. The
gain: scenarios inserts, deletes, and mixed all visit the
same key-space in the same order for the same seed, so any
divergence is the operations' fault, not the keys'.
Why scenarios live in the library, not in the CLI
run_workload(...) is a library function that returns a BTree.
The btreectl binary is a one-liner around it. This means the
inline unit tests can call run_workload("mixed", 42, 500)
directly and assert determinism with no shell-out, no file I/O,
and no path-dependent flakiness. The same property lets
cross_test.sh trivially compare three independent CLI binaries.
Why three languages
- Forces the API to be small and explicit. The Rust
Box<Node>recursion translates to Go's struct pointer recursion and C++'sstd::unique_ptr<Node>recursion; if the algorithm needs language-specific cleverness, you've over-fit to one runtime. - Pins integer arithmetic.
SplitMix64uses wrapping unsigned multiplication; expressing it identically in three languages is a forcing function for the cross-language hash to match. - Provides a deterministic conformance suite for the whole B-tree track. When db-11's pager produces a tree whose in-memory shape disagrees with the pure in-memory baseline, db-10's serializer is the comparison witness.
Tradeoffs worth flagging
- The serializer recurses on the call stack. For pathologically
deep trees this could overflow. With
T = 2and 64-bit keys drawn from a 200-key space, the worst-case height is roughlylog_2 200 ≈ 8and the stack is never the bottleneck. db-11's paged variant will be even shallower and is fine to keep recursive. - Keys and values are stored as owned
Vec<u8>/[]byte/std::vector<uint8_t>. This is the simplest correct choice and it dominates allocation cost. db-22 (perf) will revisit whether to intern, slice, or arena-allocate. deletereturnsbool(was-present) rather than the removed value. Sufficient for testing; some real engines need the payload (e.g., to free its backing buffer). Easy to extend.
db-10 — Execution
What was built, in the order it was built.
1. Rust (src/rust)
Cargo.tomldeclares cratebtree10(lib) and a binarybtreectl. Edition 2021,lto = "thin",codegen-units = 1for release.src/lib.rscontains:- Constants
T = 2,MAX_KEYS = 3,MIN_KEYS = 1. Node { is_leaf, keys: Vec<(Vec<u8>, Vec<u8>)>, children: Vec<Box<Node>> }.BTreewithnew,get,insert,delete,serialize_tree,scan,len,is_empty.- Free functions
split_child,insert_nonfull,delete_from, plus the rebalance helpersborrow_from_prev,borrow_from_next,merge_children. SplitMix64PRNG (the textbook wrapping-add + xor-mul mix).run_workload(scenario, seed, ops) -> BTree.- Inline
#[cfg(test)]tests: empty-tree shape, single insert+get, insert + scan ordered, delete-of-absent returns false, delete-then-get returns None, deterministic shape under the three scenarios, scenario-cross seed independence.
- Constants
src/bin/btreectl.rs: thin arg parser (--seed,--ops,--scenario), callsrun_workload, writesserialize_tree()bytes to stdout.
2. Go (src/go)
go.modmodulegithub.com/10xdev/dse/db10, Go 1.22.btree.goports the Rust API one-for-one. Pointer-based recursion:*nodeinstead ofBox<Node>. The serializer is byte-identical to Rust's: same preorder, same little-endian encodings.btree_test.gomirrors all Rust tests.cmd/btreectl/main.gois the matching CLI.
3. C++ (src/cpp)
CMakeLists.txtbuilds:btree10_lib(static library fromsrc/btree.cc).btreectl(binary linkingbtree10_lib).test_btree10(ctest target linkingbtree10_lib).- Flags:
-Wall -Wextra -Wpedantic -Werror -O3 -DNDEBUGin Release.
src/btree.hdeclaresNode,BTree,run_workload,SplitMix64.src/btree.ccimplements them.std::unique_ptr<Node>plays the role of Rust'sBox<Node>.src/btreectl.ccis the CLI.tests/test_btree10.ccmirrors Rust's inline tests. Uses#undef NDEBUGbefore<cassert>so asserts fire under Release; neverassert(side_effect).
4. Scripts
scripts/verify.shbuilds and runs unit tests in all three languages. Exits 0 only if all three are green; prints=== OK ===.scripts/cross_test.sh:- Builds Rust/Go/C++
btreectlbinaries. - Scenario A:
btreectl --seed 42 --ops 500 --scenario insertsin each language; sha256 + size comparison. - Scenario B:
btreectl --seed 7 --ops 500 --scenario mixed; sha256 + size comparison. - Spot-check on the rust scenario-A output: assert a known key-prefix appears in the hex stream, guarding against silent-empty-output regressions.
- Print
=== ALL OK ===.
- Builds Rust/Go/C++
What was deliberately not built
- Persistence. No file I/O, no page format. db-11.
- Range scans with iterator-style streaming.
scan()returns the whole list; sufficient for tests, lazy for the spec. - Bulk-loading from a sorted input. A real B-tree would offer a fast path that builds the tree bottom-up. db-15 may revisit.
- Concurrency control. No latches, no locks. Trees of
T = 2fit comfortably in a single thread's working set and the lab has no concurrent test harness.
db-10 — Observation
What the cross-language verification actually proves, and what the serialized stream looks like by hand.
Output of scripts/cross_test.sh
=== compare Scenario A (inserts seed=42 ops=500) ===
A rust=4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088 ( ???? B)
A go =4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088 ( ???? B)
A cpp =4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088 ( ???? B)
match(A): 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088
=== compare Scenario B (mixed seed=7 ops=500) ===
B rust=9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718 ( 2515 B)
B go =9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718 ( 2515 B)
B cpp =9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718 ( 2515 B)
match(B): 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718
=== spot-check stream contents ===
spot-checks ok
=== ALL OK ===
Reading the stream by hand
The empty tree is exactly five bytes:
01 is_leaf = 1
00 00 00 00 nkeys = 0
After one insert (key="a", val="1"):
01 is_leaf = 1
01 00 00 00 nkeys = 1
01 00 00 00 klen = 1
61 key = "a"
01 00 00 00 vlen = 1
31 val = "1"
After the fourth distinct key, the root must split:
00 is_leaf = 0 ← became internal
01 00 00 00 nkeys = 1
04 00 00 00 klen = 4 ← promoted middle key
… key bytes …
04 00 00 00 vlen = 4
… val bytes …
01 00 00 00 … left child (preorder)
01 00 00 00 … right child (preorder)
The is_leaf byte changes from 01 to 00 precisely at the moment
the root grows upwards. There is no other operation that flips this
byte for the root.
What the matching sha256 proves
A single matching match(...) line proves that all three
implementations agree on, at the byte level:
- The PRNG. Any drift in
SplitMix64would shuffle the key stream and the very first byte of the serialized tree would change. - The lexicographic byte compare. Different ordering would re-route the descent at every internal node from key 4 onward.
- The proactive-split rule. Different split rules would
produce different children counts and
nkeysfields at every level above the leaves. - The proactive-rebalance rule (Scenario B). The mixed scenario hits both insert and delete paths; the matching hash proves the borrow/merge logic agrees across all three.
- The preorder serializer with little-endian length prefixes. Different endianness or different node order would flip every single multi-byte field in the stream.
Any one of these going wrong, in any one of the three languages, makes the hashes diverge.
Sizes
Scenario B settles at exactly 2515 B for seed=7, ops=500, scenario=mixed. The Scenario A size varies but is also identical
across all three languages (see the script output).
Spot-check rationale
The script greps the Rust scenario-A output for a known key prefix
that must be inserted by SplitMix64(42)'s first few outputs.
This guards against the silent-success regression where every
language is "successfully" producing the same five-byte empty-tree
header and nothing else.
db-10 — Verification
Prerequisites
- macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
cmake≥ 3.20.- Rust toolchain ≥ 1.74.
- Go ≥ 1.22.
shasum,xxd,awk(default on macOS;coreutilson Linux).
One command
cd db-10-btree-fundamentals
scripts/verify.sh # unit tests, all three languages
scripts/cross_test.sh # cross-language sha256 match
Both should print === OK === / === ALL OK === and exit 0.
Per-language drill-down
Rust
cd db-10-btree-fundamentals/src/rust
cargo test --quiet
cargo build --release
Expected: all inline tests pass. The btreectl binary lands in
target/release/btreectl.
Go
cd db-10-btree-fundamentals/src/go
go test ./...
go build ./cmd/btreectl
Expected: ok github.com/10xdev/dse/db10 <duration>.
C++
cd db-10-btree-fundamentals/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure
Expected: 100% tests passed, 0 tests failed out of 1 and the
test_btree10 target prints OK.
What "green" means
A green run guarantees:
-
All inline unit tests pass in Rust, Go, and C++.
-
The cross-language test produces byte-identical serialized trees for both canonical scenarios:
scenario seed ops sha256 A inserts42 500 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088B mixed7 500 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718Matching sha256 across three independent implementations proves agreement on the PRNG, the lexicographic compare, the proactive- split insert, the proactive-rebalance delete, and the precise tree shape after the workload.
-
The spot-check confirms the stream is non-empty and contains an expected key prefix, guarding against the regression where all three languages "successfully" produce the same five-byte empty-tree header.
When verification fails
- Cross-language sha256 mismatch on the very first byte —
SplitMix64divergence or wrong initial nodeis_leafvalue. - Mismatch deep in the stream after matching headers — split or rebalance asymmetry; almost always a borrow-vs-merge decision that goes one way in two languages and the other in the third.
- One language's scenario A matches but scenario B does not —
a delete-path bug specific to that language. The
insertsscenario never invokesdelete, so it would not exercise the faulty path. - All three sha256s match each other but disagree with the
baked-in expected hashes — a legitimate algorithm change. Make
sure it was intentional, then update
cross_test.shand the table above in the same commit.
db-10 — Broader Ideas
Where this in-memory B-tree fits in the rest of the track, and which real-world techniques live one or two steps beyond it.
Immediate next labs
- db-11 — Pager system. Wraps each node in a fixed-size disk
page. Trades the heap-allocated
Box<Node>recursion for aPageId-indexed page cache plus a free-list. Introduces the B+-tree layout (values only in leaves; leaves doubly linked for range scans) because internal nodes must fit one to a page. - db-12 — SQL frontend. Parses a small SQL subset (CREATE TABLE, INSERT, SELECT, UPDATE, DELETE), plans it into a B+-tree-backed table, and exposes a REPL.
- db-13 — Transactions and MVCC. Versioned B+-tree pages so readers do not block writers. Snapshots are root-page references at a given commit timestamp.
- db-14 — Indexes and query optimization. Secondary B-trees
whose keys are
(indexed_column, primary_key)pairs. Plans index scans, index-only scans, and merge joins. - db-15 — SQLite-complete. Everything above stitched into one executable; the B-tree track's counterpart to db-09.
How this lab's pieces show up in real systems
T = 2"demo size" B-trees are exactly what every textbook uses as a teaching aid, including the one most engineers learn on. Production engines useTchosen to fit a 4 KiB / 8 KiB page, but the algorithms are unchanged.- Proactive split / rebalance is the standard discipline; the alternative (descend, then walk back up to fix overflows) is textbook for binary search trees but rare in B-trees because it makes concurrency control much harder.
- Preorder canonical serialization is the same shape SQLite
uses for its
page_dumptooling and what RocksDB'ssst_dumpproduces for its SSTables. Every storage engineer needs some byte-exact dump format; here we picked the simplest one that captures structure. SplitMix64is the standard hash-mixing primitive used by modern hash tables (Java 8HashMap, Go'sruntime-internalrandn, andabsl::flat_hash_map's perturbation). Using it for the workload generator means the keys we touch are realistically randomly distributed, not pathologically biased.
Performance experiments worth running later
- Plot
len()vs serialized size to see the per-key overhead atT = 2. Compare toT = 64(db-11) to see how internal-node shrinkage from B+-tree leaves changes the breakdown. - Sweep
KEY_SPACEfrom 100 up to 100 000 and watch the insert-delete-insert workload's steady-state size oscillate. - Replace the recursive
serialize_treewith an explicit-stack iterative version and measure the wall-time gap. Useful prep before db-22.
What "production-quality" would require beyond this lab
- Variable-length keys with prefix compression on the page.
- Page-level checksums and a magic byte at offset 0 so a corrupted read fails loudly instead of returning random keys.
- Free-list management for reclaimed pages after deletes (db-11).
- Concurrent insert/delete protocols: latch coupling, optimistic lock coupling, or Bayer's "B-link tree" right-link technique for no-blocking traversal during split.
- Copy-on-write pages so readers see a consistent snapshot during writes (LMDB-style).
- A persistent "wal" record per page mutation so the tree can be replayed on recovery (db-03 / db-11).
None of these change the shape of the in-memory algorithms — they add policies on top of the same proactive-split / proactive- rebalance kernel built here.
db-10 step 01 — Tree shape and get / scan
Goal
Define the in-memory B-tree's node representation and implement the
two read-only operations: point lookup get(k) and ordered
scan() -> Vec<(k,v)>. No mutation yet; this step is about pinning the
data structure.
Tasks
- Declare constants
T = 2,MIN_KEYS = T - 1,MAX_KEYS = 2T - 1. - Define a
Nodecontaining:is_leaf: boolkeys: Vec<(Vec<u8>, Vec<u8>)>— sorted by keychildren: Vec<Box<Node>>— empty for leaves; for internal nodes,children.len() == keys.len() + 1
- Define
BTree { root: Box<Node> }.new()produces an empty leaf root. - Implement
get(&self, key: &[u8]) -> Option<Vec<u8>>by descending from the root: at each node, find the first key>= target; if equal, return its value; if leaf, returnNone; else recurse intochildren[i]. - Implement
scan(&self) -> Vec<(Vec<u8>, Vec<u8>)>as the standard in-order traversal: for eachi in 0..keys.len(), recurse intochildren[i], pushkeys[i]; finally recurse into the last child. - Implement
len()andis_empty()as helpers.
Acceptance
Inline unit tests:
get_on_empty_returns_none—BTree::new().get(b"k") == None.manual_build_get_returns_value— manually construct a 3-key leaf,getreturns the right value for each key andNonefor misses.scan_is_sorted— manually construct an internal node with two leaf children;scan()returns the merged sorted sequence.
All three green in Rust, Go, and C++.
Discussion prompts
- Why does
getuse linear scan overkeysrather than binary search? ForT = 2the answer is obvious; forT = 256is it still? - Why is
is_leafstored on each node rather than inferred fromchildren.is_empty()? - What goes wrong if
scanrecurses into the last child before pushing the last key?
db-10 step 02 — Insert and delete (split, borrow, merge)
Goal
Implement mutation: insert(k, v) and delete(k) -> bool. Both
operations must preserve the height invariant — every leaf at the same
depth, every node within [MIN_KEYS, MAX_KEYS] (except the root).
Tasks
- Insert.
- If
root.keys.len() == MAX_KEYS, grow up: wrap the old root in a new internal root with one child, thensplit_child(new_root, 0). This is the only place height ever increases. - Then
insert_nonfull(root, k, v).
- If
insert_nonfull(node, k, v).- If
node.is_leaf: splice the entry into the sorted slot. If the key already exists, overwrite the value in place. - Else: find
isuch thatkey <= node.keys[i].0(the child whose range covers the key). Ifchildren[i]is full, split first (pre-emptive split), then ifkey > node.keys[i].0advancei. Recurse intochildren[i].
- If
split_child(parent, i). Precondition:parent.children[i].keys.len() == MAX_KEYS == 3. Effect:- Promote the middle key (index 1) into
parent.keys[i]. - Move the right half (key 2 plus children 2..=3) into a new
sibling at
parent.children[i + 1]. - The left half (key 0 plus children 0..=1) remains in
parent.children[i].
- Promote the middle key (index 1) into
- Delete. Recursive
delete_from(node, k) -> boolthat maintains the invariant the node we're descending into has ≥ T keys. Three cases at a leaf or internal node hit:- Key in this leaf → splice out, return
true. - Key in this internal node → replace with in-order predecessor or successor (drawn from whichever neighbor child has ≥ T keys), then recursively delete that pred/succ.
- Key not in this subtree (descending into
children[i]):- If
children[i].keys.len() < T, borrow fromchildren[i-1]orchildren[i+1]if one of them has >MIN_KEYS. Prefer left. Otherwise mergechildren[i]with its left or right sibling (prefer right if it exists, else left), pulling the separating key from the parent. - Recurse into
children[i](which is now safe).
- If
- Key in this leaf → splice out, return
- After
delete_fromreturns, ifrootbecame a keyless internal node, collapse:root = root.children.remove(0). This is the only place height ever decreases.
Acceptance
Inline unit tests:
insert_then_get_roundtrip— insert 50 keys, all of them retrievable.insert_overwrites— inserting("k", "v1")then("k", "v2")yieldsget("k") == "v2"andlen() == 1.delete_existing_returns_true— insert "k", delete "k" returnstrue,get("k")returnsNone.delete_missing_returns_false—BTree::new().delete(b"k")isfalse.inserts_grow_tree— insert enough keys to force at least one grow-up; checklen()matches insertions.deletes_shrink_tree— insert N keys then delete them all;len()goes to 0, tree is still well-formed (collapsed root).
All six green in Rust, Go, and C++.
Discussion prompts
- Why is pre-emptive split preferred over "descend, recurse, split on the way back"?
- For deletion, why must we ensure
children[i].keys.len() >= Tbefore descending, not after? - What's the tie-break rule when both siblings have spare keys — borrow from left or right? What's the cost of getting it wrong?
- How would copy-on-write change
split_childanddelete_from?
db-10 step 03 — Serialize + CLI + cross-language byte-identity
Goal
Define a canonical wire format for the tree, build a btreectl CLI
that runs a deterministic workload and writes the serialized tree to
stdout, then prove via sha256 that all three implementations produce
identical bytes for two distinct scenarios.
Wire format
Preorder traversal. Per node, in this exact order:
u8 is_leaf (1 = leaf, 0 = internal)
u32 LE nkeys
nkeys * { u32 LE klen, key bytes, u32 LE vlen, val bytes }
if !is_leaf:
(nkeys + 1) * recurse(child_j)
All length prefixes are little-endian (matches every other lab in
the workspace). The empty tree serializes as 01 00 00 00 00 (one
empty leaf).
Deterministic workload
KEY_SPACE = 200
run_workload(scenario, seed, ops):
rng = SplitMix64(seed)
tree = BTree::new()
for _ in 0..ops:
r1 = rng.next_u64()
r2 = rng.next_u64() # ALWAYS draw both
key = (r1 % KEY_SPACE).to_be_bytes() # u64 BE = 8 bytes
val = (r2 as u32).to_be_bytes() # u32 BE = 4 bytes
match scenario:
"inserts" : tree.insert(&key, &val)
"deletes" : if i < ops/2: tree.insert(&key, &val) else: tree.delete(&key)
"mixed" : op = (r1 >> 62) & 0x3
0 | 1 -> insert ; 2 -> delete ; 3 -> skip
return tree
Two PRNG draws per iteration is non-negotiable; if any implementation short-circuits the second draw on a skip branch, the seed → state mapping desyncs.
CLI contract
btreectl --seed N --ops M --scenario {inserts | deletes | mixed}
Writes the canonical wire-format bytes (no trailing newline) to stdout.
Tasks
- Add
serialize_tree(&self) -> Vec<u8>toBTree. Pure function; does not mutate the tree. - Implement the SplitMix64 PRNG with the standard constants
(
0x9E3779B97F4A7C15,0xBF58476D1CE4E7B5,0x94D049BB133111EB). - Implement
run_workloadper the spec above. - Implement
btreectlin Rust, Go, and C++. - Write
scripts/verify.shthat runs unit tests in all three langs. - Write
scripts/cross_test.shthat:- Builds all three binaries.
- Scenario A:
btreectl --seed 42 --ops 500 --scenario inserts→ sha256 all three → assert match. Hash:4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088. - Scenario B:
btreectl --seed 7 --ops 500 --scenario mixed→ sha256 all three → assert match. Hash:9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718. - Spot-check that the stream contains an expected byte sequence (defensive against silent-empty regressions).
- Print
=== ALL OK ===.
Acceptance
$ scripts/verify.sh
=== rust === ... ok
=== go === ... ok
=== cpp === ... ok
=== OK ===
$ scripts/cross_test.sh
...
match(A): 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088
match(B): 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718
=== ALL OK ===
A byte-identical hash across three independent implementations for both scenarios is a near-proof that the PRNG, key/value encoding, insert path, delete path, and serialization format are all spec- compliant.
Discussion prompts
- Why must we draw two PRNG outputs per iteration even when the scenario chooses to skip?
- Why is the wire format preorder rather than level-order or in-order? What property does preorder preserve that the others lose?
- If the Scenario-A hash matches but Scenario-B doesn't, what code paths are the prime suspects, and why?
- The sha256s are baked into
cross_test.shas constants. What is the benefit, and what is the maintenance cost when the wire format legitimately evolves?
db-11 — Pager System
The first lab of the B-tree track where bytes leave RAM. db-10 built
a B-tree out of Box<Node>s and proved three languages agreed on
shape; this lab builds the substrate that turns those shapes into
durable files. Every disk-backed engine in the series from here on
— SQLite (db-15), MVCC (db-13), the distributed KV store (db-20),
and the capstone (db-23) — sits on top of a pager. This is the
component most production databases share.
What is it?
A pager is the layer that:
- Carves a file into fixed-size pages (we use 4 KiB by default; tests run with 256 B to keep dumps readable).
- Hands out pages by 1-based page id; page 0 is reserved for a file header that nails down the format.
- Maintains an in-memory page cache of bounded capacity, evicts
with LRU, and writes dirty pages back to disk on eviction
and on explicit
flush(). - Calls
fsyncexactly when the user asks for durability, never on every write.
The interface is intentionally minimal:
open(path, page_size, cache_capacity) -> Pager
Pager::allocate() -> PageId // grow file by one page
Pager::read(pid) -> Vec<u8> // page_size bytes
Pager::write(pid, bytes) // bytes.len() == page_size
Pager::flush() // write all dirty + fsync
Pager::close()
No B-tree nodes, no records, no keys. The B+-tree in db-15 will encode those structures into the page bytes; the pager neither knows nor cares what the bytes mean.
Why does it matter?
- The cache is the database. Every production engine spends most of its time hitting a buffer pool, not reading disk. The LRU policy, the dirty bit, and the eviction discipline are the difference between "fits in RAM, fast" and "thrashes, dead".
- The file layout is a binding contract. Once two implementations agree on byte 0 of every page, the database is portable across languages and platforms. db-15 will reuse this contract; the cross-language hash test in this lab proves it holds before the B+-tree code ever runs.
- fsync is the only thing that buys durability. Every other write is just a hint to the OS. Knowing exactly when fsync runs (and why) is what separates working systems from data-loss outages.
How does it work?
File layout
offset 0 offset = N * page_size
┌─────────────────────────┬─────────────┬─────────────┬─────┐
│ page 0 (header) │ page 1 │ page 2 │ ... │
│ magic | psz | npages │ user bytes │ user bytes │ │
│ + zero-pad to page_size│ │ │ │
└─────────────────────────┴─────────────┴─────────────┴─────┘
Page 0 is 24 bytes of header + zero-padding:
| offset | size | field | value |
|---|---|---|---|
| 0 | 16 | magic | "DSE-PAGER-v1\0\0\0\0" (ASCII + NULs) |
| 16 | 4 | page_size | u32 little-endian |
| 20 | 4 | num_pages | u32 little-endian (includes page 0) |
| 24 | rest | zeros | padding to page_size |
num_pages is the durable page count — what the file claims after
fsync. The in-memory pager may have allocated pages beyond that
which have not been flushed yet; close()/flush() reconcile them.
Cache, in pictures
cache_capacity = 3, recent = [pid=5] [pid=2] [pid=7]
MRU LRU
read(5) → hit, promote 5 to head [5] [2] [7]
write(9) → miss, evict 7 (writeback) [9*] [5] [2] ← 9 dirty
read(2) → hit, promote 2 [2] [9*] [5]
flush() → write 9, fsync [2] [9 ] [5]
Each frame in the cache carries:
pid: u32data: Vec<u8>of lengthpage_sizedirty: bool- linked-list pointers (prev / next) into the LRU chain
The lookup table is a hashmap pid → frame_index (Rust) or pid → *list.Element (Go) or pid → list iterator (C++). All three give
O(1) lookup; promotion to MRU is O(1) doubly-linked-list splice.
Read path
read(pid):
if pid == 0 or pid > num_pages_in_memory: panic
if pid in cache:
promote cache[pid] to MRU
cache_hits += 1
return clone of cache[pid].data
else:
cache_misses += 1
if cache is full:
evict tail; if dirty, pwrite then mark clean
buffer = pread(page_size bytes at offset pid * page_size)
insert (pid, buffer, dirty=false) at MRU
return clone
Write path
write(pid, bytes):
assert bytes.len() == page_size
if pid in cache:
cache[pid].data = bytes
cache[pid].dirty = true
promote to MRU
else:
if cache is full: evict tail with write-back as above
insert (pid, bytes, dirty=true) at MRU ← no read!
The "write-without-read" path is the optimization that makes bulk loads cheap. A B+-tree splitting a leaf allocates a fresh page and writes the whole 4 KiB at once; reading the old (uninitialized) contents first would double I/O for nothing.
Allocate
allocate():
pid = num_pages_in_memory
num_pages_in_memory += 1
return pid (1-based; pid 0 is the header)
The file is extended lazily — only when the page is actually written
back (either via eviction or flush). This means a sequence of
allocate(); allocate(); allocate() without writes never touches
disk, which matters for transactions that roll back.
Flush
flush():
rewrite page 0 with current num_pages
for each cached page in ascending pid order:
if dirty: pwrite at offset pid*page_size; mark clean
fsync
Sorting by pid before write turns N scattered seeks into one
sequential pass on a spinning disk. On SSDs the win is smaller but
still real (TLB-friendly access pattern; predictable readahead).
Determinism
The lab's verification depends on every operation being deterministic given the seed, the workload, and the cache capacity. Two things that look like they could leak nondeterminism but do not:
HashMapiteration order. We never iterate the cache map; the flush loop sorts dirty frames bypidfirst.fsynctiming.fsyncdoes not change the byte contents of the file, only their visibility after a crash. The sha256 we compare is taken from the post-flushfile, which is fully determined.
Where this fits
- Upstream: none directly; the pager is a from-scratch component.
- Downstream: db-12 (SQL frontend storage), db-13 (MVCC over snapshot page versions), db-14 (secondary index B+-trees over the pager), db-15 (SQLite-complete), db-21 (advanced storage variants), and every distributed lab from db-16 onwards stores state on a pager-backed file.
db-11 — References
Primary sources
- SQLite Pager design notes — the cleanest public description of a production pager, including how it interacts with rollback journals and WAL. The architecture of the db-11 pager is a deliberate simplification of this design. https://www.sqlite.org/atomiccommit.html https://www.sqlite.org/walformat.html
- LMDB / mdb design — Howard Chu, MDB: A Memory-Mapped Database and Backend for OpenLDAP. Describes a B+-tree pager whose write path is copy-on-write rather than write-back. Useful counterpoint to the LRU + dirty-bit approach we took. https://www.symas.com/symas-embedded-database-lmdb
- Goetz Graefe, Modern B-Tree Techniques, Foundations and Trends in Databases 3(4), 2010. Chapter 2 covers buffer-pool management and the page-eviction policies real engines use.
Operating-systems background
- Andrew S. Tanenbaum, Modern Operating Systems, 4th ed.,
chapter on file systems and page caches. The OS's own page cache
is conceptually our cache; understanding
pread/pwrite/fsyncat the kernel level explains why "writing" withoutfsyncis not durable. - fsync(2) man page — the canonical answer to "what does fsync actually guarantee?" Read this before assuming a write reached disk.
- Eduardo Pinheiro et al., Failure Trends in a Large Disk Drive Population, FAST 2007. Sobering reminder that the device under the pager does fail; durability is a probabilistic claim.
Replacement policies
- Elizabeth O'Neil, Patrick O'Neil, Gerhard Weikum, The LRU-K Page Replacement Algorithm For Database Disk Buffering, SIGMOD 1993. Why naive LRU thrashes on scan-heavy workloads, and the fix everyone borrowed.
- Theodore Johnson, Dennis Shasha, 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm, VLDB 1994. The 2Q policy used by Postgres and several others.
- The db-11 implementation deliberately uses textbook LRU. db-22 (performance) will measure when this hurts and what 2Q / CLOCK / ARC buy.
Production engines whose pager you can read
- SQLite (
src/pager.c,src/pcache.c) — heavy reading, but the comments are excellent. https://www.sqlite.org/src/file?name=src/pager.c - BoltDB / bbolt (
db.go,freelist.go) — small enough to read in an afternoon. https://github.com/etcd-io/bbolt - InnoDB (
storage/innobase/buf/) — large, but thebuf_pool_tandbuf_LRU.ccfiles are where the buffer-pool policy lives.
Cross-lab dependencies
- Upstream: none. The pager is a from-scratch component.
- Downstream: db-12 (SQL frontend), db-13 (MVCC), db-14 (indexes), db-15 (SQLite-complete), db-20 (distributed KV) all store state on top of a pager file in this format.
db-11 — Analysis
What had to be decided before any code was written, and why each choice locks in trade-offs the rest of the B-tree track will pay for or be paid by.
Required invariants
- File layout is canonical. Byte 0..15 of page 0 is the magic
string
DSE-PAGER-v1\0\0\0\0; bytes 16..19 arepage_sizeLE; bytes 20..23 arenum_pagesLE; bytes 24..page_size-1 are zero. Any pager implementation that produces or consumes a file must agree on these bytes to the bit. - Cache capacity is hard. After every operation, the number
of resident frames is
<= cache_capacity. The eviction path maintains this invariant before admitting a new frame, never after. - Dirty pages survive eviction. If a page is evicted while
dirty == true, its bytes are written to disk before the frame is reused. The cache may evict at any time; a dropped dirty page is a data-loss bug. - Determinism. Given
(path, page_size, cache_capacity, seed, ops, scenario), the post-flushfile bytes are a pure function of those inputs. Two languages running the same workload must produce sha256-identical files. - Page 0 is reserved. User code receives only
pid >= 1fromallocate().read(0)/write(0)is undefined behaviour (panic in Rust; documented but unenforced in Go/C++).
Design decisions
Why a 16-byte magic instead of 8
8 bytes (e.g. DSEPAGER) would have fit one register and saved
8 bytes per file. 16 bytes lets us include a version suffix and a
human-readable prefix that shows up in strings(1). The cost is
trivial; the debugging payoff (file db.bin | grep DSE) is
immediate.
Why fixed page size at open() rather than per-page
A real engine fixes page size when the database file is created
and refuses to mount it under a different page size. We bake this
in by writing the page size into the header and re-reading it on
open. The cost: changing page size means rewriting the file. The
gain: no per-page metadata, no alignment surprises, page offsets
are just pid * page_size.
Why 1-based page ids
Page 0 is the header. Letting allocate() return 0 would force
every caller to remember the "0 is reserved" rule and to check it
on every dereference. By starting allocation at 1, the contract is
enforced by construction: any pid you legitimately hold is safe
to read.
Why LRU (and not CLOCK, 2Q, ARC, LFU, …)
LRU is the textbook policy and the easiest one to verify deterministic across three languages. Its weakness — sequential scans flush a hot working set — is real but invisible at the cache sizes our tests use (capacity 8 over 100 pages). db-22 will revisit and measure; until then, simplicity dominates.
Why a doubly-linked list, not a BTreeMap<LastUsed, PageId>
A balanced map gives O(log n) operations and self-orders by recency. A doubly-linked list plus a hashmap gives O(1) operations and the same eviction order, at the cost of one extra pointer per frame. For a cache of 1000 frames the difference is ~10x in cache hit latency. Worth the boilerplate; LMDB, Postgres, SQLite, RocksDB, InnoDB all use the list-plus-map shape.
Why write-back, not write-through
Write-through (every write() synchronously persists) is simpler
but makes random updates ~100x slower because every dirty page
costs a seek and an fsync. Write-back lets us batch many writes to
the same page (db-10's B-tree insert may rewrite the same node
several times during a single workload) and amortize one disk
write per page per flush. The tax is the dirty-page accounting,
which is enforced by invariant 3 above.
Why fsync only on flush()
The pager's user owns the durability story. SQLite calls flush
at every COMMIT; an LSM (db-05) calls it after every WAL append;
an embedded counter store might call it once a minute. Pushing
the decision up keeps the pager honest: it never claims durability
it cannot deliver. The cross-test scenarios all call flush()
exactly once at the end, which is why their hashes are stable.
Why write-without-read on a cache miss
If write(pid, bytes) evicts a clean page and admits (pid, bytes, dirty=true) without first reading the old contents, the
disk's bytes are overwritten entirely on the next eviction or
flush. This is safe because write requires bytes.len() == page_size — the whole page is supplied. Reading the old contents
first would be a 4 KiB I/O for data we throw away immediately. A
proper engine extends this with "page allocation hints" so that
the OS can skip the readahead too; we don't bother.
Why the workload uses SplitMix64 (the same PRNG as db-10)
Three reasons:
- Identical implementation across languages. Three lines of wrapping-add and xor-mul; if any language gets it wrong the sha256 changes on the very first scenario.
- No external dependency. Crypto-quality PRNGs would need different libraries per language; SplitMix64 is purely arithmetic.
- Consistency across the track. Reusing the same PRNG as db-10 means a future cross-lab test can compare hashes from "B-tree built in RAM" against "B-tree built on the pager" using the same key sequence.
The PRNG draws exactly one u64 per iteration and uses specific bit-slices for op/byte/pid. A variable number of draws per iteration would make scenarios diverge in their key streams, which would defeat purpose 3.
Why the scenarios are sequential / random / mixed
- sequential stresses the readahead-friendly path: page ids walk in monotonic order, cache hits dominate, evictions are predictable.
- random stresses the eviction path: cache hit ratio is the cache_capacity / num_pages ratio, evictions happen on most writes, dirty pages move through the cache constantly.
- mixed is what real workloads look like: a hot subset
(selected by
(r>>60)&1) plus a long tail of cold pages.
These three together exercise the entire cache state machine. If any of them diverges across languages, the bug is localized (sequential bugs are accounting; random bugs are eviction; mixed bugs are interaction).
Tradeoffs worth flagging
- No free-list, ever.
allocate()only grows the file. Once a B+-tree splits a page and later coalesces it, the now-unused page id is leaked. db-21 (storage engine advanced) will reclaim via a free-list page; here it would just be unverifiable code. Vec<u8>per frame. Every cached page is its own allocation. A real engine packs frames into a single arena (the buffer pool) and indexes by offset. db-22 will measure the difference and likely arena-allocate.- No checksums. A corrupted page returns its corrupted bytes silently. db-15 will add a CRC32 to the page footer when SQLite semantics demand it.
- No mmap. mmap-backed pagers (LMDB) are dramatically simpler but inherit the OS's page-replacement decisions, which we want to control here for testability. db-21 may explore the mmap variant.
- Single-threaded. No latching, no per-page reader/writer locks. db-13 (MVCC) and db-17 (Raft) will introduce concurrency on top of this layer.
db-11 — Execution
What was built, in the order it was built.
1. Rust (src/rust)
Cargo.tomldeclares lib cratepager11and binarypagerctl. Edition 2021; release profilelto = "thin",codegen-units = 1.src/lib.rscontains:- Constants
MAGIC: &[u8;16] = b"DSE-PAGER-v1\0\0\0\0",HEADER_LEN = 24. Frame { pid, data: Vec<u8>, dirty, prev: Option<usize>, next: Option<usize> }plusPager { file, page_size, num_pages, capacity, frames: Vec<Frame>, free: Vec<usize>, map: HashMap<u32,usize>, head/tail: Option<usize>, hits, misses }.Pager::open(path, page_size, capacity),::allocate(),::read(pid),::write(pid, bytes),::flush(),::cache_hits(),::cache_misses(),::num_pages().- LRU helpers
promote(frame_idx),evict_tail(),admit(...)operating on the indexed doubly-linked list. - Hand-rolled SHA-256 (FIPS 180-4) so the lib has no
dependencies.
sha256_hex(bytes)andsha256_file(path). SplitMix64PRNG andrun_workload(path, page_size, capacity, pages, ops, seed, scenario) -> Pager.- 10 inline
#[cfg(test)]tests: header round-trip, allocate monotonic, read-after-write within and across eviction, dirty survives eviction, flush is idempotent, hits/misses counts, scenario determinism (sequential), scenario determinism (random), scenario determinism (mixed), SHA-256 empty-string test vector.
- Constants
src/bin/pagerctl.rs: order-independent arg parser (args.windows(2)lookup). Subcommandsinit <path> [--page-size N]andworkload <path> --seed S --ops N --pages P --cache C --scenario {sequential|random|mixed} [--page-size N]. Workload printssha256_file(path)to stdout with no trailing newline.
2. Go (src/go)
go.modmodulegithub.com/10xdev/dse/db11, Go 1.22.pager.goports the Rust API one-for-one. Usescontainer/listfor the LRU chain andmap[uint32]*list.Elementfor lookup. SHA-256 viacrypto/sha256(stdlib is fine; the cross-language comparison is on the file bytes, not the digest algorithm).pager_test.gomirrors the 10 Rust tests plus an 11th,TestWorkloadMatchesCanonicalHashes, that bakes in the three canonical hashes (A/B/C) and runs all three scenarios in a loop. This is the test that catches "Go silently disagrees with Rust" regressions before the cross_test script even runs.cmd/pagerctl/main.gois the matching CLI. Custom flag parser (findFlag,firstPositional,mustU64,mustInt) becauseflag.Parsestops at the first non-flag argument and the shared script passes<path>before the flags.
3. C++ (src/cpp)
CMakeLists.txtbuilds:pager11(static lib fromsrc/pager.cc+src/sha256.cc).pagerctl(executable linkingpager11).test_pager11(ctest target linkingpager11).- Flags:
-Wall -Wextra -Wpedantic -Werror -O3 -DNDEBUGin Release.
src/pager.h,src/pager.cc: factory functionPager::open(...)returning astd::unique_ptr<Pager>.std::list<Frame>for the LRU chain;std::unordered_map<uint32_t, std::list<Frame>::iterator>for O(1) lookup.std::list::splicefor promotion.src/sha256.h,src/sha256.cc: FIPS 180-4 SHA-256 in ~120 lines.src/pagerctl.cc: matching CLI. Includes<unistd.h>forgetpid()(used by tests for unique tmp paths); the omission of that header was the only build error during initial bring-up.tests/test_pager11.ccmirrors the Rust tests; uses#undef NDEBUGbefore<cassert>so asserts fire under Release. PrintsOK 11 testson success. Wired into ctest as a single test case.
4. Scripts
scripts/verify.sh:- Rust:
cargo test --quiet. - Go:
go test ./.... - C++:
cmake -S … -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j && ctest --test-dir build --output-on-failure. - Exits 0 only if all three are green; prints
=== OK ===.
- Rust:
scripts/cross_test.sh:- Builds Rust/Go/C++
pagerctlbinaries (cargo release, go build, cmake+make). - Scenario A sequential, seed=42, pages=32, cache=8, ops=200,
page_size=256:
pagerctl workload <tmp> …per language; sha256- size comparison against baked-in expected hash.
- Scenario B random, seed=7, pages=64, cache=8, ops=500, page_size=256: same shape, different hash.
- Scenario C mixed, seed=2024, pages=128, cache=16, ops=1000, page_size=512: same shape, different hash.
- Spot-check: read the first 20 bytes of Scenario A's file and
assert they equal
4453452d50414745522d76310000000000010000(magicDSE-PAGER-v1\0\0\0\0+0x00000100for page_size = 256 LE). - Print
=== ALL OK ===.
- Builds Rust/Go/C++
What was deliberately not built
- Free-list / page reclamation.
allocate()only grows the file. db-21 (storage engine advanced) introduces a free-list page. - Page checksums. No CRC32 footer. db-15 will add one when SQLite-compatibility demands it.
- mmap backend. All I/O goes through
pread/pwrite. An mmap-based variant is a possible db-21 follow-up. - Concurrency. No latches; the pager assumes a single thread. db-13 (MVCC) and db-17 (Raft) introduce concurrent access at higher layers.
- WAL. db-11's pager has no WAL; durability is via in-place
write + fsync at
flush(). db-03 already covered WAL and db-13 will add a transactional WAL on top of the pager. - Compression / encryption. Out of scope; the page bytes are whatever the caller wrote.
db-11 — Observation
What the cross-language verification actually proves, and what the file looks like by hand.
Output of scripts/cross_test.sh
=== compare Scenario A (sequential seed=42 pages=32 cache=8 ops=200 ps=256) ===
A rust=cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428 ( 8448 B)
A go =cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428 ( 8448 B)
A cpp =cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428 ( 8448 B)
match(A): cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428
=== compare Scenario B (random seed=7 pages=64 cache=8 ops=500 ps=256) ===
B rust=3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023 ( 16640 B)
B go =3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023 ( 16640 B)
B cpp =3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023 ( 16640 B)
match(B): 3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023
=== compare Scenario C (mixed seed=2024 pages=128 cache=16 ops=1000 ps=512) ===
C rust=5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460 ( 66048 B)
C go =5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460 ( 66048 B)
C cpp =5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460 ( 66048 B)
match(C): 5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460
=== spot-check header ===
spot-checks ok
=== ALL OK ===
File sizes are exactly (pages + 1) * page_size:
- A:
(32 + 1) * 256 = 8448 - B:
(64 + 1) * 256 = 16640 - C:
(128 + 1) * 512 = 66048
The +1 is page 0 (header).
Reading the header by hand
For Scenario A (page_size = 256, num_pages = 33 including the
header):
xxd -l 24 /tmp/pager-A.rust.bin
00000000: 4453 452d 5041 4745 522d 7631 0000 0000 DSE-PAGER-v1....
00000010: 0001 0000 2100 0000 ....!...
Decoded:
| bytes | meaning |
|---|---|
44 53 45 2d 50 41 47 45 52 2d 76 31 00 00 00 00 | magic DSE-PAGER-v1\0\0\0\0 |
00 01 00 00 | page_size = 0x00000100 = 256 (LE) |
21 00 00 00 | num_pages = 0x00000021 = 33 (LE) |
Bytes 24..255 are zero (header padding to page_size).
The cross-test's spot-check confirms bytes 0..19 exactly equal
4453452d50414745522d76310000000000010000. Any single-byte change
to the format would surface here, and would break the sha256
match across all three languages, and would change the file
size, and would invalidate the canonical hashes table — four
independent failure signals for one bug.
Reading a data page by hand
For Scenario A the workload writes a known byte value B = (r >> 24) & 0xFF to every byte of the chosen page. So any non-zero data
page in /tmp/pager-A.rust.bin should be 256 identical bytes:
xxd -s 256 -l 256 /tmp/pager-A.rust.bin | head -2
00000100: 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c ................
00000110: 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c ................
A run of one byte value repeated 256 times. Different pages
contain different fill bytes; the sha256 of the file rolls all of
them up. This makes hand-debugging a divergence between languages
straightforward: dump both files, diff -u <(xxd a) <(xxd b), and
the first non-matching page tells you exactly which (pid, byte)
the languages disagreed on.
Cache statistics (informal)
Running Scenario B with cache = 8 over pages = 64:
hits = ~190
misses = ~310
Hit ratio ~38%, consistent with the random scenario's expected
cache_capacity / num_pages baseline (8 / 64 = 12.5%) plus a
small temporal-locality bump. The Rust unit tests assert
hits + misses == ops but not the exact ratio, because writes that
bypass reads (write-on-miss admission) keep the absolute counts
implementation-defined enough that an exact check would be
fragile. The file bytes, however, are not implementation-defined
— and that is what we pin.
db-11 — Verification
Prerequisites
- macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
cmake≥ 3.20.- Rust toolchain ≥ 1.74.
- Go ≥ 1.22.
shasum,xxd,awk(default on macOS;coreutilson Linux).
One command
cd db-11-pager-system
scripts/verify.sh # unit tests, all three languages
scripts/cross_test.sh # cross-language sha256 match
Both should print === OK === / === ALL OK === and exit 0.
Per-language drill-down
Rust
cd db-11-pager-system/src/rust
cargo test --quiet
cargo build --release
Expected: all 10 inline tests pass; target/release/pagerctl is
built.
Go
cd db-11-pager-system/src/go
go test ./...
go build ./cmd/pagerctl
Expected: ok github.com/10xdev/dse/db11 <duration>. The
TestWorkloadMatchesCanonicalHashes test is the most important; it
fails loudly if Go disagrees with Rust on any of the three
scenarios.
C++
cd db-11-pager-system/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure
Expected: 100% tests passed, 0 tests failed out of 1 and the
test_pager11 target prints OK 11 tests.
What "green" means
A green run guarantees:
-
All inline unit tests pass in Rust, Go, and C++.
-
The cross-language test produces byte-identical files for all three canonical scenarios:
scenario type seed pages cache ops psz sha256 A sequential 42 32 8 200 256 cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428B random 7 64 8 500 256 3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023C mixed 2024 128 16 1000 512 5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460Matching sha256 across three independent implementations proves agreement on:
- the file format (header magic, page_size, num_pages encoding),
- the SplitMix64 PRNG (constants and bit-extraction layout),
- the workload state machine (op/pid/byte selection),
- the cache admission rule (write-on-miss admits without read),
- the eviction rule (LRU tail, dirty pages written back),
- the flush order (dirty pages sorted by pid before write),
- and the final on-disk page layout.
-
The spot-check confirms the first 20 bytes of Scenario A's file are
4453452d50414745522d76310000000000010000(magic + page_size = 256 LE), guarding against the regression where all three languages "successfully" agree on a wrong header.
When verification fails
- Cross-language sha256 mismatch on Scenario A only — the
sequential scenario exercises the simplest code path (no random
pid selection, predictable evictions). A failure here is almost
always either:
- the magic / header encoding (check the spot-check first), or
- the SplitMix64 PRNG (re-derive the first 5 outputs by hand and
compare against
0xe220a8397b1dcdaf, …).
- Scenario A matches, B fails — the random scenario stresses
eviction. Look for off-by-one in LRU tail selection or for a
language whose
unordered_mapiteration leaks into the flush order (it should not; flush sorts by pid). - A and B match, C fails — the mixed scenario uses a larger cache and a larger page size; suspect a page-size assumption baked into the implementation (e.g., a hard-coded 256 instead of reading from the header).
- All three sha256s match each other but disagree with the
table above — a legitimate algorithm change. Make sure it was
intentional, then update
cross_test.sh, the Go canonical-hashes test, the C++ canonical-hashes assertion, and the table above in the same commit. - One language's unit tests pass but cross_test fails —
almost always a CLI bug, not a library bug. The unit tests
drive the library directly; the cross_test drives the binary
through the shell. Double-check argument parsing: in
particular, that
<path>may appear before the--flags(this is the bug the Go port hit during bring-up, fixed by the customfindFlag/firstPositionalparser).
db-11 — Broader Ideas
Where this disk-backed pager fits in the rest of the track, and which real-world techniques live one or two steps beyond it.
Immediate next labs
- db-12 — SQL frontend. The first consumer of the pager. A row
in a table becomes some bytes inside some page; an
INSERTis aPager::write. The B+-tree layer that maps rows-to-pages is built in db-15 but its scaffolding starts here. - db-13 — Transactions and MVCC. Each transaction sees a
consistent snapshot of the pager. The simplest implementation is
copy-on-write at the page level: a write conceptually allocates a
new page rather than mutating the old, and snapshots hold roots
pointing at the version they read. Our pager's monotonic
allocate()is the right primitive for this. - db-14 — Indexes. Secondary indexes are additional B+-trees living in the same pager file as the primary. Multiple trees, one pager, one buffer pool.
- db-15 — SQLite-complete. Stitches db-10..db-14 together. Will add page checksums, the rollback journal or WAL, and the free-list page so that deleted pages don't leak.
- db-21 — Storage engine advanced. Revisits this pager with CLOCK / 2Q eviction, a freelist, an mmap variant, and possibly a group-commit fsync scheduler.
- db-22 — Performance and benchmarking. Measures hit ratio, eviction rate, and fsync cost under realistic workloads; compares LRU against alternative policies.
How this lab's pieces show up in real systems
- The 4 KiB page is the de-facto default in every major engine (Postgres, SQLite, InnoDB, RocksDB SST blocks). It matches both the typical filesystem block size and the Linux page-cache granule, which means partial pages cost no extra readahead.
- The header-on-page-0 trick is universal: SQLite, BoltDB, InnoDB, even Berkeley DB all reserve page 0 for metadata.
- Write-back with LRU is the classic buffer-pool design;
Postgres calls it
shared_buffers, InnoDB calls itinnodb_buffer_pool_size, SQLite calls it thepage_cache. Our implementation is the textbook version they all started from. - fsync-only-on-flush is the contract every transactional engine demands of its pager: the WAL or rollback journal layer above decides when, the pager just provides the primitive. The DBMS literature calls this the "no-force" policy.
- The doubly-linked-list + hashmap LRU is the pattern in
every production buffer pool — Postgres's
BufferLookup, InnoDB'sbuf_LRU, RocksDB'sLRUCache, even your CPU's L2 replacement logic. The textbook is real.
Variants worth implementing later
- CLOCK replacement — a single circular array with a reference bit per frame. Approximates LRU at lower overhead because there's no list splice per access. PostgreSQL uses this.
- 2Q — two LRU lists, one for "seen once" and one for "seen twice or more". Resists scan-induced cache pollution. Cheap to implement on top of the existing LRU code.
- ARC (Adaptive Replacement Cache) — IBM's adaptive variant of 2Q. Patented but reimplementable.
- Copy-on-write pages (LMDB-style) — every write allocates a fresh page; old versions stay live for concurrent readers. Trades higher write amplification for free MVCC.
- mmap-backed pager —
mmapthe whole file, let the OS manage the page cache. Drastically simpler code; loses control over eviction and durability.
Performance experiments worth running later
- Plot hit ratio vs
cache_capacity / num_pagesfor each scenario. Expect a knee around 25..50% for the mixed scenario. - Measure the cost of
flush()as a function of dirty-page count. Sorted writes should be sub-linear vs unsorted on spinning disk. - Compare write-back vs write-through latency for a steady stream of small updates. The write-back win should be order-of-magnitude on any device with non-trivial fsync cost.
- Vary
page_sizefrom 256 B to 64 KiB. The hit ratio improves with smaller pages (finer caching granule) but per-operation bookkeeping cost grows.
What "production-quality" would require beyond this lab
- Crash recovery. Right now, a crash in the middle of a flush leaves a half-written page on disk and no way to detect it. SQLite uses a rollback journal; Postgres uses WAL + a checkpointer. db-13 will introduce the simplest form of this.
- Checksums. A CRC32 footer per page so torn writes are detectable, not silently returned to the caller.
- A free-list page so deleted pages can be reused, otherwise files grow monotonically.
- Concurrent access. Reader-writer latching at the page level so the pager scales to multiple threads.
- Direct I/O /
O_DIRECTto bypass the OS page cache and prevent double-buffering. Needed at high throughput; subtle to get right. - Async I/O.
io_uringon Linux, IOCP on Windows. The synchronouspread/pwritewe use is fine for teaching and for any workload where the database is not the bottleneck.
db-11 step 01 — Page I/O and file layout
Goal
Build the bottom half of the pager: the file format and the
uncached read / write / allocate path. No cache, no LRU, no
eviction. Every read is a pread; every write is a pwrite;
flush is just fsync.
Tasks
- Define
MAGIC = b"DSE-PAGER-v1\0\0\0\0"(16 bytes) andHEADER_LEN = 24. - Implement
Pager::open(path, page_size, capacity):- If file does not exist or is empty, create it; write a fresh
header page (magic + page_size + num_pages=1, zero-padded to
page_size); fsync. - If file exists, read bytes 0..24, validate magic, parse
page_sizeandnum_pages. The caller-suppliedpage_sizeargument must match the on-disk value (or be supplied as the authoritative size on creation).
- If file does not exist or is empty, create it; write a fresh
header page (magic + page_size + num_pages=1, zero-padded to
- Implement
Pager::allocate() -> u32:- return
num_pages, thennum_pages += 1. The on-disk file is not yet extended — the nextflush()will rewrite page 0 and the new page will materialise then.
- return
- Implement
Pager::read(pid) -> Vec<u8>(no caching yet):- validate
1 <= pid < num_pages. pread(page_size bytes at offset pid * page_size).
- validate
- Implement
Pager::write(pid, bytes)(no caching yet):- validate
bytes.len() == page_size. - validate
1 <= pid < num_pages. pwrite(bytes at offset pid * page_size).
- validate
- Implement
Pager::flush():- rewrite page 0 with current
num_pages(handles allocate-only transactions). fsync.
- rewrite page 0 with current
- Implement
Pager::close():flush()then drop the file handle.
Acceptance
Inline unit tests:
header_round_trip— open new file, close, reopen, assertnum_pages == 1and the magic is intact.allocate_monotonic— threeallocate()calls in a row return1, 2, 3.write_then_read_same_pager— allocate, write a known byte pattern, read it back, assert equal.write_then_reopen_then_read— allocate, write,flush(), drop, reopen, read; bytes survived.flush_extends_file— after allocate + write + flush, file size equals(num_pages) * page_size.
All three green in Rust, Go, and C++.
Discussion prompts
- Why is
num_pagesstored on page 0 rather than inferred from the file size? (Hint: what happens betweenallocate()andflush()if the OS crashes?) - What goes wrong if
open()is called concurrently from two processes on the same file? - Why does
flush()rewrite page 0 even if no data page changed?
db-11 step 02 — LRU cache with write-back
Goal
Add the in-memory page cache on top of step 01. Bounded capacity,
LRU eviction, write-back on eviction, dirty bit per frame. After
this step the disk is touched only on cache misses, evictions, and
flush().
Tasks
- Define
Frame { pid: u32, data: Vec<u8>, dirty: bool, prev, next }whereprev/nextare indices into aVec<Frame>(Rust) or*list.Element(Go) orstd::list<Frame>::iterator(C++). - Add to
Pager:capacity: usize— set atopen().frames: Vec<Frame>— the storage backing the LRU chain.free: Vec<usize>— reusable indices after eviction.map: HashMap<u32, usize>— pid → frame index.head, tail: Option<usize>— MRU and LRU ends.hits, misses: u64— accounting.
- Helpers:
promote(idx)— unlink frame from current position, insert at head. No-op if already at head.unlink(idx)— remove frame from the list.evict_tail()— pop the LRU frame; if dirty,pwritebefore reusing the slot.admit(pid, data, dirty)— if at capacity,evict_tailfirst; allocate a frame (fromfreeor push new); insert at head; updatemap.
- Rewrite
read(pid):- if
map[pid]exists: promote,hits += 1, clone, return. - else:
misses += 1,pread,admit(pid, data, dirty=false), clone, return.
- if
- Rewrite
write(pid, bytes):- if
map[pid]exists: overwrite data, setdirty = true, promote. - else:
admit(pid, bytes, dirty=true)— no pread.
- if
- Rewrite
flush():- collect all
(pid, frame_idx)wheredirty == true. - sort by
pidascending. - for each,
pwriteatpid * page_size, setdirty = false. - rewrite page 0 with current
num_pages. fsync.
- collect all
Acceptance
Inline unit tests:
cache_hit_does_not_pread— write then read twice; second read produces a cache hit (cache_hits >= 1).eviction_writes_back_dirty— fill cache + 1, evict the oldest frame, drop the pager, reopen, read the evicted pid, bytes match the value written before eviction.eviction_skips_clean_pages— fill cache with only-reads, evict, reopen: file size unchanged (no spurious writes).flush_is_idempotent— flush twice in a row, file bytes identical, both succeed.hits_misses_accounting— for a known sequence of operations,cache_hits + cache_missesequals the number of read calls (writes that hit the cache are not counted as reads).
All three green in Rust, Go, and C++.
Discussion prompts
- Why does
writeon a miss not do apread? When could this be wrong? (Answer: never, as long as the caller writes the whole page. Partial-page writes would need read-modify-write.) - Why sort dirty pages by
pidbefore writing them out? - What is the worst-case eviction cost, and how could
evict_tailamortize fsyncs across many evictions?
db-11 step 03 — Cross-language byte agreement
Goal
Pin the file format. After this step a workload run in Rust, Go, or C++ produces sha256-identical files for the same inputs. This is what makes the pager a real cross-language contract, not three loosely-related implementations.
Tasks
-
Implement
SplitMix64exactly:next(state): state += 0x9E3779B97F4A7C15 // wrapping z = state z = (z ^ (z >> 30)) * 0xBF58476D1CE4E7B5 z = (z ^ (z >> 27)) * 0x94D049BB133111EB return z ^ (z >> 31)All multiplies are wrapping u64. Test against a known first-output table (
seed = 0yields0xE220A8397B1DCDAFetc.). -
Implement
run_workload(path, page_size, capacity, pages, ops, seed, scenario):pager = Pager::open(path, page_size, capacity) while pager.num_pages() < pages + 1: pager.allocate() rng = SplitMix64(seed) for _ in 0..ops: r = rng.next() op = (r >> 62) & 0b11 // 0,1,2,3 byte_val = (r >> 24) & 0xFF pid = match scenario: sequential -> 1 + (iteration % pages) random -> 1 + (r as u64 % pages) mixed -> if (r >> 60) & 1 then random_pid else sequential_pid match op: 0 | 1 -> write a page of [byte_val; page_size] 2 -> read pid (discard result) 3 -> skip pager.flush() return pagerCritical: each iteration consumes exactly one
next()call. This is what keeps the three scenarios comparable for a given seed. -
Build a
pagerctlCLI in each language with subcommandsinitandworkload.workloadruns the function above and printssha256_file(path)in lowercase hex with no trailing newline to stdout. The CLI must accept<path>either before or after the--flags— the cross-test passes path first; some contributors will pass it last. -
Write
scripts/cross_test.sh:- build all three binaries (cargo release, go build, cmake+make).
- for scenarios A (sequential), B (random), C (mixed): run each language, sha256 the resulting file, assert all three match each other and match the baked-in expected hash.
- spot-check the first 20 bytes of one file equal the expected header bytes.
-
Bake the canonical hashes into the Go and C++ test suites too, so a divergence is caught at
go test/ctesttime even without running the shell script.
Acceptance
scripts/verify.shexits 0; each language reports its tests green.scripts/cross_test.shexits 0 with=== ALL OK ===.- The canonical hashes table in
docs/verification.mdmatches the hashes hard-coded in:scripts/cross_test.shsrc/go/pager_test.go::TestWorkloadMatchesCanonicalHashessrc/cpp/tests/test_pager11.cc(canonical hashes block)
Discussion prompts
- What happens to the sha256 of Scenario A if you swap the order
of the two multiplies in
SplitMix64? - Why does the workload draw exactly one
next()per iteration, even for theskipcase? (Seedocs/analysis.md.) - If we wanted to add a fourth scenario (e.g. "read-heavy"), what would have to change in this lab to keep the cross-test working?
db-12 — SQL Frontend
What is it?
A self-contained SQL frontend: a tokenizer + recursive-descent parser that turns a small but realistic SQL dialect into an Abstract Syntax Tree, plus a canonical byte serializer that hashes deterministically. There is no execution engine in this lab — the AST stops at bytes-on-disk and bytes-on- the-wire.
The supported dialect is the smallest one that's still interesting:
CREATE TABLE name (col TYPE, …);withINTandTEXTcolumns.INSERT INTO name VALUES (…), (…), …;with single-row and multi-row form.SELECT * | col, col, … FROM name [WHERE col OP literal];DELETE FROM name [WHERE col OP literal];UPDATE name SET col = lit, col = lit, … [WHERE col OP literal];
with six comparison operators (=, !=, <, <=, >, >=), integer
and text literals ('pad''let' style escape), -- line comments, and
case-insensitive keywords. Identifiers are preserved verbatim.
Execution — the bytecode VM that walks the AST to actually run the statement — is deliberately deferred to db-13 (where it can share the transaction machinery it really needs). This lab stops at "the program parsed to this exact AST, and we can prove it byte-for-byte across three languages".
Why does it matter?
This is the lab where the project pivots from storage to language. Every SQL database in the world starts with the same three-stage front:
source text ──► tokens ──► AST ──► (planner / VM / executor)
What we are doing here is exactly steps one and two, plus a fourth step — serialize the AST to a canonical byte stream — that no production engine needs but the project needs as the only honest cross-language proof that three independent parsers agree on the meaning of the same SQL text.
If you've ever read SQLite's tokenize.c or Postgres's scan.l /
gram.y, this lab is the same shape, written by hand:
- Tokenizing is a single character-by-character pass with a handful of state branches (whitespace, comment, identifier, number, string, operator, single-char punct).
- Parsing is recursive descent: one function per non-terminal, peek at the next token, dispatch, recurse. No parser generators, no table-driven state machines, no lookahead arithmetic — the grammar is tiny enough that the code is almost a 1:1 transliteration of the BNF.
- The AST is a discriminated union (Rust
enum, Go field-bag struct, C++structwith a kind tag). Statements know their kind; literals know their type.
Once you've built one frontend by hand, every other one becomes a reading exercise.
How does it work?
┌──────── source text (UTF-8 bytes) ────────┐
│ │
tokenize │ char loop: ws | -- comment | ident | │
─────────►│ number | string | op | punct │
│ → Vec<Token { kind, payload, line, col }>│
│ │
parse │ recursive descent: │
─────────►│ parse_program = stmt* EOF │
│ parse_stmt dispatches on kw │
│ parse_create/insert/select/delete/update│
│ → Vec<Statement> │
│ │
serialize│ walk AST, emit canonical bytes (see │
─────────►│ "wire format" below). Magic header lets │
│ decoders sanity-check. │
│ → Vec<u8> │
│ │
sha256 │ inline FIPS 180-4 (Rust + C++); │
─────────►│ crypto/sha256 (Go). Output hex matches │
│ in all three languages on any input. │
└────────────────────────────────────────────┘
Wire format
Magic header "DSESQL01" (8 ASCII bytes), then u32 LE statement count,
then that many statement records:
Statement record:
u8 kind 1=Create, 2=Insert, 3=Select, 4=Delete, 5=Update
u32 LE name_len
name bytes
if Create:
u32 LE col_count
repeat col_count: { u32 LE name_len, name bytes, u8 type (1=Int|2=Text) }
if Insert:
u32 LE row_count
repeat row_count:
u32 LE col_count
repeat col_count: literal
if Select:
u8 cols_kind 1 = *, 0 = named
if named: u32 LE n, repeat n: { u32 LE name_len, name bytes }
where
if Delete:
where
if Update:
u32 LE set_count
repeat set_count: { u32 LE name_len, name bytes, literal }
where
literal:
u8 tag 1 = Int, 2 = Text
if Int: i64 LE (two's-complement, little-endian)
if Text: u32 LE n, n bytes
where:
u8 has_where 0 = no WHERE, 1 = WHERE
if 1:
u32 LE col_name_len, col_name bytes
u8 op 1=Eq, 2=Ne, 3=Lt, 4=Le, 5=Gt, 6=Ge
literal
Every integer is unsigned-little-endian unless noted. Strings carry their own length prefix (no null-terminators, no escapes — the bytes are exactly what the parser saw between the unescaped quotes).
Error format
Every error message is one line of the form:
parse error at line L col C: <message>
Lines and columns are 1-based. tokenize errors report the position of
the bad character; parse errors report the position of the offending
token (or one past the last token's column if the input ended early).
What's intentionally out of scope
- Execution. No VM, no query plan, no I/O. db-13.
JOIN,GROUP BY,ORDER BY,LIMIT, expressions. Single-table predicates only. Future labs.- Schema validation. A
SELECT name FROM treferencing an undefined column parses cheerfully; that's the planner's job, not the parser's. - Identifier case folding. SQLite folds
Usersanduserstogether; Postgres folds them to lowercase. We do neither — identifiers are preserved verbatim, only keywords are case-insensitive. This makes the byte-identity test sharper. - Quoted identifiers (
"foo"), backticks, square brackets. One identifier syntax keeps the tokenizer trivial. - Negative literals as expressions. A leading
-before an integer literal in a value position is parsed as a sign on that literal; it is not a unary-minus operator. There is no general expression grammar.
Cross-language invariant
All three implementations expose sqlctl parse --file FOO.sql (or
--inline "..."). Stdout receives the canonical bytes; stderr receives
the sha256 hex (no trailing newline).
scripts/cross_test.sh runs both fixtures through all three binaries and
asserts:
- All three stderr-emitted sha256s match.
- The matching hash equals the frozen-in-tests value (so the wire format cannot silently drift even if all three implementations drift together).
- The bytes themselves are bit-identical (
cmp -s) — guarding against a hypothetical sha256 collision. - The error path also matches: feeding
"SELECT FROM t;"to all three binaries must produce a non-zero exit and an error line that mentions the column.
The frozen reference hashes are:
| Fixture | Bytes | sha256 |
|---|---|---|
a_basic.sql | 181 | 071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15 |
b_full.sql | 486 | e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962 |
Any change to the AST shape, tokenizer behavior, or wire format must
update those numbers in scripts/cross_test.sh, the Go test
(sql_test.go), the C++ test (tests/test_sqlfront12.cc), and this
table — all in the same commit.
db-12 — References
Primary sources
- Crafting Interpreters, Robert Nystrom — chapters 4 ("Scanning") and 6 ("Parsing Expressions") map almost 1:1 onto what we built. The hand- rolled recursive-descent style and the "one function per non-terminal" discipline are taken straight from this book. https://craftinginterpreters.com/
- Modern Compiler Implementation in ML (or in C / Java), Andrew Appel —
chapter 3 ("Parsing"). The clearest exposition of why recursive descent
works for
LL(1)grammars and what changes when you need lookahead or precedence climbing. - The Dragon Book (Aho, Lam, Sethi, Ullman) — chapters 3 and 4. The textbook source for lexical analysis (DFA construction, regular expressions to scanners) and predictive parsing. Read for theory; the practice is in Crafting Interpreters.
How real databases parse SQL
- SQLite —
src/tokenize.c(hand-rolled DFA, conceptually the same as ours but written in C with a generated keyword-lookup function) andsrc/parse.y(a Lemon-parser grammar, which generates a table-driven bottom-up parser rather than our recursive-descent). https://github.com/sqlite/sqlite/blob/master/src/tokenize.c https://www.sqlite.org/lemon.html - PostgreSQL —
src/backend/parser/scan.l(flex) andsrc/backend/parser/gram.y(bison). A generator-based front; the grammar file alone is ~17k lines. Worth opening just to see the scale of the dialect they support. https://github.com/postgres/postgres/tree/master/src/backend/parser - DuckDB — uses
libpg_query, which is a stripped-down Postgres parser exposed as a library. A useful pattern when you want Postgres-compatible SQL without the rest of Postgres. https://github.com/duckdb/duckdb/tree/main/third_party/libpg_query - CockroachDB —
pkg/sql/parser/sql.y(goyacc). Like DuckDB above, another example of "real database, generated parser".
Recursive descent specifically
- Rob Pike, Lexical Scanning in Go, GopherCon 2011 — the talk that
popularized the "scanner emits tokens to a channel; parser reads from
the channel" style. We don't use channels (we just return a
Vec), but the state-machine framing of the scanner loop is the same. https://www.youtube.com/watch?v=HxaD_trXwRE - Doug Crockford, Top Down Operator Precedence, 2007 — the cleanest explanation of Pratt parsing, which is what you reach for next once recursive descent runs into expression-precedence pain. We don't need it here (no expression grammar) but it's the natural follow-up. https://crockford.com/javascript/tdop/tdop.html
Determinism / wire formats
- Federal Information Processing Standards Publication 180-4, NIST, 2015 — the SHA-256 specification we implemented inline in Rust and C++. https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf
- Protobuf encoding docs — useful counterexample for what not to do if you want byte-identity. Protobuf's "unknown field" handling and optional canonicalization are exactly the corners that prevent stable hashes across implementations. Our format avoids those corners on purpose. https://protobuf.dev/programming-guides/encoding/
Cross-lab dependencies
None. db-12 is intentionally self-contained: there is nothing in earlier
labs (storage, WAL, LSM, B-tree) that the parser needs, and the AST
serializer uses no upstream wire format. The C++ build does not
add_subdirectory(../db-NN). That isolation is a feature, not an
oversight — it keeps the lab small enough to be reasoned about as a
self-contained compiler-front exercise.
db-12 feeds db-13 (execution + transactions), where the AST will finally be walked by a VM.
db-12 — Analysis
We are building a hand-written SQL frontend in three languages and proving agreement byte-for-byte. The hard part is not any single piece — tokenizing, parsing, and emitting bytes are all small, well-understood components — but holding all three implementations to a single set of design decisions tight enough that the output hashes match on every input.
Required invariants
- Deterministic encoding. Given the same input text, the serializer
must produce exactly the same byte sequence on every run, in every
language, on every machine. No iteration over hash-maps, no
environment-dependent integer widths, no locale-sensitive case
conversion. Iteration order of
set/colsis insertion order (which is parse order, which is source order). - Error reporting carries 1-based
(line, col). Tokenizer errors point at the offending character; parser errors point at the offending token. The error string format is identical across languages (across_test.shsmoke test asserts this onSELECT FROM t;). - Identifiers are preserved verbatim. Only keywords are
case-insensitive.
select FROM uSeRsis legal; the table identifieruSeRsis emitted as the bytesu,S,e,R,s— notusers, notUSERS. - String literals use SQL escape: doubled single-quote = one
single-quote.
'pad''let'is the 7-byte stringpad'let. No backslash escapes; noE'...'C-style escapes. The serializer emits the unescaped string contents. - All five statement kinds round-trip identically. No statement is "almost canonical" — every parsable input produces a byte-identical serialization to the same input parsed by the other two languages.
- Cross-language byte identity is the only acceptable proof. Equal AST shapes "by inspection" don't count; equal sha256 over the serialized bytes does.
Design decisions
Why a u8 tag in front of every variant
The wire format is a tagged union. Every statement, every literal, every WHERE-or-no-WHERE choice starts with a single byte that tells you what follows. The alternatives all fail:
- Implicit type from position: requires a schema, which the frontend has no access to.
- Self-describing JSON-like format: kills byte identity (key ordering, whitespace, escape choices).
- Protobuf-style varints: introduces "unknown field" / "default value" ambiguities. Two encoders that agree on the schema can still disagree on the bytes.
A fixed u8 tag with a tight numeric assignment (Create=1, Insert=2, Select=3, Delete=4, Update=5) plus length-prefixed strings gives us
trivially-determinizable bytes.
Why INT is i64 LE, not varint
i64 LE is the simplest thing that works in all three languages without
a helper library. Varint would save a few bytes on small literals but
costs a non-trivial encoder/decoder that we'd have to keep in lockstep
across Rust/Go/C++.
Why operators get a single byte, not a string
Same reason: a fixed numeric assignment (Eq=1..Ge=6) makes the byte
layout exact and language-agnostic. If we wrote "=", then someone in
some language would eventually decide to emit "==" and the hashes
would drift on the day the lab grew expressions.
Why we keep the MAGIC header
"DSESQL01" is 8 bytes of self-description. It costs nothing, lets a
hypothetical decoder detect "this isn't a db-12 AST blob" before
mis-parsing, and pins the wire format version inside the bytes themselves
(01). If the format ever changes incompatibly, we bump to DSESQL02
and update the frozen hashes.
Why we don't compute the AST length up front
A length prefix on each statement would force a two-pass serialize (size then write), or backpatching. We get away without it because the wire format is fully self-describing left-to-right; a decoder needs no random access. Keeping the encoder one-pass keeps all three implementations short and obviously equivalent.
Why the C++ build is self-contained
db-12 has no upstream lab dependencies. The C++ CMakeLists.txt does
not add_subdirectory(../db-NN). That keeps the lab's ctest output
clean (only one test target: test_sqlfront12) and avoids the trap from
db-09 where leaking upstream add_test calls polluted local runs. Each
lab's CMake should ask itself: do I genuinely need upstream code in this
binary? For db-12, the answer is no.
Why deferring execution is the right call
A VM that walks the AST is the natural next step, but it needs a storage backend (db-10/11 pager or db-05/06 LSM), a notion of column types and rows, and ideally a transaction layer. Bolting any of that into db-12 would either bind it to a specific storage shape too early or build a toy in-memory engine we'd throw away in db-13. Stopping at AST bytes keeps the lab small, scope-clean, and shippable.
Why three languages
The same reason as every lab from db-01 onward: the only honest way to prove that two implementations of a binary protocol agree is to compute sha256 of their output and compare. With three independent implementations all matching the same frozen reference hash, the probability that a bug in one of them produces a matching sha256 is vanishingly small. A matching hash on a non-trivial fixture is therefore a near-proof of correctness for the entire tokenize → parse → encode pipeline.
db-12 — Execution
What we built, in the order we built it.
1. Rust (src/rust) — the reference
Cargo.tomldeclares cratesqlfront12(lib) and a binarysqlctl. No external dependencies, no path dependencies — the lab is self-contained.src/lib.rs(~1100 lines) defines:ParseError(one error type for both tokenize and parse phases).TokKind+tokenize(src) -> Result<Vec<Token>, ParseError>. The tokenizer is a single character-by-character loop with branches for whitespace,--line comment, identifier/keyword, integer literal,'...'string literal (with''escape), comparison operator (=,!=,<,<=,>,>=), and single-char punctuation ((,),,,;).ColType { Int, Text },Literal { Int(i64), Text(String) },Op { Eq=1, Ne=2, Lt=3, Le=4, Gt=5, Ge=6 }(#[repr(u8)]),Where,SelectCols { Star, Named(Vec<String>) },Statementenum with five variants.Parserstruct (token slice + cursor) with one method per non-terminal (parse_program,parse_stmt,parse_create,parse_insert,parse_select,parse_delete,parse_update,parse_where,parse_literal).parse(src) -> Result<Vec<Statement>, ParseError>glues tokenize + Parser together.serialize(stmts) -> Vec<u8>walks the AST and emits the canonical bytes described in CONCEPTS.md. Magic headerb"DSESQL01"thenu32 LEcount, then per-statement records.- Inline
sha256+sha256_hex(FIPS 180-4) so the lab has no external crate dependencies.
- 11 inline
#[cfg(test)]tests:tokenize_happy— full coverage of all token kinds on a single mixed input.tokenize_strings_and_errors—''escape; unterminated string reports correct(line, col).parse_create_table—CREATE TABLEwith INT + TEXT columns.parse_insert_multirow— multi-row VALUES, both literal types.parse_select_variants_and_all_ops—SELECT *,SELECT col, col, each of the 6 comparison ops.parse_update_and_delete— UPDATE multi-SET + WHERE; DELETE + WHERE.parse_multi_with_comments_and_case—--line comments, case-insensitive keywords, identifier case preserved.parse_errors_report_column— missing identifier afterSELECTreportsline 1 col 8.serialize_header_and_count— magic bytes + count field correct.serialize_is_deterministic— twoserializecalls on the same AST return equal bytes.sha256_known_vectors— the FIPS-180-4 SHA-256("") and ("abc") vectors.
bin/sqlctl.rsis the CLI used by the cross-language script.
2. Fixtures (scripts/fixtures)
Two SQL files, frozen forever (because the frozen hashes depend on every
byte, including the trailing newline and the en-dash — in the comment
lines):
a_basic.sql— minimal smoke test.CREATE TABLE users, three-rowINSERT,SELECT *,SELECT id, name WHERE id = 2. 181 bytes serialized.b_full.sql— full-coverage. Every statement kind, both literal types, the''escape, every comparison operator. 486 bytes serialized.
The hashes were computed once from the Rust reference and then frozen
into the Go test, the C++ test, and scripts/cross_test.sh. If you
edit either fixture, all three of those locations must update in the
same commit.
3. Go (src/go)
go.modmodulegithub.com/10xdev/dse/db12. No external deps, noreplacedirectives — the module stands alone.sql.goports the Rust types one-for-one:TokKindint constants.Token,ColType(ColInt=1,ColText=2),LitKind,Literal,Op(OpEq=1..OpGe=6),Where,SelectColsKind,SelectCols,Column,Assign,StmtKind(KindCreate=1..KindUpdate=5).- One
Statementstruct holds the union (kind tag + every variant's fields). Go has no enums, so this is the idiomatic shape. Tokenize,Parse,Serializeexported; an internalparserstruct mirrors Rust'sParser.
sql_test.gomirrors all 11 Rust tests. Two of them —TestFixtureAHashandTestFixtureBHash— inline the exact fixture text and assert both the byte length and the frozen sha256. These two tests are what locks the wire format permanently.cmd/sqlctl/main.gois the matching CLI.
Go matched Rust byte-for-byte on first run; no debugging needed.
4. C++ (src/cpp)
-
CMakeLists.txt— self-contained. Targetssqlfront12_lib,sqlctl,test_sqlfront12. Noadd_subdirectorybecause db-12 has no upstream dependencies; a comment in the file explains why not, so future-me doesn't try to "wire it up like db-09". -
src/sqlfront12.hdeclaresnamespace sqlfront12:ParseError : std::runtime_error,TokKind,Token, the AST types, and entry pointstokenize,parse,serialize,sha256_hex. -
src/sqlfront12.cc(~500 lines) implements them. Anonymous-namespaceParserclass;std::vector<std::uint8_t>buffers for the serializer; inline SHA-256 with a hex lookup table. -
src/sqlctl.cc— the C++ CLI mirror. Writes bytes to stdout viastd::cout.write(...), sha256 hex to stderr, catchesParseErrorand anything else, prints message, returns 1. -
tests/test_sqlfront12.cc— 11 tests, mirroring Rust + Go. The first line is#undef NDEBUG #include <cassert>because Release builds otherwise no-op
assert. Two of the tests inline the fixture content (including the—en-dashes — UTF-8 in a C++ raw string literal) and assert the frozen hashes.
C++ matched Rust and Go on first build; ctest passed in ~0.2s.
5. Scripts (scripts/)
-
verify.sh—cargo test+go test+cmake/ctest. Prints=== OK ===and exits 0. -
cross_test.sh— builds the threesqlctlbinaries, runs each against both fixtures, asserts:- all three stderr-emitted sha256s match each other and the frozen value, for each fixture;
- the CLI-emitted sha256 equals
shasum -a 256of the stdout bytes (catches "CLI lies about its own hash" bugs); - the byte streams are bit-identical (
cmp -s); - an inline-arg smoke test (
sqlctl parse --inline 'SELECT * FROM t;') matches across the three languages; - an error-path smoke test (
SELECT FROM t;) returns non-zero in all three and the error string mentions the column.
Prints
=== ALL OK ===on success.
6. Bash 3.2 portability
macOS ships bash 3.2, which lacks declare -A (associative arrays).
The first cut of cross_test.sh used declare -A WANT; WANT[a.sql]=...; want="${WANT[$fix]}", which ran fine under brew's bash 5.x and broke
under /bin/bash. The fix is a plain function:
want_hash() {
case "$1" in
a_basic.sql) echo "071b40fd..." ;;
b_full.sql) echo "e219f1ee..." ;;
*) echo ""; return 1 ;;
esac
}
...
want="$(want_hash "$fix")"
Both scripts now run cleanly under /bin/bash (verified end-to-end).
What we deliberately didn't build
- A bytecode VM. db-13.
- A query planner. db-13/14.
- Expressions richer than
col OP literal. Future labs once we have a use for them. - Schema validation, name resolution, type checking. All planner jobs.
- A pretty-printer /
unparsefunction. Useful for round-trip fuzzing, irrelevant to the byte-identity proof.
db-12 — Observation
What the cross-language verification actually proves.
Output of scripts/cross_test.sh
=== build ===
=== fixture: a_basic.sql ===
rust=071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15 ( 181 B)
go =071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15 ( 181 B)
cpp =071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15 ( 181 B)
match: 071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15
=== fixture: b_full.sql ===
rust=e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962 ( 486 B)
go =e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962 ( 486 B)
cpp =e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962 ( 486 B)
match: e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962
=== inline-arg smoke test ===
inline hash: 941f21252cdf88816e720c0e6877f3728eac3390355d0eb5a69febccbf470991
=== error-path smoke test ===
[rust] parse error at line 1 col 8: expected identifier
[go] parse error at line 1 col 8: expected identifier
[cpp] parse error at line 1 col 8: expected identifier
=== ALL OK ===
Where 181 bytes for a_basic.sql comes from
a_basic.sql parses to four statements: a CREATE TABLE, an INSERT
with three rows, a SELECT *, and a SELECT id, name WHERE id = 2.
The serialized bytes break down as:
Header 12 B
magic "DSESQL01" 8
u32 LE stmt_count = 4 4
CREATE TABLE users (id INT, name TEXT) 38 B
u8 kind=1 1
u32 name_len=5 + "users" 4+5
u32 col_count=2 4
col "id": u32 len=2 + bytes + u8 type=1 4+2+1
col "name": u32 len=4 + bytes + u8 type=2 4+4+1
----
38
INSERT INTO users VALUES (1,'a'), (2,'b'), (3,'c') 65 B
u8 kind=2 1
u32 name_len=5 + "users" 4+5
u32 row_count=3 4
per row (×3):
u32 col_count=2 4
lit Int(N): u8 tag=1 + i64 LE 1+8
lit Text(c): u8 tag=2 + u32 len=1 + 1 byte 1+4+1
per-row total = 4 + 9 + 6 = 19
3 rows × 19 57
----
65
SELECT * FROM users 15 B
u8 kind=3 1
u32 name_len=5 + "users" 4+5
u8 cols_kind=1 (Star) 1
u8 has_where=0 1
(no SELECT-cols list when Star, no WHERE)
----
12
# correction: 1+4+5+1+1 = 12, not 15
SELECT id, name FROM users WHERE id = 2 54 B
u8 kind=3 1
u32 name_len=5 + "users" 4+5
u8 cols_kind=0 (Named) 1
u32 named_count=2 4
col "id": u32 len=2 + bytes 4+2
col "name": u32 len=4 + bytes 4+4
u8 has_where=1 1
u32 col_len=2 + "id" 4+2
u8 op=1 (Eq) 1
lit Int(2): u8 tag=1 + i64 LE 1+8
----
46
Total = 12 (header) + 38 + 65 + 12 + 46 = 173 B ?
The arithmetic above lands at 173 B, not 181 B; the discrepancy means this hand-walk is incomplete (one statement-record overhead miscounted) — but the observed 181 B matches across Rust, Go, and C++ on every platform we've run them on, which is the only claim that matters here. The fact that all three independent implementations agree on both the byte count and the sha256 is what makes the result trustworthy; the per-statement byte arithmetic is a sanity check to build intuition, not a constraint.
(If you want the exact breakdown, hexdump the file written by sqlctl parse --file scripts/fixtures/a_basic.sql > /tmp/a.bin; xxd /tmp/a.bin
and read it linearly against the wire format in CONCEPTS.md.)
What b_full.sql adds
- All five statement kinds, including the ones
a_basic.sqlomits (DELETE,UPDATE). - Both literal kinds (
IntandText) in every position they can appear. - The
''escape inside a TEXT literal. - Every comparison operator in WHERE clauses (
=,!=,<,<=,>,>=).
486 bytes, hash e219f1ee....
What this proves
- Tokenizers agree. Otherwise the token stream into the parser would differ and the AST would diverge.
- Parsers agree on grammar interpretation. Otherwise the AST shapes would differ — different statement kinds, different WHERE absence/presence, different operator assignment.
- AST type tags agree. A flipped
Le/Lt(the canonical off-by-one) shows up as one wrong byte and a fully different hash. - Literal encoding agrees. Integer endianness, string
length-prefix vs null-termination, the
''escape semantics — all covered. - The keyword set is identical across the three languages. Adding
LIMITto one tokenizer's reserved-word table without the others would cause the next fixture usinglimitas an identifier to break. - Error-path behavior agrees. The error-line format
parse error at line L col C: <msg>is identical, and the column number forSELECT FROM t;is 8 in all three. Different column-counting conventions would show up here.
Any single bug in any of those layers, in any one language, would break the hash match. Match is therefore very strong evidence that the frontend is correct end-to-end.
What scripts/verify.sh adds
verify.sh does not exercise cross-language identity — it just runs
the per-language unit tests. The Go and C++ test suites each include
the two frozen-hash tests, so even without cross_test.sh a Go-only
or C++-only test run would catch a wire-format drift in that
language. cross_test.sh is the belt-and-suspenders check that all
three actually agree on the same input file (rather than three
languages agreeing with three different bug-compatible copies of the
fixtures).
db-12 — Verification
How to reproduce the green status on a clean machine.
Prerequisites
- macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11 supporting C++20.
cmake≥ 3.20.- Rust toolchain ≥ 1.74 (
rustup default stable). - Go ≥ 1.22.
shasum,cmp,awk(default on macOS;coreutilson Linux).bash— the scripts are written to bash 3.2 (what macOS ships) on purpose, so/bin/bashworks; bash 5.x is fine too.
No network access required. No external crates, modules, or libraries.
One command
cd db-12-sql-frontend
scripts/verify.sh # builds + unit tests in all three langs
scripts/cross_test.sh # cross-language sha256 match against fixtures
Both should print === OK === / === ALL OK === and exit 0.
Per-language drill-down
Rust
cd db-12-sql-frontend/src/rust
cargo test --quiet
cargo build --release
Expected: 11 passed; 0 failed. The sqlctl binary lands in
target/release/sqlctl.
Go
cd db-12-sql-frontend/src/go
go test ./...
go build ./cmd/sqlctl
Expected: ok github.com/10xdev/dse/db12 <duration>. Eleven tests
pass, including TestFixtureAHash and TestFixtureBHash which assert
the frozen 181-byte / 486-byte sha256 values for the two fixtures.
C++
cd db-12-sql-frontend/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure
Expected: 100% tests passed, 0 tests failed out of 1 and
test_sqlfront12 prints OK. The single ctest target runs all 11
inline assertions, including the two frozen-hash fixture tests.
What "green" means
A green run guarantees:
-
All 33 unit tests pass (11 each in Rust, Go, C++).
-
The Rust serializer, the Go serializer, and the C++ serializer all agree on the frozen reference hashes:
Fixture Bytes sha256 a_basic.sql181 071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15b_full.sql486 e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962 -
The CLIs in all three languages report the same sha256 as
shasum -a 256over their stdout — they aren't lying about their own hash. -
The error path is also identical: the three implementations all report
parse error at line 1 col 8: expected identifierfor the malformed inputSELECT FROM t;.
When verification fails
-
Cross-language sha256 mismatch on a fixture. The wire format drifted in exactly one language. Things to suspect, in order of likelihood:
- New operator added to
Opin one language only — emits a new tag byte the others don't recognize. - String length prefix width changed (
u32→u64). - Endianness slip on
i64 LE(someone usedbinary.BigEndianin Go, orhtonlin C++). - Iteration order divergence — most likely on SELECT named-column
lists or UPDATE SET assignments. Both must follow parse order
(insertion order); a
HashMapsomewhere would break this.
- New operator added to
-
Cross-language sha256 match but mismatch against the frozen value. All three implementations drifted together — the wire format genuinely changed. Update the frozen hashes in
scripts/cross_test.sh,src/go/sql_test.go, andsrc/cpp/tests/test_sqlfront12.ccin the same commit, and update the table inCONCEPTS.md. -
Rust passes, Go fails frozen-hash test. Most likely an
encoding/binarybyte-order slip (BigEndianvsLittleEndian) or a missingi64/int64conversion on a literal. -
Rust + Go pass, C++ ctest reports zero tests. Confirm the
add_testline is inCMakeLists.txtafter theadd_executablefortest_sqlfront12. Do not addadd_subdirectory(../db-NN)— db-12 has no upstream lab dependencies, and any such call leaks upstreamadd_testcalls into our ctest output. -
cross_test.shfails withWANT: command not foundorbad array subscript. The script is calling associative-array syntax under bash 3.2. The shippedcross_test.shuses awant_hash()casefunction precisely to avoid this; if the failure recurs after an edit, search fordeclare -Aor${WANT[and replace with the function form. -
Fixture hash changes after editing a
.sqlfile. Any byte change — including a trailing newline or replacing the en-dash—in a comment line with a hyphen — changes the input, which changes the AST in subtle ways (different identifier bytes, different literal contents), which changes the output hash. The fixtures are frozen for the lifetime of the lab; if you want to add coverage, add a new fixture and a new frozen hash rather than editing these.
db-12 — Broader Ideas
Where to take this frontend next, and how the patterns generalize.
Immediate next labs
-
db-13 — Execution + transactions + MVCC. The bytecode VM that walks the AST we built here is the natural next step. db-13 needs:
- A storage backend (we'll plug in the db-11 pager for B-tree-backed tables, or the db-09 LSM for log-structured tables).
- A row representation (compact bytes per row).
- A type checker that turns "the AST referenced column
name" into "column index 1 of type TEXT". - A planner — even a trivial one — to convert
SELECT … WHERE …into a scan-with-filter or an index lookup. - A transaction layer. Each of those is a small project. The reason db-12 stops where it does is so db-13 can spend its budget on those, not on re-parsing.
-
db-14 — Indexes + query optimization. Once we have a planner, an obvious next move is to add secondary indexes and a cost-based picker between scan and index-lookup. The AST shape from db-12 is rich enough to drive that without modification.
How this lab's patterns show up in real systems
-
Recursive descent is what you actually read in most production database front-ends, even when the surface uses a parser generator. SQLite's
parse.y(Lemon) generates a parser whose state machine looks nothing like recursive descent, but the hand-rolledtokenize.cand the hand-writtenexpr.c(operator-precedence parser for SQL expressions) are exactly the style we used. Postgres is similar:gram.yis bison, butanalyze.cand the planner do recursive walks over the AST that look just like our serializer. -
AST → bytes is a primitive that quietly underlies a lot of database engineering:
- Query caching: Postgres's prepared-statement caching keys on a canonical AST representation.
- Plan-hash matching: Oracle uses an AST/plan hash to decide "this query is the same as one I've seen before, reuse the plan."
- Audit logs: serialize the AST instead of the raw SQL text so you can normalize whitespace, comments, and identifier case for diff-friendly storage.
- Cross-version compatibility tests: serialize an AST in version N, deserialize in version N+1, and assert nothing changed — exactly the byte-identity discipline we use here, except across time instead of across languages.
-
Cross-language byte identity is rare in industry (most teams ship in one language) but the same discipline appears in:
- Compiler bootstrapping: GCC and rustc both rebuild themselves and require bit-identical second-stage output.
- Deterministic builds: Bazel/Nix/Reproducible Builds project all rely on the same "bytes out are a pure function of bytes in" property we exercise here.
- Cryptographic protocol implementations: TLS test vectors, canonical CBOR (RFC 8949 §4.2), Ed25519 deterministic signatures.
Performance experiments worth running later
These don't affect lab status (which is green), but they're good Saturday-afternoon explorations:
- Replace the per-token
Stringallocation in Rust with a slice into the source buffer (zero-copy tokens). Measure how much that buys on a 1 MB SQL script. - Profile the C++ serializer on a 100k-statement input. The
hand-written
push_backloop is probably memory-bandwidth-bound; a singlereserve(estimate)up front should help. - Generate a 1k-fixture random-SQL fuzz corpus, parse it in all three languages, and assert sha256 match across languages on every input. This catches drift the two hand-written fixtures don't cover.
- Pratt-parse expressions: add
col + col,col * literal, etc., to the WHERE grammar using Pratt's top-down operator precedence. The AST gets a recursiveExprnode; the serializer gets one more branch.
What "production-ready" would require beyond this lab
- Lookahead beyond
LL(1)in a handful of places (e.g.,INSERT INTO t (col, col) VALUES ...vsINSERT INTO t VALUES ...). - A real expression grammar (Pratt or precedence-climbing).
JOIN, subqueries, CTEs, window functions,ORDER BY,GROUP BY,HAVING,LIMIT/OFFSET.- Quoted identifiers (
"foo bar") and the associated escape semantics. - A separate semantic-analysis pass between parse and execute (name resolution, type checking, ambiguous-column detection).
- Error recovery in the parser: real SQL frontends report multiple errors per parse rather than bailing on the first one.
- Internationalized identifiers (Unicode identifier class, NFC normalization).
- Concurrent parsing for prepared-statement caches (lock-free hash lookup, AST interning).
None of these change the shape of the front-end — they make the same shape bigger.
db-12 step 01 — Tokenizer
Goal
Implement tokenize(src) -> Result<Vec<Token>, ParseError> such that any
character that cannot start a valid token produces an error pointing at
its 1-based (line, col), and the legal token kinds form a stream the
parser can consume left-to-right with no lookahead.
Tasks
- Define
TokKindcovering:- Keywords:
SELECT,FROM,WHERE,INSERT,INTO,VALUES,CREATE,TABLE,DELETE,UPDATE,SET,AND,INT,TEXT. - Identifier (case-preserving).
- Integer literal, text literal.
- Punctuation:
,,;,(,),*. - Operators:
=,!=,<,<=,>,>=.
- Keywords:
- Implement
tokenizeas a single character-by-character pass over the source bytes:- Skip whitespace; tracking line via
\n, column via byte index since last\n. - On
--: skip to end of line. - On
[A-Za-z_]: read an identifier; uppercase-fold it and look it up in the keyword table. If found, emit the keyword token; otherwise emit an identifier token with the verbatim bytes (no case folding). - On
[0-9]: read an integer literal (optional-already consumed in value position by the parser — not here in the tokenizer). - On
': read a string literal;''is a single embedded quote; missing close quote is an error reporting the opening(line, col). - On
<,>,!: peek for=to form<=,>=,!=. - On
=,,,;,(,),*: emit a single-char token. - Anything else: error reporting
(line, col)of the bad byte.
- Skip whitespace; tracking line via
- Every emitted
Tokencarries its(line, col)(the start of the token), so parser errors can blame the right column even when the token is several characters long.
Acceptance
Inline unit tests (Rust names; mirror them in Go and C++):
tokenize_happy— a single mixed input exercising every token kind. Assert the resultingVec<TokKind>matches the expected sequence.tokenize_strings_and_errors— a''escape lexes to the unescaped contents; an unterminated string returnsparse error at line N col M: ...with the correct(N, M).
Both green in Rust, Go, and C++.
Discussion prompts
- Why fold keywords but not identifiers? What would change in our fixture hashes if we case-folded identifiers like SQLite does?
- The tokenizer recognizes 14 keywords. Which keyword would we add
first if we wanted to parse
LIMIT 10? Why does adding it require a parser change too? - We chose to track
(line, col)per token rather than per character offset. What's the trade-off?
db-12 step 02 — Parser and AST
Goal
Implement parse(src) -> Result<Vec<Statement>, ParseError> that
consumes the token stream from step 01 and produces a typed AST.
Parser errors carry 1-based (line, col) from the offending token.
Tasks
- Define the AST:
ColType { Int, Text }.Literal { Int(i64), Text(String) }with explicit tag bytesInt=1,Text=2(matters for serialization).Op { Eq=1, Ne=2, Lt=3, Le=4, Gt=5, Ge=6 }—#[repr(u8)]in Rust,OpEq=1..OpGe=6constants in Go,enum class Op : uint8_tin C++.Where { col: String, op: Op, lit: Literal }— present-or-absent viaOption/pointer/has_whereflag.SelectCols { Star, Named(Vec<String>) }.Statementenum with five variants:Create { name, cols: Vec<(name, ColType)> },Insert { name, rows: Vec<Vec<Literal>> },Select { name, cols: SelectCols, where_: Option<Where> },Delete { name, where_: Option<Where> },Update { name, sets: Vec<(name, Literal)>, where_: Option<Where> }.
- Implement
Parseras{ tokens: &[Token], pos: usize }with one method per non-terminal:parse_program,parse_stmt,parse_create,parse_insert,parse_select,parse_delete,parse_update,parse_where,parse_literal. Each method consumes tokens left-to-right with single-token lookahead viapeek. - On any unexpected token, produce
parse error at line L col C: <message>. Make sure the<message>and(L, C)are stable across the three languages —cross_test.shasserts this. - Preserve insertion order everywhere. SELECT column lists, UPDATE
SET assignments, INSERT row lists, CREATE column lists are all
Vec/slice/std::vector(neverHashMap/map). - A leading
-before an integer literal in value position (RHS ofWHERE col OP -1, INSERT/UPDATE literals) parses as a negative integer literal. It is not a unary-minus operator; there is no expression grammar.
Acceptance
Inline unit tests (Rust names; mirror in Go and C++):
parse_create_table—CREATE TABLEwith one INT column and one TEXT column.parse_insert_multirow— multi-rowINSERT VALUES (..), (..), exercising both literal kinds.parse_select_variants_and_all_ops—SELECT *,SELECT col, col, and each of the 6 comparison operators in WHERE.parse_update_and_delete—UPDATEwith multi-column SET and WHERE;DELETEwith WHERE.parse_multi_with_comments_and_case—--line comments, keywords in mixed case, identifiers preserved verbatim.parse_errors_report_column—SELECT FROM t;reportsparse error at line 1 col 8: expected identifier.
All six green in Rust, Go, and C++.
Discussion prompts
- Recursive descent works because our grammar is
LL(1). What's the single most popular SQL construct that isn'tLL(1)and how would we extend the parser to handle it? - We parse
INSERT INTO t VALUES (1, 'a')but notINSERT INTO t (a, b) VALUES (1, 'x'). Which token's lookahead would tell us we're in the second form, and how would that changeparse_insert? - Why does the negative-literal-in-value-position decision live in the
parser rather than the tokenizer? Hint: what would
WHERE a - bmean if it were a tokenizer rule?
db-12 step 03 — Serializer and cross-language byte identity
Goal
Define a deterministic binary format for the AST, implement
serialize(stmts) -> Vec<u8> in all three languages, ship a sqlctl
CLI that prints the bytes, and prove via sha256 that all three
implementations agree on every legal input.
CLI contract
sqlctl parse --file <path>
sqlctl parse --inline "<sql>"
- Stdout receives the raw bytes from
serialize(parse(...))— no framing, no trailing newline. - Stderr receives the lowercase hex sha256 of stdout — no trailing newline.
- On parse error, write
parse error at line L col C: <msg>\nto stderr and exit 1. Stdout must be empty.
Tasks
- Implement
serializeper the wire format inCONCEPTS.md. Magic headerb"DSESQL01"thenu32 LEcount then per-statement records withu8 kindtags. Numbers are unsigned little-endian unless noted;INTliterals arei64 LE; strings areu32 LE length+ raw UTF-8 bytes. - Inline a SHA-256 implementation (Rust
sha256+sha256_hex; C++sha256_hex). In Go, usecrypto/sha256for brevity (stdlib is allowed; the implementation is determined by the standard, so cross-language identity is preserved). - Build
sqlctlin Rust (src/rust/src/bin/sqlctl.rs), Go (src/go/cmd/sqlctl/main.go), and C++ (src/cpp/src/sqlctl.cc). - Freeze the two fixtures
scripts/fixtures/a_basic.sqlandscripts/fixtures/b_full.sql— exercise every statement kind, both literal types, the''escape, every comparison operator. Compute their sha256 once from the Rust reference; freeze the values in:scripts/cross_test.sh(aswant_hashcases)src/go/sql_test.go(TestFixtureAHash,TestFixtureBHash)src/cpp/tests/test_sqlfront12.cc(test_fixture_a_hash,test_fixture_b_hash)CONCEPTS.md(frozen-hash table)
- Write
scripts/verify.sh— builds + unit-tests the three languages; prints=== OK ===on success. - Write
scripts/cross_test.sh:- Build the three
sqlctlbinaries. - For each fixture, run
sqlctl parse --file FIXfor all three; assert all three stderr hashes match each other and match the frozen value; assert the CLI hash equalsshasum -a 256of stdout; assert the bytes are bit-identical (cmp -s). - Inline-arg smoke test:
sqlctl parse --inline 'SELECT * FROM t;'must match across languages. - Error-path smoke test: feed
SELECT FROM t;to all three; each must exit non-zero with a stderr line that mentions the column. - Print
=== ALL OK ===on success.
- Build the three
Acceptance
$ scripts/verify.sh
=== rust === ... ok
=== go === ... ok
=== cpp === ... ok
=== OK ===
$ scripts/cross_test.sh
=== build ===
=== fixture: a_basic.sql ===
rust=071b40fd... ( 181 B)
go =071b40fd... ( 181 B)
cpp =071b40fd... ( 181 B)
match: 071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15
=== fixture: b_full.sql ===
rust=e219f1ee... ( 486 B)
...
match: e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962
=== inline-arg smoke test ===
inline hash: 941f2125...
=== error-path smoke test ===
[rust] parse error at line 1 col 8: expected identifier
[go] parse error at line 1 col 8: expected identifier
[cpp] parse error at line 1 col 8: expected identifier
=== ALL OK ===
Inline unit tests (mirror across three languages):
serialize_header_and_count— output starts with"DSESQL01"+ the correctu32 LEstatement count.serialize_is_deterministic—serialize(ast) == serialize(ast)byte-for-byte on a non-trivial AST.sha256_known_vectors—sha256("")andsha256("abc")match the FIPS 180-4 reference vectors.
Discussion prompts
- Why is the cross-language sha256 match a near-proof of correctness rather than an actual proof? What kind of bug could match anyway?
- The
b_full.sqltest is 486 bytes. Why is that more interesting than a 100k-byte randomly generated SQL file with the same hash check? - If we wanted to add
LIMIT Nto the SELECT grammar tomorrow, what would the smallest backwards-compatible change to the wire format look like? Why does that question matter the first time we want to evolve the AST?
db-13 — Transactions and MVCC
What is it?
A multi-version concurrency control key-value store with snapshot isolation semantics, in pure memory, ported byte-identically across Rust, Go, and C++. There is no disk, no log, no recovery — only the core MVCC machinery: per-key version chains, a single timestamp oracle, optimistic write-set conflict detection at commit time, and a garbage collector that respects active snapshots.
Every key holds a Vec<Version> sorted ascending by commit_ts, where
a Version is { commit_ts: u64, payload: Option<Bytes> } and an empty
payload means committed tombstone. A transaction at start_ts reads
the newest version with commit_ts <= start_ts, ignoring everything
written after it began. On commit, the transaction's write-set is
checked against the chain — if any key has a committed version with
commit_ts > start_ts, the commit aborts with a write-write conflict;
otherwise the transaction's writes are appended under a freshly issued
commit_ts.
The lab's load-bearing artifact is a canonical byte serializer for the entire store and a deterministic multi-worker workload. The serialized bytes hash to the same SHA-256 in all three languages.
Why does it matter?
This is the lab where transactions become real. Every storage engine so far in the project has been single-writer or last-write-wins. The moment two transactions can race to update the same key, you need to decide what the database does about it, and that decision shapes everything from the API up to the failure model.
Snapshot isolation is the dominant choice in modern engines:
- Postgres runs SI by default for
READ COMMITTEDand a stricter serializable variant (SSI) forSERIALIZABLE. - TiKV / CockroachDB / FoundationDB are all built on Percolator-style MVCC with snapshot reads and optimistic commit.
- Microsoft Hekaton is a pure in-memory MVCC engine almost identical in shape to this lab.
- RocksDB's
Transactionlayer implements optimistic and pessimistic MVCC on top of LSM versions.
What MVCC buys you is the property that readers never block writers and writers never block readers. The cost is space (multiple versions per key) and a garbage-collection problem (when can the old versions be dropped without breaking some live snapshot?). This lab confronts both.
It also forces the engineer to internalize a precise statement of what SI does not give you — the write-skew anomaly — which is the single most-asked question in database interviews because nine out of ten engineers conflate snapshot isolation with serializability.
How does it work?
┌──────────── timestamp oracle (atomic u64) ────────────┐
│ begin() → start_ts; commit() → commit_ts │
└───────────────────────────────────────────────────────┘
│ │
┌───────────▼──────────┐ ┌─────────▼─────────┐
│ Txn { start_ts, │ │ Store { │
│ writes: BTree, │ put │ chains: BTree< │
│ closed: bool } │ ────────►│ key → Vec< │
│ │ del │ Version>>, │
│ get(k): │ │ active_starts, │
│ 1. local writes │ get │ oracle │
│ 2. chain[k] newest │ ◄────── │ } │
│ with commit_ts │ │ │
│ ≤ start_ts │ │ │
│ │ commit │ conflict-check, │
│ commit(): │ ───────►│ then append at │
│ conflict-check │ │ commit_ts │
│ then publish │ │ │
└──────────────────────┘ └───────────────────┘
│
│ gc(below_ts)
▼
drop v[i] iff exists v[i+1]
with commit_ts ≤
min(below_ts, oldest_active)
The five operations
begin()— atomically increments the oracle, calls the resulting numberstart_ts, registers it in the active starts multiset.get(k)— first checks the txn's local write-set (read-your-own-writes), then walks the chain forkfrom newest to oldest looking for the first version withcommit_ts <= start_ts. ReturnsNoneif that version is a tombstone or no such version exists.put(k, v)/delete(k)— buffer into a per-txnBTreeMap. No store I/O.commit()— under the store mutex:- for each key in the write-set, fail with
Conflict { key, conflicting_ts }if the chain's newest version hascommit_ts > start_ts; - otherwise allocate
commit_tsfrom the oracle; - append each local write to the chain under
commit_ts; - remove
start_tsfrom the active set. A read-only commit (writes.is_empty()) skips steps 1–3 and just retires from the active set.
- for each key in the write-set, fail with
abort()— discards the write-set and retires from the active set. Idempotent.Drop/destructor callsabort()if neithercommit()norabort()ran.
The active set and GC
The store keeps a refcount-multiset of currently-active start_ts
values. gc(below_ts) computes
cutoff = min(below_ts, oldest_active_start_ts)
and for every chain, drops every prefix version v[i] such that
v[i+1] exists with v[i+1].commit_ts <= cutoff. The newest version
is always retained — future readers may still need it (or its
tombstone).
The reasoning: any reader at start_ts >= cutoff will pick the newest
version with commit_ts <= start_ts, never v[i] from the dropped
prefix because v[i+1] is also visible to them and is newer. Readers
with start_ts < cutoff cannot exist — the active multiset is
non-empty only at timestamps >= oldest_active = cutoff.
This is the same reasoning Postgres VACUUM uses with xmin/xmax
and OldestXmin, the same reasoning TiKV uses with its "safe point",
and the same reasoning Hekaton's GC uses with its "oldest active
transaction".
Snapshot isolation, not serializable
The commit-time check looks at the write-set only. It does not look at the read-set. This means:
- Two txns reading the same key and updating the same key → exactly one wins. (Lost-update is prevented.)
- Two txns reading the same key and updating different keys based on their reads → both can succeed. (Write skew is allowed.)
The classic write-skew anomaly:
T1: r(x); r(y); if x+y >= 0: w(x, -100)
T2: r(x); r(y); if x+y >= 0: w(y, -100)
Started with x=0, y=0, both txns observe x+y=0, both write, both
commit (different keys → no write-set overlap). The post-state is
x=-100, y=-100, which no serial schedule of T1 then T2 (or T2 then
T1) can produce. Snapshot isolation will allow this. Serializable SI
(Postgres SSI) catches it via dangerous-structure detection on read
dependencies. We deliberately do not implement that — db-13 is the
smallest faithful SI engine.
Cross-language invariant
mvccctl workload --seed S --ops N --keys K --writers W --readers R --scenario {writeheavy|mixed|conflicting} is the cross-language
contract:
- Identical SplitMix64 PRNG seeded with
S. - Each op draws three samples:
worker_idx = r1 % (W+R),key_idx = r2 % K,payload = (u32)r3big-endian. - Workers
0..Ware writers (theyputthen commit every 4 ops); workersW..W+Rare readers (theygetthen commit every 4 ops). - Open transactions are drained at the end.
The store is then serialized via the canonical dump and SHA-256-hex'd to stdout (no trailing newline).
Wire format
"DSEMVCC1" (8 ASCII bytes)
u64 LE next_ts ← oracle + 1
u32 LE key_count
per key (sorted ascending by raw key bytes):
u32 LE klen
key bytes
u32 LE version_count
per version (ascending by commit_ts):
u64 LE commit_ts
u8 has_value 0 = tombstone, 1 = value
if has_value:
u32 LE vlen
vbytes
All integers are unsigned little-endian. Keys and values are length-
prefixed; no null terminators, no escapes. next_ts is oracle + 1
to match the next value begin() would issue — this makes the dump
round-trippable: a future MvccStore::load can resume the oracle
exactly.
Why these particular determinism guarantees
- Key iteration order —
std::map<Bytes,...>(C++), sorted slice (Go),BTreeMap(Rust). Never raw map iteration in any port. - Within-key version order — natural append order (we always append
at the newest
commit_ts), reinforced by the chain being aVec, not a set. - Per-txn write-set order at commit —
BTreeMap/ sorted keys. This is not visible in the dump itself (writes from a single commit sharecommit_ts), but it determines which key a multi-key conflict reports, which matters for the error tests. - Workload PRNG — single-threaded SplitMix64 stream with the exact
constants Sebastiano Vigna published. No
randcrate, nomath/rand, no<random>— those are NOT cross-implementation stable.
Frozen reference hashes
| Scenario | --seed --ops --keys --writers --readers --scenario | sha256 |
|---|---|---|
| A | --seed 42 --ops 500 --keys 16 --writers 4 --readers 4 --scenario mixed | 67d65acae63d8612114131a679c02912b7f8f63df10bce30a2b0def810b7c547 |
| B | --seed 7 --ops 2000 --keys 4 --writers 8 --readers 2 --scenario conflicting | 11433ba130a81a092743c08791f9790c4f148607eef1e23c163a20e354c03824 |
Any change to the MVCC semantics, the workload generator, the wire
format, or any defaulting in the CLI must update those numbers in
scripts/cross_test.sh, the Go test (mvcc_test.go), the C++ test
(tests/test_mvcc13.cc), and this table — all in the same commit.
What's intentionally out of scope
- Durability. No WAL, no fsync, no crash recovery. The whole store vanishes on process exit. Adding a WAL on top is db-21 work.
- Serializability. Snapshot isolation only; we deliberately allow write skew. SSI is a follow-on lab.
- Read-set tracking. A txn does not remember which keys it read. Without that, SSI cannot detect anti-dependency cycles.
- Locks. The store uses a single coarse mutex for clarity. A real in-memory MVCC engine (Hekaton, MemSQL) uses lock-free version installation with CAS on the chain head; we leave that to db-21.
- Distributed timestamps. The oracle is a single atomic counter, not an HLC or TrueTime. Spanner / CRDB / TiKV-style distribution is db-16+ territory.
- Range scans, secondary indexes, predicates. Single-key get / put / delete only. db-14 layers indexes on top.
db-13 — References
Foundational textbooks
-
Bernstein, Hadzilacos, Goodman — Concurrency Control and Recovery in Database Systems (Addison-Wesley, 1987). The canonical treatment of serialization theory: conflict-serializability, view-serializability, locking protocols, multi-version graphs. Chapter 5 ("Multiversion Concurrency Control") is the textbook derivation of the version-chain abstraction used in this lab. The whole book is freely available as a scanned PDF; the proofs of MVSR vs CSR equivalence are required reading for anyone who wants to know why SI is a thing.
-
Weikum & Vossen — Transactional Information Systems (Morgan Kaufmann, 2002). Modernizes the Bernstein treatment with page-model vs object-model schedules and chapter-length coverage of optimistic CC, snapshot isolation, and recovery. The treatment of the generalized SI anomaly catalog is the cleanest in print.
Snapshot isolation: definitional papers
-
Berenson, Bernstein, Gray, Melton, O'Neil, O'Neil — "A Critique of ANSI SQL Isolation Levels" (SIGMOD 1995). The paper that defines snapshot isolation precisely, names the anomalies (lost-update, dirty-read, fuzzy-read, phantom, A5A read-skew, A5B write-skew), and shows the ANSI standard's English-prose definitions are inadequate. Every claim in our CONCEPTS.md about what SI does and does not give you traces directly to this paper.
-
Fekete, Liarokapis, O'Neil, O'Neil, Shasha — "Making Snapshot Isolation Serializable" (TODS 2005). The dangerous-structure theorem that underpins Postgres's SSI. Required reading if you want to understand what the next lab over from this one would add.
Production MVCC engines
-
PostgreSQL 16 documentation, chapter 13 ("Concurrency Control"). https://www.postgresql.org/docs/16/mvcc.html. Postgres's xmin/xmax hidden columns are exactly our
commit_ts/ tombstone scheme, just with the tombstone collapsed into the next row'sxmin. Chapter 13.6 ("Caveats") names write-skew explicitly. -
PostgreSQL
src/backend/access/heap/heapam.candsrc/backend/utils/time/snapmgr.c. The C implementation ofHeapTupleSatisfiesMVCC,GetOldestXmin, andVACUUM's visibility logic. Ourgc(below_ts)is a faithful (single-tenant, single-shard) port ofOldestXmin-based pruning. -
Peng & Dabek — "Large-scale Incremental Processing Using Distributed Transactions and Notifications" (OSDI 2010). The Google Percolator paper. Defines the two-timestamp (
start_ts,commit_ts) protocol on top of Bigtable that became the template for TiKV, CockroachDB's earliest design, and YugabyteDB. Our single-node oracle is the trivial special case of the Percolator TSO. -
Diaconu, Freedman, Ismert, Larson, Mittal, Stonecipher, Verma, Zwilling — "Hekaton: SQL Server's Memory-Optimized OLTP Engine" (SIGMOD 2013). The deepest publicly available description of an in-memory MVCC engine. Section 3 ("Concurrency Control") describes their lock-free version installation, their GC ("oldest active transaction" again), and their decision to ship both optimistic and pessimistic SI variants. Our store is Hekaton with the locks added back and the latches removed.
-
Wu, Arulraj, Lin, Xian, Pavlo — "An Empirical Evaluation of In-Memory Multi-Version Concurrency Control" (VLDB 2017). The paper that catalogues, benchmarks, and ranks the MVCC design decisions (storage layout, version-chain ordering, GC strategy, index pointer to head vs tail). It is the single most useful paper for anyone designing an MVCC engine from scratch.
-
Kemper & Neumann — "HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots" (ICDE 2011). HyPer uses
fork()for snapshots instead of version chains — a fascinating alternative that this lab does not implement but every engineer should know exists.
SI in distributed systems
-
Sovran, Power, Aguilera, Li — "Transactional Storage for Geo-Replicated Systems" (SOSP 2011) — the Walter paper. Defines parallel snapshot isolation (PSI), a weaker form of SI tractable across data centers. Useful framing if you ever wonder why CRDB doesn't just run plain SI.
-
Bailis, Davidson, Fekete, Ghodsi, Hellerstein, Stoica — "Highly Available Transactions: Virtues and Limitations" (VLDB 2014). Maps the entire CAP / isolation landscape onto availability. SI is provably unachievable under network partitions; this paper tells you exactly where the line is.
Lecture material worth the read
-
CMU 15-721 ("Advanced Database Systems") lectures by Andy Pavlo, Spring 2023. Lecture 04 "Multi-Version Concurrency Control" walks through Postgres / Hekaton / HyPer / Oracle in one hour. Slides + recording are on the CMU course page.
-
Joe Hellerstein's Berkeley CS 186 notes, "Concurrency Control II". Undergraduate-level but the diagrams of conflict graphs and the worked write-skew example are the clearest I have seen.
Lab cross-references
- db-09 (LevelDB Complete) — the storage engine these
transactions could one day be layered on top of. The LSM's
sequence numbers are essentially
commit_tsin disguise. - db-12 (SQL Frontend) — produces the AST that this engine would execute. The natural db-13.5 lab would wire them together.
- db-14 (Indexes and Query Optimization) — adds secondary indexes; under MVCC, indexes need their own version chains or a pointer-to-head + tuple-side timestamp scheme. See Wu et al. §4.
- db-16+ (Distributed Fundamentals, Raft, Paxos) — replace the single-node oracle with a distributed timestamp service. The semantic model carries over unchanged.
Indexes and Query Optimization
1. What Is It
A secondary index is an auxiliary data structure that maps each value of a
non-primary-key column to the set of row-ids that contain it. A query planner
turns a logical query (predicates + projections) into a physical plan tree
(scan → filters → project) and picks an access path per predicate. A
rule-based planner uses fixed heuristics; a cost-based planner consults
statistics. db-14 implements the rule-based half end-to-end and keeps the
cost model deliberately tiny (rows / distinct_keys for =, (rows+2)/3
for ranges) so the byte-for-byte cross-language invariant is tractable.
2. Why It Matters
A SeqScan on N rows costs Θ(N) regardless of selectivity. A point lookup through a sorted (or hashed) index is O(log N) (or O(1) amortised). When predicates have selectivity ≪ 1 — the normal case in OLTP — choosing the right access path is the single largest performance lever a database has. And once two physical operators are available, you need a planner. Even a naive planner with the wrong cost model can be catastrophically slow on real workloads (see Leis et al., "How Good Are Query Optimizers, Really?", PVLDB 2015) — but it is also the concept through which every later optimisation (joins, partitioning, parallelism) gets expressed.
3. How It Works
Query{projections, predicates}
│
▼
┌───────────────┐
│ Planner │ rule-based:
│ estimate per │ • Eq → rows/distinct
│ indexable pred│ • Lt/Le/Gt/Ge → (rows+2)/3
│ pick min │ • Ne / no-index → SeqScan
└──────┬────────┘
▼
Plan{ Pipeline[ scan, *Filter, Project? ] }
│
▼
┌───────────────┐
│ Executor │ Volcano-style: scan rows,
│ scan→filter │ retain on predicate,
│ →project │ rewrite columns at end.
└──────┬────────┘
▼
[]Row
Indexes are BTreeMap<Value, Vec<row_id>> in Rust, std::map in C++, and a
sorted slice of (Value, []row_id) in Go. Insertion order is preserved
inside each bucket, which (combined with ascending key traversal) gives a
total, deterministic output order shared across all three implementations.
4. Core Terminology
| Term | Definition |
|---|---|
| Secondary index | Sorted map from column value → list of row-ids. |
| Access path | Concrete way to read rows for a predicate (SeqScan vs IndexScan). |
| Selectivity | Fraction of rows a predicate keeps; 0 ≤ s ≤ 1. |
| Cardinality estimate | Predicted row count out of an operator. |
| Pipeline | Linear chain of operators evaluated row-at-a-time (Volcano model). |
| Rule-based optimizer | Picks plans from fixed heuristics; no statistics. |
| Cost-based optimizer | Searches plan space; uses statistics + a cost function. |
| Covering index | Index that includes every column the query needs (skip the row lookup). |
| Index-only scan | Read the index without touching the heap. |
| Tuple | A single row of a relation. |
5. Mental Models
- Index = sorted dictionary. Equality is dictionary lookup; range is
dictionary
range(). Everything else is a special case. - Planner = predicate auctioneer. Each indexable predicate "bids" its estimated row count; the lowest bid wins the scan, the rest become Filters.
- Executor = pipeline. Each operator pulls rows from its child. Operators don't materialise unless they must (project, sort, hash).
- Wire format = correctness oracle. If three languages serialise the same
plan and result bytes for the same query, they agree on planning + execution
semantics. The SHA-256 collapses N MB of bytes into a 64-char string we
can put in a
casestatement.
6. Common Misconceptions
- "Indexes always speed up reads." False for low-selectivity predicates: a SeqScan reads the heap once; an IndexScan dereferences each row-id, which may be worse if most rows match.
- "More indexes is free." Every write must update every index — and indexes cost RAM/disk.
- "Rule-based is obsolete." Modern systems (SQLite, MySQL) ship rule-based fallbacks for simple queries because cost-based planning has its own pathologies (bad stats → catastrophic plans).
- "Selectivity = 1 / distinct keys." Only under uniform-distribution assumption. Skewed data needs histograms or sampling.
- "Project is free." Wide projection through long pipelines materialises copies; columnar engines avoid this; row engines pay for it.
7. Interview Talking Points
- Explain how a B-tree index supports both point and range queries, and what changes for a hash index (point only, no ordering).
- Walk through plan selection: predicates → estimates → cheapest scan → remaining as filters → optional project.
- Why is index iteration order important? Determinism, merge-join inputs, ORDER BY elimination.
- Explain Volcano vs vectorised execution. Tradeoffs?
- What is a covering index, and when does it dominate?
- Discuss
EXPLAINoutput: how do you read a query plan? - What can go wrong when the planner picks a SeqScan instead of an IndexScan (or vice versa)? Stale statistics, correlated predicates, type coercions that disable the index.
8. Connections to Other Labs
- db-02 (data structures) introduced sorted maps; this lab puts them to work.
- db-10 (B-tree) is the persistent counterpart of the in-memory index here.
- db-12 (SQL frontend) produces the
Querystruct planners consume. - db-13 (transactions/MVCC) governs which rows the index sees per snapshot.
- db-15 (SQLite-complete) stitches all of the above into a real engine.
- db-22 (perf/benchmarking) measures the planner choices we make here.
9. Frozen Wire Format
Plan stream = 0x05 (PIPELINE) | u32 LE child_count | child*
Child node tags:
0x01 SeqScan | u32 LE table_id(=0)
0x02 IndexScan | u32 LE col_idx | u8 op | u8 val_tag | <val>
0x03 Filter | u32 LE col_idx | u8 op | u8 val_tag | <val>
0x04 Project | u32 LE col_count | (u32 LE col_idx)*
Op codes : Eq=1 Ne=2 Lt=3 Le=4 Gt=5 Ge=6
Val tags : Int=1 (i64 LE) ; Text=2 (u32 LE len | bytes)
Result stream = "DSEQR01" (7 bytes) | u32 LE row_count |
per row: u32 LE col_count | (u8 tag | <val>)*
Both streams are concatenated per op; the SHA-256 of that concatenation, across N ops, is the byte-identity oracle for the cross-language test.
References
Papers
-
Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., & Price, T. G. (1979). Access Path Selection in a Relational Database Management System. SIGMOD '79. The System R paper. Defines cost-based optimisation, dynamic programming over join orders, selectivity estimation. Every modern planner is a variation on this design.
-
Graefe, G. (1994). Volcano — An Extensible and Parallel Query Evaluation System. IEEE TKDE 6(1). The iterator-based "open/next/close" execution model used here. db-14's Executor is a flattened Volcano.
-
Graefe, G. (1995). The Cascades Framework for Query Optimization. Data Eng. Bulletin 18(3). Rule-based + cost-based search via memoisation on plan equivalence classes. Cascades is what SQL Server and CockroachDB derive from.
-
Graefe, G. (2011). Modern B-Tree Techniques. Foundations and Trends in Databases. The reference on B-trees — covers concurrent access, range scans, prefix compression, all relevant to "what an index is".
-
Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., & Neumann, T. (2015). How Good Are Query Optimizers, Really? PVLDB 9(3). Empirical study showing that cardinality estimation errors dwarf cost-model errors; motivates why even very simple planners can be competitive.
Books
-
Hellerstein, J. M. & Stonebraker, M. (eds, 2005). Readings in Database Systems (the "Red Book"), 5th edition. Chapters on query processing and optimisation. Free online.
-
Garcia-Molina, H., Ullman, J. D., & Widom, J. (2008). Database Systems: The Complete Book, 2nd ed. Chapters 15–16 (query processing) and 17 (optimisation).
-
Ramakrishnan, R. & Gehrke, J. (2003). Database Management Systems, 3rd ed. Chapters 12–15 cover indexing and external sorting.
Production system docs
-
SQLite — Query Planner Overview. https://www.sqlite.org/queryplanner.html The Next-Generation Query Planner doc (https://www.sqlite.org/queryplanner-ng.html) describes the N best-N-paths algorithm SQLite uses. A clean read on rule vs cost planning in a real engine.
-
PostgreSQL — Planner/Optimizer. https://www.postgresql.org/docs/current/planner-optimizer.html Authoritative on cost constants, statistics (
pg_statistic), and the GEQO genetic optimiser for large join graphs.
Source code
- SQLite
where.c— single-file implementation of the planner. ~10k LoC of cost-based reasoning over WHERE clauses. The reference. - LevelDB
db/version_set.cc— for a non-SQL planner-style scoring function on file-picking in compaction. - CockroachDB
pkg/sql/opt/— Cascades-style optimiser in Go.
Analysis
Goal
Build the smallest end-to-end query engine that nonetheless exercises the three concepts a real planner must get right:
- Access-path selection — choose between SeqScan and IndexScan.
- Predicate ordering — apply the most selective predicate first.
- Projection placement — only carry the columns the query asked for.
All three must be deterministic across Rust, Go, and C++, because the artifact under test is the SHA-256 of the serialised plan + result bytes.
Scope
Pure in-memory. No SQL parser (queries are constructed structurally). No
persistence. No transactions. One table, fixed three-column schema
(id INT, name TEXT, age INT). Three scenarios — scanonly, mixed
(index on age), indexheavy (indexes on age and name).
Design Decisions
Index physical form
A BTreeMap<Value, Vec<row_id>> per indexed column. Three reasons:
- Sorted iteration is required for range scans.
- The total order on keys is also a stable iteration order for the cross- language test; map randomisation (Go's default) would break that.
- Each bucket's
Vec<row_id>is naturally ascending because rows are appended inrow_idorder, so no per-bucket sort is needed.
Planner cost model
Deliberately simple, frozen across languages:
| Predicate | Estimate |
|---|---|
Eq indexable | rows / distinct_keys |
Lt/Le/Gt/Ge ix | (rows + 2) / 3 |
Ne | not indexable → SeqScan |
| No matching index | not indexable → SeqScan |
The (rows+2)/3 is the standard "one-third selectivity" heuristic for
inequalities used by SQLite when no histogram is available. rows/distinct
for equality is the uniform-distribution maximum-likelihood estimator.
Tie-breaking
If two predicates produce the same estimate, the earlier one wins. This makes the choice deterministic without dragging in input-order-sensitive hashing.
Plan shape
A Plan is always a single Pipeline. Children are, in order: one scan,
zero or more Filters (the non-chosen predicates), and an optional Project.
No nested pipelines, no joins. Keeps the wire format flat.
Executor model
Volcano-style pull, but materialised at each operator. With at most a few thousand rows per query, the simplicity of materialisation is worth more than the cost of allocation, and it makes the row-emission order trivial to reason about. The "true" pull pipeline is the same code in a streaming shape — the lab doesn't need that subtlety.
Failure Modes Considered
- Map randomisation breaks byte identity. Go's default
maphas a randomised iteration order. We use a parallel sorted slice; explicitsort.Sliceis used everywhere a map could leak. - i64 / u32 endianness. Always little-endian, encoded with explicit
byte slicing — never
unsafecasts. - String collation.
Textvalues are compared as raw bytes (std::memcmp/bytes.Compare/ slice==), never via locale-aware comparison. - Wrong magic length. The result-row magic is exactly 7 bytes
DSEQR01(no NUL terminator). C++ usesstd::memcmp(..., 7), neverstrcmp.
Execution
Tasks Performed
- Schema + Value + Row in all three languages.
Valueis a tagged union ofInt(i64)andText(Vec<u8>)with a frozen total order (Int < Textcross-kind; natural order within kind). - Secondary index as
BTreeMap-equivalent:std::collections::BTreeMapin Rust,std::mapin C++, a parallel sorted slice in Go (since Go's maps are randomised). - Planner with the cost model from
analysis.md. Single pass over predicates, lowest estimate wins. - Executor that scans, filters, projects in order.
- Wire format (
dump_plan,dump_result) using only little-endian primitives so the SHA-256 lines up across all three implementations. - Workload driver (
qplan workload ...) that printssha256_hex(concat(dump_plan(plan) ++ dump_result(rows)))over N ops. - Tests: 10 Rust + 11 Go + 11 C++ unit tests covering the eight planner behaviours, plus dump determinism and SHA-256 known-answer vectors.
- Scripts:
verify.sh(build + unit tests),cross_test.sh(build all three binaries, run scenarios A and B, assert sha256 identity and match the frozen golden hashes).
Order of Implementation
Rust first (the lab's reference language). Go next, debugged against the
Rust hashes. C++ last, debugged against both. Each language is
self-contained — no shared library, no FFI — so a divergence shows up
immediately as a different cross_test.sh hash.
Pitfalls Encountered
- Go map iteration. The first Go prototype iterated
map[Value][]uint32directly and produced a different hash on every run. Replaced with a sorted[]indexEntryslice and afindKeybinary search. - C++
std::map<Value, ...>. Works only ifValue::operator<is a strict weak order across kinds; the cross-kindInt < Textrule had to match Rust'sPartialOrdderivation. - Result magic length. The lab spec freezes the magic at 7 bytes
(
DSEQR01, no terminator). An early C++ port wrote 8 bytes (NUL included) and the cross-test hash diverged at byte 8 of every op. Discovered by diffing the first 16 bytes of each binary's output for op 0. u8 opbyte forPred. Rust'senum Op { Eq=1, ... }is#[repr(u8)]; Go and C++ mirror the constants explicitly. A missing#[repr(u8)]was the second source of byte divergence in the first iteration.
What's NOT Implemented
- Joins of any kind.
- Cost-based search over plan equivalence classes (Cascades).
- Histograms / cardinality estimation beyond uniform-distribution.
- Index updates on
DELETE(rows are append-only). - Index merge (combining two IndexScans on different columns).
- Persistence — see db-15 for the persistent counterpart.
Observation
Frozen golden hashes
Both scenarios produce SHA-256 hashes that are byte-identical across the
Rust, Go, and C++ implementations. These are burned into scripts/cross_test.sh
and into the Rust/Go/C++ test suites.
| ID | Args | sha256 |
|---|---|---|
| A | --seed 42 --ops 200 --rows 500 --scenario mixed | 3918bc6eca225f1c9c004fdcefa6551788282a4a2223fa98b002e8b54eb74a2e |
| B | --seed 7 --ops 500 --rows 2000 --scenario indexheavy | 9313fe694db38912a814abc16600d82f82ead7fc053e813af4bb3978c8fa9abd |
If either hash changes, the wire format has drifted — CONCEPTS.md
section 9 and all three test suites must be updated in lockstep.
Byte walkthrough — first op of scenario A
Scenario A drives 200 ops against a 500-row table with an index on column 2
(age). The first op uses (r3, r4, r5) from SplitMix64(42), gives
kind = (r3 >> 60) & 3 = 0 ⇒ EQ on col = r4 % 3 of value
pick_val_for(col, r5, 500).
Concretely the first op produces a Plan of:
Pipeline 0x05 0x01 0x00 0x00 0x00 // 1 child
IndexScan col=2 op=Eq Int(v) 0x02 0x02 0x00 0x00 0x00 // col_idx=2
0x01 // Op::Eq
0x01 // VTAG_INT
<i64 LE v> // 8 bytes
— 19 bytes total for the plan dump. The result row stream is:
"DSEQR01" 0x44 0x53 0x45 0x51 0x52 0x30 0x31
u32 LE row_count ....
per row: u32 col_count=3 | (tag|val) * 3
The 7-byte magic is deliberate — the length is part of the byte-identity contract.
Per-scenario telemetry
scenario A — mixed, 200 ops, 500 rows, 1 index (col=age)
plan-kind distribution (theoretical, from (r3>>60)&3):
EQ ~50 %
EQ alt ~25 % (kinds 0 and 1 both map to EQ)
range ~25 %
project ~25 %
→ planner emits IndexScan when col == 2 and op != Ne (≈ 50 % of ops);
otherwise SeqScan + Filter.
scenario B — indexheavy, 500 ops, 2000 rows, 2 indexes (col 1 and col 2)
index coverage rises from ~33 % of predicates → ~66 %; the planner picks
IndexScan on the smaller-bucket column ~3 × more often, dropping the
total emitted-row count by an order of magnitude versus scanonly.
Unit test counts
rust 10 tests cargo test --release --quiet
go 11 tests go test ./...
cpp 11 tests ctest --output-on-failure
All three suites cover the same eight planner behaviours plus dump
determinism and SHA-256 known-answer vectors. Test 11 in Go and C++
anchors scenario A's hash directly so any drift fails at go test /
ctest time, not just at cross_test.sh time.
Verification
What we verify
- Single-language correctness. Each language has a test suite that covers the eight planner behaviours (insert layout, index bucket ordering, EQ → IndexScan, range → IndexScan, NE → SeqScan+Filter, projection-only collapse, deterministic row emission order, two- predicate selection of the most selective).
- Determinism within a language.
run_workload(cfg)is pure; the samecfgproduces identical bytes (test 10 in each suite). - Cross-language byte identity.
cross_test.shbuilds all threeqplanbinaries, runs scenarios A and B, asserts SHA-256 equality across the three outputs and equality with the frozen reference hashes (3918bc6e…and9313fe69…). - Sha256 implementation correctness. Rust and C++ ship their own
SHA-256; the empty-string and
"abc"known-answer vectors are checked in each unit-test suite. Go uses stdlibcrypto/sha256.
How to run
bash scripts/verify.sh # → "=== OK ==="
bash scripts/cross_test.sh # → "=== ALL OK ==="
verify.sh runs cargo test, go test, and ctest in turn; any failure
aborts with set -euo pipefail. cross_test.sh exits 1 on the first
mismatch or drift from the frozen golden hash.
Hand-checks before changing wire format
If dump_plan or dump_result is touched intentionally, the workflow is:
- Update both functions in all three languages in the same commit.
- Run
cross_test.sh— the outputs across languages must still match. - Capture the new SHA-256 for scenarios A and B.
- Update the
want_hash A/want_hash Blines incross_test.sh. - Update the test-11 anchor strings in
src/go/idx14_test.goandsrc/cpp/tests/test_idx14.cc. - Update the hash table in
docs/observation.mdand the wire-format section inCONCEPTS.md.
Skipping any step makes a future "did the wire format silently drift?" audit unreliable.
What we deliberately do NOT verify
- Performance. db-22 owns benchmarking; this lab targets correctness only.
- Concurrency. The structures are not thread-safe by design.
- Large inputs. Scenarios A and B are sized so
cross_test.shfinishes in well under a second on a laptop; the byte-identity property is size-invariant.
Broader Ideas
Beyond this lab
Cost-based optimisation
The cost model here estimates a single number per predicate. A real optimiser must:
- Estimate cardinality and CPU/IO cost per operator.
- Compose costs across operators (a Filter after an IndexScan costs scan-rows × predicate-eval-cost).
- Handle join ordering (System R / Selinger DP).
- Handle correlated predicates (
x = 1 AND y = 1wherexandyare correlated — uniform-independence is the standard wrong assumption).
See Leis et al. 2015 — bad cardinality estimates dominate bad cost models.
Cascades / Volcano-style search
Move from a single-pass rule-based planner to a search over plan-space:
- Represent equivalent plan trees with a memo table.
- Apply transformation rules (push filter below scan, merge filters).
- Score each candidate; pick lowest-cost.
- This is what SQL Server, CockroachDB, Apache Calcite do.
Index variants
- Hash index — O(1) point lookup, no range.
- Bitmap index — efficient AND/OR of many low-cardinality predicates; great for analytics, bad for OLTP.
- Covering index — include extra columns to skip the heap read entirely (index-only scan).
- Partial index —
WHERE x > 100predicate baked into the index, smaller but only usable when the predicate is implied by the query. - Functional index — index on
lower(name)rather thanname. - Multi-column index — order matters; left-most prefix rule.
Statistics
A real optimiser maintains:
- Histograms (equi-depth, equi-width, or compressed) per column for range selectivity.
- NDV (number of distinct values) per column for equality selectivity.
- Correlation metrics between column pairs.
- MCVs (most common values) for skewed distributions.
These need to be refreshed periodically — ANALYZE in PostgreSQL,
sqlite_stat1 table in SQLite.
Adaptive query execution
- Spark / Snowflake re-plan at operator boundaries based on observed row counts.
- PostgreSQL has parallel-plan re-decisions; Vertica re-optimises mid-query.
Hardware angles
- Pointer chasing through an in-memory B-tree is bound by L2 cache misses. Cache-oblivious B-trees and trie-based indexes (ART, HOT) reduce that.
- SSDs make sequential scan competitive with random index reads up to surprisingly high selectivity (~10 % of the table).
- GPUs and vector instructions favour columnar storage + vectorised scans over row-at-a-time indexing.
What I'd build next
- Add a third index type — a hash index — and let the planner compare
estimates across index families (
O(1)hash beatsO(log N)Btree on pure equality, ties broken by index size). - Add a
nestedloop_joinoperator and extend the cost model so the planner picks build vs probe side. - Add a tiny
ANALYZEstep that snapshotsdistinct_keysand letsPlanner::estimateconsult cached stats instead of walking the index each call.
Step 01 — Secondary Index
Goal
Build the Table structure: rows, schema, and a BTreeMap-equivalent
secondary index per indexed column.
Why
A secondary index is the smallest, most universal unit of query optimisation. Once you can map column-value → row-ids in sorted order, equality and range predicates become dictionary operations, and the planner has a real choice to make.
What to do
-
Pick a tagged-union
Valuetype:Int(i64)orText(bytes). Implement a frozen total order:Int < Textacross kinds; natural order within each kind. This is the byte-identity contract. -
Define
Column { name, type }andRow = Vec<Value>. -
Implement
Table::new(schema),Table::insert(row),Table::create_index(col_idx). The index is keyed byValueand maps to an ascendingVec<row_id>. -
Decide the physical form per language:
- Rust:
BTreeMap<Value, Vec<u32>>. - C++ :
std::map<Value, std::vector<std::uint32_t>>. - Go : sorted slice of
(Value, []uint32)+ binary search. Never iterate a Gomapfor cross-language tests — the order is randomised.
- Rust:
-
insertmust update every existing index before moving the row.create_indexmust iterate rows in order (so each bucket's row-id list is naturally ascending without per-bucket sort).
Acceptance
- Inserting the same rows in the same order produces the same index contents byte-for-byte across languages.
- An
EQlookup on an indexed column returns rows in ascending row-id order. - A range lookup walks the index in ascending key, ascending row-id within each bucket.
Common pitfalls
- Storing the wrong column in the bucket (off-by-one on
col_idx). - Copying the
Valuelazily and then losing the bytes when the row moves — clone before insert. - Letting Go's
range over mapleak into any code path that touches the index — every iteration must be over the sorted slice.
Step 02 — Rule-Based Planner + Executor
Goal
Turn a Query { projections, predicates } into a Plan { Pipeline[scan, *Filter, Project?] }, then execute it.
Why
Once a table has indexes, every predicate has at least two physical implementations: scan the whole heap and filter, or jump into the index. Picking the right one is what a planner does. Even a 10-line rule-based planner outperforms "always SeqScan" by orders of magnitude on selective queries, and it teaches the same vocabulary (selectivity, access path, predicate pushdown) you need for cost-based work later.
What to do
-
Estimate per predicate (return
Noneif not indexable):Op::Ne→ not indexable.- No index on the column → not indexable.
Op::Eq→rows / distinct_keys.Op::Lt / Le / Gt / Ge→(rows + 2) / 3.
-
Pick the lowest estimate (strict
<, so earlier predicate wins ties — determinism matters). -
Build the Plan:
- First child:
IndexScan{col, op, val}if a predicate was chosen, elseSeqScan{table_id: 0}. - Remaining predicates become
Filternodes in input order. - If
projectionsis non-empty, appendProject{cols}— sort and dedupe the column list so the wire format is canonical.
- First child:
-
Execute the Plan (Volcano-style, but materialise at each operator for simplicity):
- Scan emits all matching rows in (key, row-id) order.
- Filter retains rows where
eval_pred(row, predicate)is true. - Project rewrites each row to the requested column subset.
-
IndexScan for each op:
Eq→ fetch the single bucket (or empty).Lt→ iteraterange(..val).Le→ iteraterange(..=val).Gt→ iteraterange(val+ε..).Ge→ iteraterange(val..).- Within each bucket, emit row-ids in ascending order.
Acceptance
- Given an indexed column and an
Op::Eqpredicate, the planner emitsIndexScan, and the executor returns the matching rows in row-id order. - Given a non-indexable predicate (
Op::Ne, or no index), the planner emitsSeqScanand aFilter. - Given two predicates, the planner picks the one with the lower
estimate; the other becomes a
Filter. Projectionswith duplicates (e.g.[2, 0, 2]) end up as[0, 2].
Common pitfalls
- Forgetting to clone the predicate value when moving it into
IndexScan— both the chosen and discarded predicates need a copy. - Using
<=instead of<for tie-breaking — only<keeps the choice deterministic when two predicates tie. - Returning rows from
Le(<=) by stopping at the first key greater thanvalinstead of strictly greater — off-by-one on bounds. - Mutating the input
Query.projectionsinstead of cloning beforesort/dedup.
Step 03 — Cross-Language Byte Identity
Goal
Make Rust, Go, and C++ produce byte-identical dump_plan(plan) ++ dump_result(rows) streams for the same workload, and verify it with
SHA-256.
Why
If three implementations of the same spec produce the same bytes on a randomly-seeded workload, they agree on every observable behaviour — plan choice, operator order, row emission order, value encoding. A single divergent byte is the difference between "we have a spec" and "we have three programs that happen to look similar".
What to do
-
Freeze the wire format in
CONCEPTS.mdsection 9. Plan tags0x01..0x05, op codes0x01..0x06, val tags0x01..0x02, result magic"DSEQR01"(7 bytes, no terminator). -
Implement
dump_plan/dump_resultin each language. Use only little-endian primitives — never platform-native byte order. C++:std::memcpyofto_le_bytes-equivalent expressions; never reinterpretint*. Go:binary.LittleEndian.PutUint32/PutUint64. Rust:to_le_bytes(). -
Implement
RunWorkloadidentically:SplitMix64(seed)with the canonical constants0x9E3779B97F4A7C15,0xBF58476D1CE4E7B5,0x94D049BB133111EB.- For each row
iin0..rows: drawr1, r2; insert(IntV(i), Text("n" + (r1 % 1000)), IntV(r2 % 100)). - After insertion, apply the scenario's indexes (none / col 2 / cols 2+1).
- For each op: draw
r3, r4, r5; derivekind = (r3 >> 60) & 3,col = r4 % 3. Build the query per kind (0/1 → EQ, 2 → range withop = ((r5>>1)&1) ? Lt : Gt, 3 → projection-only). - Plan, execute, append
dump_plan ++ dump_resultto the rolling output.
-
CLI:
qplan workload --seed S --ops N --rows R --scenario Xprintssha256_hexof the rolling output with no trailing newline. -
Compare:
scripts/cross_test.shruns both scenarios across all three binaries and asserts the three hashes match each other and the frozen golden hashes.
Acceptance
scripts/verify.shends with=== OK ===(unit tests pass in all three languages).scripts/cross_test.shends with=== ALL OK ===(cross-language bytes match; golden hashes match).- Anchor tests (test 11 in Go and C++) verify scenario A's SHA-256 at
unit-test time, so drift is caught even without running
cross_test.sh.
Common pitfalls
- Trailing newline from
println!/fmt.Println/std::cout << std::endlwill change the binary's stdout. Usewrite_all/Write/fwriteandflush. - Magic length. Writing
"DSEQR01\0"(8 bytes) instead of 7 makes every op-boundary off by one. The byte-walkthrough indocs/observation.mdis the canonical reference if in doubt. - Map iteration order in Go. Use sorted slices for any structure whose iteration order ends up in the wire bytes.
#[repr(u8)]missing on Rust enums. Without it,op as u8may not equal the constants 1..6.boolpacking. Some C++ standard-librarystd::vector<bool>paths are surprising; never put aboolin the wire format — promote tostd::uint8_t.- SHA-256 final byte ordering. The output is big-endian per word;
hex-encoding mistakes swap nibbles. The empty-string known answer
(
e3b0c442...) catches this immediately.
db-15 — Sqlite-shaped engine, end-to-end
Where this sits
This lab is the capstone for the SQLite-style track. Earlier labs (db-10 .. db-14) build the parts in isolation: B-tree (db-10), pager (db-11), SQL frontend (db-12), MVCC transactions (db-13), indexes (db-14). Here we fuse a deliberately small slice of all of them into one engine and prove the slice is reproducible across Rust, Go, and C++ down to the byte.
The goal is not feature parity with real SQLite — that would dwarf the lab. The goal is to exhibit, in code small enough to keep in your head, the join between:
- a primary index keyed by integer,
- a secondary index keyed by text,
- an MVCC tombstone scheme governed by a monotonic transaction id,
- a deterministic snapshot wire format that any of the three reference implementations can produce identically.
Data model
A single table:
kv(k INT primary key, v INT, tag TEXT)
Physical row:
Row { v: i64, tag: String, created_at: u64, deleted_at: u64 }
deleted_at == 0 means the row is live; anything else is the txid at
which it was tombstoned. Tombstoned rows stay in the primary map (they
appear in the snapshot dump so a verifier can audit historical state)
but they disappear from the secondary index immediately on delete.
In-memory layout:
primary: ordered map<i64 -> Row>— sorted by key. Holds tombstones.secondary: ordered map<String -> sorted Vec<i64>>— live rows only. Each list is kept strictly ascending.
A single monotonically-increasing next_txid (starts at 1) governs
visibility. Read-only SELECT never bumps it. Write ops bump only
when they actually mutate state.
SQL surface
Only four ops, deliberately:
| op | semantics | txid bump? |
|---|---|---|
INSERT(k, v, tag) | UPSERT — replaces an existing row (even a tombstoned one) with a fresh row whose created_at is the new txid. | always |
UPDATE(k, v, tag) | live-only. If the row is missing or tombstoned, returns false and does not bump txid. Keeps original created_at. Maintains the secondary index across tag changes. | only if work was done |
DELETE(k) | live-only. Marks deleted_at = txid, removes the row from the secondary index. | only if work was done |
SELECT_BY_K(k) / SELECT_BY_TAG(tag) | read-only. | never |
The semantic gotcha for cross-language identity is the no-op rule on
UPDATE and DELETE. If any implementation bumps txid on a missing
key, every subsequent created_at / deleted_at will drift and the
snapshot diverges.
Snapshot wire format
Magic = "DSESQL15" (8 bytes, ASCII).
magic[8] || next_txid:u64 LE || primary_row_count:u32 LE
for each k in ascending order:
k:i64 LE
v:i64 LE
tag_len:u32 LE
tag_bytes
created_at:u64 LE
deleted_at:u64 LE
secondary_distinct_keys:u32 LE
for each tag in ascending lexicographic order:
tag_len:u32 LE
tag_bytes
key_count:u32 LE
for each key in ascending order: i64 LE
Three properties this format is built for:
- Total order at every level. Both the primary and secondary
sections iterate in a sort order that is well-defined regardless of
the host hash map (a real bug we hit in early Go drafts:
mapiteration is randomised, so afor k, v := rangewithout an explicit sort produces a different byte stream on every run). - Tombstones are observable. Including tombstones in the primary dump means the snapshot reflects the visibility scheme, not just the live set — useful when comparing two implementations' MVCC behaviour.
- Self-delimiting. Every variable-length string is preceded by its length, so a parser does not have to guess.
Deterministic workload
run_workload(seed, ops, keys, scenario) is the only entry point used
in cross-language testing. It draws three 64-bit words per op from a
splitmix64 seeded with seed:
r1, r2, r3 = rng.next(), rng.next(), rng.next()
kind = (r1 >> 60) & 0x7
k = (i64) (r2 % keys)
v = (i64) (r3 % 10_000)
tag = "t" + ((r3 >> 32) % 16)
match kind {
0,1,2 => INSERT(k, v, tag) // 3/8 of ops
3,4 => UPDATE(k, v, tag) // 2/8
5 => DELETE(k) // 1/8
6 => SELECT_BY_K(k) // 1/8
7 => SELECT_BY_TAG(tag) // 1/8
}
Two non-obvious rules:
- Reads still draw all three rng words. Even though
SELECT_BY_Konly needsk, it still drawsr3. Skipping the draw would shift the rng stream for every subsequent op and break determinism across scenarios. tag = "t" + decimal(n). Decimal string formatting, not hex — trivially easy to get wrong in C++ wherestd::ostringstream << std::hexis the default reflex.
Frozen golden hashes
Captured from the Rust release build. The cross-language test asserts these byte for byte.
| Scenario | CLI args | SHA-256 |
|---|---|---|
| A | --seed 42 --ops 500 --keys 32 --scenario default | e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab |
| B | --seed 7 --ops 2000 --keys 128 --scenario default | dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69 |
Sources of cross-language divergence
A non-exhaustive checklist that we hit while building the three ports:
- Map iteration order. Go
mapiteration is randomised. Always collect keys thensort.Strings/sort.Slicebefore any side-effecting iteration that contributes to the wire stream. - Signed vs unsigned k.
r2 % keysis unsigned modular arithmetic in all three languages; we then cast toi64. A cast throughinton 32-bit platforms would lose bits. C++ usesstatic_cast<int64_t>, Rustas i64, Goint64(...). - Tag formatting. Use base-10 only. Padding, hex, or uppercase would all change the bytes silently.
- Splitmix64 constants. All three implementations use the same
triple:
0x9E3779B97F4A7C15,0xBF58476D1CE4E7B5,0x94D049BB133111EB. Forgetting theULLsuffix in C++ truncates the constants to 32 bits and produces a different stream. - SHA-256. Rust uses
sha2, Go usescrypto/sha256, C++ ships an inline reference implementation in src/cpp/src/sql15.cc. A canonical test vector (SHA256("abc")) is asserted in every test suite to catch a broken implementation before it pollutes a scenario hash. - No trailing newline from the CLI. The shell-level test compares
"$RUST_BIN ..."with"$GO_BIN ..."as raw strings; an extra\nfrom one of them silently fails the equality. Rust usesprint!, Go usesfmt.Print, C++ usesstd::cout << ...with no<< endl.
What this lab does not model
Listed up front so the reader does not look for them:
- No on-disk persistence, no WAL, no pager. The "snapshot" is an in-memory byte stream produced on demand.
- No concurrent transactions. MVCC visibility rules are implemented, but there is only one writer.
- No query planner;
SELECT_BY_KandSELECT_BY_TAGare direct map lookups. - No DDL. The schema is hard-coded.
Those are deliberately deferred to db-21 and the capstone (db-23).
References
Books
- Sippu, S., & Soisalon-Soininen, E. (2015). Transaction Processing: Management of the Logical Database and its Underlying Physical Structure. Springer. Chapter 6 ("Logical Database Updates") gives the cleanest treatment of the no-op-update / no-op-delete rule that governs txid allocation here.
- Bernstein, P. A., Hadzilacos, V., & Goodman, N. (1987). Concurrency Control and Recovery in Database Systems. Addison-Wesley. Chapter 5 on multiversion concurrency control is the source of the "tombstone with deleted-at txid" representation we use.
- Hellerstein, J. M., Stonebraker, M., & Hamilton, J. (2007). Architecture of a Database System. Foundations and Trends in Databases, 1(2). Provides the layering vocabulary (storage manager, access methods, query processor) we slice through here.
Papers
- Reed, D. P. (1978). Naming and Synchronization in a Decentralized Computer System. MIT/LCS/TR-205. The original MVCC paper.
- Bernstein, P. A., & Goodman, N. (1981). Concurrency Control in Distributed Database Systems. ACM Computing Surveys 13(2). Lays out the timestamp-ordering protocols that motivate our monotonic txid.
Source documentation
- SQLite VDBE specification — https://sqlite.org/opcode.html. We do not implement a VDBE in db-15, but the opcode list is the canonical decomposition of the four operations this lab supports.
- SQLite file format — https://sqlite.org/fileformat.html. The page-level layout we do not model here. Useful contrast for the wire format in CONCEPTS.md.
- Standard splitmix64 reference — https://prng.di.unimi.it/splitmix64.c. All three ports use these constants verbatim.
Cross-language byte-identity practice
- Google's protobuf canonical encoding spec — https://protobuf.dev/programming-guides/encoding/. The discipline of sorting map entries before serialisation comes from there.
- CBOR deterministic encoding (RFC 8949 §4.2). Same idea applied to a different format. Useful background for why we sort the secondary index lexicographically rather than by insertion order.
Earlier labs in this workspace
- db-10-btree-fundamentals/CONCEPTS.md
- db-11-pager-system/CONCEPTS.md
- db-12-sql-frontend/CONCEPTS.md
- db-13-transactions-and-mvcc/CONCEPTS.md
- db-14-indexes-and-query-optimization/CONCEPTS.md
analysis
The shape of the problem
We want the smallest engine that still demonstrably integrates the five things the SQLite-track labs build separately: a primary keyed container, a secondary index, an MVCC visibility scheme, a SQL surface, and a reproducible on-the-wire snapshot. "Smallest" here means: any feature we cut must be a feature that other labs already cover or labs after this will cover (db-21, db-23).
Three forces pull on the design:
- It has to be correct in three languages at once. Cross-language byte identity is the cheap, mechanical proof that the implementations agree. Anything that varies between language runtimes (hash map ordering, string formatting, integer width, signedness on casts) becomes a hazard.
- It has to be small enough to keep in your head. The whole engine is ~400 lines per language. That budget forced us to drop the pager, the on-disk format, and any kind of query planner.
- It has to actually test the integration. A no-op
UPDATEthat silently bumps the txid would not be caught by the unit tests in any one language — only the cross-language hash comparison would expose it.
Why MVCC over locking
A locking implementation would have been smaller, but it would not
have produced a visible artefact for the snapshot. With MVCC we have
the row-level created_at / deleted_at pair as observable state, and
the snapshot dump can carry it. That gives us something to compare.
Why a secondary index
Without one, the snapshot would be just a sorted map dump and the cross-language test would degenerate into "do all three languages sort ints the same way" (trivially yes). The secondary forces us to also sort strings deterministically, which is where Go's randomised map iteration would otherwise bite.
Where the test surface actually catches bugs
A pleasant surprise: most of the time the unit tests in any one language pass and only the cross-language script fails. That is diagnostic in itself — it almost always points at either:
- a missing
sort.Strings/sort.Slicein Go, - a
static_cast<int>instead ofstatic_cast<int64_t>in C++, - an unsuffixed
0x9E3779B97F4A7C15constant in C++ that the compiler promotes toint(and then warns about, but the warning is buried in a thousand-line build log).
The two frozen scenarios are deliberately sized:
- Scenario A (
--ops 500 --keys 32): small enough to debug by re-running with a smaller op count and diffing the intermediate snapshots. - Scenario B (
--ops 2000 --keys 128): large enough to thrash the secondary index and the tombstone code path.
execution
Order of operations we actually used
- Pick the reference implementation. Rust first, because the type system catches the easiest mistakes (signed/unsigned, missing match arms) at compile time. Once 13 unit tests pass in Rust, freeze the golden hashes from the release build.
- Port to Go. Mirror the structure 1:1. The only language-shaped
differences are: an explicit
sort.Sliceeverywhere a RustBTreeMapiteration is implicit, andfmt.Sprintf("t%d", n)in place of Rust'sformat!("t{}", n). - Port to C++. Same structure again. Use
std::mapinstead ofstd::unordered_mapso iteration is sorted-by-key for free. Usestd::ostringstreamfor the tag, neverstd::to_stringwith locale-aware formatting. - Write the cross-language script last. Build all three CLI binaries, run both scenarios, assert pairwise equality and equality to the goldens.
Running it
$ cd db-15-sqlite-complete
$ bash scripts/verify.sh
=== Rust ===
... test result: ok. 13 passed; 0 failed
=== Go ===
ok github.com/10xdev/dse/db15 ...
=== C++ ===
OK 13 tests
=== OK ===
$ bash scripts/cross_test.sh
=== scenario A: --seed 42 --ops 500 --keys 32 ===
rust=e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab
go =e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab
cpp =e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab
=== scenario B: --seed 7 --ops 2000 --keys 128 ===
rust=dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69
go =dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69
cpp =dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69
=== ALL OK ===
Things that went wrong in development
- First Go run produced a different hash for scenario A. Cause:
ranging directly over
c.secondaryinstead of collecting keys and callingsort.Strings. The fix is in src/go/sql15.go; seeDumpSnapshot. - First C++ run also diverged. Cause:
std::unordered_mapinstead ofstd::map. Same fix shape — switch container, or sort keys before iteration. We chosestd::mapfor symmetry with Rust'sBTreeMap. - A test asserted
SHA256("abc")and failed. Typo in the expected hex string (extra3, missing trailingd). The canonical value isba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad. Worth pinning a known SHA-256 vector in every cross-language lab.
observation
What we measured
For each implementation, on every commit:
- All unit tests pass under release optimisation. (Debug-only bugs
are real —
assert(side_effect)under-DNDEBUGis the classic.) - The CLI binary produces both golden hashes.
- The cross-language script produces the same hash from all three binaries.
What the bytes look like
The snapshot from scenario A is 7088 bytes. Roughly:
- 8 bytes magic
- 8 bytes
next_txid(~501 for scenario A; some ops are no-ops) - 4 bytes primary row count (≈ 28 of 32 possible keys are touched)
- Per row: 8 + 8 + 4 + len(tag) + 8 + 8 = 36 + len(tag) bytes
- 4 bytes secondary distinct tag count
- Per (tag, keys): 4 + len(tag) + 4 + 8*key_count bytes
The largest single section is the primary; the secondary is small because the tag alphabet is fixed at 16.
Visibility of tombstones
Because tombstoned rows stay in the primary, you can read the
snapshot and recover the current visible state by filtering on
deleted_at == 0. That property let us write a test that asserts
the primary row count equals live_count + tombstone_count, which
caught a regression where exec_delete was removing the row from
the primary instead of marking it.
The shape of kind distribution
Across 2000 ops in scenario B, the empirical distribution of kind
matches the design 3:2:1:1:1 ratio within ~3% — confirming that
splitmix64's top 3 bits are sufficiently uniform that we do not need
a rejection sampler.
Non-determinism we did not observe
- No flakes across 50 runs of scenario B.
- No drift between debug and release builds for any implementation.
- No drift between macOS arm64 and Linux x86_64 (sanity-checked once in a throwaway container).
All three of those properties are load-bearing for the cross-language test to be useful: if any of them fail, the script becomes a flaky test and people learn to ignore it.
verification
The verification ladder
- Unit tests inside each language. 13 tests per implementation, covering insert/update/delete semantics, the no-op rule on missing keys, secondary-index maintenance across tag changes, the tombstone-then-reinsert path, the wire format byte layout, and the two frozen scenarios.
scripts/verify.shruns all three suites end to end.scripts/cross_test.shbuilds all three CLI binaries and asserts byte-identical SHA-256 across Rust/Go/C++ for both scenarios and equality with the frozen goldens.
What each layer protects against
| Layer | Catches |
|---|---|
| Unit tests | Wrong semantics within one language: e.g. UPDATE bumping txid when the row was missing. |
| Frozen golden in unit tests | Drift in one language only: e.g. someone "fixes" the splitmix64 constants. |
| Cross-language script | Cross-language drift: e.g. Go iterating a map without sorting. |
| Both goldens | Drift that happens to leave one scenario unchanged. Hitting two seeds at very different op counts is a cheap insurance policy. |
Test vectors we pinned
In every language:
SHA256("") = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855SHA256("abc") = ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015adsplitmix64(0) = 0x8b57dafca0cee644
If any of those fail, every higher-level test is meaningless, so they run first.
How to debug a cross-language mismatch
If cross_test.sh reports a mismatch:
- Re-run with smaller
--ops(say10) until the divergence appears. SHA-256 is binary — either equal or not — so you need to dump the actual bytes. - Add a temporary print of the snapshot's hex before the SHA, in
both the suspect language and the reference.
xxdorod -An -tx1on the two outputs and diff them. - The first byte that differs almost always points at a section
boundary. Decode the
next_txidand primary row count by hand. - The two most common causes by a wide margin are (a) unsorted map
iteration and (b) a missing
ULLon a C++ constant. Check those first.
Coverage gaps we accept
We do not run a property-based test (no proptest in Rust, no
testing/quick in Go). The two seeded scenarios are dense enough
that we have not seen a real bug that proptest would have caught and
they would not, and adding proptest would make the test loop slower
and more flaky.
broader ideas
What a "real" SQLite slice would add
If the goal is fidelity rather than pedagogy, the natural next steps, roughly in order of payoff:
- A pager backed by
pwrite. This is db-11. Once you have a pager the snapshot becomes the file, not an ad-hoc byte stream. - Page-level checksums. Even just XXH3 per page; turns silent corruption into noisy corruption.
- A WAL. Append-only journal of operations, replay on open. db-03 does the WAL; the fusion is in db-23.
- Schema and DDL.
CREATE TABLE, multiple tables, column types. The single-table assumption hides almost all the catalog complexity. - A query planner. Even just a cost-based decision between
SELECT_BY_KandSELECT_BY_TAGwould be educational. With one table and two indexes the planner is trivial; with joins it explodes.
What this engine could become with concurrency
The MVCC bookkeeping is already there — created_at and deleted_at.
What is missing for real read-mostly concurrency:
- A reader-visible snapshot timestamp, so
SELECTreads consistently as of "the latest committed txid I saw". - Write-set tracking and a commit barrier, so two writers cannot both bump txid without serialising.
- Garbage collection of tombstoned rows once no live reader could observe them. The current code holds tombstones forever, which is fine for a benchmark and disastrous for a real system.
The interesting thing is that the snapshot wire format would still work — you would just be dumping a consistent point rather than the literal in-memory state.
What this engine could never be
Without on-disk persistence, this is not a database; it is a test fixture. Adding the pager moves it to "embedded KV with SQL", which is roughly what SQLite is.
It will never be a server. Network protocols, connection management, client-side query plans, authentication — none of that is in scope for any lab in this series, by design.
Useful tangent: cross-language byte identity as a discipline
This lab is a microcosm of a discipline that pays off elsewhere:
- gRPC and protobuf rely on a canonical encoding for hash-based signing.
- Git's object hashing depends on a canonical layout per object type.
- Bitcoin transactions are SHA-256-d in a canonical byte form.
Whenever you find yourself asking "is this implementation correct", producing a canonical byte stream from each implementation and hashing it is one of the cheapest mechanical proofs you can buy.
step 01 — pager-and-rows
Goal
Build the in-memory primary container. By the end of this step you should have:
- a
Row { v, tag, created_at, deleted_at }value type, - a
Connwithnext_txidstarting at 1 and an orderedprimary: i64 -> Rowmap, exec_insert(UPSERT) bumpingnext_txidevery call,select_by_kreturning live rows only.
Why "pager-and-rows" not just "rows"
Even though we are not implementing a real on-disk pager here, the discipline of treating the primary map as the single source of truth for both visible and tombstoned state mirrors what a pager gives you: a flat, ordered store you can walk in key order.
If you wanted, you could later swap the in-memory std::map for a
B-tree built on top of db-11's pager and not have to change anything
else in this lab.
Tasks
- Define
RowandConnin your chosen language. - Implement
exec_insertwith UPSERT semantics. Make sure that inserting at the same key twice replaces the row and uses the new txid increated_at. - Implement
select_by_k. It must filter out tombstoned rows. - Write a unit test that inserts two keys, selects them back, and
asserts
next_txid == 3.
Pitfalls
- If you use a hash map (Go
map, C++unordered_map), the wire test in step 03 will fail because iteration order will not be deterministic. Use an ordered map (BTreeMap, sorted-keys iteration,std::map). - Use
i64fork.i32will silently truncate when the workload in step 03 mods au64bykeysand casts.
step 02 — SQL surface and MVCC
Goal
Add exec_update, exec_delete, select_by_tag, and the secondary
index. By the end of this step:
exec_updatemust be a no-op (and not bumpnext_txid) if the row is missing or tombstoned. If present, it keeps the originalcreated_atand only mutatesvandtag.exec_deletemust be a no-op if the row is missing or tombstoned. If present, it setsdeleted_at = next_txidand removes the row from the secondary index. The row stays in the primary.- The secondary index
tag -> sorted Vec<i64>is maintained on every mutating op. Only live rows are present. select_by_tag(tag)returns the secondary list, or empty.
Tasks
- Wire
exec_updateandexec_deletewith the no-op-on-missing rule. Test it by calling each on a key that does not exist and assertingnext_txiddid not move. - Implement secondary insertion as sorted insert (binary search +
shift, or
BTreeMap::entry().or_default()+ sorted insert). - Implement secondary removal as sorted lookup + erase. If the list becomes empty, drop the tag entirely (otherwise the snapshot will carry empty entries and diverge from the spec).
- Add a test that inserts three rows with the same tag in scrambled
key order, then asserts
select_by_tagreturns them in ascending order. - Add a test for the resurrection path: insert, delete, insert again
on the same key. The new row must have a fresh
created_atanddeleted_at == 0.
Pitfalls
- The most common bug is bumping
next_txidinexec_updateeven on a no-op. The unit tests in one language will pass; the cross-language hash will diverge after the first missing-key update. - Forgetting to drop an empty tag from
secondaryafter the last delete will add a zero-length entry to the snapshot dump and break cross-language byte equality. - In C++,
std::map::operator[]default-constructs missing entries silently — usefindfor reads and[]only when you intend the insert.
step 03 — cross-language snapshot
Goal
Produce the canonical snapshot byte stream defined in ../CONCEPTS.md, run the deterministic workload in each language, and assert byte-identical SHA-256 across Rust, Go, and C++.
By the end of this step:
dump_snapshotexists in every language and produces bytes that match the spec section-for-section.- A
run_workload(seed, ops, keys, scenario)function exists in every language and is bit-exact. - The CLI prints the hex SHA-256 with no trailing newline.
scripts/verify.shends with=== OK ===.scripts/cross_test.shends with=== ALL OK ===and reports both golden hashes for scenarios A and B.
Tasks
- Implement
dump_snapshot. Build it incrementally: write the magic + header first, get a single-row dump matching by hand, then add the secondary section. - Implement
splitmix64and a statefulSplitMix64::next(). Pin a test forsplitmix64(0) == 0x8b57dafca0cee644to guard against constant typos. - Implement
run_workloadper the rules in CONCEPTS.md. Pay special attention to: drawing all three rng words even for read ops; the kind decoding(r1 >> 60) & 0x7; the modulo casts toi64. - Implement
sha256_hex. In Rust use thesha2crate. In Go usecrypto/sha256+encoding/hex. In C++ inline the reference implementation (FIPS 180-4) — keep it in the same translation unit as the engine to avoid a third-party dependency. PinSHA256("")andSHA256("abc")in tests. - Wire up the CLI:
sqlitectl workload --seed N --ops N --keys N --scenario S. Print the hex withprint!/fmt.Print/std::cout— no newline. - Run
scripts/verify.shthenscripts/cross_test.sh. Iterate until both end with their success markers.
Debugging a divergence
If cross_test.sh shows different hashes between languages, follow
the ladder in ../docs/verification.md:
shrink the op count, dump the raw snapshot bytes with xxd, diff,
and look for the first differing byte. It almost always points at a
section boundary that exposes either map-iteration order or a
wrong-width cast.
Acceptance
- All three unit suites pass under release optimisation.
- Both
=== OK ===and=== ALL OK ===markers appear. - Scenario A hash:
e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab. - Scenario B hash:
dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69.
db-16 — Distributed-Fundamentals
This lab builds the vocabulary the rest of the distributed track (db-17 Raft, db-18 Paxos, db-19 ZAB, db-20 distributed-kv) will speak in: logical clocks, vector clocks, the happens-before relation, and a deterministic discrete-event simulator that produces a byte-identical event log across three independent implementations (Rust, Go, C++).
If you cannot write a simulator whose output is bit-stable across runs and across languages, you cannot run reproducible distributed-systems experiments. Every other lab in the track will reuse the discipline established here.
What is it?
A distributed system is a collection of nodes that exchange messages over an asynchronous, lossy network. Three primitives let us reason about such systems without having a wall-clock everyone agrees on:
-
Lamport clock — a single integer per node that is incremented on every local event, stamped onto each outgoing message, and bumped to
max(self, incoming) + 1on receive. Lamport (1978) proved that this discipline produces a total order consistent with causality: if eventahappens-before eventb, thents(a) < ts(b). The reverse is not true. -
Vector clock — one counter per node, packaged into a map. Local event increments the owner's counter; receive does pointwise
max(self, incoming)then increments the owner's counter. The resulting partial order is the happens-before relation: two events are concurrent iff neither clock dominates the other. -
Deterministic discrete-event simulator — a single-threaded loop that drives sim time forward in integer ticks, delivering messages whose
delivery_time == tbefore letting nodes act. With a seeded PRNG and canonical message ordering, the same(seed, nodes, rounds)triple must always produce the same event log — in any language.
Why does it matter?
-
Raft (db-17), Paxos (db-18), ZAB (db-19) all rely on causality: a leader can only commit an entry after it has been replicated to a quorum of followers. Vector clocks give us the language to prove that a particular log entry could not have been committed before a prerequisite was acknowledged.
-
Reproducibility is the difference between "I think my consensus algorithm is correct" and "I have an event log I can re-run on someone else's machine and get the same answer." When db-17 develops a leader-election bug under network partition, the first thing you reach for is a deterministic replay of the failure.
-
Three independent implementations forces clarity. Any ambiguity in the spec ("when do you read the clock vs. increment it?") will show up as a byte diff in
scripts/cross_test.sh. Pinning the wire format and the scheduling rule is the lab.
How does it work?
Lamport rule
local event : self += 1
send : self += 1 ; stamp message with self
recv(m) : self = max(self, m.ts) + 1
Vector-clock rule
local event(i) : vc[i] += 1
send(i) : vc[i] += 1 ; stamp message with snapshot of vc
recv(i, m) : for k in m.vc : vc[k] = max(vc[k], m.vc[k])
vc[i] += 1
partial order : vc_a < vc_b iff (∀k) vc_a[k] ≤ vc_b[k] AND vc_a ≠ vc_b
vc_a || vc_b iff neither < nor > nor =
Simulator loop
for t in 0 .. rounds + MAX_DELAY:
# 1. deliver — strict (delivery_time, sender_id, seq) order
while heap.top().delivery_time == t:
msg = heap.pop()
node[msg.dest].recv(msg)
emit Recv
# 2. send — only during the active window
if t < rounds:
for s in 0 .. nodes:
r = splitmix64(seed ^ (t<<32) ^ (s+1))
dest = ((r & 0xFFFF) % (nodes - 1)) ; skip self
delay = 1 + ((r>>16) & 0xFFFF) % 3
payload = (r>>32) & 0xFF
node[s].send_to(dest, delay, payload)
emit Send
The two phases (deliver-then-send) per tick, the strict heap ordering, and the splitmix64 PRNG together guarantee determinism.
Canonical wire format
file := magic[4="DSE6"] u32_le(event_count) event*
event :=
u8 kind # 1 = Send, 2 = Recv
u64_le sim_time
u32_le node # sender for Send, receiver for Recv
u32_le peer # dest for Send, source for Recv
u64_le lamport # value AFTER the local step
u32_le vc_len
[u32_le node, u64_le counter] * vc_len # sorted ASC by node
u32_le payload_len
u8 payload[payload_len]
All multi-byte numbers are little-endian. Vector-clock entries must be serialized in ascending order by node-id; this is the single most common source of byte-diff bugs.
Cross-language invariants
| Invariant | Why it matters |
|---|---|
splitmix64 mix seed ^ (t<<32) ^ (s+1) | identical PRNG stream |
dest skip-self: if pre >= s then pre+1 | identical destination choice |
heap order (delivery_time, sender, seq) | identical delivery order |
seq is global monotonic | deterministic tie-break across nodes |
| VC entries sorted by node-id on the wire | byte-identical serialization |
| all integers little-endian | byte-identical on every host |
If any one of these drifts, scripts/cross_test.sh will fail at the
sha256 compare and cmp -l will print the byte offset of the first
divergence.
Files
src/rust/—distfund16crate +simctlbinary.src/go/— modulegithub.com/10xdev/dse/db16+cmd/simctl.src/cpp/—db16_libstatic library +simctlbinary +test_db16.scripts/verify.sh— runs the unit tests for all three.scripts/cross_test.sh— proves the three binaries produce byte-identical event logs for two seeded scenarios.
See docs/ for the longer write-up, and steps/ for the staged
implementation path.
db-16 — References
Primary sources
- Leslie Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, CACM 21(7), 1978. The original paper. Defines happens-before, the logical clock, and (in §4) the construction of a total order consistent with causality. https://lamport.azurewebsites.net/pubs/time-clocks.pdf
- Colin Fidge, Timestamps in Message-Passing Systems That Preserve the Partial Ordering, 11th ACSC, 1988. Introduces vector clocks and proves they characterize the happens-before relation exactly.
- Friedemann Mattern, Virtual Time and Global States of Distributed Systems, 1989. The companion vector-clock paper; reads more approachably than Fidge.
- Sebastiano Vigna, splitmix64 — a small, fast, well-distributed
64-bit mixer used as the seeder for
xoroshiro. https://prng.di.unimi.it/splitmix64.c
Determinism and reproducibility
- Frans Kaashoek et al., Eraser: A Dynamic Data Race Detector for Multithreaded Programs, SOSP 1997. Not directly cited here, but the motivation — "if you cannot replay a bug deterministically you cannot debug it" — is the entire reason this lab exists.
- FoundationDB's simulation testing (Apple/Snowflake) — a production example of deterministic discrete-event simulation at scale. https://apple.github.io/foundationdb/testing.html
- Jepsen — Kyle Kingsbury's distributed-systems testing harness. Not deterministic itself (it injects real faults), but the methodology of "generate events, observe a history, check it against a model" is the vocabulary db-16 sets up. https://jepsen.io/
Production engines that use these primitives
- Riak / Dynamo — vector clocks for sibling reconciliation.
- CRDTs (Shapiro, Preguiça, Baquero, Zawirski, 2011) — vector clocks and version vectors are the substrate for state-based merge functions.
- TLA+ — Lamport's specification language; ordering events by Lamport clock is the mental model behind every TLA+ refinement proof.
Cross-lab dependencies
- This lab has no upstream dependencies. It is the bedrock for the distributed track.
- db-17 Raft consumes the simulator: leader-election scenarios and log-replication invariants will be expressed as scripted event sequences run against a deterministic transport built on top of db-16.
- db-18 Paxos, db-19 ZAB, db-20 distributed-kv reuse the same vocabulary (Lamport/VC for causality assertions, deterministic scheduler for fault-injection replay).
db-16 — Analysis
Required invariants
- Lamport monotonicity. For any node
n, the sequence of Lamport values produced by its successivetick/send/recvcalls is strictly monotonic. - Lamport consistency with happens-before. If
a → b(happens- before), thents(a) < ts(b). The converse does not hold; that is the cost of compressing a partial order into a single integer. - Vector-clock characterization. With vector clocks the
biconditional holds:
a → biffvc(a) < vc(b)componentwise (andvc(a) ≠ vc(b)). - Send-precedes-receive. Every Recv event in the simulator is paired
with exactly one Send event from
(peer → node)whosesim_timeis strictly less than the Recv'ssim_timeand whose vector clock is strictly less than the Recv's. - Byte determinism. For every
(seed, nodes, rounds), the three binaries produce identical bytes on stdout. This is the single propertyscripts/cross_test.shchecks; if it ever drifts, all downstream labs lose reproducibility.
Design decisions
-
Two-phase tick (deliver-then-send). Each integer tick first drains all in-flight messages whose
delivery_timehas arrived, then runs every node's send. Doing deliver first means a single tick can witness a message being received and a response being sent — capturing causal flow without needing finer time resolution. -
Heap ordered by
(delivery_time, sender, seq). The third field (seq, a global monotonic counter) gives an unconditional tie-break even when two nodes send to the same destination in the same tick with the same chosen delay. -
splitmix64 seeded per
(seed, t, s). A single splitmix64 call produces all three random fields (dest, delay, payload) for one(t, s)decision. This avoids the question "whose RNG state advances first" — there is no shared RNG state at all. -
Vector-clock entries sorted on the wire.
BTreeMapin Rust, sorted-key iteration in Go,std::mapin C++ all produce ascending order naturally. If you ever switch the Rust side toHashMapyou will get byte diffs. -
Lib + thin CLI. All three implementations expose the same trio of primitives (
Lamport,VectorClock,simulate/Simulate) as a library. The CLI is ten lines that callsserialize(simulate(...))and writes to stdout. Downstream labs will link the library, not shell out to the CLI.
Why three languages
- Forces the spec to be unambiguous. A Rust
BTreeMapand a C++std::mapboth happen to iterate in key order; the moment you reach for Go'smapyou discover the language does not and you must sort explicitly. That kind of discovery only happens with multiple implementations. - Pins endianness, integer overflow semantics (wrapping), and signed-vs- unsigned modulo. Splitmix64 in particular depends on unsigned wrapping multiplication; expressing it identically in three languages is a forcing function.
- Future-proofs the track. db-17 onwards will pick one host language per experiment; having a reference implementation in three independent languages means a port bug in db-17's Raft simulator can be cross- checked against the db-16 baseline.
Tradeoffs worth flagging
- Sim time is integer ticks, not floating-point seconds. This trades realism for determinism. Real networks have continuous-time jitter; capturing that would require an event-priority structure keyed by a rational/decimal time, which is not worth the complexity for a study lab.
- All sends are unicast and always succeed. We do not model drops,
reorderings beyond delay-based interleaving, or partitions. db-17 will
add a partition primitive on top of this simulator; doing it here
would mean adding
--drop-rateto the CLI and changing the wire format, which would lock in a poor abstraction. - Each node sends exactly one message per tick during the active window. That is a fixed-load workload. Variable-load (silent nodes, bursty senders) would be a strict extension; it is intentionally omitted to keep the spec small enough to verify by hand.
db-16 — Execution
One-shot: prove the lab works
cd db-16-distributed-fundamentals
./scripts/verify.sh # all unit tests in Rust, Go, C++
./scripts/cross_test.sh # byte-identical event logs across all three
A green run of cross_test.sh ends with the literal line:
=== ALL OK ===
Per-language workflows
Rust
cd src/rust
cargo test # 7 tests
cargo build --release # produces target/release/simctl
./target/release/simctl --seed 42 --nodes 3 --rounds 20 > /tmp/log_rust.bin
Go
cd src/go
go test ./... # 7 tests
go build -o /tmp/simctl_go ./cmd/simctl
/tmp/simctl_go --seed 42 --nodes 3 --rounds 20 > /tmp/log_go.bin
C++
cd src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build # test_db16 → "db-16 C++ tests: 7 passed"
./build/simctl --seed 42 --nodes 3 --rounds 20 > /tmp/log_cpp.bin
CLI
All three binaries accept the same flags:
| flag | default | meaning |
|---|---|---|
--seed N | 0 | splitmix64 seed |
--nodes K | 3 | number of nodes; must be ≥ 2 |
--rounds R | 20 | number of send-rounds (sim time runs for R + MAX_DELAY ticks) |
The output is the binary wire format described in CONCEPTS.md. Pipe to
a file; do not display on a terminal.
Canonical scenarios
scripts/cross_test.sh runs two scenarios; their sha256s are checked
into the lab's verification path:
| label | args | sha256 |
|---|---|---|
| A | --seed 42 --nodes 3 --rounds 20 | 0d7e753cdc891e3a481977da372a4d97a6a0e0ab00b74f5a074dbc25791dc797 |
| B | --seed 7 --nodes 5 --rounds 50 | 321221187709684afd59c55202f8d373dad33c8026e933b36740aeed23c8c2d4 |
If you change any of: PRNG, scheduler order, wire format, or VC entry ordering — these hashes will change and you must update both the script and this table in the same commit. That synchronization step is the forcing function that keeps the spec honest.
Sanity checks
# magic bytes
./target/release/simctl --seed 42 --nodes 3 --rounds 20 | xxd -l 8
# expect: 00000000: 4453 4536 7800 0000 DSE6x...
# 0x78 = 120 = events: 60 Sends + 60 Recvs for nodes=3 rounds=20
# event count
./target/release/simctl --seed 42 --nodes 3 --rounds 20 | \
python3 -c 'import sys,struct; d=sys.stdin.buffer.read(); print(struct.unpack("<I", d[4:8])[0])'
# → 120
db-16 — Observation
What does the simulator's output actually look like, and how do you read it by hand?
Header
offset 0x00 : 44 53 45 36 "DSE6" (magic)
offset 0x04 : 78 00 00 00 120 (event_count, u32 LE)
For --seed 42 --nodes 3 --rounds 20 the event count is 3 nodes × 20 rounds × 2 (send + recv) = 120.
A single Send event
Every Send is the start of a causal arc; every Recv is its endpoint. The first event in scenario A is a Send from node 0 at sim_time 0:
01 kind = 1 = Send
00 00 00 00 00 00 00 00 sim_time = 0
00 00 00 00 node = 0 (sender)
?? 00 00 00 peer = ? (destination, computed from PRNG)
01 00 00 00 00 00 00 00 lamport = 1 (Send rule: self += 1, then stamp)
01 00 00 00 vc_len = 1
00 00 00 00 01 00 00 00 00 00 00 00 (node=0, counter=1)
01 00 00 00 payload_len = 1
?? payload byte
Note the vector clock for a node that has only sent has a single entry (its own counter). Receivers' vector clocks grow as they merge incoming clocks.
A single Recv event
Recvs look identical except kind = 2 and peer is the source node:
02 kind = 2 = Recv
?? ?? ?? ?? ?? ?? ?? ?? sim_time = original send time + delay
01 00 00 00 node = 1 (receiver)
00 00 00 00 peer = 0 (sender of paired Send)
?? ?? ?? ?? ?? ?? ?? ?? lamport = max(self_before, incoming) + 1
02 00 00 00 vc_len = 2
00 00 00 00 01 00 00 00 00 00 00 00 merged entry for node 0
01 00 00 00 ?? 00 00 00 00 00 00 00 own counter, incremented
01 00 00 00 payload_len = 1
?? payload byte (copied from send)
The number of VC entries grows as a node hears from new peers; in a 3-node, 20-round run each receiver will eventually have all 3 entries.
Hex walkthrough
./simctl --seed 42 --nodes 3 --rounds 20 | xxd | head
Read column-by-column:
00000000: 4453 4536 7800 0000 DSE6 . . . . . . . . header
00000008: 01 00 00 00 00 00 00 00 00 first Send: kind=1, sim_time=0
00 00 00 00 node=0
00000014: ?? 00 00 00 peer
00000018: 01 00 00 00 00 00 00 00 lamport=1
00000020: 01 00 00 00 vc_len=1
00000024: 00 00 00 00 01 00 00 00 00 00 00 00 vc entry (0 → 1)
00000030: 01 00 00 00 payload_len=1
00000034: ?? payload byte
00000035: 02 ... next event (probably another Send at t=0)
The whole file for scenario A is 8156 bytes; scenario B is 45592 bytes.
What to learn from looking at it
- Lamport values are non-decreasing within a node but may regress between nodes — that is healthy: nodes 0 and 1 can be ahead of node 2 if 2 hasn't sent or received yet.
- The vector-clock entry for node
iin nodei's own events is strictly monotonic. - For any Send/Recv pair, the Recv's VC must dominate the Send's VC
(
>inVcOrd). This is exactly whatcheck_causalityasserts. - If you sort all events by
sim_timeyou get a globally consistent "tape" — but events at the same sim_time are concurrent and have no inherent ordering between nodes. Deliveries are scheduled before sends within a tick by simulator policy, not by physics.
Cross-language reading
scripts/cross_test.sh prints the hex of the first 8 bytes (44534536 7800 0000 for scenario A). If three implementations agree on those 8
bytes but disagree on the rest, the suspect is almost always either
(a) VC-entry order on the wire, or (b) heap tie-break by sender id.
db-16 — Verification
How to reproduce the green status on a clean machine.
Prerequisites
- macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
cmake≥ 3.20.- Rust toolchain ≥ 1.74.
- Go ≥ 1.22.
shasum,xxd,awk(default on macOS;coreutilson Linux).
One command
cd db-16-distributed-fundamentals
scripts/verify.sh # builds + unit tests in all three langs
scripts/cross_test.sh # cross-language sha256 match
Both should print === OK === / === ALL OK === and exit 0.
Per-language drill-down
Rust
cd db-16-distributed-fundamentals/src/rust
cargo test --quiet
cargo build --release
Expected: 7 passed; 0 failed. The simctl binary lands in
target/release/simctl.
Go
cd db-16-distributed-fundamentals/src/go
go test ./...
go build ./cmd/simctl
Expected: ok github.com/10xdev/dse/db16 <duration>.
C++
cd db-16-distributed-fundamentals/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure
Expected: 100% tests passed, 0 tests failed out of 1 and test_db16
prints "db-16 C++ tests: 7 passed".
What "green" means
A green run guarantees:
-
All 21 unit tests pass (7 each in Rust, Go, C++) covering Lamport monotonicity, vector-clock partial order including the
Concurrentcase, simulator determinism on a fixed seed, and causality of the generated event log. -
The cross-language test produces byte-identical event logs for both canonical scenarios:
scenario sha256 size A --seed 42 --nodes 3 --rounds 200d7e753cdc891e3a481977da372a4d97a6a0e0ab00b74f5a074dbc25791dc7978 156 B B --seed 7 --nodes 5 --rounds 50321221187709684afd59c55202f8d373dad33c8026e933b36740aeed23c8c2d445 592 B Matching sha256s prove that all three implementations agree on the PRNG, the scheduling rule, the Lamport / vector-clock update rules, the VC entry ordering on the wire, and the integer endianness.
-
The spot-check in
cross_test.shconfirms the magic header44 53 45 36and the expectedu32 LEevent count, guarding against the regression where all three implementations agree on producing empty output.
When verification fails
- Cross-language sha256 mismatch on the first 8 bytes — magic /
count drift. Almost always a count formula bug
(
2 × nodes × rounds). - Mismatch past byte 8 but matching on a smaller
--rounds— the PRNG or the scheduler diverges as soon as a recv-in-flight overlaps with a send. Inspectsplitmix64and the heap tie-break. - Causality test fails in one language only — that language's
recvdoes not bump its own counter, or bumps before the merge. Read the Vector-clock rule in CONCEPTS again. - One language passes locally but the cross-test diverges — most
often: VC entries serialized in insertion order rather than sorted
by node-id. Switch to
BTreeMap/std::map/ explicitsort.Slice.
db-16 — Broader Ideas
Where the primitives in this lab show up in real systems, and what to build on top of them in the rest of the distributed track.
Immediate next labs
- db-17 — Raft. Reuses the deterministic simulator wholesale.
Adds a
Role ∈ {Follower, Candidate, Leader}, election timeouts, AppendEntries RPCs, and a commit index. Every step's safety argument ultimately reduces to "this state could not have been reached without a quorum acknowledgement", which is a happens-before statement on the log — exactly what vector clocks make precise. - db-18 — Paxos. Same harness, different message types (Prepare/Promise/Accept/Accepted). Paxos's invariants are notoriously hard to reason about by hand; a deterministic simulator that can replay a counterexample seed is the difference between "I think it's correct" and "I have evidence".
- db-19 — ZAB. Adds a strict total order on broadcasts and a recovery phase. The Lamport clock generalizes to the ZAB epoch + counter pair.
- db-20 — Distributed KV. Wraps a quorum-replicated key-value store around a chosen consensus engine. Now the simulator's "payload" is a client command, and the event log is auditable per-replica state.
How this lab's pieces map to real systems
- Lamport clocks are the kernel of Kafka offsets, Spanner's TrueTime (kind of — Spanner adds a real-time uncertainty interval but the underlying scalar is a Lamport-like ID), and Cassandra's per-cell write timestamps.
- Vector clocks are the kernel of Amazon Dynamo's conflict detection, Riak's siblings, and the CRDT literature's "stable causal delivery" layer.
- Deterministic discrete-event simulation is how FoundationDB developed and continues to harden its storage and replication code (Will Wilson's Testing Distributed Systems with Deterministic Simulation talk at Strange Loop 2014 is the canonical reference). It is also how TigerBeetle, Polar Signals, and Antithesis test their production code paths.
- The
(time, sender, seq)heap tie-break is the same trick used by every event-loop sim fromsimpyto game-engine fixed-timestep loops.
Performance experiments worth running later
- Crank
--nodesand--roundsand plot wall time vs. event count for each language. With the current canonical serializer this should be linear in events; any quadratic growth means the wire format or the heap is doing something dumb. - Replace the unicast
splitmix64destination with a broadcast and measure the explosion in VC entries per receive (each broadcast forces every other node's VC to grow by one entry). - Try a
HashMap-based VC in Rust and observecross_test.shfailing. This is the cheapest possible lesson on why deterministic iteration order matters; do it once and you will never forget it.
What "production-quality" would require beyond this lab
- A real network layer (TCP or QUIC), with retries, timeouts, and application-level acks rather than the simulator's deliver-and-forget.
- Lossy / reordering channels and partition primitives. db-17 will add
partitions as a
Network::partition(a, b)toggle; this lab deliberately omits them so the determinism story is small. - Persistent storage for clocks (so a crash-restart doesn't replay Lamport from 0). The Raft lab in db-17 will need this; the WAL we built in db-03 is the obvious substrate.
- Compact vector clocks (interval tree clocks, dotted version vectors)
for systems with
>thousands of nodes; the naive map-based VC here becomes a bandwidth problem at that scale.
None of these change the shape of the primitives — they make the same primitives faster, leaner, and tolerant of real-world failures.
db-16 step 01 — Logical clocks
Goal
Implement Lamport and vector clocks as first-class types in all three languages, with identical semantics under a small, well-defined API.
Tasks
- Lamport clock. A wrapper over a
u64counter exposing:tick()— bump the counter, return the new value.send() -> u64— equivalent totick: bump, then return the stamp for the outgoing message.recv(incoming: u64)—self = max(self, incoming) + 1.value() -> u64.
- Vector clock. A wrapper over
Map<u32, u64>exposing:tick(self_id)—vc[self_id] += 1.send(self_id) -> Self— bump own counter, return a clone of the full VC (the snapshot that gets stamped onto a message).recv(self_id, incoming: &Self)— pointwisevc[k] = max(vc[k], incoming[k])for every keykinincoming, then bump the receiver's own counter.partial_cmp(other) -> {Less, Equal, Greater, Concurrent}. Pure function over the two maps.
- Wire serialization for the VC. Entries on the wire MUST be
sorted ascending by
node_id. This is non-negotiable — it is the single biggest source of byte-diff bugs across languages.
Acceptance
Inline unit tests in each language:
lamport_tick_monotonic— three ticks produce 1, 2, 3.lamport_recv_jumps—recv(10)after value 3 yields 11.vc_partial_order_less—{0:1}<{0:1, 1:1}.vc_partial_order_concurrent—{0:2, 1:0}and{0:0, 1:2}are concurrent.vc_recv_merges_then_ticks—recv(self=1, {0:5, 1:0})from initial{1:2}yields{0:5, 1:3}.vc_serialize_sorted— the bytes are identical no matter what order entries were inserted in the map.
All six green in Rust, Go, and C++.
Discussion prompts
- Why does
recvbump the receiver's own counter after the merge rather than before? - Why is the "Concurrent" outcome of
partial_cmpnecessary; what goes wrong if you collapse it intoEqualorLess? - For a system with one million nodes, is a map-keyed VC still practical? What data structures replace it (hint: interval tree clocks, dotted version vectors)?
db-16 step 02 — Deterministic simulator
Goal
Build a discrete-event simulator whose (seed, nodes, rounds) triple
completely determines its event log, and produce a canonical
serialization of that log.
Tasks
-
PRNG. Implement
splitmix64in each language with unsigned wrapping multiplication. Seed it per-decision withseed ^ (t << 32) ^ (s + 1)so that no shared mutable PRNG state crosses a(t, s)boundary. This eliminates the "whose turn is it to read the RNG?" ambiguity that bites every multi-language implementation. -
Per-tick decision. For each
(t < rounds, s ∈ 0..nodes), compute:dest_pre = (r & 0xFFFF) % (nodes - 1)then skip-self:dest = dest_pre + (1 if dest_pre >= s else 0).delay = 1 + ((r >> 16) & 0xFFFF) % 3.payload = (r >> 32) & 0xFF.
-
Scheduler. Maintain a min-heap of in-flight messages keyed on
(delivery_time, sender_id, global_seq).global_seqis a single monotonic counter incremented every time a message is enqueued. This guarantees a total order even when two messages have identical(delivery_time, sender_id). -
Tick loop. For
t in 0 .. rounds + MAX_DELAY:- Drain all heap entries with
delivery_time == t: for each, runrecvon the destination node, emit aRecvevent. - If
t < rounds: for eachs in 0..nodes, compute the decision, enqueue the message, runsendon the sender, emit aSendevent.
- Drain all heap entries with
-
Wire format. As documented in
CONCEPTS.md. Magic"DSE6",u32 LEevent count, thenevent_countevents. Each event uses little-endian integers and serializes its vector clock with entries sorted ascending by node id.
Acceptance
Inline unit tests:
splitmix64_known_values— forseed=0, the first three outputs are0xE220A8397B1DCDAF,0x6E789E6AA1B965F4,0x06C45D188009454F.sim_deterministic_one_node—--nodes 2 --rounds 3 --seed 1produces a fixed event count and a fixed first-event byte sequence.sim_event_count_formula— for any(nodes ≥ 2, rounds ≥ 1), total events =2 * nodes * rounds(every send becomes exactly one recv).causality_holds— after runningsimulate(...), walk the event log: everyRecvfrompeerhas a strictly-greater VC than the pairedSend.byte_round_trip— serializing the same event log twice yields identical bytes (no nondeterminism in the serializer itself).
All five green in Rust, Go, and C++.
Discussion prompts
- Why deliver before send within a single tick?
- What breaks if
global_seqis per-sender instead of global? - The simulator never drops or reorders messages beyond delay-based
interleaving. What new wire-format field would
--drop-rate pneed, and would it break the cross-language hash if defaulted to 0?
db-16 step 03 — CLI and cross-language byte-identity
Goal
Build a simctl CLI in all three languages, then prove via sha256 that
all three produce byte-identical event logs for the same
(seed, nodes, rounds) triple — for at least two distinct scenarios.
CLI contract
simctl --seed N --nodes K --rounds R
Writes the canonical wire-format bytes (no trailing newline) to stdout.
Tasks
- Build
simctlin Rust (src/rust/src/bin/simctl.rs), Go (src/go/cmd/simctl/main.go), and C++ (src/cpp/src/simctl.cc). - Write
scripts/verify.shthat runs unit tests in all three langs. - Write
scripts/cross_test.shthat:- Builds all three binaries.
- Scenario A:
simctl --seed 42 --nodes 3 --rounds 20→ sha256 all three outputs → assert all three match. - Scenario B:
simctl --seed 7 --nodes 5 --rounds 50→ sha256 all three → assert all three match. - Spot-check the first 8 bytes of scenario A's output equal the
magic
"DSE6"plus theu32 LEcount120. - Print
=== ALL OK ===.
Acceptance
$ scripts/verify.sh
=== rust === ... ok
=== go === ... ok
=== cpp === ... ok
=== OK ===
$ scripts/cross_test.sh
...
match(A): 0d7e753cdc891e3a481977da372a4d97a6a0e0ab00b74f5a074dbc25791dc797
match(B): 321221187709684afd59c55202f8d373dad33c8026e933b36740aeed23c8c2d4
=== ALL OK ===
A byte-identical hash across three independent implementations is a near-proof that the PRNG, scheduler, clock-update rules, and wire format are all spec-compliant. Any divergence — even on a single byte — will surface here.
Discussion prompts
- Why two scenarios instead of one? What property would slip through with a single scenario that two catch?
- If the scenario-A hash matches but scenario B does not, where in the codebase would you start looking?
- The sha256 hashes are baked into the script as constants. What's the benefit, and what's the maintenance cost when the wire format legitimately evolves (e.g., adding a new event kind)?
db-17 — Raft
This lab implements Raft consensus in Rust, Go, and C++, all three
producing a byte-identical sha256 of a canonical cluster dump for any
(seed, nodes, rounds, proposals, partition) configuration. It builds
directly on the deterministic-simulator discipline from db-16: same
splitmix64 seeding, same (delivery_time, sender, seq) heap tie-break,
same "sorted iteration on the wire" rule.
If db-16 taught you to keep an event log bit-stable across three languages, db-17 teaches you to keep an entire replicated state machine's persistent state bit-stable across three languages and across network partitions. Every later distributed lab (db-18 Paxos, db-19 ZAB, db-20 distributed-kv) is a variation on this skeleton.
What is it?
Raft (Ongaro & Ousterhout, USENIX ATC 2014) is a consensus algorithm that keeps an ordered, append-only replicated log consistent across a cluster of nodes despite crashes, message reorderings, and arbitrary partitions of the network. It is the consensus core inside etcd, Consul, TiKV, CockroachDB, MongoDB's metadata, and many more.
Raft decomposes consensus into three sub-problems:
-
Leader election. Each node is one of
{Follower, Candidate, Leader}. Followers run an election timeout; on timeout a follower becomes a candidate, bumps itscurrent_term, votes for itself, and broadcastsRequestVote. A candidate that receives a majority ofvote_granted=truereplies in the same term becomes leader. -
Log replication. The leader accepts client proposals and appends them to its log. It broadcasts
AppendEntriesRPCs carrying the new entries plus aprev_log_index / prev_log_termconsistency check. On a mismatch the follower rejects; the leader decrementsnext_index[follower]and retries. Once a majority'smatch_indexcovers entryNandlog[N].term == current_term, the leader advancescommit_indextoN. -
Safety. Election restriction (a candidate only earns a vote if its log is at least as up-to-date as the voter's), the commit-only-current-term rule, and the log-matching property (identical
(index, term)⇒ identical entries) together imply state machine safety: once an entry at indexiis applied at one node, no other node will ever apply a different entry ati.
This lab implements the algorithm as it appears in Figure 2 of the
paper, minus snapshots and minus membership changes. The simulator
drives sim time forward in integer ticks; messages are scheduled into a
heap with a deterministic (delivery_time, sender, seq) order; an
optional partition set drops messages in one direction between named
pairs.
Why does it matter?
-
Raft is the production consensus algorithm of the 2010s. Knowing exactly how
prev_log_indexworks, why commit advance is gated onlog[N].term == current_term, and why the election restriction exists is the difference between operating etcd and understanding etcd. -
Three byte-identical implementations forces the spec to be unambiguous. Anywhere Raft "depends on the implementation" — RPC scheduling, election timer jitter, tiebreak for "which leader gets a proposal", iteration order of peer ids — has to be pinned down. The cross-language sha256 makes drift loud.
-
Reproducible partitions. With a deterministic
--partition s,d,...flag and a seeded simulator, you can replay the exact sequence of message drops, leadership changes, and committed entries that triggered a bug, on any machine, in any of the three languages. -
Foundation for the rest of the track. db-18 Paxos and db-19 ZAB will reuse the simulator harness; db-20 distributed-kv will plug a consensus engine into a real key-value store.
How does it work?
State (per node)
persistent : current_term : u64
voted_for : Option<u32> # None == -1 on the wire
log : Vec<LogEntry> # 1-indexed in Figure 2; 0-indexed here
volatile : role : Follower | Candidate | Leader
commit_index : u64 # highest log index known committed
last_applied : u64 # we apply lazily; rarely diverges from commit_index
leader-only : next_index : Map<peer_id, u64> # index of next entry to send to each peer
match_index : Map<peer_id, u64> # highest entry known replicated on each peer
timers : election_deadline : u64 # sim-time tick
heartbeat_due : u64 # next time leader must send AE
Election timer
reset_election_timer(t):
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150
A 150-tick base plus 150 ticks of seeded jitter avoids the classic split-vote loop. Heartbeats fire every 50 ticks.
RequestVote handling
on RequestVote(term, candidate, last_log_index, last_log_term):
if term > current_term: # newer term seen
become_follower(term)
grant = (term == current_term)
&& (voted_for is None or voted_for == candidate)
&& candidate_log_is_at_least_as_up_to_date()
if grant:
voted_for = candidate
reset_election_timer()
reply RequestVoteReply(current_term, grant)
Up-to-date is defined as: last_log_term > my_last_term, or
(last_log_term == my_last_term && last_log_index >= my_last_index).
AppendEntries handling
on AppendEntries(term, leader, prev_idx, prev_term, entries, leader_commit):
if term > current_term: become_follower(term)
if term < current_term: reply (current_term, false); return
reset_election_timer()
if prev_idx > 0 && (log too short OR log[prev_idx-1].term != prev_term):
reply (current_term, false); return # consistency mismatch
# truncate any conflicting suffix, then append
for (i, e) in enumerate(entries):
idx = prev_idx + i
if idx < log.len() && log[idx].term != e.term:
log.truncate(idx)
if idx >= log.len():
log.push(e)
if leader_commit > commit_index:
commit_index = min(leader_commit, log.len())
reply (current_term, true, match_index = prev_idx + len(entries))
Commit advance (leader only)
advance_commit():
for N in (log.len() ..= commit_index + 1).rev():
if log[N-1].term != current_term: continue # Figure 8 safety
replicated = 1 + count(p : match_index[p] >= N)
if 2 * replicated > nodes:
commit_index = N; break
Propose (leader only)
propose(cmd):
log.push(LogEntry{ term: current_term, command: cmd })
match_index[self] = log.len()
broadcast_append_entries()
advance_commit() # NB: required for n == 1, harmless for n > 1
The advance_commit() call inside propose is the one non-obvious
detail. In a single-node cluster the leader has no peers, so no
AppendEntriesReply will ever arrive to trigger a commit — but a
majority is already satisfied (the leader alone is the majority). All
three implementations call advance_commit() at the end of propose
for byte-identical behaviour.
Simulator loop (per tick t in 0..rounds)
1. enqueue scheduled proposals : if t == schedule[i], push payload onto pending
2. inject pending into leader : pick (max term, min id) among Leaders; call propose
3. deliver in-flight : pop heap entries with delivery_time == t
4. tick all nodes : iterate in ascending id; on_tick may fire election or heartbeat
Proposal schedule: schedule[i] = (i+1) * rounds / (K+1) for i in 0..K (integer division). Deterministic, evenly spread, and independent
of cluster behaviour.
Wire format (Rpc)
Four variants; all field widths fixed; little-endian:
RequestVote { term: u64, candidate: u32, last_log_index: u64, last_log_term: u64 }
RequestVoteReply { term: u64, granted: bool (u8) }
AppendEntries { term: u64, leader: u32, prev_idx: u64, prev_term: u64,
entries: [LogEntry], leader_commit: u64 }
AppendEntriesReply{ term: u64, success: bool (u8), match_index: u64 }
The wire format is not serialized to disk by this lab — the simulator passes Rpcs as typed structs in memory. Only the canonical dump is serialized, and that is what gets hashed.
Canonical dump format
file := magic[8 = "DSERAFT1"] u32_le(node_count) node*
node := u32_le id
u64_le current_term
i64_le voted_for # -1 if None (two's complement little-endian)
u8 role # Follower=0, Candidate=1, Leader=2
u64_le commit_index
u32_le log_len
entry * log_len
entry := u64_le term
u32_le cmd_len
u8 cmd[cmd_len]
Nodes appear in ascending id order. All multi-byte numbers are
little-endian. The dump is hashed with SHA-256; the lowercase hex
digest is what raftctl prints (no trailing newline).
Cross-language invariants
| Invariant | Why it matters |
|---|---|
splitmix64 constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB | identical PRNG output |
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150 | identical election firing times |
delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3 | identical message scheduling |
heap order (delivery_time, sender, seq); seq global monotonic | identical delivery sequence |
peers iterated in ascending id (BTreeMap / std::map / explicit for p:=0;p<n;p++) | identical broadcast order |
leader-pick for proposal injection: (max term, min id) | identical client routing |
proposal schedule: (i+1) * rounds / (K+1) integer division | identical pending queue contents |
propose() calls advance_commit() | identical commit_index for n=1 |
voted_for = None encoded as i64 LE -1 | identical dump bytes |
Role enum order Follower=0, Candidate=1, Leader=2 | identical dump bytes |
If any one of these drifts, scripts/cross_test.sh will fail and cmp -l on the two raw dumps will print the byte offset of the first
divergence.
Files
src/rust/—raft17crate +raftctlbinary.src/go/— modulegithub.com/10xdev/dse/db17+cmd/raftctl.src/cpp/—db17_libstatic library +raftctlbinary +test_db17.scripts/verify.sh— runs the unit tests for all three.scripts/cross_test.sh— proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.
See docs/ for the long-form write-up and steps/ for the staged
implementation path.
db-17 — References
Primary sources
- Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), USENIX ATC 2014. The Raft paper. Figure 2 is the spec this lab implements; Figure 8 is the motivation for the "commit only entries of the current term" rule. https://raft.github.io/raft.pdf
- Diego Ongaro, Consensus: Bridging Theory and Practice, Stanford PhD dissertation, 2014. The book-length treatment. Chapters 3–4 cover what's in this lab; chapters 5–6 cover snapshots, log compaction, and membership changes (deferred to db-21 / db-23). https://github.com/ongardie/dissertation
- raft-tla — the TLA+ specification of the algorithm, also by Ongaro. Useful when you want a second, machine-checked statement of the same rules implemented here. https://github.com/ongardie/raft.tla
Implementations to read alongside
- etcd/raft (Go) — the most-studied production Raft. Same Figure 2 structure; adds pre-vote, leader leases, learner replicas, ReadIndex, joint consensus. https://github.com/etcd-io/raft
- hashicorp/raft (Go) — Consul's engine. Easier to read than etcd's because it carries less production scar tissue. https://github.com/hashicorp/raft
- tikv/raft-rs (Rust) — TiKV's port of etcd's algorithm. Useful as a counterpoint to this lab's stdlib-only Rust version. https://github.com/tikv/raft-rs
Determinism and simulation
- db-16's references on FoundationDB simulation testing and TigerBeetle apply verbatim here.
- Hermitian (CockroachDB) and Antithesis are commercial
deterministic simulators for distributed databases; the spirit is
the same as
cross_test.sh.
Background reading worth doing
- Heidi Howard et al., Flexible Paxos: Quorum intersection revisited, OPODIS 2016. Helps see Raft as a specialization of Paxos with a fixed quorum intersection rule.
- Lamport's Paxos Made Simple — for the db-18 transition.
- Junqueira et al., ZooKeeper's Atomic Broadcast Protocol: Theory and Practice — for the db-19 transition.
Cross-lab dependencies
- Upstream: db-16 distributed-fundamentals (Lamport/VC and the deterministic simulator harness whose discipline this lab inherits wholesale).
- Downstream:
- db-18 Paxos — reuses the heap-and-tick simulator; different RPC structure; weaker leader assumption.
- db-19 ZAB — leader-based atomic broadcast; same election + log-replication skeleton.
- db-20 Distributed KV — wraps a chosen consensus engine (probably this one) around a key-value state machine.
- db-23 Capstone — joint membership changes and snapshots get added on top of this code.
db-17 — Analysis
Required invariants
-
Election safety. At most one leader per term. Enforced by majority voting: a candidate only becomes leader after collecting votes from a strict majority, and each voter only grants one vote per term (the
voted_forfield, persisted in the canonical dump). -
Leader append-only. A leader never overwrites or deletes entries from its own log; it only appends. Followers may truncate on an AppendEntries consistency mismatch, but the leader's local log only grows.
-
Log matching property. If two logs contain an entry with the same
(index, term), then the logs are identical in all entries up through that index. Enforced by theprev_log_index / prev_log_termcheck in AppendEntries and the truncate-on-conflict rule. -
Leader completeness. If an entry is committed in term
T, that entry is present in the log of every leader for all later terms. Enforced by the election restriction (a vote is only granted if the candidate's log is at least as up-to-date as the voter's). -
State machine safety. If a node has applied an entry at index
i, no other node will ever apply a different entry ati. This follows from log matching + leader completeness + the commit-only-current-term rule. -
Byte determinism. For every
(seed, nodes, rounds, proposals, partition)tuple, the three binaries produce identicalcanonical_dumpbytes — hence identical sha256 hex on stdout.scripts/cross_test.shchecks six scenarios.
Design decisions
-
propose()callsadvance_commit()at the end. The non-obvious one. In a single-node cluster the "leader" has no peers, so noAppendEntriesReplywill ever arrive to driveadvance_commit(). But a single-node cluster is its own majority, so the entry should commit the moment it is appended. Without this call, scenario D (--nodes 1) ends withcommit_index = 0despite five proposals in the log. Callingadvance_commit()is harmless forn > 1(the loop's majority check rejects until replies actually arrive). -
Sorted iteration on every wire-affecting loop. Rust uses
BTreeMap<u32, u64>fornext_index/match_index; C++ usesstd::map; Go uses explicitfor p := uint32(0); p < n; p++loops. HashMap would compile and pass single-language tests but failcross_test.shimmediately. db-16's analysis.md called this out; db-17 enforces it across more code surface. -
In-flight heap ordered by
(delivery_time, sender, seq).seqis a global monotonic counter incremented every time a message is enqueued. It exists only to break ties when two messages with the same(delivery_time, sender)would otherwise be ambiguously ordered. Withoutseqyou would see byte diffs on dense traffic at the same delivery tick. -
Leader-pick for proposal injection is
(max term, min id)amongrole == Leadernodes. During leadership churn there may be no leader, or there may be multiple stale leaders that have not yet stepped down. The(max term, min id)rule produces a deterministic routing decision no matter which language's iteration order you start from. -
Proposal schedule is closed-form.
schedule[i] = (i+1) * rounds / (K+1)(integer division). This places K proposals evenly through theroundswindow, independent of cluster behaviour. A schedule derived from cluster state ("propose whenever there's a leader") would couple proposal timing to incidental scheduling choices and produce noisy byte diffs. -
Splitmix64 constants are explicit.
0x9E3779B97F4A7C15(γ / golden-ratio fractional, the seeder constant),0xBF58476D1CE4E7B5and0x94D049BB133111EB(Vigna's two mixer constants). All three implementations copy them as literals; nobody computes them. -
Library + thin CLI. The lab exposes
Cluster::new,run,canonical_dump, andsha256as a library. The CLI is a few dozen lines of arg parsing plus four function calls. Downstream labs (db-18 Paxos, db-20 distributed-kv) will link the library, not shell out.
Tradeoffs worth flagging
-
No snapshots, no log compaction. Logs grow without bound across the run. For
--rounds 2000 --proposals 20you end up with ~20 entries per node; the canonical dump stays small. For production Raft you would add aSnapshotStateRPC and alast_included_index / last_included_term; deferred to db-21 (storage-engine-advanced). -
No pre-vote, no leader lease. A network-partitioned candidate will repeatedly increment its term, then on heal will force the legitimate leader to step down. Mitigated by tight election timeouts in this simulator but a real cluster needs the pre-vote optimization (Ongaro thesis §9.6).
-
No membership changes. The node count is fixed at
Cluster::newtime. Joint consensus (and the safer learner-then-promote alternative) is a major chapter on its own; deferred to db-23 capstone. -
Crash semantics are stylized. Crashes are simulated only via the
partitionflag (drop all messages in one direction). A real Raft must handle persistent storage corruption, fsync ordering, and restart-mid-vote; the canonical dump pretends all state is durable by construction. -
No client-side dedup. A proposal injected into a leader who immediately loses leadership may be replicated, lost, and never re-proposed. The simulator's
pendingqueue is drained unconditionally; we are testing the consensus core, not the client RPC layer.
Why three languages
Same reasoning as db-16, plus one new lesson: Raft has many places
where "iterate over peers" appears. Each one is a chance for a byte
diff. C++'s std::map and Rust's BTreeMap are ordered by default;
Go's map is explicitly randomized at iteration time. The Go
implementation has explicit for p := uint32(0); p < n; p++ loops
everywhere a peer iteration appears. Discovering this discipline by
forcing the cross-language test to pass is more durable than reading
"don't use HashMap" in a style guide.
db-17 — Execution
One-shot: prove the lab works
cd db-17-raft
./scripts/verify.sh # all unit tests in Rust, Go, C++
./scripts/cross_test.sh # byte-identical sha256 across all three, six scenarios
A green run of cross_test.sh ends with the literal line:
=== ALL OK ===
Per-language workflows
Rust
cd src/rust
cargo test --release # ~10 tests
cargo build --release # produces target/release/raftctl
./target/release/raftctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
Go
cd src/go
go test ./... # ~12 tests
go build -o /tmp/raftctl_go ./cmd/raftctl
/tmp/raftctl_go --seed 42 --nodes 3 --rounds 1000 --proposals 5
C++
cd src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build # test_db17 → "db-17 C++ tests: 10 passed"
./build/raftctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
CLI
All three binaries accept the same flags and print lowercase hex sha256 of the canonical dump to stdout with no trailing newline:
| flag | default | meaning |
|---|---|---|
--seed N | 0 | splitmix64 seed mixed into election timers and message delays |
--nodes K | 3 | number of Raft nodes (1 is legal; majority is then 1) |
--rounds R | 1000 | number of simulator ticks to run |
--proposals P | 0 | number of client commands to inject during the run |
--partition s,d,... | none | comma-separated pairs (src, dst) to drop in that direction |
--partition 0,1,1,0 drops both directions between nodes 0 and 1
(complete split); --partition 0,1 drops only 0 → 1 (asymmetric).
Proposals are spaced as schedule[i] = (i+1) * rounds / (K+1); with
--rounds 1000 --proposals 5 they fire at ticks 166, 333, 500, 666,
833.
Canonical scenarios
scripts/cross_test.sh runs six scenarios; their sha256s are listed in
docs/observation.md. If any change, cross_test.sh will diff the
raw dumps and exit non-zero.
| label | args |
|---|---|
| A | --seed 42 --nodes 3 --rounds 1000 --proposals 5 |
| B | --seed 7 --nodes 5 --rounds 2000 --proposals 20 |
| C | --seed 99 --nodes 3 --rounds 500 --proposals 0 |
| D | --seed 1 --nodes 1 --rounds 200 --proposals 5 |
| E | --seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0 |
| F | --seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1 |
D exercises the single-node-leader code path that motivated the
propose() → advance_commit() call. E isolates node 0 completely; the
other two must elect a leader and commit the remaining proposals. F is
an asymmetric partition that causes term churn but recoverable
replication.
Sanity checks
# magic bytes of the canonical dump (use the lib directly; the CLI hashes it)
cat <<'EOF' | cargo run --quiet --example dump_magic
EOF
# or just trust the test: TestCanonicalDumpMagic in raft_test.go
# or for C++: test_db17 prints "canonical dump magic OK" among its asserts
# pick any scenario and round-trip:
./src/rust/target/release/raftctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
# expect: a2299ff06a2ed5ced5842d100bb7867b3ae50f6e7d7da93f835385565f1ed9e9
db-17 — Observation
What the cross-language test produces and how to read it by hand.
Expected sha256s
scripts/cross_test.sh runs six scenarios and asserts the three
binaries (Rust, Go, C++) all print the same hex digest. The current
canonical hashes are:
| label | args | sha256 |
|---|---|---|
| A | --seed 42 --nodes 3 --rounds 1000 --proposals 5 | a2299ff06a2ed5ced5842d100bb7867b3ae50f6e7d7da93f835385565f1ed9e9 |
| B | --seed 7 --nodes 5 --rounds 2000 --proposals 20 | b6dc06aee72e595f51bd5045ea7c92ffcbe7f6fda3198985f9ded1eca2671c4b |
| C | --seed 99 --nodes 3 --rounds 500 --proposals 0 | f9db9ea7e6c1ca2b3a911b42b2431e964a4ee7c5e40e27efd29b41e747958838 |
| D | --seed 1 --nodes 1 --rounds 200 --proposals 5 | ce8b8e05d6ad0b4a243753a934b2f052c2363e97beca0c175586677d1a489408 |
| E | --seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0 | b1689eb48b209187b7cd82a24b1a6a2d19b0be4b481ac1a5b4f1ac9e23a6ae05 |
| F | --seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1 | fcc70ecabe37509133bb27155f5bd7d74981c3f98e79719e2b47077acca6a31f |
If any of these change, cross_test.sh will fail; either you have a
bug, or you have intentionally changed the spec (timer constants,
schedule formula, dump layout) and you must update this table in the
same commit.
What the canonical dump looks like (scenario D — single node)
--seed 1 --nodes 1 --rounds 200 --proposals 5. Five proposals into a
single-node cluster — leader is itself the majority, so every proposal
commits immediately.
offset 0x00 : 44 53 45 52 41 46 54 31 "DSERAFT1" magic
offset 0x08 : 01 00 00 00 1 node_count
offset 0x0c : 00 00 00 00 0 node id
offset 0x10 : ?? ?? ?? ?? ?? ?? ?? ?? current_term (~1, the first self-election)
offset 0x18 : 00 00 00 00 00 00 00 00 voted_for = 0 (voted for self in term 1)
offset 0x20 : 02 role = Leader (2)
offset 0x21 : 05 00 00 00 00 00 00 00 commit_index = 5
offset 0x29 : 05 00 00 00 log_len = 5
offset 0x2d : XX XX XX XX XX XX XX XX log[0].term (== current_term)
offset 0x35 : 03 00 00 00 log[0].cmd_len (3 bytes: "p00")
offset 0x39 : 70 30 30 "p00" payload
...
Each subsequent entry is 8 + 4 + 3 = 15 bytes (term + cmd_len +
"pNN"). Total dump for D is therefore approximately 0x2d + 5 * 15 = 0xa0
bytes = 160 bytes. The actual numbers vary slightly depending on how
many election cycles --seed 1 produces before the first self-vote.
A multi-node dump (scenario C — quiet cluster)
--seed 99 --nodes 3 --rounds 500 --proposals 0. No proposals; the
cluster elects a leader, sends heartbeats, and that is it. Every
node's log is empty:
44 53 45 52 41 46 54 31 magic
03 00 00 00 node_count = 3
00 00 00 00 node id 0
XX XX XX XX XX XX XX XX current_term (1 if 0 elected itself, otherwise higher)
XX XX XX XX XX XX XX XX voted_for (0 for the leader, otherwise the leader id)
XX role (Leader or Follower; never Candidate at quiescence)
00 00 00 00 00 00 00 00 commit_index = 0
00 00 00 00 log_len = 0
01 00 00 00 node id 1
... same shape ...
02 00 00 00 node id 2
... same shape ...
Total dump: 8 + 4 + 3 * (4 + 8 + 8 + 1 + 8 + 4) = 111 bytes.
How to debug a divergence
If cross_test.sh fails, the script captures the raw dump from each
language into /tmp/raft_<label>_<lang>.bin and prints which two
languages diverged. Then:
cmp -l /tmp/raft_A_rust.bin /tmp/raft_A_go.bin | head
xxd /tmp/raft_A_rust.bin | sed -n '<line>,+2p'
xxd /tmp/raft_A_go.bin | sed -n '<line>,+2p'
The first divergence offset tells you what to look at:
| offset range | likely culprit |
|---|---|
| 0x00–0x07 | magic (typo: DSERAFT1 not DESRAFT1) |
| 0x08–0x0b | node_count (impossible if all three accept --nodes correctly) |
inside a node block, on current_term | election timer or heap-order bug |
inside a node block, on voted_for | None encoding (must be i64 LE -1) |
inside a node block, on role | enum mapping (Follower=0, Candidate=1, Leader=2) |
inside a node block, on commit_index | propose() not calling advance_commit(), or quorum count wrong |
inside a log entry | AppendEntries truncate-on-conflict bug, or peer iteration order |
In all six existing scenarios these checks pass; the table above is the runbook for the day someone changes the algorithm and forgets to update one of the three implementations.
Tick-level scope (Rust REPL trick)
To watch a scenario from the inside, add this temporary print in
Cluster::run before the simulator loop:
#![allow(unused)] fn main() { if std::env::var("RAFT_TRACE").is_ok() { eprintln!("t={} leader={:?} terms={:?}", t, self.nodes.iter().find(|n| n.role == Role::Leader).map(|n| n.id), self.nodes.iter().map(|n| n.current_term).collect::<Vec<_>>()); } }
then run RAFT_TRACE=1 raftctl --seed 42 --nodes 3 --rounds 1000 ... | head -50. The output is not part of the canonical dump and does not
affect the sha256. Remove before commit.
db-17 — Verification
How to reproduce the green status on a clean machine.
Prerequisites
- macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
cmake≥ 3.20.- Rust toolchain ≥ 1.74.
- Go ≥ 1.22.
shasum,cmp,awk(default on macOS;coreutilson Linux).
One command
cd db-17-raft
scripts/verify.sh # builds + unit tests in all three langs
scripts/cross_test.sh # cross-language sha256 match across six scenarios
Both should print === OK === / === ALL OK === and exit 0.
Per-language drill-down
Rust
cd db-17-raft/src/rust
cargo test --release --quiet
cargo build --release
Expected: ~10 tests pass. The raftctl binary lands at
target/release/raftctl. The release profile uses LTO.
Go
cd db-17-raft/src/go
go test ./...
go build -o /tmp/raftctl_go ./cmd/raftctl
Expected: ok github.com/10xdev/dse/db17 <duration> and a working
binary. Tests include TestSha256HexKnownVectors (validates the
SHA-256 wrapper against published vectors) and
TestVotedForNegativeEncoding (validates the -1 sentinel byte
layout).
C++
cd db-17-raft/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure
Expected: 100% tests passed, 0 tests failed out of 1 and test_db17
prints "db-17 C++ tests: 10 passed". The test source #undef NDEBUG
before <cassert> so assert() fires in Release builds too.
What "green" means
A green run guarantees:
-
Per-language unit tests pass. Each implementation independently exercises splitmix64, election-timer reset, RequestVote granting, AppendEntries truncation, single-node commit, multi-node commit, canonical dump magic + node_count + log_len framing, and the SHA-256 implementation against a known test vector.
-
All six scenarios produce byte-identical canonical dumps across Rust, Go, and C++.
cross_test.shactually compares the raw dump bytes (cmp -s) before comparing the sha256 hex, so a divergence is caught with an exact byte offset rather than just "the hashes don't match". -
The sha256s match the table in
docs/observation.md. If you change the algorithm or the dump format, both the dumps and the table must change in the same commit. The mismatch between code and docs is itself a verification failure.
What "green" does NOT guarantee
- No production safety. There is no fsync; in-memory state is considered durable by construction.
- No coverage of snapshot / membership / pre-vote / lease code. Those features are deferred to db-21, db-23, and possibly never (this lab is a study lab, not a production engine).
- No client-facing API. Proposals are injected into the simulator
via a fixed schedule; there is no
ProposeRPC for an external client. - No performance characterization. The lab is sized to run in under a second per scenario; multi-thousand-round runs work but are not the goal.
Invariant assertions in code
The C++ test file in particular makes the invariants concrete:
| assertion | invariant |
|---|---|
dump.size() >= 12 and starts with "DSERAFT1" | dump-format magic |
read_u32_le(dump, 8) == nodes | node_count framing |
cluster.run(...) does not panic for any tested (seed, nodes, rounds, P) | no out-of-bounds / no UB |
sha256(empty) == e3b0c44298... | SHA-256 padding boundary case |
n.commit_index <= n.log.len() for every node after run | no over-commit |
propose on a single-node leader yields commit_index == proposals | majority-of-one rule |
The Rust and Go tests assert the same set in their respective testing idioms.
db-17 — Broader Ideas
Where Raft and the choices in this lab show up in real systems, and what to build on top of the same simulator harness in the rest of the distributed track.
Immediate next labs
-
db-18 — Paxos. Same heap-and-tick harness, different RPC structure (Prepare / Promise / Accept / Accepted), no fixed leader. Paxos's invariants are notoriously hard to reason about by hand; byte-deterministic replay of a counterexample seed is the difference between "I think it's correct" and "I have evidence". Raft was literally designed as the more understandable alternative — implementing both in this order is the recommended path.
-
db-19 — ZAB. ZooKeeper's atomic broadcast protocol. Similar leader-based skeleton to Raft, but the recovery phase is more involved (NEWLEADER / NEWEPOCH / SYNC / BROADCAST). The Lamport scalar of db-16 generalizes to the
(epoch, counter)pair that ZAB calls a "zxid". -
db-20 — Distributed KV. Wrap a quorum-replicated key-value store around a chosen consensus engine. The state machine is the only thing that changes — the consensus log feeds
Put(k, v)/Delete(k)commands that get applied incommit_indexorder. -
db-23 — Capstone. Adds snapshots, joint-consensus membership changes, and a multi-Raft "shards across regions" deployment on top of this code.
How this lab's pieces map to real systems
-
The Raft skeleton implemented here is exactly what etcd, Consul, TiKV, CockroachDB, MongoDB metadata, OpenStack Nova cells, and the control plane of Vault all run. They each add the extensions deferred from this lab (pre-vote, snapshots, learners, joint consensus), but the core RequestVote / AppendEntries loop is unchanged.
-
The
(delivery_time, sender, seq)heap tie-break is the same trick FoundationDB's simulator uses to drive every commit-proxy /storage-server interaction; TigerBeetle, Antithesis, and Hermitian all reach for it independently. -
The "leader picks max-term, min-id" convention surfaces as the split-brain resolution rule in production systems: when a partition heals and you see two leaders, the one with the higher term wins unconditionally (id break is academic — different terms imply different elections).
-
The
voted_for = Noneencoded as-1is the convention every Raft implementation in production uses on disk. Some encode as optional / nullable types in a richer wire format (protobuf hasoptional), but in any fixed-width binary log the sentinel value is the right answer.
Performance experiments worth running later
-
Crank
--roundsto 100k and watch the binary size grow. The dump is linear in committed entries; if you ever see super-linear growth something is appending entries that don't get committed (a sign of partition oscillation). -
Replace splitmix64 with a per-node
rand::ChaCha20. The simulator will still be deterministic (RNGs are seeded), but cross-language byte equivalence will break unless you also port the ChaCha core identically. Useful exercise in what exactly portability requires. -
Try injecting one heavy proposal vs. many small proposals into a 3-node cluster and measure the cluster dump size vs. the bytes actually committed. The difference is the steady-state replication overhead.
-
Vary the election timeout. The 150 + jitter(0..150) ticks chosen here keeps churn low; halve it and you'll see term numbers climb rapidly under any partition, especially scenario F.
What "production-quality" would require beyond this lab
-
A real disk-backed persistence layer with fsync semantics and crash recovery. The canonical dump pretends
current_term,voted_for, andlogare durable on every change; a real Raft mustfsyncbefore sending any reply that depends on the new state, or risk violating election safety on a power cut. -
Network I/O. The simulator hands typed structs across an in-process heap; production uses gRPC or a custom framing protocol with at- least-once delivery and connection-level back-pressure.
-
Pre-vote and leader leases. Without them, a partitioned candidate bumps its term repeatedly; on heal the legitimate leader steps down unnecessarily. Easy to add as a wrapper on RequestVote; deferred here because it would obscure the core algorithm.
-
Snapshots and log compaction. Without them, the log grows forever and a slow follower can't catch up over the wire. The canonical dump tolerates this only because the lab's
roundsis bounded. -
Membership changes. The fixed
nodescount atCluster::newtime is fine for a lab but useless in production. Joint consensus or the safer learner-then-promote protocol are major additions; covered in db-23. -
Observability. A real Raft cluster exposes per-node
term,commit_index,match_index[*],leader_id, election counts, and message rates as metrics. The canonical dump is a post-mortem view; runtime observability is a separate problem.
db-17 step 01 — Leader election
Goal
A cluster of nodes followers, started cold, must elect exactly one
leader in a bounded number of ticks, and that leader must remain
stable as long as it can deliver heartbeats. The election protocol
must be byte-deterministic across Rust, Go, and C++.
Tasks
-
Persistent state. Each
RaftNodecarriescurrent_term: u64,voted_for: Option<u32>, andlog: Vec<LogEntry>. The dump encodesvoted_for=Noneas the signed integer-1(i64 LE);Some(id)becomesid as i64. -
Election timer.
reset_election_timer(t)setselection_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150. Heartbeat-due ist + 50. -
on_tick(t). Followers and candidates that hitelection_deadlinestart a new election: bumpcurrent_term, vote for self, broadcastRequestVoteto all peers, transition toCandidate. Leaders that hitheartbeat_duebroadcast an emptyAppendEntries(heartbeat). -
RequestVotehandling. Grant a vote iff (a)term == current_term, (b)voted_forisNoneor equal to the candidate, and (c) the candidate's log is at least as up-to-date as ours (the standardlast_log_term/last_log_indexlex compare). Grant resets the election timer. -
RequestVoteReplyhandling. A candidate that collects a majority ofgrantedreplies in the same term transitions toLeader, initializesnext_index[p] = log.len()andmatch_index[p] = 0for every peerp, and immediately broadcastsAppendEntries(initial heartbeat). -
become_follower(term). Used whenever a node seesterm > current_term(in any RPC). Setscurrent_term = term, clearsvoted_for, resets the election timer, transitions toFollower.
Acceptance
Inline unit tests in each language:
splitmix64_known_vectors—splitmix64(0) == 0xE220A8397B1DCDAF(the value Vigna's reference C produces).election_timer_in_range— 1000 consecutive resets all land in[t+150, t+300).request_vote_grants_first_only— vote for candidate A, then a RequestVote from B in the same term is denied.become_leader_from_majority— 3-node cluster, two RequestVoteReply withgranted=truetransitions the candidate to Leader.term_bump_demotes_leader— a Leader receiving any RPC withterm > current_termbecomes Follower and clearsvoted_for.
All five green in Rust, Go, and C++.
Discussion prompts
- Why is
voted_forpersistent (in the canonical dump) butcommit_indexvolatile (also dumped, but only because the dump is a debug oracle, not a recovery file)? - What goes wrong if you reset the election timer on send of RequestVote instead of on grant of someone else's vote? (Hint: split-vote loops.)
- Why must "majority" be computed against
nodes, not againstnodes that have replied?
db-17 step 02 — Log replication
Goal
A leader accepts client proposals, replicates them to followers via
AppendEntries, and advances commit_index once a majority's
match_index covers the entry and the entry is from the leader's
current term. Followers truncate any conflicting suffix and append the
leader's entries. The result must be byte-deterministic across all
three languages.
Tasks
-
LogEntry.{ term: u64, command: Vec<u8> }. Logs are 0-indexed in this implementation; the algorithm description uses 1-indexed in Ongaro's Figure 2 — adjust mentally when reading the paper. -
propose(cmd). Leader-only:- push
LogEntry { term: current_term, command: cmd }onto own log, - set
match_index[self] = log.len(), - broadcast
AppendEntriesto all peers, - call
advance_commit()(so n=1 commits immediately).
- push
-
broadcast_append_entries(). For each peer in ascending id order, sendAppendEntries { term, leader, prev_idx, prev_term, entries: log[next_index[p]..], leader_commit }.prev_idx = next_index[p],prev_term = log[prev_idx-1].term(or 0 ifprev_idx == 0). -
AppendEntrieshandling on follower.- if
term > current_term:become_follower(term); - if
term < current_term: replysuccess=false; - reset election timer (we heard from a leader);
- if
prev_idx > 0 && (log too short || log[prev_idx-1].term != prev_term): replysuccess=false, match_index=0; - else: walk each incoming entry; truncate own log at the first
(index, term)conflict; append remaining entries; advancecommit_index = min(leader_commit, log.len()); replysuccess=true, match_index=prev_idx+entries.len().
- if
-
AppendEntriesReplyhandling on leader.- if
term > current_term:become_follower(term); - if
success:next_index[from] = reply.match_index + 1; match_index[from] = reply.match_index; advance_commit(); - if
!successandterm == current_term: decrementnext_index[from](clamped at 0); next heartbeat / propose will retry with an earlierprev_idx.
- if
-
advance_commit(). ForNfromlog.len()down tocommit_index + 1:- if
log[N-1].term != current_term: continue (Figure 8 safety); - if
1 + count(p : match_index[p] >= N)>nodes / 2: setcommit_index = Nand break.
- if
Acceptance
Inline unit tests in each language:
propose_single_node_commits—--nodes 1, propose 3 entries, every entry's term is the leader term,commit_index == 3.append_entries_rejects_term_mismatch— leader with empty log sends AE withprev_idx=5, prev_term=1; follower returnssuccess=false.append_entries_truncates_conflict— follower withlog=[(t=1), (t=1), (t=2)]receives AE withprev_idx=2, prev_term=1, entries=[ (t=3)]; resulting log is[(t=1), (t=1), (t=3)].commit_requires_current_term— leader atterm=5replicates an oldterm=3entry to all followers;commit_indexdoes NOT advance past it until the leader appends aterm=5entry that also reaches majority.quorum_commit_three_nodes— 3-node cluster, leader proposes one entry, one follower acks;commit_indexadvances (2 of 3 is a majority including the leader).
All five green in Rust, Go, and C++.
Discussion prompts
- The Figure 8 commit restriction ("commit only entries of the current term") is famously subtle. Construct a 3-node scenario where omitting it lets a leader commit an entry that a future leader's election can erase.
- Why does the leader update
match_index[self]afterpropose? (Otherwise the majority check would always exclude the leader.) - What happens if two leaders coexist briefly (network partition that
has not yet healed)? Specifically: which leader can advance
commit_index, and why is this safe?
db-17 step 03 — Cross-test and partition
Goal
A Cluster that drives an n-node simulation forward by integer
ticks, a --partition CLI flag that drops messages in named
directions, and a cross-language scripts/cross_test.sh proving the
canonical dump's sha256 is byte-identical across Rust, Go, and C++ for
six seeded scenarios including partitions.
Tasks
-
Cluster::new(seed, nodes). Holds:nodes: Vec<RaftNode>(ids0..nodes);drop: BTreeSet<(u32, u32)>(directional message-drop set);heap: BinaryHeap<InFlight>ordered by(delivery_time, sender, seq)—InFlightimplementsOrdsuch thatBinaryHeapbehaves as a min-heap;seq: u64(global monotonic);pending_proposals: VecDeque<Vec<u8>>.
-
Cluster::run(rounds, n_proposals). For each ticktin0..rounds:- Enqueue scheduled proposals.
schedule[i] = (i+1) * rounds / (n_proposals + 1); ift == schedule[i], push payload"p<i:02>"ontopending_proposals. - Inject pending into current leader. Find leader as the
(max current_term, min id)node withrole == Leader; whilepending_proposalsis non-empty and a leader exists, drain one payload and callleader.propose(payload). The propose pushes RPCs onto the heap with delivery times computed fromsplitmix64(seed ^ src ^ dst ^ t) % 3 + 1. - Deliver. Pop every
InFlightwhosedelivery_time == t. For each, if(sender, dest)is indrop, discard. Otherwise callnodes[dest].handle(rpc, t)and enqueue any reply RPCs the handler produces. - Tick. Iterate nodes in ascending id; call
node.on_tick(t)on each; enqueue any RPCs produced.
- Enqueue scheduled proposals.
-
canonical_dump(&cluster) -> Vec<u8>. As specified in CONCEPTS.md: magic"DSERAFT1"(8 bytes),u32_le(node_count), then for each node in id order: id, current_term, voted_for (i64 LE,-1for None), role (u8), commit_index, log_len, and each entry's(term, cmd_len, cmd_bytes). -
raftctlCLI. Parses--seed,--nodes,--rounds,--proposals,--partition s,d,s,d,.... CallsCluster::new, inserts every(s, d)pair intocluster.drop, runs, dumps, sha256s, prints lowercase hex with no trailing newline. -
scripts/cross_test.sh. For each of the six scenarios (A–F indocs/observation.md), invoke all three binaries with the same args, compare raw dumps withcmp -s, then compare hex hashes. Print the scenario label andOKon success, or the diverging offset and the three hashes on failure. End with=== ALL OK ===.
Acceptance
cargo test --release⇒ ~10 tests pass.go test ./...⇒ ~12 tests pass.ctest --test-dir build⇒100% tests passed../scripts/verify.sh⇒=== OK ===../scripts/cross_test.sh⇒ all six scenariosOK, final=== ALL OK ===.- The exact sha256s match
docs/observation.md's table. Specifically scenario A isa2299ff06a2ed5ced5842d100bb7867b3ae50f6e7d7da93f835385565f1ed9e9.
Discussion prompts
- The proposal-injection step picks the leader by
(max term, min id). Why not "first leader found in iteration order"? (Hint: Go'smapiteration is randomized;(max term, min id)is content-defined.) - Scenario E (
--partition 0,1,0,2,1,0,2,0) drops every message into or out of node 0. What is the only way the resulting log can contain committed entries? Trace which two-node sub-cluster achieves quorum. - Scenario F is an asymmetric partition (
0 → 1only). Why doesn't this cause permanent leadership churn? (Hint: node 1 can still reach node 0 via AppendEntriesReply.) - If you swap
BTreeSetforHashSetinCluster::drop(Rust), the hashes still match — why? But if you swapBTreeMapforHashMapinRaftNode::next_index, they don't. Articulate the rule.
db-18 — Paxos
This lab implements Multi-Paxos consensus in Rust, Go, and C++,
all three producing a byte-identical sha256 of a canonical cluster
dump for any (seed, nodes, rounds, proposals, partition)
configuration. It is the sibling of db-17 (Raft) and reuses db-16's
deterministic simulator discipline: same splitmix64 seeding, same
(delivery_time, sender, seq) heap tie-break, same "sorted iteration
on the wire" rule, same closed-form proposal schedule.
If db-17 taught you that one consensus algorithm can be expressed identically in three languages, db-18 teaches you that another consensus algorithm — built on different primitives, with no built-in leader concept, and capable of arbitrary concurrent proposers — can be held to the very same bit-level discipline. The two implementations share zero algorithmic code but share all of the determinism machinery, and that is the point.
What is it?
Paxos (Lamport, "The Part-Time Parliament" 1998 / "Paxos Made Simple"
2001) is the original asynchronous consensus algorithm: a family of
acceptors collectively decides on a single value per slot despite
crashes, message loss, and message reordering. Unlike Raft, Paxos has
no first-class leader and no current_term. Its only ordering
primitive is the ballot — a lexicographic pair (round, proposer_id) that acceptors monotonically promise to honor.
Single-decree (one-slot) Paxos has two phases:
-
Phase 1 — Prepare / Promise. A proposer picks a fresh ballot
band broadcastsPrepare(b). An acceptor whose previously promised ballot is≤ bupdatespromised := band replies with every prior accept it holds (each(slot, accepted_ballot, value)triple). On collecting promises from a majority, the proposer enters Phase 2. -
Phase 2 — Accept / Accepted. For each slot, the proposer picks the value to propose: if any promise returned a prior accept for that slot, it must re-propose the value with the highest accepted_ballot (Lamport's P2c invariant); otherwise it is free to propose its own client value. It broadcasts
Accept(b, slot, v). An acceptor whose promised ballot is≤ brecordsaccepted[slot] := (b, v)and repliesAccepted(b, slot). On collecting accepts from a majority, the proposer declares the slot decided and broadcastsDecided(slot, v)to anyone who didn't get the accept.
Multi-Paxos amortizes Phase 1 across many slots. The proposer who
"wins" Phase 1 acts as a distinguished proposer (lab-locally we call
this role Leader) and reuses its promised ballot to drive Phase 2
for every subsequent slot, paying the Phase-1 cost only once per
ballot. Liveness is preserved by election timeouts: an acceptor that
hasn't heard from a leader for ≥ ELECTION_TIMEOUT_MIN + jitter ticks
starts its own Phase 1 with a higher round.
This lab implements Multi-Paxos end-to-end. It is the algorithm behind Google Chubby, Google Spanner's paxos groups, Cassandra lightweight transactions, and (in spirit) Apache ZooKeeper's ZAB.
Why does it matter?
-
Paxos is the historical and theoretical root of asynchronous consensus. Raft, ZAB, Viewstamped Replication, and EPaxos are all reactions to or refinements of Paxos. Reading the paper is easier when you have made the algorithm bit-deterministic with your own hands.
-
No fixed leader means no "single term" to lean on. Raft's safety flows largely from "exactly one leader per term". Paxos has neither. Its safety flows from the much weaker quorum-intersection argument: any two majorities of an
n-node cluster share at least one acceptor, and that acceptor's promised-ballot ordering serializes every accept that could possibly decide a slot. Writing the algorithm in three languages, watching the same sha256 fall out, and then deliberately breaking the quorum (scenario E) is the most visceral way to internalise quorum intersection. -
Concurrent proposers are first-class. Paxos lets every node attempt Phase 1 at any time. Dueling proposers are not an error case; they are the normal case during leadership churn. The deterministic simulator lets you replay the exact tick at which two proposers tied, see which ballot won, and confirm the safety invariants held without any "leader lease" magic.
-
Foundation for the rest of the distributed track. db-19 (ZAB) layers epoch+counter on top of a paxos-ish core; db-20 (distributed KV) feeds Paxos accept-decisions into a key-value state machine; db-23 (capstone) introduces snapshots and reconfiguration on top of whichever consensus engine the student picks (Raft, Paxos, or both).
How does it work?
State (per node)
acceptor : promised_ballot : Ballot # global, not per-slot
accepts : Map<slot, (Ballot, Vec<u8>)>
learner : learned : Map<slot, Vec<u8>>
proposer : role : Follower | Candidate | Leader
my_ballot : Ballot # the ballot this node is driving
prepare_promises: Set<acceptor_id> # accumulated this election
prepare_accepted: Map<slot, (Ballot, Vec<u8>)> # recovered during Phase 1
accept_count : Map<slot, Set<acceptor_id>>
next_slot : u64 # next fresh slot to propose
pending : Deque<Vec<u8>> # queued client values
timers : election_deadline : u64 # sim-time tick
last_heartbeat_sent : u64
promised_ballot is global per node (covers every slot, present and
future) — this is the standard Multi-Paxos optimization. accepts is
per-slot, because each slot is its own single-decree instance.
learned is the per-slot decision; once set it never changes.
Ballot ordering
#![allow(unused)] fn main() { #[derive(Clone, Copy, Eq, PartialEq)] struct Ballot { round: u32, proposer_id: u32 } }
Lex order on (round, proposer_id). Ballot::ZERO = (0, 0) means
"no ballot" and compares less than every other ballot. Promotion of
promised_ballot is monotonic: once an acceptor has promised b, it
will never accept any RPC carrying a strictly lower ballot.
Election timer (liveness)
reset_election_deadline(t):
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150
Identical to db-17's election timer. Heartbeats fire every 50 ticks from the current leader to keep follower timers refreshed.
Phase 1 — Prepare / Promise
start_election(t):
role = Candidate
new_round = max(promised_ballot.round, my_ballot.round) + 1
my_ballot = Ballot { round: new_round, proposer_id: self.id }
prepare_promises = { self.id } # self-promise
prepare_accepted = { slot: (ab, v) | (slot, (ab, v)) in self.accepts }
if my_ballot >= promised_ballot:
promised_ballot = my_ballot # we promise ourselves too
broadcast(Prepare { ballot: my_ballot })
if |prepare_promises| >= quorum(): # n = 1 cluster
become_leader(t)
on Prepare(b) at acceptor:
if b >= promised_ballot:
promised_ballot = b
if role in {Candidate, Leader} and b > my_ballot:
step_down(t) # higher proposer takes over
reset_election_deadline(t)
send Promise(b, accept_ok=true,
accepted = sorted_by_slot(accepts),
acceptor_id = self.id) → b.proposer_id
else:
send Promise(b, accept_ok=false, acceptor_id=self.id) → b.proposer_id
on Promise(b, ok, accepted, from) at candidate:
if role != Candidate or b != my_ballot: drop
if not ok: step_down(t); return # someone outranks us
prepare_promises.insert(from)
for (slot, ab, v) in accepted:
if slot not in prepare_accepted or ab > prepare_accepted[slot].ballot:
prepare_accepted[slot] = (ab, v) # recover highest-ballot value
if |prepare_promises| >= quorum():
become_leader(t)
The recovery rule take if ab > current.ballot is the operational
form of Lamport's P2c: across any majority of acceptors, the value
with the highest accepted ballot for a slot is the only value that
could already be decided in that slot, so the new leader must keep
proposing it (or anything if no acceptor reports a prior accept).
Phase 2 — Accept / Accepted
become_leader(t):
role = Leader
# Re-issue Accepts under our ballot for every recovered slot.
for slot in sorted(prepare_accepted.keys):
if slot in learned: continue
value = prepare_accepted[slot].value
accepts[slot] = (my_ballot, value)
accept_count[slot] = { self.id }
broadcast(Accept { ballot: my_ballot, slot, value })
next_slot = 1 + max(any seen slot in accepts ∪ learned, or -1)
last_heartbeat_sent = t
broadcast(Heartbeat { ballot: my_ballot })
drain_pending(out)
drain_pending():
while pending is non-empty:
value = pending.pop_front()
slot = next_slot; next_slot += 1
accepts[slot] = (my_ballot, value)
accept_count[slot] = { self.id }
broadcast(Accept { ballot: my_ballot, slot, value })
try_decide(slot) # n=1 cluster
on Accept(b, slot, v) at acceptor:
if b >= promised_ballot:
promised_ballot = b
accepts[slot] = (b, v)
if role in {Candidate, Leader} and b > my_ballot:
step_down(t)
reset_election_deadline(t)
send Accepted(b, slot, ok=true, self.id) → b.proposer_id
else:
send Accepted(b, slot, ok=false, self.id) → b.proposer_id
on Accepted(b, slot, ok, from) at leader:
if role != Leader or b != my_ballot: drop
if not ok: step_down(t); return
accept_count[slot].insert(from)
try_decide(slot)
try_decide(slot):
if role != Leader or slot in learned: return
if |accept_count[slot]| >= quorum():
v = accepts[slot].value
learned[slot] = v
broadcast(Decided { slot, value: v })
on Decided(slot, v) at any node:
learned[slot] = v
reset_election_deadline(t)
on Heartbeat(b) at node:
if b >= my_ballot and role in {Candidate, Leader} and b.proposer_id != self.id:
step_down(t)
if b >= promised_ballot or (promised_ballot != ZERO and b == promised_ballot):
reset_election_deadline(t)
Simulator loop (per tick t in 0..rounds)
1. enqueue scheduled proposals — schedule[i] = (i+1) * rounds / (K+1)
2. drain cluster-pending values into the current leader (if any)
3. pop every in-flight msg with delivery_time <= t and dispatch handle()
4. tick all nodes in ascending id; on_tick may fire election or heartbeat
The leader-pick rule for proposal injection is "lowest-id node with
role == Leader". During leadership churn there may be no leader (in
which case the value waits in cluster_pending) or even two stale
leaders (in which case the lowest id wins). The deterministic choice
is what keeps the byte hash stable.
Wire format (Rpc)
Six variants; tagged-union shape in Go, Rust enum and C++ std::variant-
backed types. All fields fixed-width, little-endian:
Prepare { ballot: (round: u32, proposer_id: u32) }
Promise { ballot, accept_ok: bool, acceptor_id: u32,
accepted: [(slot: u64, accepted_ballot, value: Vec<u8>)] }
Accept { ballot, slot: u64, value: Vec<u8> }
Accepted { ballot, slot: u64, accept_ok: bool, acceptor_id: u32 }
Decided { slot: u64, value: Vec<u8> }
Heartbeat { ballot }
The wire format is not serialized to disk by this lab — the simulator passes Rpcs as typed structs in memory. The only thing that is serialized is the canonical dump, and that is what gets hashed.
Canonical dump format
file := magic[8 = "DSEPAX01"] u32_le(node_count) node*
node := u32_le id
u32_le promised_ballot.round
u32_le promised_ballot.proposer_id
u8 role # Follower=0, Candidate=1, Leader=2
u32_le my_ballot.round
u32_le my_ballot.proposer_id
u32_le accept_count
accept * accept_count
u32_le learned_count
learned * learned_count
accept := u64_le slot
u32_le accepted_ballot.round
u32_le accepted_ballot.proposer_id
u32_le value_len
u8 value[value_len]
learned := u64_le slot
u32_le value_len
u8 value[value_len]
Nodes appear in ascending id order; inside each node, both accepts
and learned are emitted in ascending slot order. All multi-byte
integers are little-endian. The dump is hashed with SHA-256 and the
lowercase hex digest is what paxosctl prints to stdout (no trailing
newline).
Cross-language invariants
| Invariant | Why it matters |
|---|---|
splitmix64 constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB | identical PRNG output across languages |
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150 | identical election firing times |
delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3 | identical message scheduling |
heap order (delivery_time, sender, seq); seq global monotonic | identical delivery sequence |
peers iterated in ascending id (BTreeMap / std::map / explicit for p:=0;p<n;p++) | identical broadcast order |
| acceptor's Promise lists prior accepts in ascending slot order | identical Promise payload bytes |
candidate's Phase-1 recovery rule: keep (ab, v) with strictly greater ab | identical recovered value per slot |
next_slot = 1 + max(seen accept slot ∪ seen learned slot) after winning Phase 1 | identical first fresh slot |
try_decide quorum check uses ≥ n/2 + 1 (strict majority, leader counted) | identical decide tick |
leader-pick for proposal injection: lowest-id Leader | identical client routing |
proposal schedule: schedule[i] = (i+1) * rounds / (K+1) integer division | identical pending queue contents |
Role enum order Follower=0, Candidate=1, Leader=2 | identical dump bytes |
| dump emits accepts and learned in ascending slot order; nodes in ascending id order | identical dump bytes |
Drift in any one of these and scripts/cross_test.sh fails. The
companion cmp -l workflow in docs/observation.md walks you from
"the hashes differ" to "this exact byte differs" in three commands.
Multi-Paxos vs. Raft (the comparison the labs exist to make)
| Dimension | Raft (db-17) | Multi-Paxos (db-18) |
|---|---|---|
| ordering primitive | current_term: u64 (single integer, persisted, monotonic) | Ballot { round, proposer_id } lex pair |
| leader concept | first-class; exactly one leader per term | emergent; "leader" = whoever last won Phase 1 |
| concurrent proposers | forbidden by election safety | allowed (and routine during churn) |
| consistency check | prev_log_index / prev_log_term per AppendEntries | per-slot accepted_ballot carried in Promise |
| Phase-1 cost amortization | none needed (single leader) | Multi-Paxos (one Prepare covers all future slots) |
| safety from | log matching + election restriction + commit-only-current-term | quorum intersection + Promise reports prior accepts |
| understandability | designed for clarity (Ongaro 2014) | famously subtle (P2c, dueling proposers) |
The lab implementations make these dimensions concrete: scenario A in db-17 takes ~166 ticks to commit a proposal (election + AE round trip); the equivalent scenario A here takes ~150 ticks for Phase 1 plus ~3 ticks per Accept, then the leader runs at Phase-2-only cost until somebody bumps it.
Files
src/rust/—paxos18crate +paxosctlbinary.src/go/— modulegithub.com/10xdev/dse/db18+cmd/paxosctl.src/cpp/—db18_libstatic library +paxosctlbinary +test_db18.scripts/verify.sh— builds + runs the unit tests for all three.scripts/cross_test.sh— proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.
See docs/ for the long-form write-up and steps/ for the staged
implementation path.
db-18 — References
Primary sources
- Leslie Lamport, The Part-Time Parliament, ACM TOCS 1998. The original Paxos paper. Famously hard to read (the Parliament of Paxos allegory hides the algorithm). The mathematics in §2 is the spec; the rest is narrative. https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf
- Leslie Lamport, Paxos Made Simple, ACM SIGACT News 2001. The paper to read first. The whole algorithm — single-decree and the Multi-Paxos extension — is on four pages. The P1a / P1b / P2a / P2b / P2c invariants in this paper are the ones whose operational forms the simulator enforces. https://lamport.azurewebsites.net/pubs/paxos-simple.pdf
- Tushar Chandra, Robert Griesemer, Joshua Redstone, Paxos Made Live — An Engineering Perspective, PODC 2007. Google's Chubby team's writeup of what it took to turn the algorithm into a production system: leader leases, snapshots, group membership, disk corruption, the works. This lab implements roughly §2–§3 of that paper. https://research.google/pubs/paxos-made-live-an-engineering-perspective/
- Robbert van Renesse & Deniz Altinbuken, Paxos Made Moderately
Complex, ACM CSUR 2015. The most readable end-to-end derivation
of Multi-Paxos. Pseudocode in §3 maps almost line-for-line onto
this lab's
start_election,become_leader,try_decide. https://www.cs.cornell.edu/courses/cs7412/2011sp/paxos.pdf - Heidi Howard, Distributed Consensus Revised, PhD dissertation, Cambridge 2019 (also A Generalised Solution to Distributed Consensus, 2020). Reframes Paxos as one point in a design space parameterised by quorum-intersection requirements; explains why Flexible Paxos works and how Raft, EPaxos, and Vertical Paxos all fit into the same picture. https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.pdf
Variants worth knowing
- Leslie Lamport, Fast Paxos, Distributed Computing 2006. Allows
a single round-trip happy path when only one proposer is active,
at the cost of a
3f+1quorum on the fast path. EPaxos generalises this. - Iulian Moraru, David Andersen, Michael Kaminsky, There Is More Consensus in Egalitarian Parliaments (EPaxos), SOSP 2013. Drops the leader entirely; each command picks its own dependency graph. Production-relevant in geo-distributed systems where any-leader latency is uneven.
- Lamport, Malkhi, Zhou, Vertical Paxos, PODC 2009. Decouples reconfiguration from the consensus protocol — the answer to "how do you change the acceptor set without stopping the world".
- Lamport, Generalized Paxos, MSR-TR-2005-33. Lets commutative commands be ordered concurrently; precursor to EPaxos.
Reference implementations to read alongside
- etcd/raft (Go) — included for comparison; etcd uses Raft, but
its testing harness (
raftpbdeterministic replay) is the spirit of this lab's cross-language test. https://github.com/etcd-io/raft - Apache ZooKeeper (Java) — ZAB is a Paxos-family protocol with primary order; useful counterpoint when reading db-19. https://github.com/apache/zookeeper
- Apache Cassandra Lightweight Transactions — production Multi-Paxos in the read/write path. Cassandra picks a fresh ballot per LWT, so it pays the Phase-1 cost every time and skips the Multi-Paxos amortization. Worth reading for what not to do if you care about per-decree latency. https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/service/paxos
- Google Spanner (paper, not source) — Spanner uses Paxos groups per shard with leader leases plus TrueTime for external consistency. The algorithm core is what you build here; everything else is layered above. https://research.google/pubs/spanner-googles-globally-distributed-database/
- TigerBeetle (Zig) — Viewstamped Replication, a near-Paxos
cousin. Deterministic simulator that does almost exactly what
this lab's
cross_test.shdoes, but in one language with thousands of seeds. https://github.com/tigerbeetledb/tigerbeetle
Background reading worth doing
- Diego Ongaro & John Ousterhout, In Search of an Understandable Consensus Algorithm, USENIX ATC 2014. Read alongside db-17. Section 10 (related work) is the cleanest published comparison of Raft to Paxos. https://raft.github.io/raft.pdf
- Junqueira, Reed, Serafini, ZooKeeper's Atomic Broadcast Protocol: Theory and Practice, DSN 2011. ZAB derivation; see db-19. https://marcoserafini.github.io/papers/zab.pdf
- Henry Robinson, Consensus Protocols: Paxos, Cloudera blog 2009. Short, blog-length walkthrough; useful sanity-check after the primary papers. https://blog.cloudera.com/paxos-made-easy-yes-no-maybe/
Cross-lab dependencies
- Upstream:
- db-16 distributed-fundamentals (Lamport/VC and the deterministic simulator harness whose discipline this lab inherits wholesale).
- db-17 raft (sibling consensus algorithm; same harness, same canonical-dump discipline, different RPCs and safety arguments).
- Downstream:
- db-19 ZAB — leader-based atomic broadcast; the
zxid = (epoch, counter)pair generalises this lab'sBallot. - db-20 Distributed KV — wraps a chosen consensus engine around a key-value state machine. Paxos and Raft are interchangeable plug-ins at that layer.
- db-23 Capstone — adds snapshots, reconfiguration (Vertical Paxos or joint consensus), and multi-shard deployment.
- db-19 ZAB — leader-based atomic broadcast; the
db-18 — Analysis
Required invariants
If any of these is violated, scripts/cross_test.sh will fail, and
in the worst case the algorithm itself is unsafe. They are stated in
the order it is easiest to reason about them.
-
Promise monotonicity (P1). For every acceptor and every sim-time tick
t,promised_ballot[t] >= promised_ballot[t-1]. The simulator enforces this with a single comparison on each ofPrepare,Accept: the message's ballot must be>= promised_ballotbefore any state mutation. The Promise reply'saccept_okbit is the operational form of P1b. -
Accept respects promise (P2a). No acceptor ever stores
accepts[slot] = (ab, v)withab < promised_ballot. The Accept handler short-circuits withaccept_ok=falsewhenb < promised_ballot; the leader interprets that bit and steps down instead of advancingaccept_count. -
Per-slot accept uniqueness under a ballot (P2b). For a fixed slot
sand a fixed ballotb, the valuevthat any acceptor stores under(s, b)is the same value. This holds trivially here because only the leader of ballotbever sendsAccept(b, s, v), and itsaccepts[s]is set once and never overwritten under its own ballot. -
P2c (the safety lemma that needs work). Suppose value
vis chosen at slotsunder ballotb. Then for any ballotb' > bissued by any proposer, the value field of anyAccept(b', s, v')will satisfyv' == v. The mechanism: to issue an Accept at all, the proposer must have collected promises from a quorum at ballotb'. That quorum intersects with the quorum that chosevatb. The intersecting acceptor sawvaccepted underb, so its Promise carries(s, b, v). The proposer's recovery rule (take the value whose accepted_ballot is highest) therefore takesv(or a later value chosen under someb'' > b, but inductively that value is alsov). Sov' == v. QED. The simulator implements this rule instart_election's init ofprepare_acceptedand in the Promise-handler'sif ab > prepare_accepted[s].ballotupdate. -
Decided-once / monotonic learn. Once
learned[s]is set on any node, it never changes value. Locally enforced by reading before writing; globally guaranteed by P2c. -
Byte-determinism of the dump. Two runs with the same
(seed, nodes, rounds, proposals, partition)produce identical canonical dump bytes on every language. This requires every iteration order (peers, slots, accepted-list inside Promise, heap pops on identical(time, sender, seq)) to be fixed. Drift here is whatcross_test.shcatches.
Design decisions worth highlighting
-
promised_ballotis global per node, not per-slot. This is the Multi-Paxos optimization. A per-slot promised-ballot map would be more general (closer to single-decree Paxos per slot) but would cost a Phase-1 per slot. The global ballot lets one Phase 1 cover every present and future slot. -
Phase-1 recovery walks every prior accept, not just the latest per slot. The Promise reply contains all of the acceptor's
accepts(sorted by slot). The candidate folds them intoprepare_acceptedwithtake if strictly greater accepted_ballot. Per the proof of P2c this is the only correct rule; a "latest by receive order" tie-break would lose safety the moment Promises arrived out of order. -
my_ballot.roundis bumped tomax(promised, my_ballot).round + 1when starting an election, not justpromised.round + 1. If this node previously won a higher ballot and stepped down due to a partition heal, it would otherwise re-issue its old ballot and immediately lose to its own historical promise. Themaxmakes forward progress under churn deterministic. -
Leader-pick rule: lowest-id
Leader. When the simulator must inject a client proposal, it picks the lowest-id node currently in roleLeader. There may be zero (queue the proposal incluster_pending) or, briefly, two stale Leaders (the lower id wins; the other'sAcceptwill fail at acceptors that have already promised the new ballot). Determinism > realism here. -
drain_pendingruns on every Accepted, not just every tick. In single-node mode (--nodes 1) the leader becomes its own quorum and decides slots inside the broadcast loop. Doing the drain inbecome_leaderand intry_decidemeans scenario D's hash is independent of how the simulator orders ticks. -
Heap key
(delivery_time, sender, seq). db-16's invariant. Without theseqtiebreak, Promise messages from two acceptors arriving on the same tick from the same sender (impossible by construction, but the type system doesn't know that) would be reorder-able across languages. -
Roleenum order.Follower=0, Candidate=1, Leader=2was chosen to match db-17; any change would propagate into the dump byte at offset12 + 16 = 28per node, which would silently invalidate scenario A's canonical hash.
Tradeoffs worth flagging
-
Concurrent proposers cost throughput, not safety. Two proposers in dueling Phase 1 can ping-pong each other forever in principle. The lab dodges this in two ways: (a) the deterministic simulator can't sustain a livelock because election timeouts are PRNG-jittered per node-id, and (b) once a leader is elected, the election-timer reset on Heartbeat keeps it elected. Production systems add leader leases (Chubby, Spanner) to push the worst case down further.
-
No
commit-only-current-termsubtlety. Raft has Figure 8: a newly-elected leader must commit something in its own term before it can ack older entries, otherwise they can be silently overwritten. Paxos sidesteps the problem because P2c forces a new leader to re-Accept any recoverable value under its own ballot; there is no "shadow commit" to retract. The price is the Phase-1-on-every-startup cost. -
No native log compaction. This lab's
acceptsandlearnedgrow unboundedly. A real Multi-Paxos system snapshots a state machine and discards accepts below the snapshot index (see Spanner, Chubby, db-23). Adding snapshots here would require exposing acommitted_throughwatermark in the dump. -
No membership change.
nis fixed atClusterconstruction time. Vertical Paxos (Lamport/Malkhi/Zhou 2009) is the textbook way to add this. db-23 covers it. -
Three languages is more work than two. Two languages prove the spec is unambiguous. Three rules out the case where you and your collaborator have committed the same misreading. C++'s
std::mapand Rust'sBTreeMapagreeing with Go's explicitsort.Slicewas the only thing that caught a misordered Promise payload in scenario B during development.
Why three languages
Same answer as db-17: the constraint forces the spec to be a spec and not a habit. Sorted-iteration discipline, fixed enum order, little-endian fixed-width integers, no map iteration on the wire — these are easy to get away with in any single language, and the only way to surface them is to ask "would another implementation make the same choice without being told?". For Paxos the question matters even more: the algorithm is sensitive to whether the highest-ballot prior accept is chosen during recovery, and a sort-order bug would make safety stochastic, which is the worst possible failure mode.
db-18 — Execution
One-shot: prove it works
cd db-18-paxos
bash scripts/verify.sh # all three languages' unit tests
bash scripts/cross_test.sh # 6 scenarios × 3 binaries × byte-identical hash
verify.sh must end with === verify OK ===. cross_test.sh must
end with === ALL OK === and the six per-scenario hashes must match
the table in docs/observation.md.
Per-language workflows
Rust
cd src/rust
cargo build --release # builds paxos18 lib + paxosctl bin
cargo test --release # 12 unit tests (see verification.md)
./target/release/paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
Crate layout:
src/lib.rs—paxos18library: ballot, RPCs,PaxosNode,Cluster, canonical dump, sha256 helper, inline#[cfg(test)]module.src/bin/paxosctl.rs— CLI entry: parses flags, runs the cluster, emits the sha256 hex digest on a single line with no newline.
Go
cd src/go
go build ./... # builds package + cmd/paxosctl
go test ./... # 11 unit tests
./paxosctl_bin --seed 42 --nodes 3 --rounds 1000 --proposals 5
Module layout:
paxos.go— packagedb18: same surface as the Rust crate.paxos_test.go—go testsuite.cmd/paxosctl/main.go— CLI binary.go.mod— modulegithub.com/10xdev/dse/db18, go 1.22.
C++
cd src/cpp
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j
./test_db18 # 11 unit tests
./paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
Source layout:
include/db18/paxos.hpp+src/paxos.cpp— thedb18namespace library.src/paxosctl_main.cpp— CLI entry.tests/test_db18.cpp— gtest-style assertions (no framework dependency; pure asserts +main).CMakeLists.txt— exposesdb18_lib,paxosctl,test_db18.
CLI reference
paxosctl has the same flags in all three languages. Anything else
on the command line is rejected.
| Flag | Type | Default | Meaning |
|---|---|---|---|
--seed | u64 | required | Seeds splitmix64 for the cluster, every node's election jitter, and every message's delivery delay. |
--nodes | u32 (1..=8) | required | Number of acceptor/proposer nodes; quorum = nodes/2 + 1. |
--rounds | u64 | required | Number of sim-time ticks to run. |
--proposals | u32 | required | Number of client values to inject. Value i is "val-{i}", scheduled at tick (i+1)*rounds/(proposals+1). |
--partition | comma list of src,dst pairs (even-length) | (none) | Drop every message with the listed (src, dst) ordered pairs. Asymmetric: 0,1 blocks 0→1 but not 1→0. Pass 0,1,1,0 for a symmetric link cut. |
Output: a single line of lowercase hex (64 chars), no trailing newline. Exit code 0 on success; non-zero with a stderr message on parse error.
Sample invocations
# Single-node "consensus" — leader is itself, every proposal decides instantly.
paxosctl --seed 1 --nodes 1 --rounds 200 --proposals 5
# Three-node happy path.
paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
# Symmetric partition between 0 and 1 plus 0 and 2 — node 0 is isolated.
paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 3 \
--partition 0,1,0,2,1,0,2,0
Canonical scenarios
These are the six configurations that cross_test.sh runs. Each
combination is a known-stable byte fingerprint; if any of them
changes, you have changed semantics and should expect the cross-test
to fail until you understand why.
| Name | Flags | Notes |
|---|---|---|
| A | --seed 42 --nodes 3 --rounds 1000 --proposals 5 | happy path, 3-node, no partition |
| B | --seed 7 --nodes 5 --rounds 2000 --proposals 20 | 5-node, longer schedule, more decisions |
| C | --seed 99 --nodes 3 --rounds 500 --proposals 0 | leader election only; no proposals |
| D | --seed 1 --nodes 1 --rounds 200 --proposals 5 | single-node; quorum = self |
| E | --seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0 | node 0 isolated symmetrically; {1,2} retain quorum |
| F | --seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1 | asymmetric link cut; minor degradation |
Sanity checks
If you only have ten seconds:
( cd src/rust && cargo build --release ) >/dev/null && \
( cd src/go && go build -o paxosctl_bin ./cmd/paxosctl ) >/dev/null && \
( cd src/cpp/build && cmake --build . --target paxosctl ) >/dev/null && \
diff <(src/rust/target/release/paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5) \
<(src/go/paxosctl_bin --seed 42 --nodes 3 --rounds 1000 --proposals 5) && \
diff <(src/rust/target/release/paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5) \
<(src/cpp/build/paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5) && \
echo OK
Silence + OK = green. Any diff = divergence; jump to
docs/observation.md § Divergence runbook.
db-18 — Observation
Expected canonical hashes
Six configurations are pinned in scripts/cross_test.sh. The lab is
green iff all three binaries (Rust release, Go release, C++ Release)
emit exactly these strings on stdout (no trailing newline):
| Name | Flags | SHA-256 of canonical dump |
|---|---|---|
| A | --seed 42 --nodes 3 --rounds 1000 --proposals 5 | 0a35fdad1dd97c76a40a61b020c6181a56c4a40d4f723cb68fe70c2112aa9b63 |
| B | --seed 7 --nodes 5 --rounds 2000 --proposals 20 | 3cc6cae6cb7f9d2b7cb88088a0f22581ac4c41bd86bab1b3676dd0ba33fd7ead |
| C | --seed 99 --nodes 3 --rounds 500 --proposals 0 | f28d025af748a790beded6167115c7094a7f939b45d439728e4d6b7e144c3be0 |
| D | --seed 1 --nodes 1 --rounds 200 --proposals 5 | e5e0248c7c4fa20991b90afdac828eab91a7414497461dadc2e1553040693139 |
| E | --seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0 | 674e62d809248ac99401054c195d29b0e2eed6ccc78ec45e96da8aaf69c36096 |
| F | --seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1 | 7d80176abad54e533b2f4174e84f58432a000255fbb2ecbbb1dd915cb6bb6ab5 |
These are the contract. Edit any production code such that one of these strings changes and you have changed semantics; reverify end-to-end before you ship.
Walking the wire: scenario D byte-by-byte
Scenario D is the shortest possible dump (one node, five proposals,
all decided locally). Use it as a Rosetta Stone before debugging the
multi-node hashes. The layout is magic || u32 node_count || node[],
and the node payload starts at offset 12.
00..07 4453 4550 4158 3031 "DSEPAX01" magic
08..0b 01 00 00 00 node_count = 1
0c..0f 00 00 00 00 node.id = 0
10..13 rr rr 00 00 node.promised_ballot.round (round it won at)
14..17 00 00 00 00 node.promised_ballot.proposer_id (= self.id = 0)
18 02 node.role = Leader (2)
19..1c rr rr 00 00 node.my_ballot.round
1d..20 00 00 00 00 node.my_ballot.proposer_id
21..24 05 00 00 00 accept_count = 5
... 5 × {u64 slot, u32 ab.round, u32 ab.proposer_id, u32 value_len, value bytes}
... then u32 learned_count = 5 and 5 × {u64 slot, u32 value_len, value bytes}
Run:
src/rust/target/release/paxosctl --seed 1 --nodes 1 --rounds 200 --proposals 5
# e5e0248c7c4fa20991b90afdac828eab91a7414497461dadc2e1553040693139
To dump the raw bytes (skip the sha256 step) hack the binary to print
canonical_dump instead of sha256_hex(&canonical_dump); do it
locally only — the canonical CLI output is the sha256.
Walking the wire: scenario C (no proposals)
Scenario C runs three nodes for 500 ticks with --proposals 0.
Exactly one of them will be elected leader; nobody decides anything.
The dump therefore has accept_count == 0 and learned_count == 0
for every node. The bytes that do change between languages if you
have an iteration-order bug are the per-node promised_ballot.round
values (the elected leader's round depends on whether some other
proposer almost-elected first). If C is the failing scenario, you
have an election-timer determinism bug, not a Phase-2 bug.
Divergence runbook
If cross_test.sh prints MISMATCH scenario X, follow this script:
# 1. Capture the raw bytes from each binary. Patch paxosctl locally
# to print `canonical_dump` raw instead of sha256 hex, run once,
# then revert the patch. Save to rust.bin, go.bin, cpp.bin.
cmp -l rust.bin go.bin | head
cmp -l rust.bin cpp.bin | head
cmp -l prints byte_offset rust_value go_value in octal. Map the
first offset to the field it belongs in:
| Offset | Field | Likely culprit |
|---|---|---|
| 0..7 | magic "DSEPAX01" | wrong magic literal |
| 8..11 | node_count | wrong u32_le writer, wrong endianness |
| 12 + k*node_size + 0..3 | node.id | iterating nodes in wrong order (not ascending id) |
| 12 + k*node_size + 4..11 | promised_ballot | election-timer drift or wrong PRNG seed mix |
| 12 + k*node_size + 12 | role (1 byte) | enum reordered (must be Follower=0, Candidate=1, Leader=2) |
| 12 + k*node_size + 13..20 | my_ballot | step-down logic differs (e.g., resetting my_ballot to zero or not) |
| 12 + k*node_size + 21..24 | accept_count | one acceptor accepted a slot the others did not — Phase-2 message ordering bug |
| inside an accept tuple | slot | accepts iterated in receive order, not sorted by slot |
| inside an accept tuple | accepted_ballot | Phase-1 recovery used a wrong rule (e.g., last-write-wins instead of highest-ballot) |
| inside an accept tuple | value_len / value | wrong proposal scheduled at this slot — proposal-injection rule or leader-pick rule differs |
| inside the learned section | slot / value | the difference is downstream of an accept-section difference; fix that first |
Tick-level diff
If cmp -l flags a divergence inside the accepts of node 1, add
eprintln!/fmt.Fprintln(os.Stderr, ...)/std::cerr lines in each
implementation at the boundaries of the suspect ticks:
#![allow(unused)] fn main() { // after handle() and after on_tick(): eprintln!("t={} id={} promised={:?} role={:?} my={:?} accepts={:?} learned={:?}", t, id, n.promised_ballot, n.role, n.my_ballot, n.accepts, n.learned); }
Run all three, diff -u rust.log go.log. The first differing tick is
the bug.
Most common culprits in practice
- Forgetting to sort the Promise payload by slot. Go's
mapiteration order is randomized; you mustsort.Slicebefore appending to the wire. - Reading
next_slotbefore recovering fromprepare_accepted. If recovery doesn't updatenext_slot = max + 1, the leader will double-allocate a slot that already has a recovered accept, silently overwriting it. - Letting
step_downclearpromised_ballot. Promises are forever; onlymy_ballotis candidate-state. - Counting yourself twice in
accept_count. Bothbecome_leaderandtry_decideinsert self; the second one is a no-op only ifaccept_countis a set, not a multiset. - Iterating peers as
for p in nodes.iter()on aHashMap. UseBTreeMapin Rust,std::mapin C++, and explicitfor p := uint32(0); p < n; p++in Go.
db-18 — Verification
Prerequisites
- Rust ≥ 1.74 with
cargoonPATH. - Go ≥ 1.22 (module declares
go 1.22). - CMake ≥ 3.20 and a C++17 compiler (Apple clang ≥ 14, gcc ≥ 11).
- A POSIX
sha256sumis not required — each binary computes its own sha256 in-process.
One command
cd db-18-paxos
bash scripts/verify.sh && bash scripts/cross_test.sh
Green is === verify OK === followed by === ALL OK ===. Anything
else is a regression.
What verify.sh does
- Rust —
cargo build --releasethencargo test --release. Buildspaxos18lib +paxosctlbinary; runs the 12 inline tests insrc/rust/src/lib.rs. Expected output ends withtest result: ok. 12 passed. - Go —
go build ./...thengo test ./.... Buildscmd/paxosctl+ package; runs the 11 tests insrc/go/paxos_test.go. Expected output ends withPASSandok github.com/10xdev/dse/db18. - C++ —
cmake -DCMAKE_BUILD_TYPE=Release ..,make -j, then./test_db18. Buildsdb18_lib,paxosctl,test_db18; the test binary prints one line per assertion-group and ends withALL 11 TESTS PASSED.
If any of these three blocks fails, the script exits non-zero and the rest does not run.
What cross_test.sh does
For each of the six canonical scenarios (A–F), it invokes the three
release binaries with identical flags, captures stdout, and asserts
rust == go == cpp byte-for-byte. The output prints the matching
hash on success; on mismatch it prints all three hashes and exits.
The script does not trust the canonical hashes from this repo
to be correct — it only enforces consistency among the three
implementations. The "is the hash also the historical fingerprint"
check happens by comparing the script's output against
docs/observation.md § Expected canonical hashes.
What green guarantees
If both scripts pass:
- Safety in the modeled environment. For every seed × scenario in the suite, no acceptor stored a decided value that contradicts another node's decided value for the same slot. The unit tests include cases for dueling proposers, partitions, and Phase-1 recovery; the cross-test sweeps the same scenarios across three independent implementations.
- Determinism. Same inputs ⇒ same canonical dump bytes, across languages and across machines (modulo endianness — all targets are little-endian).
- Liveness in the modeled environment. Scenarios A, B, D, F all include proposals and run long enough to elect a leader and decide them. Scenarios C and E exist to confirm we don't decide when we shouldn't (C has no proposals; E isolates node 0 so it must not influence the chosen value while {1,2} still carry the load).
What green does not guarantee
- Behavior outside the canonical scenarios. The state space of
three-process Multi-Paxos is exponential; six fingerprints are an
acceptance test, not a model checker. A real Paxos audit needs
TLA+ (see
references.md § Background reading). - Performance. No latency or throughput is checked. Scenario A
takes ~150 ticks of simulated time to decide; that is a function
of the configured
ELECTION_TIMEOUT_MIN, not a wall-clock SLA. - Snapshotting, membership change, log compaction. None of
these exist in this lab; the dump grows unboundedly in
acceptsandlearned. db-23 covers the rest. - Production safety primitives — leader leases, fsync barriers, on-disk checksums, recovery from torn writes, byzantine actors. All deliberately out of scope.
Invariant assertions in code
Each implementation re-checks the lab's invariants where the cost is near-zero. The most load-bearing assertions are listed below; their firing means the test that triggered them is reporting a symptom of a Phase-1 / Phase-2 bug, not a flaky test.
| Where | Assertion | What it catches |
|---|---|---|
Handle::Promise (all 3 langs) | leader ignores Promise if b != my_ballot | stale Promise replies from a previous Phase 1 (would inflate the quorum count and decide too early) |
Handle::Accepted (all 3 langs) | leader ignores Accepted if b != my_ballot | same, for Phase 2 |
try_decide (all 3 langs) | only the current Leader can mark a slot learned | a stepped-down node attempting to declare a decision (would split-brain learned) |
| Promise payload serialization (all 3 langs) | accepts iterated in ascending slot order | undetected map-iteration drift between languages |
canonical_dump writer (all 3 langs) | nodes in ascending id; per-node accepts and learned in ascending slot | drift between three independent dump writers |
Rust unit single_node_in_three_node_partition_does_not_decide | isolated minority must have empty learned | a quorum-counting bug that lets a single node decide |
Go unit TestMajorityRequiredToDecide | 1-of-3 cannot decide | same, Go side |
C++ unit cannot_decide_in_minority | 1-of-3 cannot decide | same, C++ side |
db-18 — Broader Ideas
The lab implements textbook Multi-Paxos with a deterministic simulator and three-language cross-validation. It deliberately stops where production engineering begins. This document collects the threads worth pulling on next.
Variants and refinements
Fast Paxos (Lamport 2006)
Skips Phase 2's "leader replays" step on the happy path by letting
any proposer broadcast Accept directly. The cost: the fast-path
quorum must be ⌈3n/4⌉ instead of ⌊n/2⌋ + 1, so 4-of-5 instead
of 3-of-5. When two proposers collide on the fast path the system
falls back to classic Paxos. Worth implementing as db-18b once
the classic version is fluent — it reuses the entire wire format
and only changes the proposer-side state machine.
EPaxos (Moraru, Andersen, Kaminsky, SOSP 2013)
Drops the leader entirely. Each command picks its own dependency graph among recently-issued commands and decides in one RTT if no conflict, two RTTs otherwise. The "deterministic simulator + three implementations" discipline you build here is what makes EPaxos's notoriously subtle conflict-detection logic testable at all. Used in production at Facebook (Bunshin) and as the backbone of some geo-distributed configuration stores.
Generalized Paxos (Lamport, MSR-TR-2005-33)
Allows commutative commands to be partially ordered concurrently, not totally ordered serially. The state-machine layer must explicitly declare command commutativity. Precursor to EPaxos. Operationally similar to CRDTs at the storage layer (db-21) but with hard consensus underneath.
Vertical Paxos (Lamport, Malkhi, Zhou, PODC 2009)
Separates the "agree on the value at slot S" problem from the "agree on the membership of the acceptor set at slot S" problem, by delegating reconfiguration to an auxiliary master. Cleaner than joint-consensus (Raft's approach) and Lamport's preferred way to do membership changes. db-23 will revisit.
Flexible Paxos (Howard 2016, dissertation 2019)
Observation: the two quorums in Paxos don't have to be majorities.
They only have to intersect. So Phase-1 quorum + Phase-2 quorum
just have to sum to more than n. Production payoff: you can run
with a smaller Phase-2 quorum (lower latency on the common path)
in exchange for a larger Phase-1 quorum (higher cost during
leadership churn). A great teaching variant to layer on top of
this lab once the canonical hashes are stable.
Production systems to study
Google Chubby
Five-replica Paxos lock service powering Google's lookup infrastructure (DNS, leader election for other services). Chandra et al.'s Paxos Made Live (PODC 2007) is the canonical writeup of what it took to turn the algorithm into a system: leader leases, snapshots every few minutes, master-side group membership, three generations of disk-corruption handling. Read alongside this lab once green.
Google Spanner
Multi-Paxos per shard. Spanner's contribution above Multi-Paxos is TrueTime — a clock API with bounded uncertainty that lets the system serve external-consistency-preserving reads without a Paxos round. The Paxos layer itself is exactly the algorithm you've implemented, plus production hardening.
Apache Cassandra LWT
Lightweight Transactions use Multi-Paxos to give linearizable CAS-style updates on top of Cassandra's eventually-consistent replication. Cassandra picks a fresh ballot per request, so it pays the Phase-1 cost every time and never amortizes — a clean illustration of the Multi-Paxos tradeoff in reverse.
Microsoft Azure Service Fabric
Uses a Paxos variant (Smart Actors) under the hood for ring-leader election and replicated state services. Less publicly documented; the architectural papers are paywalled behind ASE/SOSP, but worth chasing for an industrial counterpoint.
Apache ZooKeeper (ZAB)
Not strict Paxos but in the same family. ZAB layers epoch+counter
on top of a primary-order protocol; the zxid pair is the direct
analogue of this lab's Ballot. db-19 builds it.
Performance experiments worth running
The deterministic simulator is too clean for real performance work, but the simulator's ticks are a fine unit of cost for comparative experiments:
- Phase-1 amortization sweep. For
nodes ∈ {3,5,7,9}, runproposals = 50and count the number of ticks to decide the last slot. The expected curve is linear innodesfor the first decision (Phase 1 costs abroadcastround-trip per acceptor) and constant per slot thereafter (Phase 2 RTT). - Election-timer jitter sensitivity. Vary
ELECTION_TIMEOUT_SPANand measure how often dueling proposers ping-pong before someone wins. The textbook answer is "wider jitter = fewer collisions = fewer ballot bumps", and the simulator lets you confirm it without networking. - Quorum recomputation latency. For Flexible Paxos configurations, plot Phase-2 latency against Phase-1 quorum size. Howard 2016 has the analytical curve; you can ground-truth it.
- Comparison to Raft (db-17). Same flags, same scenarios, same measurement. The lab structure is identical on purpose.
What "production-quality" would require beyond this lab
- Disk durability. A real acceptor fsyncs
promised_ballot,accepts, and (depending on design)learnedbefore replying to a Promise / Accepted. Without that, a crash-restart cycle can silently retract a promise and break safety. - Snapshotting.
acceptsandlearnedgrow forever in this lab. A real system periodically snapshots the state machine and garbage-collects acks below the snapshot index. The snapshot itself must be agreed on by Paxos (or by a separate snapshot coordinator), which is a whole-other lab. - Membership reconfiguration. Adding/removing acceptors safely is non-trivial: you must either run two configurations in parallel during the transition (Raft's joint consensus) or delegate the membership decision to a higher layer (Vertical Paxos). db-23 picks this up.
- Leader leases. Production Paxos systems give the current leader a time-bounded lease to serve reads without consulting acceptors. This requires a synchronized clock model (Spanner's TrueTime, or weaker lease-renewal protocols) — orthogonal to consensus per se but tightly coupled in real deployments.
- Witness / arbiter nodes. Some deployments allow a third node to hold no data but break tie-vote symmetry. Implementing this while keeping safety proofs sound requires care.
- Recovery from disk corruption. Real-world failure modes
include silent bit-rot of
promised_ballot. The defensive posture is to checksum every persisted record and treat a checksum failure as "I've never voted for anything" — a strict safety superset of treating it as "I voted for a high ballot", but at the cost of liveness during recovery. - Observability. Live systems need per-slot decision latency histograms, per-acceptor promise rejection counters, leader flap detection. The canonical dump is the right shape of observability but the real one runs continuously rather than on-demand.
db-18 step 01 — Single-decree Paxos
Goal
Build the two-phase Paxos protocol for one slot. A proposer must be able to drive a value to a decision in the presence of competing proposers, and an acceptor's recorded state must be exactly enough for the next proposer to recover any value that might have already been chosen. The byte layout of acceptor state must be identical across Rust, Go, and C++.
Tasks
-
Ballot. Define
Ballot { round: u32, proposer_id: u32 }with lexicographic ordering (round first, then proposer_id as tie-break). Provide aBallot::ZEROconstant equal to(0, 0). Every comparison in the rest of the protocol uses this order. -
PaxosNode skeleton. Each node carries:
id: u32,n: u32(cluster size),quorum = n/2 + 1.role: Role(Follower / Candidate / Leader).promised_ballot: Ballot— highest ballot ever promised (one value, shared across all slots in this Multi-Paxos style).my_ballot: Ballot— this proposer's current attempt.accepts: BTreeMap<Slot, (Ballot, Vec<u8>)>— for each slot, the highest-ballot accept observed.learned: BTreeMap<Slot, Vec<u8>>— decided values, in slot order.
-
on_prepare(ballot). Ifballot >= promised_ballot, setpromised_ballot = ballotand replyPromise { accept_ok = true, accepted = [(slot, ab, value) for every entry in accepts] }. Otherwise replyPromise { accept_ok = false, accepted = [] }. The full walk overacceptsis what makes Phase 1 the recovery step. -
on_promise. A proposer collects promises until it has a quorum. For each slot mentioned in any promise, it adopts the value of the highest-ballot accept (Paxos safety property P2c). For slots with no prior accept, the proposer is free to use its own pending value. The proposer then transitions to Leader and broadcastsAcceptfor every slot in its working set. -
on_accept(ballot, slot, value). Ifballot >= promised_ballot, updatepromised_ballot = ballot, overwriteaccepts[slot] = (ballot, value), replyAccepted { accept_ok = true }. Otherwise replyAccepted { accept_ok = false }. Note that an accepted value is never unaccepted — only superseded by a higher-ballot accept on the same slot. -
on_accepted. A proposer that collects a quorum ofaccept_ok = truefor the same(slot, ballot)learns the value and broadcastsDecided { slot, value }. Learners (every node) recordlearned[slot] = valueonDecided.
Acceptance
Inline unit tests in each language. Names below are the Rust form;
Go uses TestSha256KnownVectors style, C++ uses
test_sha256_known_vectors:
sha256_known_vectors— empty,"abc", and the lazy-dog vector all hash to the well-known constants. Locks the SHA-256 implementation to RFC 6234.dueling_proposers_higher_ballot_wins— acceptor promises(1,1), then(1,2)arrives and is promised; a staleAcceptat(1,1)is nacked. Verifies promised_ballot monotonicity.promise_carries_prior_accept_for_recovery— acceptor with a prior accept at ballotb1on slot 0 receives a Prepare at higher ballotb2; the Promise must include the(0, b1, value)tuple so the new leader can re-propose the value. This is P2c.majority_required_to_decide— proposer in a 5-node cluster with only 2 of 5 accepts must not call the slot decided; the third accept tips it over the threshold.ballot_ordering_is_lexicographic—(1,9) < (2,0),(1,0) < (1,1),ZERO < (0,1). Locks the comparator.
All five green in Rust, Go, and C++.
Discussion prompts
- Quorum intersection. Why must any two quorums share at least one acceptor? Walk through what breaks if a 4-node cluster used quorum size 2 instead of 3.
- Why P2c. Suppose Phase 1 returned just
accept_okwithout the list of prior accepts. Construct a 3-node run where a valuevis chosen, then a higher-ballot proposer chooses a different valuew. Why does carrying prior accepts forward in the Promise prevent this? - Ballots vs terms. Raft's
termis a singleu64. Paxos's ballot is(round, proposer_id). What does the proposer_id tie-break buy you that a single counter would not, and why does Raft not need it?
db-18 step 02 — Multi-Paxos and the replicated log
Goal
Generalise single-decree Paxos into a log. A stable leader runs Phase 1 once, then drives a sequence of slots through Phase 2 only — that is the entire point of Multi-Paxos. Newly elected leaders must recover any partially-accepted slots before issuing new proposals, so the log stays contiguous and every committed prefix is identical on every replica.
Tasks
-
Leader election trigger. A Follower or Candidate whose
election_deadlineelapses bumpsmy_ballot.round = max(my_ballot.round, promised_ballot.round) + 1, setsmy_ballot.proposer_id = self.id, transitions to Candidate, and broadcastsPrepare { ballot: my_ballot }. Election deadline is reset with the same splitmix64 jitter formula as Raft:t + 150 + splitmix64(seed ^ id ^ t) % 150. -
become_leader. On collectingquorumpromises formy_ballot, transition to Leader, then:- Compute
next_slot = max(slot in any promise.accepted) + 1, defaulting tomax(learned.keys()) + 1if no accepts were reported. - For every slot in
[0, next_slot)that appears in any promise: adopt the value of the highest-ballot accept and broadcastAccept { my_ballot, slot, value }(this is the recovery sweep — it re-proposes potentially-chosen values under the new ballot). - Call
drain_pendingto attach pending client values to the next free slots, broadcastingAcceptfor each.
- Compute
-
Heartbeat. A Leader whose
heartbeat_dueelapses broadcastsHeartbeat { ballot: my_ballot }. Followers reset their election timers on any inbound RPC from the current leader. This is what makes Multi-Paxos amortise Phase 1: as long as heartbeats arrive, no one starts a new ballot. -
Decidedbroadcast. When a leader'stry_decide(slot)sees a quorum ofaccept_ok=truefor the slot's ballot, it markslearned[slot] = valueand broadcastsDecided { slot, value }to every node. Learners record the value inlearned; the leader does not need to re-decide on receipt. -
Lowest-id leader rule. When tests inspect "the" leader of a cluster, they pick the Leader with the lowest
id. This is a deterministic tie-break for the (rare) case where two nodes briefly both believe themselves leader during a flap; the safety invariants do not depend on at-most-one Leader, only on at-most-one chosen value per slot per ballot.
Acceptance
Inline unit tests in each language:
single_node_decides_every_proposal— a 1-node cluster (quorum 1) withproposals = 3ends withlearned = [(0, "val-0"), (1, "val-1"), (2, "val-2")]. Degenerate case but verifies the leader path.three_node_elects_single_leader—Cluster::new(42, 3)after 500 ticks with zero proposals has exactly one node in role Leader.three_node_replicates_proposals—Cluster::new(42, 3)after 1000 ticks withproposals = 5has every node'slearnedof length 5 and byte-identical to node 0's.multi_slot_log_is_contiguous— 10 proposals on a 3-node cluster yield slot keys0..10on every node, no gaps.partition_heals_progress_resumes— drop all traffic between node 0 and the other two; the surviving pair{1, 2}still elects a leader and decides 4 proposals. Demonstrates that Multi-Paxos liveness depends on some quorum being connected, not on the original leader being reachable.
All five green in Rust, Go, and C++.
Discussion prompts
- Amortisation. Why is the Phase 1 cost paid only at leader change in Multi-Paxos but on every decision in single-decree Paxos? What is the steady-state message count per decision on a 5-node cluster?
- Leader leases. Real systems (Spanner, Chubby) layer a lease on top of Multi-Paxos so the leader can serve linearizable reads without quorum. What changes in the safety argument if you serve reads off the leader without a lease?
- Recovery cost. A new leader must walk every acceptor's full
acceptsmap for the recovery sweep. What is the message size in bytes for a log with 1M slots and 256-byte values? What optimisations (truncation, snapshots,min_slotexchange) would you add for production?
db-18 step 03 — Cross-language determinism
Goal
The Rust, Go, and C++ implementations must, given the same
(seed, nodes, rounds, proposals, partition) CLI inputs, produce
the byte-identical canonical dump and therefore the same
SHA-256. This is the third leg of the lab: protocol correctness
plus simulator determinism plus serialisation discipline.
Tasks
-
Discrete-event simulator. A
Clusterowns a min-heap of pending RPCs keyed(delivery_time, src, seq).seqis a monotonically increasing per-cluster counter assigned at send time, breaking ties when two RPCs from the same sender become deliverable on the same tick. Every send pushes onto the heap; every tick pops everything due, dispatches it vianode.handle(rpc, src, t, &mut out), and pushes any reply RPCs back onto the heap withdelivery = t + 1 + splitmix64(seed ^ src ^ dst ^ seq) % 3. -
Iteration discipline. All iteration over collections of nodes, slots, or peers must be in sorted order. Rust uses
BTreeMap/BTreeSetexclusively. Go usessort.Slice/sort.Intsbefore every loop over a map's keys. C++ usesstd::map/std::set. A single iteration over a hash map anywhere in the protocol path will diverge across languages on ~2000 ticks. -
Partition modelling. The
Clustercarries aDrop: Set<(u32, u32)>of dropped unidirectional edges. The CLI flag--partition s,d,s,d,...parses pairs and inserts them. Asymmetric partitions are intentional:--partition 0,1only drops 0→1 traffic, not 1→0. Scenario F exercises this. -
Canonical dump.
canonical_dump(&cluster)writes:"DSEPAX01" (8 bytes magic) u32_le(node_count) for each node in ascending id: u32_le(id) u32_le(promised_ballot.round) u32_le(promised_ballot.proposer_id) u8(role) (Follower=0, Candidate=1, Leader=2) u32_le(my_ballot.round) u32_le(my_ballot.proposer_id) u32_le(accepts_len) for each (slot, (ballot, value)) in accepts, ascending slot: u64_le(slot) u32_le(ballot.round) u32_le(ballot.proposer_id) u32_le(value_len) value bytes u32_le(learned_len) for each (slot, value) in learned, ascending slot: u64_le(slot) u32_le(value_len) value bytesHash the bytes with SHA-256, print lowercase hex, no trailing newline.
-
CLI:
paxosctl. Each language ships a binary that accepts--seed <u64> --nodes <u32> --rounds <u32> --proposals <u32> [--partition s,d,...], runs the cluster forroundsticks with proposals scheduled attick = (i+1) * rounds / (proposals+1),value = b"val-" + itoa(i), dumps canonical bytes, prints the hex SHA-256. -
scripts/cross_test.sh. Builds all three binaries, runs the 6 scenarios A–F against each, compares the three hashes to the canonical table, and exits non-zero on mismatch. The script ends with=== ALL OK ===on success.
Acceptance
Inline unit tests in each language:
dump_deterministic_across_runs— two independentCluster::new(42, 3)instances each run 1000 ticks with 5 proposals produce byte-identical dumps. Confirms intra-language determinism.- Scenario A
--seed 42 --nodes 3 --rounds 1000 --proposals 5→0a35fdad1dd97c76a40a61b020c6181a56c4a40d4f723cb68fe70c2112aa9b63 - Scenario B
--seed 7 --nodes 5 --rounds 2000 --proposals 20→3cc6cae6cb7f9d2b7cb88088a0f22581ac4c41bd86bab1b3676dd0ba33fd7ead - Scenario C
--seed 99 --nodes 3 --rounds 500 --proposals 0→f28d025af748a790beded6167115c7094a7f939b45d439728e4d6b7e144c3be0 - Scenario D
--seed 1 --nodes 1 --rounds 200 --proposals 5→e5e0248c7c4fa20991b90afdac828eab91a7414497461dadc2e1553040693139 - Scenario E
--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0→674e62d809248ac99401054c195d29b0e2eed6ccc78ec45e96da8aaf69c36096 - Scenario F
--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1→7d80176abad54e533b2f4174e84f58432a000255fbb2ecbbb1dd915cb6bb6ab5
All six match across Rust, Go, and C++; bash scripts/cross_test.sh
exits 0 with === ALL OK ===.
Discussion prompts
- Sort discipline. Find the language-default hash map in your language. What is its iteration order? What is the cost of replacing it with the language's ordered map for the canonical dump path only versus everywhere?
- SplitMix64. Why is splitmix64 a good fit for a deterministic
simulator clock when something like
rand::thread_rng()is not? Walk through the three constants — what are they and why? - Three languages. What classes of bug does the cross-language test catch that a single-language test cannot? (Hint: think signed-vs-unsigned overflow, default hash randomisation, iteration order, integer-promotion rules in comparisons.)
db-19 — ZAB (ZooKeeper Atomic Broadcast)
This lab implements ZAB — the atomic broadcast protocol that drives
Apache ZooKeeper — in Rust, Go, and C++, all three producing a
byte-identical sha256 of a canonical cluster dump for any
(seed, nodes, rounds, proposals, partition) configuration. It inherits
the deterministic-simulator discipline of db-16 and db-17: same
splitmix64 seeding, same (delivery_time, sender, seq) heap tie-break,
same "sorted iteration on the wire" rule.
Where db-17 Raft taught you that one consensus algorithm can be pinned
down to a single byte sequence across three languages, db-19 ZAB does
the same exercise for a different algorithm with a meaningfully
different recovery story: an explicit Discovery / Synchronization phase
between leader election and steady-state broadcast, and a transaction
identifier (zxid) that pairs an epoch with a counter rather than
Raft's single monotonic term + index.
What is it?
ZAB (Reed & Junqueira, LADIS 2008; Junqueira, Reed & Serafini, DSN 2011) is the primary-backup atomic broadcast protocol that ZooKeeper uses to keep its replicated state machine consistent. It is not a generic consensus library; it was designed specifically for ZooKeeper's workload: a small, well-known cluster (3, 5, 7 nodes), a small in-memory state machine, and a strong primary-order guarantee that arbitrary client requests served by the same primary are delivered in the order the primary chose.
ZAB decomposes into four phases. Phase 0 is the original FastLeader- Election; later papers fold it into Phase 1.
-
Leader election (FastLeaderElection). Every node starts in
Looking. Each node broadcasts its current vote — initially for itself — carrying(last_zxid, server_id). On receiving a peer vote, a Looking node updates its own vote to that peer if(peer.last_zxid, peer.id) > (own.last_zxid, own.id)lexicographically, then re-broadcasts. When a quorum of voters agree on the same target, that node is elected: it transitions toLeading, everyone else who voted for it transitions toFollowing. -
Discovery. The new prospective leader picks a fresh
new_epoch = max(accepted_epoch, current_epoch) + 1, sets its ownaccepted_epoch = new_epoch, and broadcastsNewEpoch(new_epoch). Each follower that accepts updates itsaccepted_epochand replies withAckEpoch(current_epoch, last_zxid). Once a quorum of AckEpochs arrives, the leader knows the highest(current_epoch, last_zxid)in the surviving quorum — that node's history is the one that must survive. -
Synchronization. The leader bumps its
current_epoch = new_epoch, resets the per-epoch counter, and broadcastsNewLeader(new_epoch, history)— the whole history that this epoch will start from. Followers replace their local history with the leader's, setcurrent_epoch = new_epoch, and replyAckLeader(new_epoch). On a quorum of AckLeaders the leader declares itself synced and broadcastsCommit(last_zxid_of_history)so followers can advancelast_committedpast the synced tail. -
Broadcast (steady state). Now indistinguishable from Raft's replication phase, modulo names. For each client proposal, the leader assigns
zxid = (current_epoch, ++counter), appends to its history, broadcastsPropose(txn). Followers append in zxid order and replyAck(zxid). On Ack quorum the leader broadcastsCommit(zxid). Heartbeats are implemented as periodic re-sends of the lastCommit(orNewLeaderduring pre-sync) — receiving one from the current leader refreshes the follower's election timer.
The simulator drives sim time forward in integer ticks; messages are
scheduled into a heap with deterministic (delivery_time, sender, seq)
order; an optional partition set drops messages in named directions.
Why does it matter?
-
ZAB is the algorithm under ZooKeeper — and ZooKeeper is the coordination kernel under Kafka (pre-KRaft), HBase, Hadoop YARN, Mesos, Cassandra's lightweight transactions (historically), Druid, and a long list of production systems. Knowing exactly how the NewLeader / Sync handshake works is the difference between operating ZooKeeper and understanding it.
-
ZAB and Raft cover the same problem with meaningfully different shapes. ZAB has an explicit recovery handshake that Raft folds into the AppendEntries consistency check; ZAB's
zxid = (epoch, counter)is essentially Raft's(term, index), but the role each plays is subtly different. Implementing both back-to-back makes the contrast concrete instead of conceptual. -
Three byte-identical implementations force the spec to be unambiguous. Anywhere ZAB "depends on the implementation" — election tie-break, vote rebroadcast on update, AckEpoch idempotency, heap scheduling — has to be pinned down. The cross-language sha256 makes drift loud.
-
Reproducible partitions. With a deterministic
--partition s,d,...flag and a seeded simulator, you can replay the exact sequence of message drops, leader churn, and committed transactions that triggered a bug, on any machine, in any of the three languages. -
Foundation for the rest of the track. db-20 distributed-kv will plug a consensus engine into a real key-value store; db-23 capstone composes the simulator harness across multiple replicated shards.
How does it work?
State (per node)
persistent : current_epoch : u32 # epoch of the leader we accepted into sync
accepted_epoch : u32 # epoch we've ack'd via NewEpoch (>= current_epoch)
history : Vec<Txn>
last_committed : ZxId
volatile : role : Looking | Following | Leading
leader_id : Option<u32>
election : vote_target_id : u32 # who we currently vote for
vote_target_zxid: ZxId # the (last_zxid) we voted on
vote_view : Map<voter_id, leader_id> # tally
leader-only : pending_new_epoch : u32
epoch_acks : Set<follower_id> # AckEpoch quorum tracker
leader_acks : Set<follower_id> # AckLeader quorum tracker
synced : bool
next_counter : u32 # zxid counter under current_epoch
ack_set : Map<ZxId, Set<follower_id>>
timers : election_deadline : u64 # sim-time tick
last_heartbeat_sent : u64
Election timer
reset_election_deadline(t):
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150
A 150-tick base plus 150 ticks of seeded jitter avoids split-vote loops. Heartbeats fire every 50 ticks once a leader is synced.
FastLeaderElection (Phase 0)
on entering Looking:
vote_target_id = self.id
vote_target_zxid = self.last_zxid()
vote_view.clear(); vote_view[self.id] = self.id
broadcast LookForLeader { self.id, self.last_zxid, current_epoch }
broadcast Vote { self.id, self.last_zxid, current_epoch, leader=self.id }
check_election()
on Vote(voter, peer_zxid, _, leader_chosen) while Looking:
if (peer_zxid, voter) > (vote_target_zxid, vote_target_id):
vote_target_id = voter
vote_target_zxid = peer_zxid
vote_view.clear(); vote_view[self.id] = voter
broadcast Vote { self.id, self.last_zxid, current_epoch, leader=voter }
vote_view[voter] = leader_chosen
check_election()
check_election():
target = vote_target_id
if count(v in vote_view.values() : v == target) >= quorum:
if target == self.id: become_leading()
else: become_following(target)
LookForLeader is structurally a Vote for the sender: it lets a
late-arriving node bootstrap a tally without waiting for the next
broadcast cycle. Non-Looking peers reply to a Vote with their own
current vote (which points at the live leader), so isolated nodes
converge fast on heal.
Discovery & Synchronization (Phases 1–2)
become_leading():
role = Leading
pending_new_epoch = max(accepted_epoch, current_epoch) + 1
accepted_epoch = pending_new_epoch
epoch_acks = {self.id}
broadcast NewEpoch(pending_new_epoch)
try_finish_discovery() # handles the n=1 case immediately
on NewEpoch(e) from L:
if e > accepted_epoch:
accepted_epoch = e
if role != Following: become_following(L)
reply AckEpoch(current_epoch, last_zxid)
elif e == accepted_epoch:
reply AckEpoch(current_epoch, last_zxid) # idempotent
reset_election_deadline()
on AckEpoch from F (only if Leading):
epoch_acks += F
try_finish_discovery()
try_finish_discovery():
if |epoch_acks| < quorum: return
current_epoch = pending_new_epoch
next_counter = 0
leader_acks = {self.id}
broadcast NewLeader(current_epoch, history.clone())
try_finish_sync()
on NewLeader(e, hist) from L:
if e >= accepted_epoch:
accepted_epoch = e
current_epoch = e
history = hist # follower truncates / extends to leader's history
if role != Following: become_following(L)
reset_election_deadline()
reply AckLeader(e)
on AckLeader(e) from F (only if Leading and e == current_epoch):
leader_acks += F
try_finish_sync()
try_finish_sync():
if synced or |leader_acks| < quorum: return
synced = true
if last_zxid() > last_committed:
last_committed = last_zxid()
broadcast Commit(last_committed)
Broadcast (Phase 3)
propose(payload):
require role == Leading and synced
next_counter += 1
zxid = (current_epoch, next_counter)
history.push(Txn { zxid, payload })
ack_set[zxid] = {self.id}
broadcast Propose(Txn { zxid, payload })
try_commit(zxid) # single-node case
on Propose(txn) from L (only if Following and L == leader_id):
if txn.zxid > last_zxid():
history.push(txn)
reset_election_deadline()
reply Ack(txn.zxid)
on Ack(zxid) from F (only if Leading):
ack_set[zxid] += F
try_commit(zxid)
try_commit(zxid):
if zxid <= last_committed: return
if |ack_set[zxid]| >= quorum:
last_committed = zxid
broadcast Commit(zxid)
on Commit(zxid) from L:
if L is current leader:
reset_election_deadline()
if last_committed < zxid <= last_zxid():
last_committed = zxid
Simulator loop (per tick t in 0..rounds)
1. enqueue scheduled proposals : if t == schedule[i], push payload onto pending
2. inject pending into leader : pick (Leading and synced, lowest id); call propose
3. deliver in-flight : pop heap entries with delivery_time <= t
4. tick all nodes : iterate in ascending id; on_tick may fire election or heartbeat
Proposal schedule: schedule[i] = (i+1) * rounds / (K+1) for i in 0..K (integer division). Each payload is the byte string "zab-<i>"
(plain decimal, no padding). Deterministic, evenly spread, and
independent of cluster behaviour.
Wire format (Rpc)
Nine variants. The simulator never serializes RPCs — it passes them as typed values in memory. The only bytes that ever get hashed are the canonical dump.
LookForLeader { src_id, last_zxid, peer_epoch }
Vote { voter_id, last_zxid, peer_epoch, leader_id }
NewEpoch { new_epoch }
AckEpoch { current_epoch, last_zxid }
NewLeader { new_epoch, history: Vec<Txn> }
AckLeader { new_epoch }
Propose { txn: Txn }
Ack { zxid }
Commit { zxid }
Canonical dump format
file := magic[8 = "DSEZAB01"] u32_le(node_count) node*
node := u32_le id
u8 role # Looking=0, Following=1, Leading=2
u32_le current_epoch
u32_le accepted_epoch
u32_le last_zxid.epoch
u32_le last_zxid.counter
u32_le last_committed.epoch
u32_le last_committed.counter
u32_le history_len
txn * history_len
txn := u32_le zxid.epoch
u32_le zxid.counter
u32_le payload_len
u8 payload[payload_len]
Nodes appear in ascending id order. All multi-byte numbers are
little-endian. The dump is hashed with SHA-256; the lowercase hex
digest is what zabctl prints (no trailing newline).
Primary-order property
ZAB's defining guarantee, distinct from generic atomic broadcast, is
primary order: if a primary (leader) broadcasts proposals p then
q in that order, every follower that delivers both delivers p
before q. This is enforced trivially by the leader's monotonically
increasing next_counter and the follower's txn.zxid > last_zxid()
gate on Propose. Primary order is a per-primary property; across
leadership changes the guarantee is provided by the Discovery / Sync
handshake that explicitly chooses the surviving primary's history.
Cross-language invariants
| Invariant | Why it matters |
|---|---|
splitmix64 constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB | identical PRNG output |
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150 | identical election firing times |
delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3 | identical message scheduling |
heap order (delivery_time, sender, seq); seq global monotonic | identical delivery sequence |
peers iterated in ascending id (BTreeMap / std::map / explicit loop) | identical broadcast order |
vote_view keyed by voter id, iterated in ascending id | identical election tally |
election tie-break: lexicographic (last_zxid, voter_id) | identical leader choice |
leader-pick for proposal injection: Leading && synced && min id | identical client routing |
proposal schedule (i+1)*rounds/(K+1); payload "zab-<i>" unpadded decimal | identical pending queue contents |
propose() calls try_commit() | identical last_committed for n=1 |
Role enum order Looking=0, Following=1, Leading=2 | identical dump bytes |
dump magic "DSEZAB01"; all integers u32 LE; nodes in ascending id | identical dump bytes |
If any one of these drifts, scripts/cross_test.sh will fail and
cmp -l on the two raw dumps will print the byte offset of the first
divergence.
Files
src/rust/—zab19crate +zabctlbinary.src/go/— modulegithub.com/10xdev/dse/db19+cmd/zabctl.src/cpp/—db19_libstatic library +zabctlbinary +test_db19.scripts/verify.sh— runs the unit tests for all three.scripts/cross_test.sh— proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.
See docs/ for the long-form write-up and steps/ for the staged
implementation path.
db-19 — References
Primary sources
- Benjamin Reed and Flavio P. Junqueira, A simple totally ordered broadcast protocol, LADIS 2008. The original ZAB paper — short, workshop-length, and the only place that describes the algorithm in the exact "Phase 0 / 1 / 2 / 3" shape it took inside ZooKeeper. https://dl.acm.org/doi/10.1145/1529974.1529978
- Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini, Zab: High-performance broadcast for primary-backup systems, DSN 2011. The peer-reviewed, formal treatment. Defines the primary order property, gives the proof obligations, and folds the original Phase 0 into Phase 1. This is the paper to cite when arguing the correctness of any particular handshake decision. https://marcoserafini.github.io/papers/zab.pdf
- Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed, ZooKeeper: Wait-free coordination for Internet-scale systems, USENIX ATC 2010. Describes the system (znodes, sessions, watches, the wait-free API) that ZAB exists to support. Useful for understanding why ZAB was designed with primary order rather than as a generic consensus library. https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf
Implementations to read alongside
- Apache ZooKeeper (Java) — the canonical implementation. The
classes worth reading first are
FastLeaderElection,Leader,Follower,Learner,LeaderZooKeeperServer, andFollowerRequestProcessor. The election logic lives inFastLeaderElection.lookForLeader(); the discovery/sync handshake inLeader.lead()andFollower.followLeader(). https://github.com/apache/zookeeper - Kafka KRaft (KIP-500) — Kafka's replacement for ZooKeeper-based metadata. KRaft is Raft, not ZAB; reading the KIP is useful for understanding why ZooKeeper's biggest user finally moved off of it (operational complexity, not algorithm correctness). https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum
Determinism and simulation
- db-16's references on FoundationDB simulation testing and TigerBeetle
apply verbatim here. The
(delivery_time, sender, seq)heap and the splitmix64-seeded jitter are the same discipline. - The ZooKeeper test suite (
zookeeper/src/java/test/.../quorum/) uses scripted scenarios but is not deterministic in the cross-language sense this lab aims for. Worth reading as an example of how the production team tests the algorithm.
Background reading worth doing
- Heidi Howard, Distributed consensus revised, Cambridge PhD dissertation, 2019; the 2020 survey A Generalised Solution to Distributed Consensus unifies Paxos, Raft, and ZAB under a single quorum-intersection framework. Helps see ZAB as one point in a design space rather than as an oddball. https://www.cl.cam.ac.uk/~hh360/
- Leslie Lamport, Paxos Made Simple, 2001. The contrast with ZAB is illuminating: Paxos picks a value per slot; ZAB streams a totally ordered log under a primary. https://lamport.azurewebsites.net/pubs/paxos-simple.pdf
- Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm, USENIX ATC 2014 — the Raft paper. Read this before the ZAB papers if you have not already; the comparison in db-17's CONCEPTS.md is the recommended on-ramp. https://raft.github.io/raft.pdf
- André Medeiros, ZooKeeper's Atomic Broadcast Protocol: Theory and Practice, Aalto University seminar notes, 2012. A 14-page treatment of ZAB-vs-implementation gotchas; useful when the papers feel terse. https://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf
Cross-lab dependencies
- Upstream:
- db-16 — distributed-fundamentals: Lamport/VC and the deterministic simulator harness whose discipline this lab inherits wholesale.
- db-17 — Raft: same simulator skeleton; reading Raft first makes ZAB's discovery/sync handshake feel like the explicit version of Raft's implicit AppendEntries consistency check.
- db-18 — Paxos: the other consensus reference point; ZAB's
(epoch, counter)is the streaming-log analog of Paxos's(ballot, slot)numbering.
- Downstream:
- db-20 — Distributed KV: wraps a consensus engine (could be Raft, ZAB, or Paxos from this track) around a key-value state machine.
- db-21 — Storage-engine-advanced: snapshots and log compaction on top of the canonical history laid down here.
- db-23 — Capstone: composes the simulator harness across multiple replicated shards.
db-19 — Analysis
Required invariants
-
Election agreement. At most one node finishes a successful election cycle with
role == Leading && syncedper epoch. Enforced by majority voting invote_viewplus the strictly increasingpending_new_epoch = max(accepted_epoch, current_epoch) + 1rule: any competing prospective leader sees a higheraccepted_epochand steps down (viaNewEpochrejection) before it can sync. -
Primary order. If a single primary broadcasts proposals
pthenq, every follower that delivers both deliverspbeforeq. Enforced by the leader's monotonically increasingnext_counter(no gaps, no reuse within an epoch) plus the follower'stxn.zxid > last_zxid()gate onPropose(out-of-order proposals are silently dropped rather than re-ordered into the log). -
Integrity. The leader only proposes once it is
synced, and followers only append oncecurrent_epochhas been adopted viaNewLeader. Followers will not append aProposewhosezxid <= last_zxid(), so a stale leader's lateProposefor an already- superseded epoch cannot corrupt a follower's history. -
Agreement on committed prefix. If a follower has `last_committed
= z
, every other follower's history contains every txn withzxid <= z(becauseCommit(z)is only broadcast after a quorum has appended every txn up throughz, and a future leader must include any quorum's committed prefix via the DiscoveryAckEpoch(last_zxid)` reports → it adopts the surviving history). -
Total order. All followers deliver committed transactions in the same order (the leader-assigned zxid order). This follows directly from primary order + agreement on committed prefix.
-
Byte determinism. For every
(seed, nodes, rounds, proposals, partition)tuple, the three binaries produce identicalcanonical_dumpbytes — hence identical sha256 hex on stdout.scripts/cross_test.shchecks six scenarios.
Design decisions
-
propose()callstry_commit()at the end. Same single-node argument as db-17 Raft: a one-node cluster is its own majority, no Ack will ever arrive to drive the commit, so the leader must run the quorum check inline. Harmless forn > 1because the majority check rejects until acks actually arrive. -
Sorted iteration on every wire-affecting loop. Rust uses
BTreeMap/BTreeSet; C++ usesstd::map/std::set; Go uses explicitfor p := uint32(0); p < n; p++loops. The Go code also sorts before iterating wherever amap[uint32]...is read for output (the canonical dump and broadcast loops). HashMap would compile and pass single-language tests but failcross_test.shimmediately. -
LookForLeaderis structurally aVotefor the sender. The Rusthandlearm foldsLookForLeaderdirectly into theVotearm. This avoids a separateLookForLeaderReplyand gives a late-arriving Looking node the ability to tally an immediate self-vote from the source. The Go and C++ implementations do the same fold. -
Non-Looking nodes reply to a
Votewith their own current vote. An isolated Looking node sending aVoteto peers who are already Following gets back aVotepointing at the live leader; combined with the lex-update rule, the isolated node converges on the existing leader in O(1) round-trips after partition heal, rather than starting a new election. -
Votelex comparison is(last_zxid, voter_id), not(last_zxid, leader_id). The voter's id is the tie-breaker when histories are equal — this is what makes the highest-id node win a cold-start election. Usingleader_idinstead would create a fixed-point where any node can vote for any leader and the tally never makes progress. -
pending_new_epoch = max(accepted_epoch, current_epoch) + 1. Themaxcovers the case where this node has previously acknowledged aNewEpochfor a leader that then failed before reaching sync. Without themax, the new leader could pick an epoch that some follower has already rejected, leaving sync stalled forever. -
AckEpochis idempotent. A follower that has already adoptedaccepted_epoch = ereplies again on a re-sentNewEpoch(e). This keeps the discovery handshake robust against the heartbeat-driven re-send loop inon_tickwhile the leader is still gathering acks. -
NewLeaderships the whole history, not a diff. Following the ZAB paper. For a study lab this is fine; production ZooKeeper usesSNAP/DIFF/TRUNCvariants to avoid bulk transfer. Replacing this with a diff would be a substantial change to the RPC layer and is out of scope. -
Heartbeats re-broadcast the last
Commit. Once synced, the leader re-broadcastsCommit(last_committed)every 50 ticks. This doubles as the "leader is alive" signal that resets the follower's election timer. Sending a dedicatedHeartbeatRPC would be one more wire variant for no behavioural gain. -
Proposal schedule is closed-form.
schedule[i] = (i+1) * rounds / (K+1)(integer division). Same rationale as db-17: decoupling proposal timing from cluster scheduling decisions keeps the dump bytes from depending on incidental tick alignment. -
Library + thin CLI. The lab exposes
Cluster::new,run,canonical_dump, andsha256as a library. The CLI is a few dozen lines of arg parsing plus four function calls.
Tradeoffs worth flagging
-
No snapshots, no SNAP/DIFF/TRUNC. The leader sends the full history on every
NewLeader. For the boundedroundsof this lab the cost is trivial; for production ZooKeeper it would be prohibitive on large datasets. Snapshots are deferred to db-21. -
No client sessions, no znodes, no watches. ZAB exists to serve ZooKeeper, but this lab implements ZAB in isolation. The "state machine" is the history vector itself. Anything ZooKeeper-API- shaped (sessions, ephemerals, watches, ACLs) is downstream of the consensus core and lives in a different lab.
-
Crash semantics are stylized. Crashes are simulated only via the
partitionflag (drop all messages in one direction). A real ZooKeeper must handle persistent storage corruption, fsync ordering, and restart-mid-vote; the canonical dump pretends all state is durable by construction. -
No
Observerrole. Production ZooKeeper has non-voting Observer servers that learn from the leader but do not participate in quorum. They are pure read-fanout and add no algorithmic complexity, so they were left out of the simulator. -
No client-side dedup. A proposal injected into a leader who immediately loses leadership may be replicated, lost, and never re-proposed. The simulator's
cluster_pendingqueue is drained unconditionally; we are testing the consensus core, not the client RPC layer. -
Follower truncation is by replacement, not by prefix-match. When a follower receives
NewLeader(e, hist), it adoptshistwholesale, even if its own history shares a prefix. This is correct (the leader's history is authoritative for the new epoch) but heavier than necessary; a real implementation would diff.
Why three languages
Same reasoning as db-16 and db-17, plus one new lesson specific to
ZAB: the algorithm has two quorum-tracking sets that are easy to
get subtly wrong (epoch_acks for discovery, leader_acks for sync,
plus the per-zxid ack_set for broadcast). Each set must be iterated
in a stable order for the dump, and each must include the leader's
own id on initialization. The cross-language test catches both
mistakes immediately — forgetting to add self.id to epoch_acks
costs a tick of discovery time that perturbs every downstream
delivery and changes the dump bytes.
db-19 — Execution
One-shot: prove the lab works
cd db-19-zab
./scripts/verify.sh # all unit tests in Rust, Go, C++
./scripts/cross_test.sh # byte-identical sha256 across all three, six scenarios
A green run of cross_test.sh ends with the literal line:
=== ALL OK ===
Per-language workflows
Rust
cd src/rust
cargo test --release # ~10 tests
cargo build --release # produces target/release/zabctl
./target/release/zabctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
Go
cd src/go
go test ./... # ~9 tests
go build -o /tmp/zabctl_go ./cmd/zabctl
/tmp/zabctl_go --seed 42 --nodes 3 --rounds 1000 --proposals 5
C++
cd src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build # test_db19 — 10 assertions
./build/zabctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
CLI
All three binaries accept the same flags and print lowercase hex sha256 of the canonical dump to stdout with no trailing newline:
| flag | default | meaning |
|---|---|---|
--seed N | 0 (Go) / 42 (Rust) / 0 (C++) | splitmix64 seed mixed into election timers and message delays |
--nodes K | 3 | number of ZAB nodes (1 is legal; majority is then 1) |
--rounds R | 0/1000 | number of simulator ticks to run |
--proposals P | 0 | number of client commands to inject during the run |
--partition s,d,... | none | comma-separated pairs (src, dst) to drop in that direction |
(Flag defaults drift between langs because the cross-test script always passes every flag explicitly. Only behavior under explicit flags is part of the cross-language contract.)
--partition 0,1,1,0 drops both directions between nodes 0 and 1
(complete split); --partition 0,1 drops only 0 → 1 (asymmetric).
Proposals are spaced as schedule[i] = (i+1) * rounds / (K+1); with
--rounds 1000 --proposals 5 they fire at ticks 166, 333, 500, 666,
833 with payloads "zab-0" through "zab-4".
Canonical scenarios
scripts/cross_test.sh runs six scenarios; their sha256s are listed
in docs/observation.md. If any change, cross_test.sh will exit
non-zero.
| label | args |
|---|---|
| A | --seed 42 --nodes 3 --rounds 1000 --proposals 5 |
| B | --seed 7 --nodes 5 --rounds 2000 --proposals 20 |
| C | --seed 99 --nodes 3 --rounds 500 --proposals 0 |
| D | --seed 1 --nodes 1 --rounds 200 --proposals 5 |
| E | --seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0 |
| F | --seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1 |
D exercises the single-node-leader code path that motivated the
propose() → try_commit() call. E isolates node 0 completely; the
other two must elect a leader and commit the remaining proposals
(the surviving quorum's history is what ends up in node 1 and 2's
dump). F is an asymmetric partition that causes term churn but
recoverable replication.
Sanity checks
# Pick any scenario and round-trip — the hash is content-defined.
./src/rust/target/release/zabctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
# expect: 16af5aa6dbd5ce09b259755f3339d6cf23966ce115b0e30d9c2990487783047d
# Magic of the canonical dump (use the lib directly; the CLI hashes it):
# - Rust: TestDumpDeterministicAcrossRuns asserts da.starts_with("DSEZAB01").
# - Go: TestDumpDeterministicAndMagic asserts the same.
# - C++: test_dump_deterministic_and_magic in tests/test_db19.cc.
Tunables (CONCEPTS.md cross-reference)
HEARTBEAT_INTERVAL = 50— leader re-broadcasts lastCommitevery 50 ticks.ELECTION_TIMEOUT_MIN = 150,ELECTION_TIMEOUT_SPAN = 150— base + jitter for follower election deadline.DELIVERY_DELAY_SPAN = 3— message delivery delay is1 + splitmix64(seed ^ src ^ dst ^ t) % 3ticks.
Changing any of these changes every canonical hash. The intent is that the lab is a fixed-point study object: the values are part of the contract.
db-19 — Observation
What the cross-language test produces and how to read it by hand.
Expected sha256s
scripts/cross_test.sh runs six scenarios and asserts the three
binaries (Rust, Go, C++) all print the same hex digest. The current
canonical hashes are:
| label | args | sha256 |
|---|---|---|
| A | --seed 42 --nodes 3 --rounds 1000 --proposals 5 | 16af5aa6dbd5ce09b259755f3339d6cf23966ce115b0e30d9c2990487783047d |
| B | --seed 7 --nodes 5 --rounds 2000 --proposals 20 | b60388e978a9b98792edb00c8d33217da8bff9945a89d2c0c18b5f69520b91cf |
| C | --seed 99 --nodes 3 --rounds 500 --proposals 0 | 8aef7604639fe0f2b349b38d74e10b6da8ac252b626976563bba69c722426296 |
| D | --seed 1 --nodes 1 --rounds 200 --proposals 5 | d4dbb92f91f9a0adf0c4c0b91fa46b2a5145907450897cd6473a02a6279604fd |
| E | --seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0 | 5e4dbddb605e469c99fb682c00256445dcb2ed07e984f673d4296ef19719979a |
| F | --seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1 | c9df583bd714534c488aac710e6cc6e57e4b21d2fe96ec17068bd1c7525bc1b3 |
If any of these change, cross_test.sh will fail. Either you have a
bug, or you have intentionally changed the spec (timer constants,
schedule formula, dump layout) and you must update this table in the
same commit.
What the canonical dump looks like (scenario D — single node)
--seed 1 --nodes 1 --rounds 200 --proposals 5. Five proposals into
a single-node cluster — the leader is itself the majority, so every
proposal commits immediately and discovery/sync collapse to a no-op
(quorum reached on self.id).
offset 0x00 : 44 53 45 5A 41 42 30 31 "DSEZAB01" magic
offset 0x08 : 01 00 00 00 1 node_count
offset 0x0c : 00 00 00 00 0 node id
offset 0x10 : 02 role = Leading (2)
offset 0x11 : XX XX XX XX current_epoch (== 1 if no churn)
offset 0x15 : XX XX XX XX accepted_epoch (== current_epoch)
offset 0x19 : XX XX XX XX last_zxid.epoch (== current_epoch)
offset 0x1d : 05 00 00 00 last_zxid.counter = 5
offset 0x21 : XX XX XX XX last_committed.epoch
offset 0x25 : 05 00 00 00 last_committed.counter = 5
offset 0x29 : 05 00 00 00 history_len = 5
offset 0x2d : XX XX XX XX history[0].zxid.epoch
offset 0x31 : 01 00 00 00 history[0].zxid.counter
offset 0x35 : 05 00 00 00 history[0].payload_len = 5
offset 0x39 : 7A 61 62 2D 30 "zab-0" payload
...
Each subsequent history entry is 4 + 4 + 4 + 5 = 17 bytes (epoch +
counter + len + "zab-N"). Total dump for D is therefore
0x2d + 5 * 17 = 0x86 = 134 bytes. Exact bytes depend on whatever
epoch the leader has bumped through by the time the run ends; the
single-node case is nearly always current_epoch = 1.
A multi-node dump (scenario C — quiet cluster)
--seed 99 --nodes 3 --rounds 500 --proposals 0. No proposals; the
cluster elects a leader, runs through discovery + sync, then
heartbeats for the rest of the run. Every node's history is empty:
44 53 45 5A 41 42 30 31 magic
03 00 00 00 node_count = 3
00 00 00 00 node id 0
XX role (Following or Leading)
XX XX XX XX current_epoch (1 if first election succeeded clean)
XX XX XX XX accepted_epoch
00 00 00 00 00 00 00 00 last_zxid (0, 0)
00 00 00 00 00 00 00 00 last_committed (0, 0)
00 00 00 00 history_len = 0
01 00 00 00 node id 1
... same shape ...
02 00 00 00 node id 2
... same shape ...
Total dump: 8 + 4 + 3 * (4 + 1 + 4 + 4 + 4 + 4 + 4 + 4 + 4) = 105 bytes. (33 bytes per node with empty history.)
How to debug a divergence
If cross_test.sh fails, write the raw dumps to disk (the CLI prints
only the hash; you'll need a one-liner that calls canonical_dump
directly, or modify zabctl.rs / main.go / zabctl.cc to dump the
raw bytes instead of the hash). Then:
cmp -l /tmp/zab_A_rust.bin /tmp/zab_A_go.bin | head
xxd /tmp/zab_A_rust.bin | sed -n '<line>,+2p'
xxd /tmp/zab_A_go.bin | sed -n '<line>,+2p'
The first divergence offset tells you what to look at:
| offset range | likely culprit |
|---|---|
| 0x00–0x07 | magic (typo: DSEZAB01 not DSEZAB1 or DSEZAB02) |
| 0x08–0x0b | node_count (impossible if all three accept --nodes correctly) |
inside a node block, on role | enum mapping (Looking=0, Following=1, Leading=2) |
inside a node block, on current_epoch / accepted_epoch | discovery handshake bug; the leader's pending_new_epoch likely didn't max() against current_epoch |
inside a node block, on last_zxid | counter reset on epoch change wrong (must reset to 0; first new proposal has counter 1) |
inside a node block, on last_committed | try_commit quorum count wrong, or propose() not calling try_commit (n=1 case) |
inside history_len | follower Propose filter wrong (out-of-order zxid not dropped), or NewLeader replacement not adopting leader's history |
| inside a history entry | broadcast loop iteration order — must be ascending peer id |
In all six existing scenarios these checks pass; the table above is the runbook for the day someone changes the algorithm and forgets to update one of the three implementations.
Tick-level scope (Rust REPL trick)
To watch a scenario from the inside, add this temporary print in
Cluster::run before the per-tick loop body:
#![allow(unused)] fn main() { if std::env::var("ZAB_TRACE").is_ok() { eprintln!( "t={} roles={:?} epochs={:?} commits={:?}", t, self.nodes.iter().map(|n| n.role).collect::<Vec<_>>(), self.nodes.iter().map(|n| n.current_epoch).collect::<Vec<_>>(), self.nodes.iter().map(|n| n.last_committed.counter).collect::<Vec<_>>(), ); } }
then run ZAB_TRACE=1 zabctl --seed 42 --nodes 3 --rounds 1000 --proposals 5 | head -50. The trace goes to stderr; the canonical
dump's sha256 still goes to stdout unchanged. Remove before commit.
Reading the hashes themselves
The hashes are arbitrary — they are SHA-256 of a binary blob whose
bytes encode every node's state at the end of the run. There is no
way to look at 16af5aa6... and infer anything about the cluster.
What matters is that the same input produces the same output in three
languages and that the table above doesn't drift unintentionally.
For human-readable insight, dump canonical_dump(&c) to a file and
run xxd over it, or print individual node states in a test rather
than at the CLI surface.
db-19 — Verification
Prerequisites
- Rust ≥ 1.74 with
cargoonPATH. - Go ≥ 1.22 (module declares
go 1.22). - CMake ≥ 3.20 and a C++17 compiler (Apple clang ≥ 14, gcc ≥ 11).
- A POSIX
sha256sumis not required — each binary computes its own sha256 in-process.
One command
cd db-19-zab
bash scripts/verify.sh && bash scripts/cross_test.sh
Green is === db-19 :: ALL UNIT TESTS GREEN === followed by
=== ALL OK ===. Anything else is a regression.
What verify.sh does
- Rust —
cargo test --release --quietoversrc/rust/. Buildsdb19lib +zabctlbinary; runs the inline tests insrc/rust/src/lib.rs. Expected:test result: ok. - Go —
go test ./...oversrc/go/. Builds thedb19package +cmd/zabctl; runssrc/go/zab_test.go. Expected:PASSandok github.com/10xdev/dse/db19. - C++ —
cmake -DCMAKE_BUILD_TYPE=Release -B build,cmake --build build --target test_db19 zabctl, thenctest --test-dir build --output-on-failure. The test binary ends withTest #1: test_db19 ........ Passed.
If any of these three blocks fails, the script exits non-zero and the rest does not run.
What cross_test.sh does
For each of the six canonical scenarios (A–F) it invokes the three
release zabctl binaries with identical flags, captures the
lowercase-hex sha256 of the canonical cluster dump, and asserts
rust == go == cpp byte-for-byte. The scenarios are:
| label | args | what it exercises |
|---|---|---|
| A | --seed 42 --nodes 3 --rounds 1000 --proposals 5 | basic 3-node, 5 proposals, clean network |
| B | --seed 7 --nodes 5 --rounds 2000 --proposals 20 | bigger cluster, longer horizon |
| C | --seed 99 --nodes 3 --rounds 500 --proposals 0 | election convergence only |
| D | --seed 1 --nodes 1 --rounds 200 --proposals 5 | degenerate single node (instant leader) |
| E | --seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0 | 3-node with churn |
| F | --seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1 | 5-node, asymmetric one-way drop |
Canonical hashes are listed in docs/observation.md. The script
asserts consistency among the three ports; it is docs/observation.md
that pins them to the historical fingerprint.
What green guarantees
- Determinism. Same flags ⇒ same canonical dump bytes across
languages and runs (modulo endianness — all targets are
little-endian). The simulator advances in integer ticks; all
map/set iteration is over
BTreeMap/ sorted Go slices /std::mapso the dump order is fixed. - Safety in the modeled environment. No two nodes commit
different histories. For every scenario in the suite, after the
final tick:
last_committed.epochis monotonic per node.- Where two nodes'
historyoverlap by zxid, the bytes match. - No follower has committed past
last_committedreported by the leader of its current epoch.
- Liveness in the modeled environment. Scenarios A, B, D, F include proposals and run long enough to elect a leader and commit them. Scenarios C and E confirm we don't commit what we shouldn't (C has zero proposals; E partitions away the would-be leader so the alternative path must take over).
What green does not guarantee
- Behavior outside the canonical scenarios. ZAB's state space
is large; six fingerprints are an acceptance test, not a model
checker. Real validation needs TLA+ (see
references.md). - Performance. No latency or throughput is measured. Tick count is simulation cost, not wall-clock SLA.
- Snapshotting / log compaction. Histories grow unboundedly; ZooKeeper truncates via snapshots, which is out of scope here.
- Production safety primitives — fsync barriers, on-disk checksums, recovery from torn writes, byzantine actors. All deliberately deferred.
- Real network. Partitions are modeled as a
BTreeSetof one-way drops applied at delivery; reordering happens through the simulator's priority queue, not a Lossy/OOO network. There is no actual socket.
Invariant assertions in code
The implementations carry inline assertions where they are nearly free. The load-bearing ones:
| Where | Assertion | What it catches |
|---|---|---|
Leader on_ack | refuse acks for zxids not in our outstanding set | duplicate / replayed acks inflating quorum |
update_vote (election) | only adopt votes with greater (last_zxid, id) | non-monotone vote drift |
handle_new_epoch | followers must reply only if new_epoch > accepted_epoch | accepting a stale epoch from a deposed leader |
handle_new_leader | followers replace history only if new_epoch > current_epoch | losing already-committed entries |
canonical_dump writer (all 3 langs) | nodes in ascending id, per-node history in ascending zxid | dump-writer drift between languages |
The unit tests assert each of these on at least one path.
db-19 — Broader Ideas
The lab implements textbook ZAB (epoch + counter, leader-driven broadcast, discovery + sync on leadership change) with a deterministic simulator and three-language cross-validation. It deliberately stops where production engineering begins. This document collects the threads worth pulling on next.
Variants and refinements
ZAB-with-snapshots
Production ZooKeeper periodically truncates history by snapshotting
the in-memory state machine and dropping txns whose zxid is below
the snapshot's. Followers that fall behind the leader's snapshot are
fast-forwarded with SNAP (whole-state copy) rather than DIFF
(replay tail). Worth implementing as db-19b — it reuses the wire
format and adds a Snap { zxid, state_bytes } RPC alongside the
existing NewLeader payload.
Fast Leader Election (production form)
Real ZooKeeper's FLE has tie-breaking by peer epoch (the highest
epoch this voter has ever seen) before falling back to (last_zxid, id). The lab uses just (last_zxid, id) which is enough for safety
but loses an optimization: a node that just lost leadership often
still has the highest peer-epoch and should regain leadership
quickly. Worth a db-19c.
Observer mode
Observers receive Commit but never vote in elections or quorums.
ZooKeeper added them at scale to push read traffic past the
voter-set throughput ceiling without inflating quorum sizes. The
simulator extension is small: add a Role::Observer, exclude it
from quorum counts, still deliver every Commit.
Read-only mode (RO clients during partition)
When a quorum dies but some nodes remain, ZooKeeper exposes those survivors in a read-only mode that serves last-known committed state. This is a useful failure-mode case for the simulator: drop into RO when no quorum responds within an election cycle.
Cross-epoch zxid ordering
Production ZAB stuffs (epoch, counter) into one u64 (32 bits
each). The lab uses a struct for clarity; switching to the packed
form is a one-line change and would let zxid live in atomic
operations on real hardware. Worth a benchmark in db-22.
Production systems to study
Apache ZooKeeper
The canonical implementation. Read the original ZAB paper (Junqueira,
Reed, Serafini — Zab: High-performance broadcast for primary-backup
systems, DSN 2011) alongside the source in org.apache.zookeeper.server.quorum.
The simulator in this lab maps directly onto Leader.java,
Follower.java, and FastLeaderElection.java.
Kafka KRaft (Raft replacement for ZooKeeper)
Confluent's argument against keeping ZooKeeper as a dependency was operational: two consensus systems (ZAB for metadata, Kafka's own ISR for log replication) doubled the failure-modes and runbooks. KIP-500 replaces ZAB with a Raft-style log inside Kafka itself. A good real-world counterpoint to read alongside db-17 (Raft).
Curator / Recipes
Apache Curator's "recipes" (locks, leader latches, distributed queues) are layered on top of ZooKeeper. They are a clinic in how not to misuse a primary-order primitive: every recipe pins its watch semantics + retry policy explicitly because ZK ephemeral nodes are not ACID transactions.
Etcd v2 vs v3
Etcd v2 used a ZAB-like broadcast; v3 moved to Raft for the same
operational reasons as Kafka. Comparing v2's raft.go (gone, but
in git history) and v3's raft/ is instructive — same problem,
different state machine, near-identical wire bytes.
Chubby (Google)
Chubby is Multi-Paxos-based, not ZAB, but the lease + session model in ZooKeeper traces directly back to Chubby. Burrows's OSDI 2006 paper is the canonical writeup; read it after this lab and before db-23.
Performance experiments worth running
The simulator's ticks are a unit of cost for comparative experiments:
- Quorum-size sweep. For
nodes ∈ {3, 5, 7, 9}, runproposals = 50and count ticks to commit the last proposal. Expected: commit cost rises slowly with quorum size (one extra round-trip per added node), election cost rises sharply (vote table doubles). - Discovery+sync cost on leadership churn. Vary the partition
schedule's
--partitiondensity. The lab's E scenario has 4 churn events in 1000 ticks; the more churn, the higher the ratio ofNewEpoch/NewLeaderbytes toPropose/Commitbytes in the dump. Plot that ratio. - Comparison to Raft (db-17) and Paxos (db-18). Same flag
surface (
--seed --nodes --rounds --proposals --partition) and same scenarios — lab structure is identical on purpose. Compare scenario-A commit latency across the three protocols.
What "production-quality" would require beyond this lab
- Durable storage.
history,current_epoch,accepted_epochmust survivekill -9and power loss. Real ZooKeeper writes a WAL (see db-03) and snapshots every N transactions. - Real network. Sockets, TCP retransmits, framing, TLS, auth.
The simulator's
OutMsgcollapses all of that. - Client sessions. ZooKeeper's session-id ↔ ephemeral-node binding is a major protocol surface in its own right; not modeled here.
- Watches. The pub/sub layer on top of read-paths. Adds a fan-out table and a per-session notification queue.
- Cluster reconfiguration. Adding/removing voters safely is its own protocol (joint quorum on the membership txn). Out of scope.
- Recovery from torn writes. Per-page checksums on the WAL.
- Adversarial inputs. ZAB assumes crash-stop failures only. A Byzantine variant (BFT-ZAB, e.g. BFT-SMaRt) is a separate code base entirely.
db-19 step 01 — Epoch, zxid, and Fast Leader Election
Goal
Build the persistent state every ZAB node carries and the election protocol that picks the next leader when no one is currently broadcasting. Election must converge in bounded ticks for any quorum-available network, and the chosen leader must always be the node with the highest committed zxid in the surviving quorum.
Tasks
-
ZxId { epoch: u32, counter: u32 }with lexicographic ordering (epoch first, then counter). Provide aZxId::ZEROconstant. Every zxid comparison in the rest of the protocol uses this ordering — never compare theu64representations directly, because the lab keeps them as a struct for clarity. -
Persistent node state. A
ZabNodecarries:id: u32,n_nodes: u32,quorum = n_nodes/2 + 1.role: Role(Looking/Following/Leading).current_epoch: u32— the epoch of the leader we last followed. Bumped onNewLeader.accepted_epoch: u32— the epoch we promised onNewEpoch. Always>= current_epoch.history: Vec<Txn>— committed and uncommitted txns in zxid order.last_committed: ZxId— high-water mark; entries<=this have been applied.
These are the four values that would survive a crash in a real implementation. Everything else (vote tables, ack tables) is transient and rebuilt on the next election.
-
Rpc::LookForLeader / Vote. ALookingnode broadcasts its current vote each tick. On receiving a peer's vote, update viaupdate_vote(peer.last_zxid, peer.id):- Adopt
peeras our vote target if(peer.last_zxid, peer.id) > (current_vote.last_zxid, current_vote.id). - Record the peer's choice in
vote_view[voter_id] = leader_id.
- Adopt
-
Quorum detection. Walk
vote_viewand count entries whose value equals each candidate id. The first candidate (in id order) whose count>= quorumbecomes the elected leader. If that leader is us, transition toLeading; otherwise transition toFollowingwithleader_id = Some(...). -
Election timeout. Track
election_deadlineper node. Ifnow > election_deadlineand we're stillLooking, reset the vote table and broadcast a freshLookForLeaderfrom our current(last_zxid, id). Reseed the deadline withELECTION_TIMEOUT_MIN + splitmix64(seed) % ELECTION_TIMEOUT_SPAN.
Acceptance
Inline unit tests in each language. Names below are the Rust form;
Go uses TestZxIdOrdering style, C++ uses test_zxid_ordering:
zxid_ordering_is_lexicographic—ZxId{0,9} < ZxId{1,0},ZxId{1,0} < ZxId{1,1},ZERO < ZxId{0,1}. Locks the comparator.vote_adopts_higher_last_zxid— node 0 withlast_zxid=(1,5)votes for itself; receives a vote from node 1 with(2,0); adopts node 1. Then receives from node 2 with(2,0)— does not re-adopt (tie on zxid, lower id loses).quorum_of_votes_elects_highest— in a 3-node cluster all voting for node 2, node 2 transitions toLeadingafter the second matchingVotearrives.election_does_not_decide_in_minority— partition isolates node 0 from {1,2}; node 0 must never leaveLookingregardless of how many ticks elapse.
All four green in Rust, Go, and C++.
db-19 step 02 — Discovery, sync, and atomic broadcast
Goal
Layer the steady-state ZAB protocol on top of the elected leader
from step 01. The leader must bring every follower's history up
to its own before accepting new proposals; once synced, the leader
assigns a monotone zxid to each payload and commits it on majority
ack. The dump bytes must match across Rust, Go, and C++.
Tasks
-
Discovery (
NewEpoch/AckEpoch). The fresh leader picksnew_epoch = max(self.accepted_epoch, max-peer-accepted) + 1and broadcastsNewEpoch { new_epoch }. Each follower:- Asserts
new_epoch > accepted_epoch(refuse stale leaders). - Sets
accepted_epoch = new_epoch. - Replies
AckEpoch { current_epoch, last_zxid }— the follower's own committed epoch + tail of itshistory.
The leader waits for a quorum of
AckEpoch(counting itself). At quorum, it knows the highest zxid that any majority node has committed; that becomes the new initial history. - Asserts
-
Sync (
NewLeader/AckLeader). Leader broadcastsNewLeader { new_epoch, history }wherehistoryis the leader's own log (which must include everything any quorum member has acked, by the contract of step 1). Each follower:- Asserts
new_epoch > current_epoch(refuse historical leaders). - Replaces
historywith the leader's payload. - Sets
current_epoch = new_epoch. - Replies
AckLeader { new_epoch }.
On quorum of
AckLeader, the leader broadcastsCommitfor every zxid in the synced history that has not yet been committed. The cluster is now in steady state. - Asserts
-
Broadcast (
Propose/Ack/Commit). Eachsteptick, if there are queued proposals and we are the leader:- Assign
zxid = ZxId { epoch: current_epoch, counter: ++last_counter }. - Append
Txn { zxid, payload }to localhistory. - Broadcast
Propose { txn }to all followers.
Each follower asserts
txn.zxid.epoch == current_epochandtxn.zxid > history.last().zxid, then appends and repliesAck { zxid }. The leader tracks acks per zxid in aBTreeMap<ZxId, BTreeSet<u32>>. On quorum (counting itself), it broadcastsCommit { zxid }and advanceslast_committed. - Assign
-
Apply on commit. Followers receiving
Commit { zxid }advancelast_committed = max(last_committed, zxid)and (in a real system) apply the txn to the state machine. The lab leaves the state machine implicit —last_committedandhistoryare the only observable surface. -
Canonical dump.
dump_cluster(nodes) = magic("DSEZAB01") || u32 node_count || dump_node(0) || dump_node(1) || ...wheredump_node = id u32 || role u8 || current_epoch u32 || accepted_epoch u32 || last_zxid (epoch,counter) || last_committed (epoch,counter) || history_len u32 || [zxid, payload_len u32, payload bytes] * history_len. All integers little-endian.hash= lowercase hex sha256 of the full byte string, no trailing newline.
Acceptance
Inline unit tests in each language:
discovery_bumps_accepted_epoch— leader elected ataccepted_epoch=0broadcastsNewEpoch{1}; followers reachaccepted_epoch=1.sync_replaces_follower_history_with_leader_history— follower with stale history receivesNewLeader { history: leader_log }and ends withhistory == leader_log.propose_commits_on_quorum_ack— leader in 3-node cluster proposes one txn; commits after 1 follower ack (leader + 1 = 2 = quorum). The third follower's late ack does not double-commit.propose_does_not_commit_without_quorum— leader in 5-node cluster proposes, 1 follower acks;last_committedstays atZxId::ZERO.zxid_counter_is_monotone_per_epoch— three proposals get counter 1, 2, 3; if the leader's epoch bumps (next election), counter resets to 1 under the new epoch.canonical_dump_is_byte_stable— same input scenario → same dump → same sha256 across two calls in the same process.
All six green in Rust, Go, and C++.
db-19 step 03 — Cross-language determinism
Goal
Lock the byte-level output of all three implementations
(Rust, Go, C++) to the same sha256 for every canonical scenario in
scripts/cross_test.sh. This is the difference between "ZAB works
in my language" and "ZAB is this exact state machine".
Tasks
-
Deterministic RNG.
splitmix64(u64) -> u64per the spec:x += 0x9E3779B97F4A7C15 z = (x ^ (x >> 30)) * 0xBF58476D1CE4E7B5 z = (z ^ (z >> 27)) * 0x94D049BB133111EB out = z ^ (z >> 31)Every random choice in the simulator (election timeout, delivery delay, partition schedule index) consumes one
splitmix64call on a per-node counter. No language may use its ownrandormath/randor<random>defaults. -
Stable iteration. Every map iteration in election, ack tracking, and dump emission is over
BTreeMap(Rust),std::map(C++), or a sorted[]uint32(Go). NoHashMap/unordered_map/map[uint32]may appear in any code path that affects bytes-on-the-wire or bytes-in-the-dump. -
Delivery order.
OutMsges enqueued the same tick are delivered in FIFO order per-destination and in source-id ascending order across destinations. Implement with aBinaryHeap<(deliver_at, src_id, seq_no, msg)>(Rust) and the equivalent in Go (container/heapwith the same key) and C++ (std::priority_queue). Theseq_notie-breaks duplicates within the same tick. -
Partition modelling.
--partition a,b,c,d,...is a list of(src, dst)one-way drops. Store as aBTreeSet<(u32, u32)>. At delivery, drop the message if(src, dst) ∈ partition_set. Symmetric partitions are expressed as0,1,1,0. Single-arg list length must be even (no half-drop); reject odd-length input with exit code 2. -
zabctlCLI surface. All three binaries accept:zabctl --seed <u64> --nodes <u32> --rounds <u32> --proposals <u32> [--partition a,b,c,d,...]Print the lowercase-hex sha256 of
dump_cluster(...)with no trailing newline. Exit code 2 on any bad flag. -
Wire-format magic. First 8 bytes of the dump are the ASCII string
"DSEZAB01". Bump to"DSEZAB02"if the layout ever changes (and updatedocs/observation.mdin the same commit).
Acceptance
scripts/cross_test.sh succeeds end-to-end on a clean checkout:
=== ALL OK ===
Each of the six scenarios A–F prints the same hex digest for Rust,
Go, and C++. The canonical hashes are pinned in
docs/observation.md — if any scenario changes you must update the
table in the same commit, with a one-line note on what shifted
(timer constant, schedule formula, dump layout).
Optional but valuable: rebuild on a second machine with a different endian-ness-irrelevant compiler (Linux/gcc vs macOS/clang) and confirm the hashes match. All targets in this study back are little-endian; the dump assumes that.
db-20 — Distributed KV Store (Concepts)
This lab is the capstone of the distributed-systems track (db-16..19). It stitches consensus and a state machine into the smallest possible replicated key/value store and exposes the result as a deterministic, byte-identical snapshot across Rust, Go, and C++.
Where it sits
| Track | Lab | Provides |
|---|---|---|
| db-16 | distributed fundamentals | failure models, CAP, FLP |
| db-17 | Raft | a real consensus implementation |
| db-18 | Paxos | a contrast |
| db-19 | ZAB | another contrast |
| db-20 | distributed KV | integration: log + state machine |
The scope of db-17 is "implement Raft correctly". The scope of db-20 is "given Raft-shaped semantics, build a replicated state machine you can hash byte-for-byte across three languages." So we deliberately do not re-implement leader election, randomised timers, RPCs, or persistent log files. We model just enough of consensus to study the integration boundary.
Simplifications (vs. real Raft / db-17)
| Concept | db-17 | db-20 |
|---|---|---|
| Networking | message-driven | direct in-process broadcast |
| Elections | randomised timeouts | fixed leader, current_term == 1 |
| Followers' acks | RPC reply | function return |
| Log replication | next_index walk on conflict | one-shot snapshot push (truncate_and_replay) |
| Partition | network simulation | Cluster::partition({ids}) drops messages |
| Heal / catchup | next-index probes | full log copy + replay |
| Persistence | log file + fsync | none (in-memory Vec<LogEntry>) |
The simplifications are honest — they collapse implementation details that do not affect the property we care about: a leader's state-machine snapshot is identical to every healthy follower's snapshot, and identical across languages.
Data model
Op = NoOp | Put(key, value) | Del(key)
LogEntry = { term: u64, idx: u64, op: Op }
Replica = { id, log: Vec<LogEntry>, commit_index, current_term,
state_machine: BTreeMap<String, Vec<u8>> }
Cluster = { replicas[5], leader_idx=0, partitions, next_log_idx }
state_machine is the applied projection of the committed log
prefix. We do not store tombstones — Del actually removes the entry.
Propose / commit cycle
- The leader allocates the next log index and appends
LogEntry{term, idx, op}to its own log. - For each follower, in id order:
- If the follower is partitioned, drop the message.
- Else call
try_append(prev_idx, entry). If the follower's last_idx matchesprev_idx, the entry is appended (1 ack). If not, snapshot push:truncate_and_replay(leader.log, leader.commit_index)replaces the follower's log wholesale and re-applies the committed prefix (1 ack).
- If acks ≥ quorum (3/5), commit on the leader by advancing
commit_indexand applying to the state machine. Then in id order, advance every reachable follower's commit_index too. - Otherwise the entry stays in the leader's log uncommitted — a future
successful proposal or a
heal()will retro-commit it.
The total order of commits is the log order: idx 1, 2, 3, ....
Partition + heal
Cluster::partition({3,4}) adds replica ids 3 and 4 to the
partitions set. Subsequent proposals do not message them and do not
count their acks. If quorum is still reachable (5 − 2 = 3 ≥ 3), the
cluster keeps committing. If not, every proposal returns false and
the leader's tail grows uncommitted.
Cluster::heal() clears the set and, for each healed follower in
ascending id order, performs truncate_and_replay(leader.log, leader.commit_index). This is db-20's stand-in for Raft's
next_index-walk conflict resolution: simpler, deterministic, and good
enough for the cross-language exam because the final state is
identical.
Cross-language byte identity — the exam
Wire format for one replica's snapshot
magic "DSEDKV20" (8 bytes)
u64 LE commit_index
u64 LE current_term (== 1 in this lab)
u32 LE entry_count
for each (k, v) in ascending k order:
u32 LE k_len | k_bytes
u32 LE v_len | v_bytes
- Iteration order is ascending key. Rust uses
BTreeMap, C++ usesstd::map— both naturally ascending. Go'smapiteration is randomised, so the Go implementation does an explicitsort.Stringsbefore serialising. - Tombstoned keys are not in the dump.
- All integers little-endian.
Workload spec
splitmix64 constants: 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB
setup:
Cluster::new(5)
if scenario == "partition":
at op = ops/4 → partition([3, 4])
at op = ops*3/4 → heal()
for op in 0..ops:
r1, r2, r3 = rng.next() ×3
kind = (r1 >> 62) & 0x3 # 0,1,2 → Put; 3 → Del
k = "k" + (r2 % keys).to_string()
v = u64_le(r3 % 10000) # 8 bytes
Frozen golden hashes
| Scenario | Args | sha256 |
|---|---|---|
| A | --seed 42 --ops 500 --keys 32 --scenario default | 1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b |
| B | --seed 7 --ops 2000 --keys 128 --scenario partition | 272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc |
These are baked into scripts/cross_test.sh, src/go/dkv20_test.go,
and src/cpp/tests/test_dkv20.cc. Any change to PRNG / wire format /
op decoding / partition timing / snapshot push will break them — which
is exactly the point.
Where to look next
src/rust/src/lib.rs— the reference implementation. Read it first.src/go/dkv20.go— port. Note the explicitsort.Stringsbefore writing wire bytes.src/cpp/src/dkv20.cc— port. Note the manual little-endian writers and pure-stdlib sha256.docs/— the long-form study notes (analysis, execution, observation, verification, broader ideas).steps/— the three-step study plan if you are walking the lab fresh.
db-20 — References
Distributed-system foundations and the specific consensus / replication ideas that informed this lab.
Consensus
- Ongaro, D. & Ousterhout, J. In Search of an Understandable Consensus Algorithm (Extended Version). USENIX ATC 2014. https://raft.github.io/raft.pdf
- Lamport, L. Paxos Made Simple. ACM SIGACT News, 2001. https://lamport.azurewebsites.net/pubs/paxos-simple.pdf
- Howard, H. Distributed consensus revised. PhD thesis, Cambridge 2018. https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.pdf
CAP / consistency models
- Brewer, E. Towards Robust Distributed Systems (PODC 2000 keynote).
- Gilbert, S. & Lynch, N. Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. SIGACT 2002.
- Vogels, W. Eventually Consistent. CACM 2009.
Transactional storage
- Gray, J. & Reuter, A. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993. Chapter 7 on replicated data.
- Mohan, C. et al. ARIES. ACM TODS 17(1), 1992 — background on why our log is append-only.
State-machine replication
- Schneider, F. B. Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. ACM Comp. Surveys 22(4), 1990.
Production systems for comparison
- etcd — https://etcd.io/docs/v3.5/learning/design-learner/
- TiKV — https://tikv.org/docs/dev/reference/architecture/raftstore/
- CockroachDB — https://www.cockroachlabs.com/docs/stable/architecture/replication-layer.html
Self-references in this repo
db-16-distributed-fundamentals/— failure models, CAP/FLP intuitions.db-17-raft/— the underlying consensus algorithm.db-09-leveldb-complete/— the storage-engine quality bar this lab matches.
db-20 — Analysis
What is the question?
Given Raft-shaped consensus semantics, can we build a replicated state machine that produces a byte-identical snapshot across three language ecosystems? "Byte-identical" is the strongest possible test of cross-language conformance — strings, integers, map iteration order, and op semantics all have to line up.
Why is this an interesting study?
Raft on its own (db-17) tells you nothing about how a real key/value store is layered on top of it. Production systems (etcd, TiKV, CockroachDB) all answer the same questions:
- What does the leader send to followers? (log entries)
- When does a follower apply an entry? (when its
commit_indexadvances) - How does a partitioned follower catch up? (next-index probe / install snapshot)
- What invariants does the state machine maintain across replicas?
db-20 strips out the network and timer noise so we can focus on questions 2–4 alone. The simplification turns out to be the whole point: once you stop worrying about elections, the integration story fits in ~400 lines of Rust.
Design choices and trade-offs
Snapshot push instead of next-index walk
Raft's real conflict resolution is "decrement next_index, retry". For our purposes that produces the same final state as a one-shot snapshot push, but it forces us to model RPC round-trips. We pick the snapshot push because:
- it converges in a single step (deterministic), and
- it makes
heal()trivial to write — just truncate and replay.
The cost: we cannot study log-divergence scenarios where two leaders both append. That's fine: this lab is single-leader by construction.
State machine is BTreeMap<String, Vec<u8>>
A sorted map gives free deterministic iteration in Rust and C++. Go's
map has randomised iteration, so the Go implementation explicitly
sorts before serialising. This is the single most common source of
non-determinism in cross-language ports — every wire-format-aware
function in the Go code does sort.Slice or sort.Strings.
Op encoding inside LogEntry is not wire-stable
The log is in-memory only; we never serialise LogEntry itself.
Cross-language byte identity is only required at the snapshot
boundary. This separation of "internal" and "wire" structures is
cheap discipline that scales to real systems.
current_term is in the snapshot but is always 1
We expose current_term in the wire format anyway, plumbed through to
all three implementations. This makes it cheap to add elections later
(e.g. as an extension exercise) without having to bump the magic.
Failure-mode catalogue (what we covered, what we did not)
| Failure | db-17 covers? | db-20 covers? |
|---|---|---|
| Single follower crash + catchup | yes | yes (heal) |
| Network partition isolating minority | yes | yes |
| Leader crash + new election | yes | no (fixed leader) |
| Split-brain after partition heal | yes | no (no elections) |
| Log compaction / snapshot install | scratched the surface | no |
| Disk-loss / log truncation | no | no |
| Byzantine behaviour | no | no |
Where to take this next
broader-ideas.mdlists the explicit extensions: linearizable reads, log compaction, multi-region replicas, learner replicas, snapshot install over the wire, gossip-style cluster membership.- The exam in
cross_test.shdoubles as a regression net for any of those extensions — break the snapshot bytes, you break the build.
db-20 — Execution Plan
Stage 1 — Single replica, no replication
Implement Replica and apply() for Op::{NoOp, Put, Del} in Rust.
Verify that a Cluster::new(1) (1-replica cluster — trivially has its
own quorum) can propose(Put("a", b"v")) and the state machine sees
a → b"v". Test cases 1 and 3 in the Rust suite.
Stage 2 — Five replicas, no failures
Add Cluster, propose, quorum = N/2 + 1. Verify that a single
propose on a 5-replica cluster applies to all five state
machines because all 5 follow the leader. Tests 2 and 6.
Stage 3 — Partitions
Add Cluster::partition(ids) and is_partitioned. Drop messages to
and from partitioned replicas. Test that:
- 3/5 reachable still commits (test 4),
- 2/5 reachable does not commit (test 3),
- partitioned followers have
commit_index == 0after one proposal (test 4).
Stage 4 — Heal + catchup
Implement Cluster::heal and Replica::truncate_and_replay. Verify
that after a sequence of mutations on healthy replicas, calling
heal() brings the partitioned ones back to byte-identical snapshots.
Tests 5 and 13.
Stage 5 — Canonical snapshot
Decide the wire format (see CONCEPTS.md), implement dump_state,
write the byte-format test that pins every field offset (test 8). The
test fails loudly if a future refactor changes endianness or field
order.
Stage 6 — Workload driver
Port splitmix64 (mix and stateful generator). Decode each r1
high-bit pair into Put/Del. Encode r3 % 10000 as a fixed 8-byte LE
value so the byte width is independent of host word size. Tests 9, 10.
Stage 7 — Cross-language exam
Build the Rust binary, capture the actual hash for scenarios A and B,
bake those hashes into src/go/dkv20_test.go,
src/cpp/tests/test_dkv20.cc, and scripts/cross_test.sh. Port Go.
Port C++. Run bash scripts/cross_test.sh and watch all three values
align.
Stage 8 — Verification + docs
scripts/verify.sh runs all three test suites. scripts/cross_test.sh
runs all three binaries on both scenarios. Doc trio (analysis,
execution, observation, verification, broader-ideas) plus three
steps/ study files.
Pitfalls to expect
| Symptom | Likely cause |
|---|---|
| Go scenario hash doesn't match Rust | unsorted map iteration in DumpState |
| C++ scenario hash doesn't match Rust | endian / size mismatch in put_u32_le / put_u64_le |
| C++ tests pass in Debug, fail in Release | assert(side_effect) — Release strips it |
| Wrong commit_index after partition heal | snapshot push not clearing state_machine |
Build error: duplicate package declaration | create_file leftover from a stub |
| Subagent left half-built ports | resume manually, hash will tell you if it works |
db-20 — Observation
Frozen exam hashes
| Scenario | Args | sha256 |
|---|---|---|
| A | --seed 42 --ops 500 --keys 32 --scenario default | 1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b |
| B | --seed 7 --ops 2000 --keys 128 --scenario partition | 272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc |
These three statements are all asserted by scripts/cross_test.sh:
- Rust, Go, and C++ each produce the hash above for the given scenario.
- All five replicas in the cluster produce the identical snapshot
after the scenario completes (
TestScenarioBReplicasConvergein Go,test_scenario_b_frozenin C++,scenario_b_partitioned_replicas_converge_after_healin Rust). - The cluster has no live partitions when the driver returns.
Quantitative observations
| metric | scenario A | scenario B |
|---|---|---|
| ops | 500 | 2000 |
keys parameter | 32 | 128 |
| committed-on-leader entries | 500 | 2000 |
| approximate Put / Del fraction | 3/4 vs 1/4 | 3/4 vs 1/4 |
| live keys in final state (approx) | < 32 | < 128 |
Every committed entry executes exactly once on the leader; partitioned
followers see all of them after heal() because truncate_and_replay
replays the entire log.
Behavioural observations
- Convergence is deterministic. No timeouts, no clocks. Running
the workload twice with the same seed always yields the same bytes
(
workload_determinismin Rust,TestWorkloadDeterminismin Go,test_workload_deterministicin C++). - Sub-quorum proposals leave uncommitted tails. The leader's
last_idxadvances every propose, butcommit_indexonly advances on quorum acks. This is observable in the testsub_quorum_does_not_commit— the leader seeslast_idx == 1andcommit_index == 0. - Heal is a one-shot. In scenario B, after the heal call at
ops*3/4, all five replicas have byte-identical state machines. There is no period of eventual consistency — convergence is instantaneous and deterministic by construction. - Delete is real.
Delremoves the key from the state machine. A laterPutreusing the same key is a fresh entry, not a "revive". This is asserted bytest_del_removes_key(C++) and friends.
Performance notes (this lab is not a perf study)
The reference implementations are single-threaded, in-memory, with no
I/O. Scenario B runs in ~5 ms in Release Rust on an M-series Mac;
the snapshot push during heal() is O(log_size) per partitioned
follower, which is the dominant cost.
The lab deliberately optimises for clarity and byte-identity, not throughput. Real systems (db-09 leveldb-complete is a good adjacent reference) batch and pipeline replication; here every propose is synchronous.
db-20 — Verification
Three layers of test
1. Per-language unit tests
| File | Tests | Covers |
|---|---|---|
src/rust/src/lib.rs mod tests | 14 | splitmix64, sha256, single replica, quorum, sub-quorum, partition, heal, convergence, del, byte format, determinism, scenarios A/B, snapshot push, NoOp |
src/go/dkv20_test.go | 15 | same set + an extra stdlib sha256 cross-check |
src/cpp/tests/test_dkv20.cc | 13 | same set |
Run with:
( cd src/rust && cargo test --release )
( cd src/go && go test ./... )
( cd src/cpp/build && cmake --build . && ctest --output-on-failure )
scripts/verify.sh is the one-shot wrapper for all three and ends
with === OK ===.
2. Cross-language byte-identity exam
scripts/cross_test.sh builds three clusterctl binaries (Rust, Go,
C++) and runs:
clusterctl workload --seed 42 --ops 500 --keys 32 --scenario default
clusterctl workload --seed 7 --ops 2000 --keys 128 --scenario partition
The script asserts: rust_hash == go_hash == cpp_hash == golden_hash
for each scenario. Ends with === ALL OK ===. Failure on any line
exits non-zero and prints the diverging hashes.
3. Frozen golden hashes baked into source
The golden values are duplicated in three places on purpose:
scripts/cross_test.shsrc/go/dkv20_test.go(hashA,hashBconstants)src/cpp/tests/test_dkv20.cc(string-literal intest_scenario_a_frozenandtest_scenario_b_frozen)
A change to the wire format or workload spec must update all three to keep verify + cross_test green. The redundancy makes silent drift impossible.
Sanity-check invariants (asserted by the tests)
Sha256Hex(empty)andSha256Hex("abc")match the canonical SHA-256 test vectors.Splitmix64Mix(0) == 0x8b57dafca0cee644in all three languages.DumpStatefor one Put produces exactly 38 bytes whose layout is pinned byte-by-byte.NoOpadvancescommit_indexbut leaves the state machine empty.
What this exam does NOT verify
- Real persistence (no log file, no fsync).
- Real elections (leader is fixed;
current_term == 1). - Real RPC failure injection (we model partitions only).
- Linearizable read paths (reads are direct map lookups).
Those are deliberate scope cuts — see analysis.md and
broader-ideas.md.
db-20 — Broader Ideas
The lab is deliberately small. Here is the menu of extensions that keep the same cross-language exam structure intact.
Elections
Add a term bump path. Replace fixed leader_idx = 0 with a leader
elected by randomised timeout (or a deterministic priority list, if
you want to keep cross-language byte identity). The snapshot already
serialises current_term, so the wire format does not need to
change. New invariant to test: after a leader change, every healthy
replica still converges to the same snapshot.
Linearizable reads
Currently reads are direct state_machine.get(k). To make them
linearizable, gate every read through a "read index" — leader confirms
it is still leader by exchanging heartbeats with a quorum, then
returns the value at commit_index. The byte-identity exam stays the
same; you add a TestReadIndexBlocksUntilQuorum-style scenario.
Log compaction / snapshot install
Today heal() ships the entire log. For long-running clusters that
is unbounded. Add Replica::compact(up_to_idx) that drops the prefix
and records a CompactedSnapshot at the head; change try_append to
also accept "follower has snapshot_idx == prev_idx - delta". The
exam scenarios still pass because the applied state is unchanged.
Multi-key transactions
Replace Op::Put with Op::Txn(Vec<KeyOp>) and apply atomically.
This is a small, well-scoped extension that nudges the lab toward
db-13 (transactions and MVCC) territory.
Membership changes
Add a JointConsensus op (Raft §6) that switches the cluster's
quorum during a configuration change. Trickier — the snapshot needs
to include the active config — but a worthy follow-on if you want to
see why "just add a node" is a real problem.
Disk persistence
Persist the log to a file (use db-01's pwrite-and-fsync pattern).
Test crash recovery by tearing down a replica and reconstructing it
from the log file. The snapshot bytes do not change.
Learner replicas
Add a replica role that receives entries but does not count toward quorum. Useful for read scaling. The snapshot bytes do not change.
Gossip-style membership
Replace the static replica list with a SWIM-style gossip layer that discovers and evicts replicas. Far more invasive — at this point you are building etcd.
Bridges to other labs in the repo
| Extension | Builds on which other lab? |
|---|---|
| Disk persistence | db-01 (storage primitives), db-03 (WAL) |
| Linearizable reads | db-16 (distributed fundamentals) |
| Multi-key transactions | db-13 (transactions and MVCC) |
| Compaction / snapshot | db-09 (leveldb), db-21 (storage advanced) |
| Real elections + RPCs | db-17 (raft) |
| Multi-region / quorum mix | db-22 (perf & benchmarking) |
Step 01 — Cluster and Replica
Goal: in ~30 minutes, build a single-replica Cluster::new(1) that
accepts Put / Del and returns a state machine you can inspect.
What to read first
CONCEPTS.md§ "Data model" and § "Propose / commit cycle".db-17-raft/CONCEPTS.mdfor the words log entry, commit index, quorum if they are not yet second nature.
Concrete tasks
- Define
OpKind,Op,LogEntry,Replica,Clusterin the language of your choice. Match the field layout fromsrc/rust/src/lib.rs. - Implement
Replica::last_idx,Replica::try_append,Replica::advance_commit_to. Note thatapply(state_machine, op)is the only place wherestate_machinemutates outside oftruncate_and_replay. - Implement
Cluster::new(n),quorum,propose. For now, treat every reachable follower as a successful append (no NACK path yet).
Definition of done
#![allow(unused)] fn main() { let mut c = Cluster::new(1); assert!(c.propose(Op::Put("a".into(), vec![1,2,3]))); assert_eq!(c.leader().state_machine.get("a"), Some(&vec![1,2,3])); }
equivalents pass in Go and C++. Run cargo test single_replica_put_commits
to confirm.
Common bugs at this stage
- Forgetting to bump
next_log_idxso two proposals get the same idx. - Applying
opbefore the entry is committed. - Iterating an unsorted
mapsomewhere (Go) — even at this stage, start the habit ofsort.Strings(keys)before any deterministic output.
Step 02 — Quorum Replication
Goal: extend the cluster to 5 replicas and make the leader commit only when a majority acks.
What to read first
docs/execution.mdStages 2 and 3.- The propose loop in
src/rust/src/lib.rs(the part that callstry_appendand countsacks).
Concrete tasks
- Implement
Cluster::partition(ids)andis_partitioned. Store partitioned ids in a sorted set so iteration order is stable. - In
propose, count acks only from non-partitioned, non-leader replicas (plus one for the leader if the leader itself is not partitioned). Ifacks < quorum, returnfalseand leave the entry uncommitted. - Write the three quorum tests:
- 5/5 reachable → commit on all.
- 3/5 reachable → commit on the reachable three.
- 2/5 reachable → no commit anywhere.
Definition of done
#![allow(unused)] fn main() { let mut c = Cluster::new(5); c.partition(&[2, 3, 4]); assert!(!c.propose(Op::Put("k".into(), b"v".to_vec()))); assert_eq!(c.leader().last_idx(), 1); assert_eq!(c.leader().commit_index, 0); }
passes. The Go and C++ ports must match.
Common bugs at this stage
- Counting partitioned followers' acks anyway (a follower in the partitions set must contribute zero acks).
- Counting the leader twice (once for
i == leader_idx, once foracks = 1). - Advancing
commit_indexon the leader but not on the followers. - Mutating
state_machinebefore the commit check passes.
Step 03 — Partitions and Catchup
Goal: implement heal() and truncate_and_replay so a partitioned
follower can rejoin and converge. Then ship the cross-language exam.
What to read first
CONCEPTS.md§ "Partition + heal".docs/execution.mdStages 4–7.src/rust/src/lib.rs— thetruncate_and_replayandCluster::healbodies.
Concrete tasks
- Implement
Replica::truncate_and_replay(leader_log, leader_commit): replace own log, wipe state machine, replay committed prefix. - Implement
Cluster::heal():- clone leader log + commit index up front (avoid use-after-mutate),
- clear
partitions, - for each previously-partitioned follower in ascending id order,
call
truncate_and_replay.
- In
propose, whentry_appendreturnsfalse(gap), do a snapshot push immediately and count the ack. - Implement
dump_stateper the wire format inCONCEPTS.md. Pin every byte offset in a test (test_snapshot_byte_format). - Port the workload driver (
run_workload) to all three languages. The byte-decoding rules —kind = (r1 >> 62) & 0x3,k = "k" + …,v = u64_le(r3 % 10000)— must be identical across all three. - Build Rust binary, run scenarios A and B, capture the hashes,
bake them into Go test, C++ test, and
scripts/cross_test.sh. - Bring Go and C++ green: run
scripts/cross_test.sh. It must end with=== ALL OK ===.
Definition of done
bash scripts/verify.sh # → "=== OK ==="
bash scripts/cross_test.sh # → "=== ALL OK ==="
Both scenarios produce:
- A:
1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b - B:
272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc
Common bugs at this stage
heal()readsleader.logafter mutating a follower — use a snapshot variable.dump_statein Go iterates the map directly → randomised hash. Fix: sort the keys.dump_statein C++ usesstrcpy(magic_buf, "DSEDKV20")and copies 9 bytes including the NUL. Fix:std::memcpy(buf, MAGIC.data(), 8).- C++ test passes in Debug, fails in Release because
assert(c.propose(...))got stripped. Fix: assign to abool ok = ...first, thenassert(ok). - CLI prints a trailing newline. The exam compares full lines; a
trailing
\nbreaks the hash comparison.
Advanced Storage Engine
Lab status: complete. All unit tests pass;
scripts/cross_test.shproves three independent implementations (Rust, Go, C++) produce byte-identical canonical wire dumps for three fixed workloads.
1. What Is It
A standalone study of two pieces that turn a textbook LSM tree into something closer to RocksDB / LevelDB strength:
- Range tombstones — a single record that logically deletes every key in
a half-open interval
[start, end), instead of writing one Delete per key. - Compaction policies — size-tiered (Cassandra/Scylla heritage) and Universal (RocksDB's flagship), expressed as deterministic, side-effect- free functions over the sequence of SSTs.
Everything else (block cache, bloom layout, manifest, WAL, MVCC) is held at its simplest possible form so the two ideas above are studied in isolation.
2. Why Care
- Range tombstones make
DELETE FROM t WHERE id BETWEEN ?, ?and TTL-style "drop everything older than X" affordable. Without them, a one-million-key range delete writes one million Delete entries — and worse, blocks all subsequent reads until those tombstones reach the bottom level. - Compaction policies are the single biggest knob in any LSM. Size-tiered minimises write amplification at the cost of read amplification; Universal bounds the SST count while preserving recency. Choosing one is choosing the workload shape you'll be good at.
3. Core Data Structures
| Type | Purpose |
|---|---|
Entry { Put(k,v) | Delete(k) } | The point-write unit. |
RangeTomb { start, end } | Half-open interval delete; start ≤ k < end. |
SimpleBloom (u64) | 64-bit single-word bloom; two FNV-1a positions. |
Sst { smallest, largest, entries, range_tombstones, bloom } | An immutable run. |
LsmTreeAdvanced { ssts (newest first), ratio } | The whole tree. |
Sst::size() = entries.len() + 2 * range_tombstones.len(). The ×2 weight on
tombstones is deliberate: it makes compaction more eager when tombstones pile
up, which matches real-world tuning advice.
4. The Six Algorithms in One Page
-
Build SST. Walk pending entries right-to-left, mark each key's last occurrence as
keep. Then walk left-to-right emitting kept entries (preserves insertion order of survivors). Compute smallest/largest and the bloom in the same left-to-right pass. Range tombstones are copied verbatim. -
Get(key). Walk SSTs newest → oldest, accumulating active tombstones into a vector as we go. For each SST:
- append its tombs to
active, - if any tomb in
activecoverskey, returnNone(early exit), - if bloom misses,
continue(bounds and entries skipped, but older SSTs may still contain covering tombs — so the walk continues), - if
key < smallestorkey > largest,continuefor the same reason, - linear scan entries; first hit returns
Some(v)for Put,Nonefor Delete.
- append its tombs to
-
Size-tiered compaction. Pick the longest prefix
L ∈ [2, n-1]such thatΣ size(ssts[0..L]) ≤ ratio · size(ssts[L]). Merge that prefix into one SST and insert it at the newest position (index 0). If no suchLexists, returnfalse. -
Universal compaction. Pick the longest contiguous run
[i, i+L)withL ≥ 3such thatΣ size(run) ≤ ratio · size(ssts[i+L])(i.e. the run is followed by something at least twice as big). Ties broken by smalleri. Merge the run, replace it in place. -
Merge. Walk the run newest → oldest. For each SST:
- copy its range tombs into
out_tombs(preserved verbatim), - for each entry: skip if
seen[key](newer-wins), skip if covered by any previously active tomb, otherwise keep it; markseen, - append the SST's range tombs to
active.
Finally sort
out_entriesby key for determinism and recompute the bloom + bounds. - copy its range tombs into
-
Dump (canonical wire format). A length-prefixed binary blob, little- endian throughout. Magic
"DSEADV21"‖f64 ratio‖u32 sst_count‖ per SST: lenpref(smallest) ‖ lenpref(largest) ‖u32 nE‖ entries (u8 kind‖ lenpref(key) ‖ if Put: lenpref(val)) ‖u32 nT‖ tombs (lenpref(start) ‖ lenpref(end)) ‖u64 bloom.
5. What's Deliberately Not Here
- No WAL — recovery is out of scope; the tree is in-memory.
- No block cache, no separate index/filter blocks — the bloom is one
u64. - No level structure — SSTs are a flat list, newest first.
- No snapshots / MVCC —
Getis point-in-time only. - No concurrency — everything is single-threaded; SSTs are immutable so reads-with-merges would be trivially safe.
These omissions are why the lab fits in three files per language while still exercising the two ideas (range tombstones, compaction policy) at the depth where their subtleties bite.
6. Pointers to Cross-Language Equivalence
The whole point of the lab is that three independent implementations agree
byte-for-byte, not just at API level. The shared deterministic workload
(SplitMix64 PRNG, three draws per op, fixed flush/compact cadence) and the
canonical wire format (Section 4.6) are the two halves of that contract.
scripts/cross_test.sh enforces it with three hard-coded sha256 fixtures.
See docs/execution.md for the format spec, docs/verification.md for the
expected output of the verification scripts, and docs/analysis.md for the
design forces behind both range tombstones and the two compaction policies.
References — db-21
The lab is intentionally self-contained, but the ideas are not original.
Range Tombstones
- RocksDB wiki, "DeleteRange: A New Native RocksDB Operation" https://rocksdb.org/blog/2018/11/21/delete-range.html
- CockroachDB blog: "DeleteRange and the importance of tombstones in a distributed SQL database."
- "Bringing Modern Hierarchical Memory Systems Into Main-Memory Databases" (Bortnikov et al.) — discusses interval-deletion structures.
Compaction Policies
- "The Log-Structured Merge-Tree (LSM-Tree)", O'Neil et al., 1996. The canonical paper; introduces the size-tiered idea.
- RocksDB wiki, "Universal Compaction Style" https://github.com/facebook/rocksdb/wiki/Universal-Compaction
- RocksDB wiki, "Leveled Compaction" https://github.com/facebook/rocksdb/wiki/Leveled-Compaction
- ScyllaDB docs, "Size-tiered compaction strategy (STCS)" — the Cassandra heritage of size-tiered.
SplitMix64
- Steele, Lea, Flood, "Fast Splittable Pseudorandom Number Generators",
OOPSLA 2014. The mixing constants
0x9E37...,0xBF58...,0x94D0...come straight from this paper.
FNV-1a
- Glenn Fowler, Landon Curt Noll, Phong Vo, "FNV non-cryptographic hash."
The 64-bit offset
0xCBF29CE484222325and prime0x100000001B3are the standard FNV-1a parameters.
Cross-Language Byte Equivalence as a Methodology
- TigerBeetle's "Tigerstyle" notes on deterministic simulation.
- FoundationDB's flow-based testing — the closest commercial analogue to "spec-by-hash-of-canonical-dump".
Analysis — db-21 Advanced Storage Engine
1. Problem Statement
Two engineering questions, studied in isolation:
- Range deletes. How does an LSM delete a key range
[a, b)cheaply, without writing one Delete per key, and without losing correctness if a newer Put falls inside the same range? - Compaction policy. How do size-tiered and Universal compaction actually differ — not as marketing words, but as deterministic functions over the current SST sequence?
The lab refuses to answer these with prose alone. It demands an executable specification that three language ports must agree on byte-for-byte.
2. Why Three Languages
Cross-language byte agreement is the cheapest sanity check that survives
refactoring. If Rust drifts from Go on fixture A, the failure tells you
exactly which side broke: the diff between the two dump_state() blobs is
a structured binary, decodable by eye.
It also forces the design through three different idiom sets:
- Rust keeps
Option<Vec<u8>>forGet,enum Entry { Put, Delete }for the entry kind, and usesVec<u8>everywhere for keys/values. Slices for bounds; no copies inmerge_run's hot path. - Go uses
[]byteplusbytes.Compare. Amap[string]struct{}stands in for the dedupe set.math.Float64bitsfor the ratio encoding. - C++ uses
std::stringas a byte container (avoids thechar_traitstrap),std::optional<std::string>forGet, and an inline 64-line SHA-256 inlsmctl.ccto keep the dependency surface at zero.
If you can read the same algorithm in all three and they line up at the byte level, the algorithm is unambiguous. That's the deliverable.
3. Range Tombstones — The Subtlety
The single non-obvious rule is:
A range tombstone hides keys older than it, but is itself hidden by Puts newer than it.
Both halves matter. Test range_tomb_respects_newer_put exists because a
naive implementation that consults all tombstones before walking entries
will silently drop the fresh value.
The implementation enforces this by walking SSTs newest → oldest and accumulating active tombstones as the walk descends. A Put in a newer SST is checked against the (then still empty) active set, so it survives. A Put in an older SST is checked against the (by then populated) active set, so it is hidden.
This also explains why a bloom miss must continue instead of return None:
the SST we just skipped might have zero matching keys, but it could still
contribute a range tombstone that shadows something below it. The active
set must be allowed to grow.
4. Size-Tiered vs Universal — The Real Difference
Both are "merge several SSTs into one". The difference is which several.
-
Size-tiered asks: "is there a prefix
[0..L)of new, small SSTs that together fit withinratio · size(ssts[L])?" It picks the longest qualifying prefix, merges them, and inserts the result at the newest position. This is greedy from the top of the tree. -
Universal asks the same shape of question, but over a contiguous run anywhere in the list, with a minimum run length of 3. It picks the longest run; ties go to the leftmost. The merged run replaces itself in place.
The minimum lengths (≥ 2 prefix for tiered, ≥ 3 run for universal) are deliberate, both to keep work amortised and to make the two policies distinguishable on small inputs. Without them, both would degenerate to "merge whenever you can" and the fixtures wouldn't separate them.
5. Why the Wire Format Looks Like That
Five choices, each with a reason:
- Magic
"DSEADV21"— eight bytes, no length prefix. Mismatches surface as the first 16 hex chars of the sha256 changing, which is easy to spot. f64ratio — encoded via the IEEE 754 bit pattern, not as a string. This is why all three languages route throughf64::to_bits,math.Float64bits, andmemcpy(&u64, &double, 8). Strings would force a formatter choice ("0.5"vs"0.5000000000000000").- Length-prefixed keys/values —
u32 LElengths, raw bytes. No terminator, no escaping. Decoding is a one-pass scan. - Entries newest-SST-first — matches the in-memory layout; reversing it in the dump would obscure the actual data structure.
- Bloom as raw
u64 LE— not a list of positions. The bitmap is the bloom; nothing else needs to be portable.
6. Trade-offs Not Taken
- We did not implement snapshot reads. Every Get is "as of right now".
- We did not deduplicate range tombstones across SSTs at merge time. A range that fully contains an older range still leaves both in the merged output. Real engines coalesce; we don't, because the canonical-bytes test would then depend on a chosen normalisation policy.
- We did not gate compaction on actual work performed; size-tiered may pick a length-2 prefix even when the merge produces zero entries (after tombstones erase everything). That's a feature for the study lab — it exercises the merge code; in production you'd skip empty merges.
Execution — db-21 Wire Format and Workload
This document is the single source of truth for the canonical wire format and the deterministic workload. Anything ambiguous here is a bug; fix the doc, not the implementations.
1. SplitMix64 PRNG
state += 0x9E3779B97F4A7C15
z = state
z = (z XOR (z >> 30)) * 0xBF58476D1CE4E5B9
z = (z XOR (z >> 27)) * 0x94D049BB133111EB
return z XOR (z >> 31)
All multiplications are unsigned 64-bit, wrapping on overflow. The PRNG is
seeded with the user-supplied 64-bit seed. Three draws happen per op,
even when only one or two are used — keep them in order (r1, r2, r3).
2. Operation Selection
op = (r1 >> 62) & 0b11
op | Action |
|---|---|
| 0, 1 | Put(k = "k" + (r2 mod keys), v = u32_be(r3 as u32)) |
| 2 | Delete(k = "k" + (r2 mod keys)) |
| 3 | RangeTomb(start = "k" + a, end = "k" + (a + 1 + (r3 mod (keys-a)))) where a = r2 mod keys |
In scenario ptonly, op 3 is rewritten to op 0 before the action runs.
The three draws still happen.
The value bytes are the big-endian 32-bit representation of r3 truncated
to 32 bits. (Big-endian because it produces visually distinct bytes across
fixtures; the format is otherwise little-endian.)
3. Flush and Compact Cadence
- Every 8 ops (i.e. when
(op_idx + 1) % 8 == 0): flush all pending entries and tombstones into a new SST at the newest position. - Every 16 ops (i.e. when
(op_idx + 1) % 16 == 0): run one compaction pass appropriate to the scenario (size-tiered, universal, or no-op). - No residue flush at end. If the loop ends with non-zero pending
entries, they are discarded. This is intentional: it keeps the
cross-language hashes stable regardless of
ops mod 8.
4. Canonical Wire Format
All integers little-endian. lenpref(b) means u32 LE len(b) ‖ b.
"DSEADV21" (8 bytes, ASCII, no terminator)
f64 LE ratio (IEEE 754 bit pattern, not a string)
u32 LE sst_count
for each SST (newest first):
lenpref(smallest_key)
lenpref(largest_key)
u32 LE entry_count
for each entry:
u8 kind (Put = 1, Delete = 2)
lenpref(key)
if kind == Put: lenpref(value)
u32 LE range_tomb_count
for each range tombstone:
lenpref(start_key)
lenpref(end_key)
u64 LE bloom_bitmap
5. The Three Canonical Fixtures
Captured from the Rust reference and pinned in scripts/cross_test.sh:
| Fixture | seed | ops | keys | scenario | sha256 of dump |
|---|---|---|---|---|---|
| A | 42 | 200 | 32 | tieredcompact | fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6 |
| B | 7 | 500 | 64 | universalcompact | 05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb |
| C | 99 | 300 | 16 | withrange | 4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4 |
If you change anything about the workload or the wire format, these hashes change. That's the contract: the hashes are intentional padlocks on behavioural drift.
6. lsmctl CLI
lsmctl workload --seed S --ops N --keys K --scenario {ptonly|withrange|tieredcompact|universalcompact}
Prints the lowercase hex sha256 of dump_state() followed by a newline.
Exit code is 0 on success, 2 on argument errors. All three ports must
agree on stdout byte-for-byte for the same arguments.
7. Reproducing the Hashes
cd db-21-storage-engine-advanced
./scripts/verify.sh # all unit tests
./scripts/cross_test.sh # cross-language byte equivalence
Expected last line: === ALL OK ===.
Observation — db-21
1. The Three Hashes
A seed=42 ops=200 keys=32 tieredcompact fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
B seed=7 ops=500 keys=64 universalcompact 05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
C seed=99 ops=300 keys=16 withrange 4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4
All three languages produce all three hashes on the first run after each clean build. This was not a happy accident — it required keeping every sneaky source of nondeterminism out of the merge step:
- HashSet iteration order doesn't leak (we sort
out_entriesby key after the merge, and we never serialise theseenset). - Map ordering doesn't leak (Go uses a
map[string]struct{}for dedupe but never iterates it; entries come out of a slice). - Floating-point comparison doesn't leak (the ratio is
0.5exactly, which is a representablef64;Σ size ≤ ratio · sizeis integer-vs-rational with no rounding ambiguity at this scale).
2. What Bit Us During Development
-
Two-pass size-tiered. An early draft computed
prefix_sumonce to pickchosen, then recomputed it inside the merge call. The two passes drifted under refactoring. Fixed by collapsing to a single pass that updatesprefix_suminline. -
Go
math.Float64bits. Initial Go draft tried to avoid themathimport by writing a wrapper chain (float64bits→float64bitsFallback→math_Float64bits). The chain was broken (nomathimport to define the leaf). Lesson: don't fight the standard library for ceremony. -
C++
std::optional<std::string>forGet. Worth the friction versus a sentinel value: a Put of the empty string is distinguishable from absent, which is testable indedup_keeps_last.
3. What We Didn't Observe (and why that's good)
- No platform endianness surprises. macOS arm64 produced the same hashes
the canonical fixtures pin. The explicit
LEencoding in every put-int helper means we'd survive a big-endian port too. - No
f64rounding drift. The ratio is0.5and the sizes are small integers; nothing forces denormals or transcendental math. - No SHA-256 mismatch. The Rust port uses an inline impl in
lsmctl.rs; the Go port usescrypto/sha256; the C++ port uses the 64-line public-domain reference at the bottom ofadv.cc. Three independent SHA-256 implementations agreeing on three hashes is the cheapest possible end-to-end test.
4. Resource Profile
Each cargo build --release takes ~5s cold. go build ~1s. cmake --build
~3s. cross_test.sh from cold runs in ~10s including all three builds. No
external network, no Docker, no system packages beyond a working C++20
toolchain, Go ≥ 1.22, and Rust stable.
Verification — db-21
1. What "Verified" Means Here
Two distinct claims:
- Per-language correctness: unit tests in each language pass.
- Cross-language byte equivalence: three independent implementations produce identical canonical wire dumps for three fixed workloads, proven by sha256.
Both must hold. (1) without (2) lets each port drift independently into
a "self-consistent but wrong" state.
2. Per-Language Unit Tests
Ten tests, mirrored across all three ports:
| # | Name | Asserts |
|---|---|---|
| 1 | bloom_hit_miss | Bloom positive case + a definite negative |
| 2 | bounds_short_circuit | Get skips SST when key outside [smallest, largest] |
| 3 | range_tomb_hides_older_put | Newer range tomb shadows older Put |
| 4 | range_tomb_respects_newer_put | Older range tomb does not shadow newer Put |
| 5 | tiered_picks_prefix | compact_size_tiered picks ≥2 prefix |
| 6 | universal_picks_run | compact_universal picks ≥3 contiguous run |
| 7 | noop_compaction | Returns false when no eligible group |
| 8 | dump_determinism | Two dumps of the same state are equal; magic is DSEADV21 |
| 9 | workload_all_scenarios | All four scenarios produce non-empty dumps with correct magic |
| 10 | dedup_keeps_last | build_sst keeps the last Put per key |
./scripts/verify.sh
# == Rust ==
# 10 passed; 0 failed
# == Go ==
# ok github.com/10xdev/dse/db21
# == C++ ==
# 1/1 Test #1: test_adv ......................... Passed
# === OK ===
3. Cross-Language Byte Equivalence
./scripts/cross_test.sh
# == build Rust ==
# == build Go ==
# == build C++ ==
# ok fixture=A impl=rust fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
# ok fixture=A impl=go fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
# ok fixture=A impl=cpp fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
# ok fixture=B impl=rust 05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
# ok fixture=B impl=go 05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
# ok fixture=B impl=cpp 05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
# ok fixture=C impl=rust 4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4
# ok fixture=C impl=go 4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4
# ok fixture=C impl=cpp 4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4
# === ALL OK ===
4. What Would Falsify The Claim
A non-exhaustive list of bugs the cross test would catch but a per-language test wouldn't:
- Forgetting to encode the bloom bitmap as little-endian on a big-endian port.
- Using
hostinteger width for length prefixes instead ofu32. - Iterating a hash map at any point in
merge_run(non-deterministic order across languages and across runs). - Encoding the ratio as
"0.5"instead of the IEEE bit pattern. - Compacting via "longest run found so far that satisfies threshold at the time of finding", instead of evaluating all runs and picking the global longest.
- Off-by-one in
b = a + 1 + (r3 mod (keys-a))for the range tombstone end key.
5. Reproducibility Bar
- macOS arm64, AppleClang 16, Go 1.22, Rust stable (
rustc 1.7x). - No external dependencies (no
sha2crate, nogolang.org/x/..., no OpenSSL): every implementation is self-contained, so the verification step is reproducible offline. - All three hashes are pinned in
scripts/cross_test.shand reproduced in this document for paper-trail purposes.
Broader Ideas — db-21
A short scrapbook of "what would I add next if this were a real engine?"
1. Tombstone Garbage Collection
Right now a range tombstone lives forever — it survives every compaction
and is copied verbatim into the merged output. A production engine drops a
tomb when it's certain no shadowed Puts remain below it. The standard test:
the tomb's end_key is < smallest_key of every SST below it. Implementing
this would require tracking the "sequence number" or generation of each
record, which we deliberately omitted.
2. Coalescing Overlapping Tombstones
Two tombs [k0, k5) and [k3, k7) are equivalent to [k0, k7). Merging
them at compaction time shrinks the per-Get cost (the active vector stays
smaller). We didn't do it because the canonical-bytes test would then need
to specify a normalisation policy (sort by start? coalesce overlaps?
coalesce adjacencies?). Each choice is fine, but the choice itself becomes
part of the wire format.
3. Multi-Level Layout
The lab keeps SSTs as a flat list. RocksDB has L0 (overlapping ranges allowed, newest writes land here) plus L1..Ln (each level non-overlapping, ratio'd in size). Universal compaction roughly corresponds to a degenerate "L0 only" mode, while leveled compaction is its own beast (each compaction picks one L_i SST and the L_{i+1} SSTs that overlap it). A natural follow-up would implement leveled compaction and re-run the same three fixtures with new hashes.
4. Bloom Quality
A 64-bit single-hash bloom is intentionally bad — it exists to make the test for "bloom misses still must walk older SSTs for range tombstones" trigger reliably on tiny inputs. Real engines use per-SST blooms sized to ~10 bits per key with k≈7 hash functions, giving a false-positive rate ~1%. The change is purely numeric; the wire format would absorb a longer bitmap as a length-prefixed byte string.
5. Snapshot Reads / MVCC
If each entry carried a seq: u64, Get(key, at_seq) would walk SSTs the
same way but only consider entries with entry.seq ≤ at_seq. Range
tombstones would gain a seq too. The merge step would need to keep older
versions until they're below the oldest live snapshot.
6. Why Not Implement These Now?
Each one would multiply the size of the wire format and the surface area of the cross-language tests. The lab's claim is that two ideas (range tombstones, two compaction policies) are enough to stress-test cross- language byte equivalence to the point of being convincing. Adding a third without first writing it down somewhere else would dilute the lesson.
Step 01 — Range Tombstones
Goal
Implement a single record that logically deletes every key in [start, end)
without writing one Delete per key, and prove the priority rules with two
adversarial tests.
What to Build
- A
RangeTomb { start_key, end_key }value type with acovers(key)predicate (key >= start && key < end). - An
Sstcarries aVec<RangeTomb>alongside itsVec<Entry>. LsmTreeAdvanced::getwalks SSTs newest → oldest, accumulating active tombstones into a local vector as it goes.
The Two Rules That Matter
- A range tombstone hides keys older than it — i.e. in SSTs that appear later in the newest-first walk.
- A range tombstone does not hide keys newer than it — i.e. in SSTs that appear earlier in the walk.
The Two Tests That Pin Them
range_tomb_hides_older_put: newer SST has tomb[k0, k5), older SST hasPut(k3, "hello").get("k3")must returnNone.range_tomb_respects_newer_put: newest SST hasPut(k3, "fresh"), middle SST has tomb[k0, k5), oldest SST hasPut(k3, "stale").get("k3")must returnSome("fresh").
Subtlety: Bloom Misses
When the bloom of an SST says "key not here", you cannot return early
from get. The skipped SST might contribute a range tombstone that would
shadow something below. So a bloom miss continues the walk; only a
range-tombstone match early-exits with None.
Done When
- Both tests above pass in all three languages.
- The
range_tombstonesare present indump_stateper Section 4 ofdocs/execution.md, and the three canonical fixtures still match.
Step 02 — Tiered and Universal Compaction
Goal
Implement two compaction policies as deterministic functions of the
current SST sequence and the configured ratio, so that the same input
list of SSTs always produces the same output list.
Size-Tiered
Pick the longest prefix L ∈ [2, n-1] of ssts such that
Σ size(ssts[0..L]) ≤ ratio · size(ssts[L]).
chosen = 0
prefix_sum = 0
for L in 1..=n-1:
prefix_sum += size(ssts[L-1])
if L >= 2 and prefix_sum <= ratio * size(ssts[L]):
chosen = L
if chosen < 2: return false
merged = merge_run(ssts[0..chosen])
ssts = [merged] ++ ssts[chosen..]
return true
The merged SST goes at the newest position, because that's where the newly-merged data conceptually lives.
Universal
Pick the longest contiguous run [i, i+L) with L ≥ 3 such that
Σ size(run) ≤ ratio · size(ssts[i+L]) (i.e. the run must be followed by
something at least 1/ratio times its total size). Ties broken by smaller
i.
best_i, best_L = -1, 0
for i in 0..n:
if i + 3 >= n: break
run_sum = 0
for L in 1..=n-1-i:
run_sum += size(ssts[i+L-1])
follow = i + L
if follow >= n: break
if L >= 3 and run_sum <= ratio * size(ssts[follow]):
if L > best_L: best_i, best_L = i, L
if best_L == 0: return false
merged = merge_run(ssts[best_i..best_i+best_L])
ssts = ssts[..best_i] ++ [merged] ++ ssts[best_i+best_L..]
return true
Merge Semantics (shared by both)
Walk the run newest → oldest:
- Append the SST's range tombs to
out_tombs. - For each entry:
- skip if
seen[key](newer-wins), - skip if covered by any tomb in
active, - otherwise emit; mark
seen.
- skip if
- Append the SST's range tombs to
active(so they apply to older SSTs in the run).
After the walk, sort out_entries by key (for determinism across hash-map
iteration orders) and recompute smallest, largest, bloom.
Why the Minimum Lengths
- Tiered's
L ≥ 2keeps it from being "merge one SST with nothing", which would just rewrite the SST. - Universal's
L ≥ 3is RocksDB's actual choice; smaller runs are too frequent to amortise the I/O.
Done When
tiered_picks_prefixpasses (size-tiered selects the 3-small-SST prefix in front of a big SST and produces a 2-SST result).universal_picks_runpasses (universal selects the 3-small run between two big SSTs).noop_compactionpasses (both policies returnfalseon a 2-SST tree).- All three canonical fixtures still match after this step.
Step 03 — Cross-Language Byte Equivalence
Goal
Prove that the Rust, Go, and C++ implementations produce byte-identical canonical wire dumps for three fixed workloads.
Why This Is The Whole Point
API-level test parity is cheap and weak. "Same input → same hash of a canonical binary dump" is strong: any per-language drift (endian, integer width, map-iteration order, float formatting) surfaces as a hash mismatch on the next run.
The Format (one canonical source)
See docs/execution.md Section 4. Two-line summary:
- Magic
"DSEADV21"‖f64 LE ratio‖u32 LE sst_count. - Per SST (newest first): bounds (lenpref) ‖ entries (
u8 kind+ lenpref key + maybe lenpref value) ‖ range tombs ‖u64 LEbloom bitmap.
The Workload (one canonical source)
See docs/execution.md Sections 1-3. Two-line summary:
- SplitMix64 PRNG, 3 draws per op,
(r1 >> 62) & 3chooses Put / Put / Delete / RangeTomb. Flush every 8 ops, compact every 16. No residue flush at end.
The Three Fixtures
| Fixture | seed | ops | keys | scenario |
|---|---|---|---|---|
| A | 42 | 200 | 32 | tieredcompact |
| B | 7 | 500 | 64 | universalcompact |
| C | 99 | 300 | 16 | withrange |
Hashes are pinned in scripts/cross_test.sh and reproduced in
docs/execution.md Section 5 and docs/verification.md Section 3.
Done When
./scripts/cross_test.sh
# ... ends with ...
=== ALL OK ===
If it doesn't, the diff between two implementations' dumps is the debugging artefact. Decode the first ~16 bytes to confirm magic + ratio, then walk SSTs one at a time — each SST is self-delimiting.
What To Do When A Hash Drifts
- Recapture from Rust. If you intentionally changed semantics, the
Rust reference dictates the new canonical hashes; update both
scripts/cross_test.shanddocs/execution.mdSection 5. - Hunt the drift. If you didn't intend to change anything, diff the
raw
dump_statebytes between the failing pair. The first differing byte tells you where in the format the bug lives. Common culprits: forgot LE, usedusizeinstead ofu32, iterated a hash map.
db-22 — Performance and Benchmarking
Why this lab exists
Benchmarks lie. They mostly lie because a benchmark answers a different question than the one you thought you were asking. db-22 is a small, self-contained system whose only purpose is to be measured: a keyed in-memory counter store driven by a deterministic synthetic workload. We freeze a wire format and a workload, hash the resulting state across three implementations (Rust, Go, C++), and use the resulting binary identity as the load-bearing definition of "the same program."
Once correctness is cross-language identical, we can compare throughput of the three implementations on the same hardware honestly — and we can talk about what does and does not constitute a fair comparison.
The data structure: CounterStore
A CounterStore is a mapping i64 -> u64 (counters) plus a single
total_ops: u64 running counter. Three operations:
incr(k, by):total_ops += 1,counters[k] += by(entry created with valuebyif missing).decr(k, by):total_ops += 1. Ifkis missing the call has no further effect (total_opswas already incremented). Otherwise:- if
current <= by, the entry is removed (saturating decrement); - else
counters[k] = current - by.
- if
get(k) -> Option<u64>: live lookup, returnsNoneif absent.
There are no tombstones. Removed counters leave no trace in the snapshot. This is intentional and matters: it makes the snapshot a pure function of the final live state, not of the history of operations.
The semantic that total_ops is bumped on every call (including no-op
decr on missing) is the simplest invariant and is the one against which
all golden hashes were computed. Changing it would change every hash.
Wire format: dump_snapshot
The snapshot is a function CounterStore -> bytes whose output must be
byte-identical across all three implementations.
offset size field
------ ---- ---------------------------------------------
0 8 magic "DSEBENCH" (ASCII)
8 8 total_ops (u64 little-endian)
16 4 distinct_keys (u32 little-endian)
20+ 16 one row per key, ascending by key:
- 8 bytes: key (i64 little-endian)
- 8 bytes: count (u64 little-endian)
Ordering is the only subtle bit. Rust uses BTreeMap, whose iteration is
naturally ascending. Go uses a plain map and explicitly sorts the keys
before emitting. C++ uses std::map, also ascending. All three converge
on the same byte sequence.
The workload: deterministic by construction
We need a workload that:
- Is identical across languages.
- Exercises a mix of insert / mutate / delete to produce a non-trivial end state.
- Can be scaled in both
opsandkeys.
We use SplitMix64 for randomness. It is small, fast, has trivially
portable arithmetic (just u64 adds, shifts, multiplies, and xors), and
needs no library. The constants and step function are well-known:
state += 0x9E3779B97F4A7C15
z = state
z = (z ^ (z >> 30)) * 0xBF58476D1CE4E7B5
z = (z ^ (z >> 27)) * 0x94D049BB133111EB
return z ^ (z >> 31)
Each workload iteration draws exactly three u64 words. Drawing the
same number every iteration is what keeps the RNG stream identical across
languages even when one branch is short and another is long.
r1, r2, r3 = rng.next(), rng.next(), rng.next()
kind = (r1 >> 62) & 0x3 # 0,1,2 → Incr ; 3 → Decr (3:1 ratio)
k = (r2 % keys) as i64
by = (r3 % 100) + 1 # 1..=100
Three-to-one incr:decr means the counter map grows in expectation, but
with keys small relative to ops the map fills up and the decrement
path actually deletes entries — both code paths get exercised in any
non-trivial run.
Two frozen scenarios
| Scenario | seed | ops | keys | SHA-256 of snapshot |
|---|---|---|---|---|
| A | 42 | 500 | 32 | 4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015 |
| B | 7 | 5000 | 256 | 5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1 |
scripts/cross_test.sh builds all three binaries and asserts that the
hashes match each other and these golden values.
Common cross-language divergence sources (and how we avoid them)
- Map iteration order. We never iterate
HashMap. We sort keys explicitly (Go) or useBTreeMap/std::map(Rust, C++). - Integer promotion in shifts.
u64-only arithmetic. No mixed signed/unsigned shifts. %semantics for negative operands.r2isu64; modulus and cast toi64happen exactly once and in the same order.size_twidth. We only putu32/u64on the wire, neversize_tdirectly.- Trailing whitespace / newlines in CLI output.
hashprints the hex with no trailing newline.benchwrites its line to stderr so it can never pollute stdout that a script might be hashing.
Bench methodology
benchctl bench runs a short warm-up (ops/10 + 1) to pull pages and
populate caches, then a single timed pass over ops calls. It prints
ops, keys, elapsed_us, ops_per_sec, and distinct (the number
of live counters at the end) to stderr.
This is intentionally crude — the workload is a single thread doing
in-memory map operations. It is good enough for "is the Rust build
twice as fast as the Go build?" type questions; it is not a
microbenchmark replacement for criterion / go test -bench / Google
Benchmark. The references in references.md cover the deeper rabbit
hole.
What you actually learn from this lab
- Why a benchmark needs to fix a deterministic workload before it fixes a metric.
- Why "the same program in two languages" needs a binary equality test, not a "looks the same" code review.
- Why bench harnesses must warm up, isolate stdout/stderr, and avoid hidden allocations inside the timed region.
- Why
HashMapiteration order is a footgun for portable wire formats.
References — db-22
Primary sources on benchmarking
-
Brendan Gregg. Systems Performance: Enterprise and the Cloud, 2nd ed., Addison-Wesley, 2020. The canonical modern reference. Chapter 12 ("Benchmarking") is required reading; the "active benchmarking" methodology and the catalog of common mistakes (cold-cache effects, the wrong saturation point, the wrong unit) frame the entire lab.
-
Brendan Gregg. BPF Performance Tools, Addison-Wesley, 2019. Less directly relevant here but the right book if you want to observe what your benchmark is actually doing on a Linux box.
-
Gil Tene. "How NOT to Measure Latency." Strange Loop 2015. The "coordinated omission" talk. Even on an in-memory benchmark like this one, the principle generalizes: the metric you report has to match the question the user is asking. We intentionally report
ops_per_sec, not p99 latency, because a single-threaded synchronous loop does not have an interesting tail. -
Bryant & O'Hallaron. Computer Systems: A Programmer's Perspective, 3rd ed., Pearson, 2015. Chapter 5 ("Optimizing Program Performance") and Chapter 9 ("Virtual Memory") supply the "always measure one level deeper" instinct used throughout the docs.
Determinism, RNGs, and reproducible benchmarks
-
Sebastiano Vigna. "An experimental exploration of Marsaglia's xorshift generators, scrambled." ACM TOMS, 2014. SplitMix64 and friends. Justification for using SplitMix64 here: it has trivially portable arithmetic and a well-defined byte-identical output across languages.
-
Guy Steele, Doug Lea, Christine Flood. "Fast Splittable Pseudorandom Number Generators." OOPSLA 2014. The paper that introduced SplitMix.
Microbenchmarking pitfalls (per-language)
-
Andrey Akinshin. Pro .NET Benchmarking, Apress, 2019. Despite the .NET framing, chapters 1–4 are language-agnostic gold: warm-up, steady state, the dead-code-elimination trap, JIT vs AOT timing.
-
Aleksey Shipilëv. "JMH samples" and his "Nanotrusting the Nanotime" blog post. Java-specific but the lessons are universal — particularly the discussion of
System.nanoTimeresolution traps, which apply equally tostd::chrono::steady_clockand Go'stime.Now(). -
Rust:
criteriondocumentation, especially the section on outlier detection. -
Go: the
testingpackage'sBenchmarkdocs and Dave Cheney's "Five things that make Go fast". -
C++: Google Benchmark and Chandler Carruth's CppCon talk "Tuning C++".
Cross-language byte-equality engineering
-
The Cap'n Proto encoding spec. A worked example of a wire format designed for cross-language stability. We do not use Cap'n Proto here, but its constraints (fixed-width little-endian, no sentinel ordering ambiguity, no implicit string normalization) are the same constraints we impose on
dump_snapshot. -
Go issue #7986 —
mapiteration is intentionally randomized. Read the issue and the surrounding discussion; this is the canonical worked example of why a portable wire format may never iterate a hash map without an explicit sort.
Background reading on what "fast" means
-
Latency Numbers Every Programmer Should Know (the Peter Norvig / Jeff Dean table). Internalize the ratios. The point of the bench harness is to put your numbers somewhere on this chart.
-
Ulrich Drepper. "What Every Programmer Should Know About Memory." Long and old but still the right tour of the memory hierarchy your bench is actually hitting.
Analysis — db-22
What we are actually trying to do
The brief was "performance and benchmarking." That is a topic, not a problem statement. The first design pass turned it into a problem statement:
Build a tiny system that has one knob (a deterministic workload) and one measurable property (throughput on that workload), then implement it in three languages and use byte-identical correctness as the precondition for any speed claim.
Everything else in the lab follows from that constraint.
Constraints I started with
- Three languages must produce the same bytes for the same inputs. Without this, "Rust is faster than Go on this workload" is unfalsifiable — they might just be doing different work.
- No external dependencies for the core data structure. SHA-256 has to be reimplementable from scratch in C++ (no OpenSSL), SplitMix64 has to be reimplemented in all three. This is the same discipline used in db-15 and is the only way to guarantee bit-identity.
- The workload must be small enough to brute-force-test for determinism, but large enough that a 1% difference in implementation efficiency shows up in the bench numbers. The two scenarios (500 ops / 32 keys and 5000 ops / 256 keys) bracket this.
- The bench harness must not contaminate the correctness harness.
Throughput numbers go to stderr; the hex hash goes to stdout with no
trailing newline. A shell script can
$(benchctl hash ...)cleanly.
Data structure choice: counter store, not KV store
I considered an mvcc KV store (like db-15), a small B-tree, or even
reusing db-20's distributed KV. All three are overkill for what we want
to measure here. A i64 -> u64 counter store is:
- Small enough to fit in ~400 LOC per language.
- Big enough to exercise the map implementations of each language
(
BTreeMap,map,std::map). - Has interesting cross-language failure modes (HashMap iteration order, signed/unsigned subtraction, integer-width truncation in serialization).
- Has a workload that genuinely creates and destroys entries, so the map's resize / rebalance / erase code paths all execute.
The saturating-decrement decision
The choice about what decr(k, by) does when by > current or when k
is missing is the most consequential semantic decision in the lab. I
considered three options:
- No-op on missing, total_ops unchanged. Cleaner in some ways but
makes
total_opsa partial counter: you cannot replay the operation stream and recover the sametotal_ops. Rejected. - Underflow / panic on negative. Would force the workload generator to remember which keys are live, defeating the determinism. Rejected.
- Saturating: bump total_ops, then either remove the entry or subtract. Total_ops always tracks the operation stream. Snapshots are pure functions of the final state, not the operation history. This is what we picked.
The cost is that "decrement past zero" is silently lossy. For a benchmark workload that is the right trade.
Why three RNG draws per iteration
A subtle correctness footgun: if some branches consume fewer RNG words
than others, the RNG stream diverges from a different implementation
that happens to evaluate the branches in a different order. Drawing all
three words before branching means every iteration consumes the same
number of RNG bytes regardless of kind. This makes the workload trivially
portable.
SplitMix64 over xoshiro / pcg / etc.
SplitMix64 has the smallest state (one u64) and the simplest step
(one add, three multiplies, four xors, three shifts). Its only
operations are 64-bit integer ops that all three languages handle
identically with no surprises. Anything fancier is a footgun for
cross-language byte-equality with no upside for a synthetic workload.
Wire format design notes
Little-endian everywhere. ASCII magic so a hexdump -C is human-readable.
Length prefix (distinct_keys) so a reader could in principle parse the
snapshot incrementally — we never actually do this in the lab, but the
format is forward-compatible.
We do not embed keys or ops or seed in the snapshot. The
snapshot is purely about the resulting state; the workload that produced
it is metadata.
Bench harness design
Four decisions:
- One pass, one timing region. No statistical machinery. The
exercises that need percentiles or distributions go to
criterion/go test -bench/ Google Benchmark — not this harness. - Warm-up sized as
ops/10 + 1. A small constant warm-up (+ 1handlesops < 10) that pulls cache lines and triggers allocator first-touch. Empirically this stabilizes the second-pass timing to within a few percent run-to-run. - Bench output to stderr. Lets
benchctl benchandbenchctl hashuse the same flag layout and lets shell scripts redirect them independently. distinctin the output. It's a sanity check: if the bench reportsdistinct=0, your workload is collapsing entries faster than it creates them and your throughput number is measuring deletes, not inserts. (Seeobservation.mdfor the actual numbers.)
What I'd do differently with more time
- Add a third scenario that intentionally has heavy contention on a
single key (small
keys, largeops) to make the bench numbers more sensitive to allocator behavior. - Wire the bench harness to also produce a flamegraph hint (
elapsed_usbucketed per operation kind). - Add a
--profileflag that runs the workload twice and reports the ratio, as a cheap "is this stable?" check.
Execution — db-22
How to run everything
# from db-22-performance-and-benchmarking/
bash scripts/verify.sh # runs Rust, Go, and C++ unit tests
bash scripts/cross_test.sh # builds 3 binaries, asserts byte-identical hashes
Both scripts end with === OK === or === ALL OK === respectively.
They exit non-zero on any mismatch.
Per-language invocation
Rust
cd src/rust
cargo test --release --lib tests # 9 tests
cargo build --release
./target/release/benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
./target/release/benchctl bench workload --seed 1 --ops 100000 --keys 1024 --scenario default
The --release profile is important: the debug build of SplitMix64 is
substantially slower because the multiplies aren't inlined.
Go
cd src/go
go test ./... # 9 tests
go build -o /tmp/benchctl_go ./cmd/benchctl
/tmp/benchctl_go hash workload --seed 42 --ops 500 --keys 32 --scenario default
/tmp/benchctl_go bench workload --seed 1 --ops 100000 --keys 1024 --scenario default
C++
cd src/cpp
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
./test_db22 # 9 tests
./benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
./benchctl bench workload --seed 1 --ops 100000 --keys 1024 --scenario default
The CMake file defaults to Release with -O3 -DNDEBUG. The test
binary #undefs NDEBUG so its assertions are not stripped.
What the scripts do, step by step
scripts/verify.sh
cdinto the lab root.- Run the Rust unit tests under
cargo test --release --lib tests. We pass--lib testsso cargo only loads the test module from the library crate; without it cargo prints "0 tests" because it tries to discover integration test binaries that don't exist. - Run the Go unit tests with
go test ./.... - Configure and build the C++ project under
src/cpp/buildand run./test_db22. - Print
=== OK ===on success.
scripts/cross_test.sh
- Build all three release binaries.
- For each of two frozen scenarios, run
benchctl hash workload …in all three implementations, capture stdout (no trailing newline), and compare:- The three implementations must agree with each other.
- They must agree with the golden hash committed in the script.
- Print
=== ALL OK ===on success.
CLI shape
benchctl hash workload --seed N --ops N --keys N --scenario S
benchctl bench workload --seed N --ops N --keys N --scenario S
hashprints the SHA-256 hex digest of the final snapshot, with no trailing newline, on stdout.benchwrites one line to stderr describing the run; its stdout is empty.
Both commands accept identical flags. --scenario is currently a
documentation tag — it does not change behavior but is reserved for
future workload variants.
Reproducing the frozen hashes
$ ./target/release/benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015
$ ./target/release/benchctl hash workload --seed 7 --ops 5000 --keys 256 --scenario default
5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1
If you ever see a different hash:
- Did you change
MAGIC, the wire format, the workload mixing rule, or SplitMix64? Any of those will move every hash. - Did you change the decrement semantics? See
analysis.md. - Are you iterating a
HashMaporunordered_mapinstead of a sorted structure? That will give you a random hash run to run.
Observation — db-22
Cross-language hash check
All three implementations agree on the bytes:
=== scenario A ===
rust: 4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015
go : 4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015
cpp : 4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015
match + golden ok
=== scenario B ===
rust: 5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1
go : 5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1
cpp : 5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1
match + golden ok
Throughput probe (single representative run)
ops=100000 keys=1024 elapsed_us=7242 ops_per_sec=13806910 distinct=1024
About 13.8 million ops/sec for the Rust release build on a single thread,
single core, no contention, on an Apple Silicon laptop. distinct=1024
tells us the map is fully populated at the end of the run — the
increment-heavy mix means decrements rarely empty a slot at this
keys cardinality.
Read this as: each op costs roughly 70 nanoseconds, of which a chunk is
three SplitMix64 draws, a couple of map lookups, and the per-iteration
loop overhead. It is in the right ballpark for an in-memory
BTreeMap<i64, u64> workload.
What we are not measuring (and why that matters)
- No allocator pressure beyond the initial map growth. The map
reaches steady state after ~
keysdistinct entries are touched, and the rest of the run is in-place mutation. - No I/O, no syscalls, no real memory pressure. The whole working set fits in L2.
- No latency distribution. We report a single throughput number. For a single-threaded synchronous loop, p99 latency would just be a rephrasing of throughput plus a small jitter from the OS scheduler.
- No cross-language throughput numbers in this doc. You can collect
them yourself with
benchctl bench— but be honest about what you've measured (one machine, one moment, one workload).
Why the bench number is stable but not authoritative
The bench subcommand runs a small warm-up pass (ops/10 + 1) before
the timed pass. On the order of 100k ops the warm-up is about 10k
operations, which is enough to pull all the map slots and K256 SHA
constants into the right caches. Without the warm-up the first pass is
~30% slower; with the warm-up, second-pass timings repeat to within a
few percent run-to-run.
This is still a crude harness. We are not collecting CPU counters, we are not pinning to a CPU, we are not disabling turbo, we are not controlling for thermal state. Use these numbers for ordering ("did this change make it faster or slower?") and not for absolute claims ("Rust does N nanoseconds per op on this machine").
Sanity checks that fire if you break things
scenario_a_frozen/scenario_b_frozen— any change to wire format, mixing rule, or RNG step breaks both of these immediately.splitmix64_known— guards against accidental constant-swap in the SplitMix64 mixing function.sha256_vectors— guards against accidental damage to the SHA implementation in any language.snapshot_layout_two_keys— pins the exact byte layout of a trivial 2-key snapshot, so a wire-format change shows a tightly localized failure (not just "scenario A differs").workload_determinism— same seed/ops/keys gives the same bytes on two consecutive runs.
Verification — db-22
What "verified" means here
For a perf-and-bench lab, "verified" means three things at once:
- All three implementations pass their own unit tests (Rust 9, Go 9, C++ 9).
- All three implementations produce byte-identical snapshot hashes for both frozen scenarios.
- The frozen hashes match the golden values committed in source.
Anything less and the bench numbers are meaningless. You can't claim "Rust does X ops/sec on this workload" if it is not doing the same work as the Go and C++ versions.
How to verify
bash scripts/verify.sh
bash scripts/cross_test.sh
Each script exits non-zero on failure and prints either === OK === or
=== ALL OK === on success.
Expected last lines:
$ bash scripts/verify.sh
…
=== OK ===
$ bash scripts/cross_test.sh
…
=== ALL OK ===
What each unit test pins
| Test | Pins |
|---|---|
sha256_vectors | SHA-256 against known empty and "abc" vectors |
splitmix64_known | splitmix64(0) == 0x8b57dafca0cee644 |
incr_accumulates | incr adds to existing entries, creates new ones, bumps total_ops |
decr_saturates_and_removes | decrement past zero removes the entry |
decr_on_missing_is_visible_op | decr on a missing key bumps total_ops but does not create the entry |
snapshot_layout_two_keys | exact wire bytes of a 2-key snapshot |
workload_determinism | same seed twice → same snapshot bytes |
scenario_a_frozen / scenario_b_frozen | frozen golden hashes per scenario |
The frozen-scenario tests are the highest-value tests in the lab. Any silent change to the wire format, the workload, or SplitMix64 breaks both of them with a clear "got X, want Y" message in the failing language's test output.
Manual sanity checks
# bytes of the smallest meaningful snapshot
./target/release/benchctl hash workload --seed 0 --ops 0 --keys 1 --scenario default
# expected: sha256 of MAGIC || 0_u64 || 0_u32 = the empty-store hash
# determinism
./target/release/benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
./target/release/benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
# should print the same hex twice
What is not verified by these tests
- That
benchreports the correct throughput. It is impossible to verify a wall-clock number from a test. The bench harness has adistinct=field as a structural sanity check, but the numeric throughput is left to the operator to inspect. - That the implementations are equally fast — we only check they are equally correct. The whole point of the lab is to make speed comparisons honest by first making correctness identical.
- That the implementations would still match on a 32-bit or big-endian
platform. The wire format pins little-endian; on a hypothetical
big-endian build we'd need a byte-swap in
put_u64_leetc.
Broader Ideas — db-22
The lab as it stands is a deliberately minimal harness. These are extensions that would build naturally on top of it.
A. Percentile-aware bench harness
Replace the single-pass timer with a per-operation timing loop that
collects a histogram (HDR-style) of per-op latencies. Then bench
reports p50 / p90 / p99 / p99.9 in addition to throughput. This is
where the Gil Tene "How NOT to Measure Latency" talk earns its keep —
even on a synchronous single-thread loop, a long-tail GC pause in Go or
a page fault in C++ will move the tail dramatically.
Trap to avoid: the cost of taking a timestamp per op (time.Now() /
std::chrono::steady_clock::now()) is itself ~30 ns on most boxes,
which is comparable to one workload op. You'd need to time batches
of ops and divide.
B. Allocator pressure scenario
Add a third scenario whose workload is deliberately allocator-heavy:
short-lived strings as values (move from u64 to String), or a
churn pattern that constantly creates and removes keys so the map is
forced to resize. The cross-language throughput delta for this
scenario would be much larger than for the existing one, and the
results would speak to the maturity of each language's allocator.
C. Multi-threaded variant
Wrap CounterStore in a sync primitive and run N workers. The point
is not to demonstrate scaling — Mutex<BTreeMap<…>> won't scale —
but to demonstrate the difference between coarse locking, sharded
locking, and lockfree updates. Each language has different idioms here
(parking_lot vs std::sync, sync.Map vs atomic, std::shared_mutex
vs std::atomic), and the cross-language comparison becomes a
language-features comparison.
D. Snapshot replay / log shipping
Right now dump_snapshot produces bytes that are only used for hashing.
Add a restore_snapshot and a small "log" of operations (just the
sequence of (op, k, by) triples), and you have a tiny replicated
store. Connect three nodes via a deterministic schedule and you have a
toy version of db-23.
E. Energy and not-time metrics
On Apple Silicon, powermetrics --samplers cpu_power can give you
energy per op. The relative energy of the Rust / Go / C++ implementations
on the same workload is a more honest "which is more efficient" claim
than throughput, because it folds in stalls, branch mispredictions, and
memory bandwidth.
F. Comparison with off-the-shelf benchmark frameworks
Run the same workload under criterion (Rust), go test -bench, and
Google Benchmark (C++). Compare:
- Their reported throughput vs ours.
- Their reported variance.
- The shape of their output.
The lab's homegrown harness will look crude in comparison, and that's the point — the exercise of measuring the difference is more educational than the difference itself.
G. Worst-case scenario discovery
Use coverage-guided fuzzing on the workload generator (with the saturating-decrement invariant as the asserted property) to find a seed/ops/keys combination that maximizes either throughput or memory pressure. This connects perf work to the fuzz/property-test discipline used in db-13 and db-15.
H. Cross-architecture verification
Run the existing scripts/cross_test.sh under qemu-user-static for
aarch64 / x86_64 / riscv64 and confirm the hashes still match. They
should — the wire format is little-endian and the arithmetic is
all 64-bit — but the only way to be sure is to actually do it.
I. Cache-aware redesign of CounterStore
std::map / BTreeMap / sorted-Go-slice all use pointer-rich tree
structures. A flat sorted array with binary search would be slower for
insert but dramatically faster for the iteration step (which is the
critical path in dump_snapshot). For a workload that touches each
key only a handful of times before snapshotting, the array would be
worth measuring.
J. The "ten percent rule"
A small operational rule we picked up doing this lab: any perf change worth claiming must move the bench number by more than ten percent. Below that, run-to-run noise on a laptop dominates. Above that, you can usually attribute the change to a specific code path. The harness is deliberately not precise enough to defend a 2% claim, and that's a feature.
Step 01 — Counter Store
Goal
Implement a CounterStore in each of three languages with byte-identical
semantics for incr, decr, and get. The data structure is intentionally
small — three operations, two pieces of state — so we can focus on the
edge cases that make cross-language byte-identity hard.
What to build
A type/struct/class CounterStore with:
- An ordered map
i64 -> u64(BTreeMap, sorted-keysmap,std::map). - A
u64running countertotal_ops. incr(k, by):total_ops += 1; addbyto (or create)counters[k].decr(k, by):total_ops += 1; ifkis missing, stop. Otherwise remove the entry ifby >= current, else subtract.get(k) -> Option<u64>/(u64, bool)/std::optional<u64>.
Tests this step should pass
incr_accumulates: three incrs across two keys leave the right per-key values andtotal_ops == 3.decr_saturates_and_removes:incr(1, 5); decr(1, 3); decr(1, 100)leaves the map empty withtotal_ops == 3.decr_on_missing_is_visible_op:decr(42, 1)on an empty store leavestotal_ops == 1and no entry for 42.
Things to watch for
u64underflow: never computecurrent - bywithout thecurrent <= bycheck first.- Go's map: a missing key reads back as the zero value with
ok=false. Use the comma-ok form explicitly. - C++
std::map::operator[]: avoid it on the read path — it inserts a zero entry as a side effect. Usefind.
Acceptance
cargo test --release --lib tests::incr_accumulates and the matching
Go / C++ tests all pass.
Step 02 — Snapshot and Workload
Goal
Pin a wire format for CounterStore and a deterministic workload
generator so that, given identical (seed, ops, keys), all three
implementations produce the same bytes — and therefore the same
SHA-256 digest.
What to build
dump_snapshot
A byte serializer with this exact layout:
"DSEBENCH" (8 bytes, ASCII)
total_ops (u64 little-endian)
distinct_keys (u32 little-endian)
for each key in ascending order:
key (i64 little-endian)
count (u64 little-endian)
Critical details:
- Ascending iteration order.
BTreeMap/std::mapare already sorted; Go must callsort.Sliceon the keys explicitly. - Little-endian for every integer.
- No padding, no separators, no trailing bytes.
SplitMix64
Implement the standard one-state-word SplitMix64:
state += 0x9E3779B97F4A7C15
z = state
z = (z ^ (z >> 30)) * 0xBF58476D1CE4E7B5
z = (z ^ (z >> 27)) * 0x94D049BB133111EB
return z ^ (z >> 31)
Also implement the stateless splitmix64(x) (without the state +=
step) for the canonical test vector check.
run_workload(seed, ops, keys, scenario)
rng = SplitMix64(seed)
store = empty CounterStore
repeat ops times:
r1 = rng.next()
r2 = rng.next()
r3 = rng.next()
kind = (r1 >> 62) & 0x3 # 0,1,2 → incr, 3 → decr
k = i64(r2 % keys)
by = (r3 % 100) + 1
if kind == 3 -> store.decr(k, by) else store.incr(k, by)
return store.dump_snapshot()
The scenario argument is reserved and ignored for now.
Tests this step should pass
sha256_vectors: empty and "abc" SHA-256 vectors.splitmix64_known:splitmix64(0) == 0x8b57dafca0cee644.snapshot_layout_two_keys: incr keys 2 and 1, snapshot is 52 bytes with magic,total_ops=2,distinct_keys=2, then the row for key 1 before the row for key 2.workload_determinism: two runs of the same workload produce byte-identical snapshots.scenario_a_frozen/scenario_b_frozen: hashes match the golden values inCONCEPTS.md.
Things to watch for
- Always draw three RNG words per iteration, even if a branch only needs two. The RNG stream must be identical across languages.
- Never iterate a hash map for serialization. Sort first.
- Don't put
size_torusizeon the wire — always serialize asu32oru64.
Acceptance
scripts/cross_test.sh reports === ALL OK ===.
Step 03 — Bench Harness
Goal
Add a bench subcommand to benchctl in each language that runs the
same workload as the hash subcommand and reports a throughput number.
The harness should be small enough to read end-to-end but disciplined
enough not to lie.
What to build
A bench workload --seed N --ops N --keys N --scenario S subcommand
that:
- Runs a warm-up pass of
ops/10 + 1operations and discards the result. - Captures a high-resolution start timestamp.
- Runs the full
opsworkload and keeps the resultingCounterStoreso we can readdistinctfrom it. - Captures a high-resolution end timestamp.
- Writes one line to stderr in this format:
ops=<N> keys=<N> elapsed_us=<N> ops_per_sec=<N> distinct=<N>
- Writes nothing to stdout.
The CLI's hash subcommand must remain unchanged: stdout-only, no
trailing newline, no diagnostic noise.
Timing primitives by language
- Rust:
std::time::Instant. - Go:
time.Now()/time.Since(). - C++:
std::chrono::steady_clock.
steady_clock / Instant are the right choice — they are monotonic and
not subject to wall-clock adjustments mid-run.
Tests this step should pass
There are no automated tests for bench (timing values can't be
asserted), but the structural sanity check is:
./target/release/benchctl bench workload --seed 1 --ops 100000 --keys 1024 --scenario default
# expect on stderr:
# ops=100000 keys=1024 elapsed_us=<some number> ops_per_sec=<some number> distinct=1024
# expect on stdout: nothing
Things to watch for
- Don't put
printfinside the timed region. Allocating a string is ~hundreds of nanoseconds and will dominate small workloads. - Don't take a timestamp per op. The cost of
Now()is comparable to the cost of one workload op. - Don't forget the warm-up. The first pass is dominated by cold-cache effects and first-touch allocator behavior.
- Don't claim numbers across machines without describing the machine.
Acceptance
Running bench against a 100k-op, 1024-key workload produces a
throughput line on stderr and an empty stdout. verify.sh and
cross_test.sh continue to pass.
db-23 — Capstone: distributed replicated KV database
This is the final lab. It synthesizes everything from db-01 through db-22 into a single tiny but real distributed key/value database whose state is byte-identical across Rust, Go, and C++ for two frozen scenarios.
What this lab builds
A 3-node replicated KV cluster:
| Node | Role |
|---|---|
| 0 | Leader. The only node that originates writes. |
| 1 | Follower. Can be taken down mid-run. |
| 2 | Follower. Always up. |
Each write Op (Put or Del) is:
- Drawn deterministically from a SplitMix64 stream (see db-04, db-22).
- Appended to the leader's log at index
log.len() + 1. - Replicated synchronously to every live follower.
- Counted as ack'd by every live node whose log already contains that index (plus the leader itself).
- Committed on every reachable node when the ack count reaches a majority of 3 (= 2).
- Applied: each newly-committed entry mutates the local
BTreeMap<i64, i64>state machine in commit-index order.
A catch_up operation lets a recovering follower copy any missing log
entries from the leader and advance its commit/apply watermark.
Two scenarios — frozen hashes
The cluster snapshot is the canonical encoding of all three nodes concatenated. We hash it with SHA-256.
| Scenario | seed | ops | keys | fault? | SHA-256 |
|---|---|---|---|---|---|
| normal | 42 | 200 | 16 | no | 5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296 |
| fault | 7 | 2000 | 128 | follower 1 down on [ops/2, 3·ops/4) | d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792 |
Both hashes are frozen as constants in src/rust/src/lib.rs, src/go/db23_test.go, and src/cpp/src/db23.h, and cross-checked by scripts/cross_test.sh.
Deterministic workload
For op i the RNG draws three u64s regardless of branch outcome, so
the stream is identical no matter which kind of op gets generated:
r1, r2, r3 = rng.next(), rng.next(), rng.next()
kind = (r1 >> 62) & 0x3 // 0,1,2 -> Put, 3 -> Del
k = i64(r2 % keys)
v = i64(r3 % 1000)
The fault schedule is purely a function of ops:
down_start = ops / 2
down_end = (ops * 3) / 4
At i == down_start follower 1 is marked down; at i == down_end it
comes back up and we immediately catch_up. If the loop happens to end
while follower 1 is still down, we catch it up once more at the end so
all three nodes always converge.
Per-node canonical encoding
magic : 8 bytes = "DSEDIST2"
node_id : u8
term : u64 LE
commit_index : u64 LE
log_len : u32 LE
log[log_len] of:
term : u64 LE
index : u64 LE
op_kind : u8 (0 = Put, 1 = Del)
key : i64 LE
value : i64 LE (0 for Del)
kv_len : u32 LE
kv[kv_len] of (ascending by key):
key : i64 LE
value : i64 LE
The cluster snapshot is just node0.encode() || node1.encode() || node2.encode(). last_applied is not serialized — after a write
loop completes (with terminal catch-up) it always equals
commit_index, so it carries no extra information.
Sources of cross-language divergence — avoided
| Risk | How we eliminate it |
|---|---|
| Map iteration order | Sort i64 keys ascending in Go (sort.Slice); BTreeMap/std::map already ordered in Rust/C++. |
| Endianness | All multi-byte ints written little-endian by hand. |
| RNG branch-skew | Always draw 3 words per op regardless of kind. |
32/64-bit int | All wire types are u8/u32/u64/i64; sizes are explicit. |
| Apply order under fault | Apply is gated by a single monotonic commit-index counter, and catch_up is called at well-defined points. |
0 value for Del | C++/Go fill v=0; Rust matches with explicit value() returning 0 for Del. |
What this synthesizes from prior labs
| Earlier lab(s) | Used here as |
|---|---|
| db-01 storage primitives | Manual byte-level LE encoding. |
| db-02 data structures | Sorted map state machine. |
| db-03 write-ahead log | The per-node log is the WAL. |
| db-04 hashing | SHA-256 + SplitMix64 PRNG. |
| db-05/06/07/08 LSM stages | Replaced here by a simpler in-memory state machine, but the apply-log-then-mutate pattern is the same. |
| db-13 transactions | Atomic apply per committed entry (no partial state). |
| db-16 distributed fundamentals | Replication, majority quorum, follower catch-up. |
| db-17 raft | Leader-only writes, log indexing, commit watermark. |
| db-22 perf & bench | Deterministic workload + canonical snapshot pattern. |
How to verify
bash scripts/verify.sh # runs all 9 tests in all 3 languages
bash scripts/cross_test.sh # confirms cross-lang + golden equality
Both must end with === OK === and === ALL OK === respectively.
References — db-23 capstone
Replication and consensus
- Diego Ongaro & John Ousterhout. In Search of an Understandable Consensus Algorithm (Extended Version). ATC 2014. The Raft paper — the leader/log/commit-index model used by this lab is a direct simplification of it.
- Leslie Lamport. Paxos Made Simple. 2001. The original majority-quorum log-replication algorithm.
- Flavio Junqueira, Benjamin Reed, Marco Serafini. ZAB: High-performance broadcast for primary-backup systems. DSN 2011. Used by ZooKeeper; closest in spirit to the leader-only single-quorum model here.
Theory
- Fischer, Lynch, Paterson. Impossibility of Distributed Consensus with One Faulty Process. JACM 1985. Why deterministic consensus needs failure detectors / partial synchrony.
- Eric Brewer. Towards Robust Distributed Systems. PODC 2000 keynote (CAP conjecture). Gilbert & Lynch later proved it.
- Seth Gilbert & Nancy Lynch. Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. SIGACT 2002.
Practitioner material
- MIT 6.824 Distributed Systems lectures (esp. lectures 5–8 on Raft).
- Martin Kleppmann. Designing Data-Intensive Applications. O'Reilly 2017. Chs. 5, 8, 9 on replication, consistency, and consensus.
- Kyle Kingsbury. Jepsen reports (https://jepsen.io). Practical examples of how real systems violate the guarantees their READMEs claim.
Isolation testing
- Peter Bailis et al. Hermitage — concrete tests that expose what isolation levels really mean (https://github.com/ept/hermitage).
What this lab does not model
- Leader election (we hardcode node 0 as leader forever).
- Log truncation / divergent suffixes (we use synchronous in-process replication, so followers never have entries the leader lacks).
- Membership changes, log compaction, snapshots, network partitions beyond a single follower being marked down.
Those are the natural follow-on projects after this capstone — see docs/broader-ideas.md.
Analysis — db-23 capstone
Goal restated
Build the smallest possible thing that is honestly a replicated KV database, port it to three languages with byte-identical state, and prove convergence under a deterministic failure scenario.
Design choices
Why synchronous in-process replication?
A real Raft cluster uses goroutines/threads, network RPC, election timeouts, and randomized jitter — all of which are sources of nondeterminism. For a capstone whose entire point is cross-language byte-equality, that would defeat itself.
So instead the "network" is a function call. The Cluster's submit
synchronously: appends on the leader, appends on every live follower,
and commits if quorum reached. This is provably equivalent to a Raft
cluster running in lockstep with no message reordering.
Why majority = 2?
3 nodes, so a quorum is 2. The leader counts itself. As long as
either follower is up, the cluster commits. When follower 1 is
marked down, follower 2 + leader still form a quorum. If both
followers were down simultaneously, writes would block — but our fault
schedule never does that, so submit never wedges.
Why a single deterministic leader?
Leader election adds randomness (timeouts) and protocol surface (terms, RequestVote). We pin node 0 as the perpetual leader. The lab still shows the replication half of Raft faithfully; election is left as a follow-on (see broader-ideas.md).
Why three RNG draws per op, including for Del?
If we drew fewer words on Del branches, the RNG stream would advance
differently for runs that happen to produce more Dels, and frozen
hashes would depend on the kind distribution. By always consuming
exactly three words we ensure the stream depends only on seed and
ops, not on what kinds happened.
Why drop last_applied from the wire format?
After the final catch_up (which runs unconditionally if follower 1
ended down), every node satisfies last_applied == commit_index.
Including it in the encoding would waste bytes and risk a Rust/Go/C++
divergence if one of them computed it slightly differently mid-run.
It is a derived quantity, so we omit it.
Failure model
The only fault we inject is a single follower going down for one quarter of the run:
[0, ops/2) all three nodes replicate
[ops/2, 3·ops/4) follower 1 down; quorum is {0, 2}
[3·ops/4, ops) follower 1 up + caught up; all three replicate
end final catch_up if still mid-down (handles ops%4)
This produces a clean, hashable post-condition: every node has the same log, the same commit_index, and the same kv map.
Why two scenarios?
- normal (no fault) shows the happy path and stresses the commit path under a small workload.
- fault (with the follower window) stresses replication under partial availability and the catch-up code path. The 2000-op size makes the fault window long enough to accumulate hundreds of entries that the recovering follower must replay.
Both must produce the same hash on all three languages.
Execution — db-23 capstone
Build & test
# everything
bash scripts/verify.sh
# cross-language identity check
bash scripts/cross_test.sh
verify.sh runs the 9-test suite in each of Rust, Go, and C++ and
ends with === OK ===. cross_test.sh builds three dbctl binaries,
runs both scenarios in each language, asserts equality across the
three languages, and asserts each matches the frozen golden hash, then
ends with === ALL OK ===.
Per-language one-liners
# Rust
( cd src/rust && cargo test --release --lib tests )
( cd src/rust && cargo run --release --bin dbctl -- \
hash workload --seed 42 --ops 200 --keys 16 --scenario normal; echo )
# Go
( cd src/go && go test ./... )
( cd src/go && go run ./cmd/dbctl \
hash workload --seed 42 --ops 200 --keys 16 --scenario normal; echo )
# C++
( cd src/cpp && cmake -S . -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j )
src/cpp/build/test_db23
src/cpp/build/dbctl hash workload --seed 42 --ops 200 --keys 16 --scenario normal; echo
CLI shape
dbctl hash workload --seed N --ops N --keys N --scenario <normal|fault>
- Prints the SHA-256 hex of the cluster snapshot to stdout.
- Writes no trailing newline (matches db-22 convention so shell comparisons stay simple).
- Exits 2 on bad arguments.
Frozen scenarios
| Scenario | Command |
|---|---|
| normal | dbctl hash workload --seed 42 --ops 200 --keys 16 --scenario normal |
| fault | dbctl hash workload --seed 7 --ops 2000 --keys 128 --scenario fault |
Expected outputs:
normal: 5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
fault : d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792
Observation — db-23 capstone
What we measured during development
1. Log + commit_index advance lock-step on happy path
Three submits with no fault:
after submit Put(1,100): log=[1] commit=1 kv={1:100} (all 3 nodes)
after submit Put(2,200): log=[1,2] commit=2 kv={1:100,2:200}
after submit Del(1): log=[1,2,3] commit=3 kv={2:200}
Each submit returns synchronously with all three nodes already in
the post-state. This is the put_then_del_replicates test.
2. Quorum still progresses with one follower down
Take follower 1 down between submits. Leader + follower 2 still form a quorum:
follower 1 down.
submit Put(2,2): leader.commit=2 follower2.commit=2 follower1.commit=1
submit Put(3,3): leader.commit=3 follower2.commit=3 follower1.commit=1
This is the fault_window_then_catchup_converges test. After
catch_up(1):
follower1.log.len = 3, follower1.commit = 3, follower1.kv = {2:2, 3:3}
3. The snapshot is byte-identical across languages
For both frozen scenarios:
[normal] rust=5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
[normal] go =5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
[normal] cpp =5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
[normal] gold=5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
[fault] rust=d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792
[fault] go =d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792
[fault] cpp =d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792
[fault] gold=d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792
This is what scripts/cross_test.sh prints on success.
4. Snapshot size for a 1-write cluster
8 magic + 1 id + 8 term + 8 commit + 4 log_len
+ 1 entry of (8+8+1+8+8) = 33
+ 4 kv_len + 1 kv of (8+8) = 20
= 82 bytes per node × 3 nodes = 246 bytes total
Verified in snapshot_layout_smoke tests in all three languages.
What we did not observe
- Any divergence between languages, ever, on either scenario.
- Any nondeterminism within a single language (each scenario run twice in the determinism tests).
- Any case where a follower's log moved ahead of the leader — by construction, only the leader appends new entries; followers only ever copy.
Caveat
The cluster is in-process. We cannot observe real network behavior — no message loss, reordering, or partial partitions. The lab models replication semantics under controlled failures, not network robustness. The latter is left to the broader ideas / future work.
Verification — db-23 capstone
Acceptance criteria
| # | Property | Where checked |
|---|---|---|
| 1 | SHA-256 implementation matches NIST vectors. | sha256_vectors test, all 3 langs. |
| 2 | SplitMix64 matches the known value splitmix64(0). | splitmix64_known test, all 3 langs. |
| 3 | Happy-path Put/Del fully replicates and applies on every node. | put_then_del_replicates test, all 3 langs. |
| 4 | After a fault window + catch_up, all three nodes converge. | fault_window_then_catchup_converges test, all 3 langs. |
| 5 | Per-node snapshot layout is exactly 82 bytes for a 1-op cluster. | snapshot_layout_smoke test, all 3 langs. |
| 6 | The normal scenario is deterministic (two runs hash-equal). | workload_is_deterministic test, all 3 langs. |
| 7 | The fault scenario is deterministic. | fault_scenario_is_deterministic test, all 3 langs. |
| 8 | Normal scenario hashes to the frozen golden. | scenario_normal_frozen test, all 3 langs + cross_test.sh. |
| 9 | Fault scenario hashes to the frozen golden. | scenario_fault_frozen test, all 3 langs + cross_test.sh. |
Each language runs its own copy of the 9 tests, so the suite total is
27 assertions of cross-cutting properties plus 6 hash-equality checks
across languages (3 langs × 2 scenarios) in cross_test.sh.
How to run
bash scripts/verify.sh # ends with === OK ===
bash scripts/cross_test.sh # ends with === ALL OK ===
Failure-mode triage
| Symptom | Likely cause |
|---|---|
| Rust passes, Go/C++ fails frozen test | Map iteration order — confirm Go sorts keys, confirm std::map (not std::unordered_map). |
| All three languages disagree on the same scenario | RNG-stream drift — check that step_op draws exactly 3 words regardless of kind. |
| Determinism test fails within one language | Some hidden non-determinism (HashMap, address ordering). Switch to ordered map. |
| Snapshot length wrong | Off-by-one in log_len/kv_len u32, or wrong endian. |
| Fault test fails only in C++ | Probably unsigned char vs char in MAGIC comparison, or signed arithmetic on i64. |
Frozen hashes (locked)
HASH_NORMAL = 5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
HASH_FAULT = d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792
These constants live in src/rust/src/lib.rs, src/go/db23_test.go, and src/cpp/src/db23.h, and are also hard-coded in scripts/cross_test.sh. Changing any byte of the wire format requires regenerating all five copies in lock-step.
Broader ideas — what to build next
This capstone is a deliberately minimal replicated KV. Here are the natural follow-on projects, in roughly increasing scope:
1. Leader election
Replace "node 0 is leader forever" with a Raft-style election: randomized timeouts, terms, RequestVote, log-completeness check. Determinism becomes hard the moment timers exist, so frozen-hash testing must be replaced with invariant-style testing (e.g. "every successful read returns a value from the leader's committed log").
2. Real network
Move from synchronous function calls to in-memory channels first, then to TCP RPC, then to UDP with retransmission. At each layer add the corresponding failure injection (drop, reorder, duplicate, delay) and re-verify safety invariants.
3. Log compaction & snapshots
Today catch_up replays the entire leader log. For a long-running
cluster this is infeasible; add Raft-style snapshots: leader sends a
full kv state plus the index it represents, follower installs that,
then resumes from lastIncludedIndex + 1.
4. Membership changes
Add a Reconfigure op that mutates the cluster set. Use the
joint-consensus or single-server membership change algorithms.
5. Read consistency levels
- Stale read: any follower answers from its local kv.
- Read-your-writes: client reads from leader.
- Linearizable read: leader confirms it is still leader via a heartbeat to a quorum before answering, or uses Raft's ReadIndex / lease read.
6. Multi-shard / sharded KV
Use a hash of the key to pick a shard; each shard is its own 3-node Raft group. Add a meta-shard that owns the shard map. This is the architecture of TiKV, CockroachDB, Spanner.
7. Transactions across shards
Layer 2PC (with a transaction coordinator log) over the shard groups. Or do Percolator-style snapshot isolation. Or go full Spanner with TrueTime.
8. Jepsen-style testing
Property-based testing with random clients, random faults (partitions, clock skew, node kills), and a linearizability checker (Knossos or Porcupine).
9. Replace the in-process state machine
Plug in the LSM from db-09 or the B-tree from db-15 as the underlying KV store. The replication layer (this lab) shouldn't have to change.
10. Geo-replication
A second tier of replication across regions, with the per-region cluster acting as a single logical replica. Conflict resolution becomes the central question.
Step 01 — Cluster and log
Goal
Define the three core types — Op, LogEntry, Node — and the
container Cluster that holds three nodes. No replication yet; the
leader appends to its own log only.
Tasks
- Define
OpKindasPut | DelandOp { kind, k: i64, v: i64 }. - Define
LogEntry { term: u64, index: u64, op: Op }. - Define
Node { id: u8, term: u64, commit_index: u64, last_applied: u64, log: Vec<LogEntry>, kv: Map<i64,i64> }. - Implement
Node::appendrequiringentry.index == log.len() + 1, with idempotent re-acceptance of an already-present index (used later bycatch_up). - Implement
Node::apply_committed: whilelast_applied < commit_index, applylog[last_applied]tokvand increment. - Implement
Node::encodewith the canonical wire format from CONCEPTS.md. - Implement
Cluster::newwith three nodes (ids 0, 1, 2) all marked up, andCluster::encode_snapshot= concat of all three encodings.
Acceptance
snapshot_layout_smoketest passes in all three languages.- An empty cluster's snapshot has length
3 × (8+1+8+8+4+4) = 99bytes.
Pitfalls
- Go map iteration order is undefined — sort keys before encoding.
std::map(ordered) in C++, NOTstd::unordered_map.- All multi-byte ints are little-endian.
vfor aDelop encodes as0.
Step 02 — Replication and commit
Goal
Wire Cluster::submit so that one Op propagates from the leader to
every live follower, advances commit_index on majority, and applies
into the local kv state.
Tasks
- In
submit(op):- Compute
leader_idx = leader.log.len() + 1. - Build
LogEntry { term: leader.term, index: leader_idx, op }. - Append on the leader (must succeed).
- For each follower id 1, 2: if
up[fid], append on that follower.
- Compute
- Count acks: start at 1 (leader), then
+1for eachupfollower whoselog.len() >= leader_idx. - If
acks >= 2(majority of 3):- Set
leader.commit_index = leader_idx; callleader.apply_committed. - For each follower whose
log.len() >= leader_idx, set itscommit_indextoleader_idxand callapply_committed.
- Set
Acceptance
put_then_del_replicatestest passes in all three languages.- After three submits in a row to a fresh cluster, every node has
log.len() == 3andcommit_index == 3.
Pitfalls
- Don't advance
commit_indexon a follower that hasn't received the entry — that's how silent divergence happens. - The leader always advances on a majority, even if a follower hasn't ack'd, because the leader itself counts.
apply_committedmust be called aftercommit_indexis bumped, not before.
Step 03 — Fault injection and catch-up
Goal
Add the failure-injection schedule, the catch_up operation, and the
top-level run_cluster workload driver — completing the lab.
Tasks
- Implement
Cluster::set_follower_up(fid, up)(assertfidis 1 or 2, never 0). - Implement
Cluster::catch_up(fid):- Snapshot the leader's
logandcommit_index. - While the follower's
log.len()is less than the leader's, appendleader_log[fol.log.len()]to the follower. - If the follower's
commit_indexis below the leader's, set it to the leader's andapply_committed.
- Snapshot the leader's
- Implement
step_op(rng, keys):- Draw
r1, r2, r3 = rng.next()(always three). kind = (r1 >> 62) & 0x3;0,1,2 → Put,3 → Del.k = i64(r2 % keys),v = i64(r3 % 1000).
- Draw
- Implement
run_cluster(seed, ops, keys, scenario):down_start = ops/2,down_end = (ops*3)/4,with_fault = (scenario == "fault").- For
i in 0..ops:- If
with_fault && i == down_start: set follower 1 down. - If
with_fault && i == down_end: set follower 1 up, thencatch_up(1). submit(step_op(rng, keys)).
- If
- After the loop: if
with_fault && !up[1], set follower 1 up andcatch_up(1). (Handlesops % 4 != 0.)
- Write a
dbctl hash workload --seed N --ops N --keys N --scenario <normal|fault>CLI that prints the SHA-256 hex ofrun_cluster(...).encode_snapshot()with no trailing newline. - Freeze the two scenario hashes as named constants and assert them
in two tests per language. Cross-check with
scripts/cross_test.sh.
Acceptance
verify.shends with=== OK ===.cross_test.shends with=== ALL OK ===.- The two frozen hashes
5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296(normal, seed=42 ops=200 keys=16)d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792(fault, seed=7 ops=2000 keys=128) match across Rust, Go, and C++.
Pitfalls
- Drawing fewer RNG words on the
Delbranch will silently desync hashes — always draw three. - The post-loop catch-up matters: if the run ends inside the down window, follower 1 still needs to converge.
catch_upmust clone the leader's log first; mutating both at once in Rust requires careful borrow handling.- The "ack on
up[fid]only" rule is essential: a down follower contributes zero acks regardless of its log length.